# A Gentle Introduction to Deep Learning in Medical Image Processing

###### Abstract

This paper tries to give a gentle introduction to deep learning in medical image processing, proceeding from theoretical foundations to applications. We first discuss general reasons for the popularity of deep learning, including several major breakthroughs in computer science. Next, we start reviewing the fundamental basics of the perceptron and neural networks, along with some fundamental theory that is often omitted. Doing so allows us to understand the reasons for the rise of deep learning in many application domains. Obviously medical image processing is one of these areas which has been largely affected by this rapid progress, in particular in image detection and recognition, image segmentation, image registration, and computer-aided diagnosis. There are also recent trends in physical simulation, modelling, and reconstruction that have led to astonishing results. Yet, some of these approaches neglect prior knowledge and hence bear the risk of producing implausible results. These apparent weaknesses highlight current limitations of deep learning. However, we also briefly discuss promising approaches that might be able to resolve these problems in the future.

###### keywords:

Introduction, Deep Learning, Machine Learning, Medical Imaging Image Classification, Image Segmentation, Image Registration, Computer-aided Diagnosis, Physical Simulation, Image Reconstruction^{†}

^{†}journal: Journal of Medical Physics

## 1 Introduction

Over the recent years, Deep Learning (DL) lecun2015deep has had a tremendous impact on various fields in science. It has lead to significant improvements in speech recognition dahl2012context and image recognition krizhevsky2012imagenet, it is able to train artificial agents that beat human players in Go silver2016mastering and ATARI games mnih2015human, and it creates artistic new images mordvintsev2015inceptionism; tan2017artgan and music DBLP:journals/corr/abs-1709-01620. Many of these tasks were considered to be impossible to be solved by computers before the advent of deep learning, even in science fiction literature.

Obviously this technology is also highly relevant for medical imaging. Various introductions to the topic can be found in the literature ranging from short tutorials and reviews seebock2015deep; shen2017deep; pawlowski2017dltk; litjens2017survey; erickson2017machine; suzuki2017survey; hagerty2017medical; lakhani2018hello; kim2018prospects; ker2018deep over blog posts and jupyter notebooks rajchl2018introduction; breininger2018tutorial; cornelisse2018 to entire books zhou2017deep; lu2017deep. All of them serve a different purpose and offer a different view on this quickly evolving topic. A very good review paper is for example found in the work of Litjens et al. litjens2017survey, as they did the incredible effort to review more than 300 papers in their article. Since then, however, many more noteworthy works have appeared - almost on a daily basis - which makes it difficult to create a review paper that matches the current pace in the field. Hence, it is important to select methods of significance and describe them in high detail. Zhou et al. zhou2017deep do so for the state-of-the-art of deep learning in medical image analysis and found an excellent selection of topics. Still, deep learning is being quickly adopted in other fields of medical image processing and the book misses, for example, topics such as image reconstruction. While an overview on important methods in the field is crucial, the actual implementation is as important to move the field ahead. Hence, works like the short tutorial by Breininger et al. breininger2018tutorial are highly relevant to introduce to the topic also on a code-level. Their jupyter notebook framework creates an interactive experience in the web browser to implement fundamental deep learning basics in Python. In summary, we observe that the topic is too complex and evolves too quickly to be summarized in a single document. Yet, over the past few months there already have been so many exciting developments in the field of medical image processing that we believe it is worthwhile to point them out and to connect them to a single introduction.

Readers of this article do not have to be closely acquainted with deep learning at its terminology. We will summarize the relevant theory and present it at a level of detail that is sufficient to follow the major concepts in deep learning. Furthermore, we connect these observations with traditional concepts in pattern recognition and machine learning. In addition, we put these foundations into the context of emerging approaches in medical image processing and analysis, including applications in physical simulation and image reconstruction. As a last aim of this introduction, we also clearly indicate potential weaknesses of the current technology and outline potential remedies.

## 2 Materials and Methods

### 2.1 Introduction to machine learning and pattern recognition

Machine learning and pattern recognition essentially deal with the problem of automatically finding a decision, for example, separating apples from pears. In traditional literature niemann2013pattern, this process is outlined using the pattern recognition system (cf. Fig. 1). During a training phase, the so-called training data set is preprocessed and meaningful features are extracted. While the preprocessing is understood to remain in the original space of the data and comprised operations such as noise reduction and image rectification, feature extraction is facing the task to determine an algorithm that would be able to extract a distinctive and complete feature representation, for example, color or length of the semi-axes of a surrounding ellipse for our apples and pears example. This task is truly difficult to generalize, and it is necessary to design such features anew essentially for every new application. In the deep learning literature, this process is often also referred to as “hand-crafting” features. Based on the feature vector , the classifier has to predict the correct class , which is typically estimated by a function that directly results in the classification result . The classifier’s parameter vector is determined during the training phase and later evaluated on an independent test data set.

### 2.2 Neural networks

In this context, we can now follow neural networks and associated methods in their role as classifiers. The fundamental unit of a neural network is a neuron, it takes a bias and a weight vector as parameters to model a decision

(1) |

using a non-linear activation function . Hence, a single neuron itself can already be interpreted as a classifier, if the activation function is chosen such that it is monotonic, bounded, and continuous. In this case, the maximum and the minimum can be interpreted as a decision for the one or the other class. Typical representatives for such activation functions in classical literature are the sign function resulting in Rosenblatt’s perceptron rosenblatt1957perceptron, the sigmoid function , or the tangens hyperbolicus . (cf. Fig. 5). A major disadvantage of individual neurons is that they only allow to model linear decision boundaries, resulting in the well known fact that they are not able to solve the XOR problem. Fig. 2 summarizes the considerations towards the computational neuron graphically.

In combination with other neurons, modelling capabilities increase dramatically. Arranged in a single layer, it can already be shown that neural networks can approximate any continuous function on a compact subset of cybenko1989approximation. A single layer network is conveniently summarized as a linear combination of individual neurons

(2) |

using combination weights . All trainable parameters of this network can be summarized as

The difference between the true function and its approximation is bounded by

(3) |

where decreases with increasing for activation functions that satisfy the criteria that we mentioned earlier (monotonicity, boundedness, continuity) hornik1991approximation. Hence, given a large number of neurons, any function can be approximated using a single layer network only. At first glance, this contradicts all recent developments in deep learning and therefore requires additional attention.

In the literature, many arguments are found why a deep structure has benefits for feature representation, including the argument that by recombination of the weights along the different paths through the network, features may be re-used exponentially bengio2013representation. Instead of summarizing this long line of arguments, we look into a slightly simpler example that is summarized graphically in Fig. 3. Decision trees are also able to describe general decision boundaries in . A simple example is shown on the top left of the figure, and the associated partition of a two-dimensional space is shown below, where black indicates class and white . According to the universal approximation theorem, we should be able to map this function into a single layer network. In the center column, we attempt to do so using the inner nodes of the tree and their inverses to construct a six neuron basis. In the bottom of the column, we show the basis functions that are constructed at every node projected into the input space, and the resulting network’s approximation, also shown in the input space. Here, we chose the output weights to minimize . As can be seen in the result, not all areas can be recovered correctly. In fact, the maximal error is close to 0.7 for a function that is bounded by 0 and 1. In order to improve this approximation, we can choose to introduce a second layer. As shown in the right column, we can choose the strategy to map all inner nodes to a first layer and all leaf nodes of the tree to a second layer. Doing so effectively encodes every partition that is described by the respective leaf node in the second layer. This approach is able to map our tree correctly with . In fact, this approach is general, holds for all decision trees, and was already described by Ivanova et al. in 1995 ivanova1995initialization. As such, we can now understand why deeper networks may have more modelling capacity.

### 2.3 Network training

Having gained basic insights into neural networks and their basic topology, we still need to discuss how its parameters are actually determined. The answer is fairly easy: gradient descent. In order to compute a gradient, we need to define a function that measures the quality of our parameter set , the so-called loss function . In the following, we will work with simple examples for loss functions to introduce the concept of back-propagation, which is the algorithm that is commonly used to efficiently compute gradients for neural network training.

We can represent a single-layer fully connected network with linear activations simply as , i.e., a matrix multiplication. Note that the network’s output is now multidimensional with . Using an L2-loss, we end up with the following objective function:

(4) |

In order to update the parameters in this example, we need to compute

(5) |

using the chain rule. Note that indicates the operator’s side, as matrix vector multiplications generally do not commute. The final weight update is then obtained as

(6) |

where is the so-called learning rate and is used to index the iteration number.

Now, let us consider a slightly more complicated network structure with three layers , again using linear activations. This yields the following objective function:

(7) |

Note that this example is academic, as could simply be collapsed to a single matrix. Computing the gradient with respect to the parameters of the last layer follows the same recipe as in the previous network:

(8) |

For the computation of the gradient with respect to the second layer , we already need to apply the chain rule twice:

(9) | |||||

Which leads us to the input layer gradient that is determined as

(10) | |||||

The matrix derivatives above are also visualized graphically in Fig. 4. Note that many intermediate results can be reused during the computation of the gradient, which is one of the reasons why back-propagation is efficient in computing updates. Also note that the forward pass through the net is part of , which is contained in all gradients of the net. The other partial derivatives are only partial derivatives either with respect to the input or the parameters of the respective layer. Hence, back-propagation can be used if both operations are known for every layer in the net. Having determined the gradients, each parameter can now be updated analogous to Eq. 6.

### 2.4 Deep Learning

With the knowledge summarized in the previous sections, networks can be constructed and trained. However, deep learning is not possible. One important element was the establishment of additional activation functions that are displayed in Fig. 5. In contrast to classical bounded activations like , , and , the new functions such as the Rectified Linear Unit and many others, of which we only mention the Leaky ReLU

were identified to be useful to train deeper networks. Contrary to the classical activation functions, many of the new activation functions are convex and have large areas with non-zero gradients which simplify numerical issues which were the reasons why vanishing gradients did not allow training of networks that were much deeper than about three layers. Also note that each neuron does not loose its interpretation as a classifier, if we consider 0 as the classification boundary. Furthermore, the universal approximation theorem still holds for a single-layer network with ReLUs sonoda2017neural. Hence, several useful and desirable properties are attained using such modern activation functions.

One disadvantage is, of course, that the ReLU is not differentiable over the entire domain of . At a kink is found that does not allow to determine a unique gradient. For optimization, an important property of the gradient of a function is that it will point towards the direction of the steepest ascent. Hence, following the negative direction will allow minimization of the function. For a differentiable function, this direction is unique. If this constraint is relaxed to allow multiple directions that lead to an extremum, we arrive at sub-gradient theory rockafellar. It allows us to still use gradient descent algorithms to optimize such problems, if it is possible to determine a sub-gradient, i.e., at least one instance of a valid direction towards the optimum. For the ReLU, any value between 0 and -1 would be acceptable at for the descent operation. If such a direction can be obtained, convergence is guaranteed for convex problems by application of specific optimization programs, such as using a fixed step size in the gradient descent bertsekas2015convex. This allows us to remain with back-propagation for optimization, while using non-differentiable activation functions.

Another significant advance towards deep learning is the use of specialized layers. In particular, the so-called convolution and pooling layers enable to model locality and abstraction (cf. Fig. 6). The major advantage of the convolution layers is that they only consider a local neighborhood for each neuron, and that all neurons of the same layer share the same weights, which dramatically reduces the amount of memory required to store such a layer. These restrictions are identical to limiting the matrix multiplication to a matrix with circulant structure, which exactly models the operation of convolution. As the operation is generally of the form of a matrix multiplication, the gradients introduced in Section 2.3 still apply. Pooling is an operation that is used to reduce the scale of the input. For images, typically areas of or are analyzed and summarized to a single value. The average operation can again be expressed as a matrix with hard-coded weights, and gradient computation follows essentially the previous section. Non-linear operations, such as maximum or median, however, require more attention. Again, we can exploit the sub-gradient approach. During the forward pass through the net, the maximum or median can easily be determined. Once this is known, a matrix is constructed that simply selects the correct elements that would also have been selected by the non-linear methods. The transpose of the same matrix is then employed during the backward pass to determine an appropriate sub-gradient miccai:schirrmacher. Fig. 6 shows both operations graphically and highlights an example for a convolutional neural network (CNN). If we now compare this network with Fig. 1, we see that the original interpretation as only a classifier is no longer valid. Instead, the deep network now models all steps directly from the signal up to the classification stage. Hence, many authors claim that feature “hand-crafting” is no longer required because everything is learned by the network in a data-driven manner.

The last missing remark towards deep learning is the role of availability of large amounts of data and annotations that could be gathered over the internet, the immense compute power that became available by using graphics cards for general purpose computations, and, last but not least, the positive trend towards open source software that enables users world-wide to download and extend deep learning methods very quickly. All three elements were crucial to enable this extremely fast rise of deep learning.

### 2.5 Noteworthy architectures and concepts

With the developments of the previous section, much progress was made towards improved signal, image, video, and audio processing, as already detailed earlier. In this introduction, we are not able to highlight all developments, because this would go well beyond the scope of this document, and there are other sources that are more suited for this purpose bengio2013representation; goodfellow2016deep; litjens2017survey. Instead, we will only shortly discuss some advanced network architectures and methods that we believe had, or will have, an impact on medical image processing.

Autoencoders use a contracting and an expanding branch to find representations of the input of a lower dimensionality vincent2008extracting. They do not require annotations, as the network is trained to predict the original input using loss functions such as . Variants use convolutional networks holden2015learning, add noise to the input vincent2010stacked, or aim at finding sparse representations huang2007unsupervised.

Google’s inception network is an advanced and deep architecture that was applied successfully for several tasks szegedy2015going. Its main highlight is the introduction of the so-called inception block that essentially allows to compute convolutions and pooling operations in parallel. By repeating this block in a network, the network can select by itself in which sequence convolution and pooling layers should be combined in order to solve the task at hand effectively.

Ronneberger’s U-net is a breakthrough towards automatic image segmentation ronneberger2015u and has been applied successfully in many tasks that require image-to-image transforms, for example, images to segmentation masks. Like the autoencoder, it consists of a contracting and an expanding branch, and it enables multi-resolution analysis. In addition, U-net features skip connections that connect the matching resolution levels of the encoder and the decoder stage. Doing so, the architecture is able to model general high-resolution multi-scale image-to-image transforms. Originally proposed in 2-D, many extensions, such as 3-D versions, exist cciccek20163d; milletari2016v.

ResNets have been designed to enable training of very deep networks he2016deep. Even with the methods described earlier in this paper, networks will not benefit from more than 30 to 50 layers, as the gradient flow becomes numerically unstable in such deep networks. In order to alleviate the problem, a so-called residual block is introduced, and layers take the form , where contains the actual network layer. Doing so has the advantage that the addition introduces a second parallel branch into the network that lets the gradient flow from end to end. ResNets also have other interesting properties, e.g., their residual blocks behave like ensembles of classifiers veit2016residual.

Variational Networks enable the conversion of an energy minimization problem into a neural network structure kobler2017variational. We consider this type of network as particular interesting, as many problems in traditional medical image processing are expressed as energy minimization problems. The main idea is as follows: The energy function is typically minimized by optimization programs such as gradient descent. Thus, we are able to use the gradient of the original problem to construct a so-called variational unit that describes exactly one update step of the optimization program. Succession of such units then describe the complete variational network. Two observations are noteworthy: First, this type of framework allows to learn operators within one variational unit, such as a sparsifying transform for compressed sensing problems. Second, the variational units generally form residual blocks, and thus variational networks are always ResNets as well.

Precision Learning is a strategy to include known operators into the learning process maier2018precision. While this idea is counter-intuitive for most recognition tasks, where we want to learn the optimal representation, the approach is actually very useful for signal processing tasks in which we know a priori that a certain operator must be present in the processing chain. Embedding the operator in the network reduces the maximal training error, reduces the number of unknowns and therefore the number of required training samples, and enables mixing of most signal processing methods with deep learning. The approach is applicable to a broad range of operators. The main requirement is that a gradient or sub-gradient must exist.

Recurrent neural networks (RNNs) enable the processing of sequences with long term dependencies mandic2001recurrent. Furthermore, recurrent nets introduce state variables that allow the cells to carry memory and essentially model any finite state machine. Extensions are long-short-term memory (LSTM) networks hochreiter1997long and gated recurrent units (GRU) chung2014empirical that can model explicit read and write memory transactions similar to a computer.

Adversarial examples consider the input to a neural network as a possible weak spot that could be exploited by an attacker yuan2017adversarial. Attacks range from generating a special kind of noise that will mislead the network, but not a human observer, to specialized inputs that will even mislead networks after printing and re-digitization of the attack pattern brown2017adversarial.

Generative adversarial networks (GANs) employ two networks to learn a representative distribution from the training data goodfellow2016nips. A generator network creates new images from a noise input, while a discriminator network tries to differentiate real images from generated images. Both are trained in an alternating manner such that both gradually improve for their respective tasks. GANs are known to be difficult to train, however, they also generate plausible and realistically looking images. Conditional GANs gauthier2014conditional allow to encode states in the process such that images with desired properties can be generated. CylceGANs zhu2017unpaired drive this even further as they allow to convert one image from one domain to another, for example from day to night, without directly corresponding images in the training data.

Deep reinforcement learning is a technique that allows to train an artificial agent to perform actions given inputs from an environment and expands on traditional reinforcement learning theory sutton1998reinforcement. In this context, deep networks are often used as flexible function approximators representing value functions and/or policies silver2016mastering. In order to enable time-series processing, sequences of environmental observations can be employed mnih2015human.

## 3 Results

As can be seen in the last few paragraphs, deep learning now offers a large set of new tools that are applicable to many problems in the world of medical image processing. In fact, these tools have already been widely employed. In particular, perceptual tasks are well suited for deep learning. We present some highlights that are discussed later in this section in Fig. 7. On the international conference of Medical Image Computing and Computer-Assisted Intervention (MICCAI) in 2018, approximately 70 % of all accepted publications were related to the topic of deep learning. Given this fast pace of progress, we are not able to describe all relevant publications here. Hence, this overview is far from being complete. Still we want to highlight some publications that are representative for the current developments in the field. In terms of structure and organization, we follow zhou2017deep here, but add recent developments in physical simulation and image reconstruction.

### 3.1 Image detection and recognition

Image detection and recognition deals with the problem of detecting a certain element in a medical image. In many cases, the images are volumetric. Therefore efficient parsing is a must. A popular strategy to do so is marginal space learning zheng2014marginal, as it is efficient and allows to detect organs robustly. Its deep learning counter-part ghesu2016marginal is even more efficient, as its probabilistic boosting trees are replaced using a neural network-based boosting cascade. Still, the entire volume has to be processed to detect anatomical structures reliably. ghesuDL drives efficiency even further by replacing the search process by an artificial agent that follows anatomy to detect anatomical landmarks using deep reinforcement learning. The method is able to detect hundreds of landmarks in a complete CT volume in few seconds.

Bier et al. proposed an interesting method in which they detect anatomical landmarks in 2-D X-ray projection images bier2018miccai. In their method, they train projection-invariant feature descriptors from 3-D annotated landmarks using a deep network. Yet another popular method for detection are the so-called region proposal convolutional neural networks. In akselrod2016region they are applied to robustly detect tumors in mammographic images.

Detection and recognition are obviously also applied in many other modalities and a great body of literature exists. Here, we only report two more applications. In histology, cell detection and classification is an important task, which is tackled by Aubreville et al. using guided spatial transformer networks aubreville that allow refinement of the detection before the actual classification is done. The task of mitosis classification benefits from this procedure. Convolutional neural networks are also very effective for other image classification tasks. In aubreville2018IJCARS they are employed to automatically detect images containing motion artifacts in confocal laser-endoscopy images.

### 3.2 Image segmentation

Also image segmentation greatly benefited from the recent developments in deep learning. In image segmentation, we aim to determine the outline of an organ or anatomical structure as accurately as possible. Again, approaches based on convolutional neural networks seem to dominate. Here, we only report Holger Roth’s Deeporgan roth2015deeporgan, the brain MR segmentation using CNN by Moeskops et al. moeskops2016automatic, a fully convolutional multi-energy 3-D U-net presented by Chen et al. univis91841629, and a U-net-based stent segmentation in X-ray projection domain by Breininger et al. Breininger2018 as representative examples. Obviously segmentation using deep convolutional networks also works in 2-D as shown by Nirschl et al. for histopathologic images nirschl2017deep.

Middelton et al. already experimented with the fusion of neural networks and active contour models in 2004 well before the advent of deep learning middleton2004segmentation. Yet, their approach is neither using deep nets nor end-to-end training, which would be desirable for a state-of-the-art method. Hence, revisiting traditional segmentation approaches and fusing them with deep learning in an end-to-end fashion seems a promising scope of research. Fu et al. follow a similar idea by mapping Frangi’s vesselness into a neural network ArXivWeilin. They demonstrate that they are able to adjust the convolution kernels in the first step of the algorithm towards the specific task of vessel segmentation in ophthalmic fundus imaging.

Yet another interesting class of segmentation algorithms is the use of recurrent networks for medical image segmentation. Poudel et al. demonstrate this for a recurrent fully convolutional neural network on multi-slice MRI cardiac data poudel2016recurrent, while Andermatt et al. show effectiveness of GRUs for brain segmentation andermatt2016multi.

### 3.3 Image registration

While the perceptual tasks of image detection and classification have been receiving a lot of attention with respect to applications of deep learning, image registration has not seen this large boost yet. However, there are several promising works found in the literature that clearly indicate that there are also a lot of opportunities.

One typical problem in point-based registration is to find good feature descriptors that allow correct identification of corresponding points. Wu et al. propose to do so using autoencoders to mine good features in an unsupervised way wu2016scalable. Schaffert et al. drive this even further and use the registration metric itself as loss function for learning good feature representations schaffert2018metric. Another option to solve 2-D/3-D registration problems is to estimate the 3-D pose directly from the 2-D point features miao2017convolutional.

For full volumetric registration, examples of deep learning-based approaches are also found. The quicksilver algorithm is able to model a deformable registration and uses a patch-wise prediction directly from the image appearance yang2017quicksilver. Another approach is to model the registration problem as a control problem that is dealt with using an agent and reinforcement learning. Liao et al. propose to do so for rigid registration predicting the next optimal movement in order to align both volumes liao2017artificial. This approach can also be applied to non-rigid registration using a statistical deformation model univis91731175. In this case, the actions are movements in the vector space of the deformation model. Obviously, agent-based approaches are also applicable for point-based registration problems. Zhong et al. demonstrate this for intra-operative brain shift using imitation learning univis91890067.

### 3.4 Computer-aided diagnosis

Computer-aided diagnosis is regarded as one of the most challenging problems in the field of medical image procesing. Here, we are not only acting in a supportive role quantifying evidence towards the diagnosis. Instead the diagnosis itself is to be predicted. Hence, decisions have to be done with utmost care and decisions have to be reliable.

The analysis of chest radiographs comprises a significant amount of work for radiologistic and is performed routinely. Hence, reliable support to prevent human error is highly desirable. An example to do so is given in diamant2017chest by Diamant et al. using transfer learning techniques.

A similar workload is imposed on ophthalmologists in the reading of volumetric optical coherence tomography data. Google’s Deep Mind just recently proposed to support this process in terms of referral decision support de2018clinically.

There are many other studies found in this line, for example, automatic cancer assessment in confocal laser endoscopy in different tissues of the head and neck aubreville2017epithelialcancer, deep learning for mammogram analysis carneiro2017deep, and classification of skin cancer esteva2017dermatologist.

### 3.5 Physical simulation

A new field of deep learning is the support of physical modelling. So far this has been exploited in the gaming industry to compute realistically appearing physics engines wu2015galileo, or for smoke simulation chu2017data in real-time.

Based on such observations, researchers started to bring such methods into the field of medical imaging. One example to do so is the deep scatter estimation by Maier et al. maier2018deep. Unberath et al. drive this even further to emulate the complete X-ray formation process in their DeepDRR 10.1007/978-3-030-00937-3_12. In horger2018towards Horger et al. demonstrate that even noise of unknown distributions can be learned, leading to an efficient generative noise model for realistic physical simulations.

Also other physical processes have been investigated using deep learning. In maier2018precision a material decomposition using deep learning embedding prior physical operators using precision learning is proposed. Also physically less plausible interrelations are attempted. In han2017mr Han et al. attempt to convert MR volumes to CT volumes. Stimpel et al. drive this even further predicting X-ray projections from MR projection images univis91895709. While these observations seem promising, one has to follow such endeavors with care. Schiffers et al. demonstrate that cycleGANs may create correctly appearing flourecence images from fundus images in ophthalmology schiffers2018cyclegan. Yet, undesired effects appear, as occasionally drusen are mapped onto micro aneurysms in this process. Cohen et al. demonstrate even worse effects Cohen2018distribution. In their study, cancers disappeared or were created during the modality-to-modality mapping. Yet, they also demonstrate that GANs are quite effective for artificially increasing their training data, which is commonly referred to as data augmentation.

### 3.6 Image Reconstruction

Also the field of medical image reconstruction has been affected by deep learning and was just recently the topic of a special issue in the IEEE Transactions on Medical Imaging. The editorial actually gives an excellent overview on the latest developments wang2018image that we will summarize in the next few lines.

One group of deep learning algorithms omit the actual problem of reconstruction and formulate the inverse as image-to-image transforms with different initialization techniques before processing with a neural network. Recent developments in this image-to-image reconstruction are summarized in mccann2017review. Still, there is continuous progress in the field, e.g. by application of the latest network architectures zhang2018sparse or cascading of U-nets kofler2018u.

A recent paper by Zhu et al. proposes to learn the entire reconstruction operation only from raw data and corresponding images zhu2018image. The basic idea is to model an autoencoder-like dimensionality reduction in raw data and reconstruction domain. Then both are linked using a non-linear correlation model. The entire model can then be converted into a single network and trained in an end-to-end manner. In the paper, they show that this is possible for 2-D MR and PET imaging and largely outperforms traditional approaches.

Learning operators completely data-driven carries the risk that undesired effects may occur huang2018considerations, as is shown in Fig. 8. Hence integration of prior knowledge and the structure of the operators seems beneficial, as already described in the concept of precision learning in the previous section. Ye et al. embed a multi-scale transform into the encoder and decoder of a U-net-like network, which gives rise to the concept of deep convolutional framelets ye2018deep. Using wavelets for the multi-scale transform has been successfully applied in many applications ranging from denoising kang2018deep to sparse view computed tomography han2018framing.

If we design a neural network inspired by iterative algorithms that minimize an energy function step by step, the concept of variational networks is useful. Doing so allows to map virtually all iterative reconstruction algorithms onto deep networks, e.g., by using a fixed number of iterations. There are several impressive works found in the literature, of which we only name the MRI reconstruction by Hammernik et al. hammernik2018learning and the sound speed reconstruction by Vishnevskiy et al. vishnevskiy2018image at this point. The concept can be expanded even further, as Adler et al. demonstrate by learning an entire primal-dual reconstruction adler2018learned.

Würfl et al. also follow the idea of using prior operators deeplearningct; wurfl2018deep. Their network is inspired by the classical filtered back-projection that can be retrained to better approximate limited angle geometries that typically cannot be solved by classical analytic inversion models. Interestingly, as the approach is described in an end-to-end fashion, errors in the discretization or initialization of the filtering steps are intrinsically corrected by the learning process ISBIArchiveSyben. They also show that their method is compatible with other approaches, such as variational networks that are able to learn an additional de-streaking sparsifying transform hammernik2017dlct. Syben et al. drive these efforts even further and demonstrate that the concept of precision learning is able to mathematically derive a neural network structure syben2018deriving. In their work, they demonstrate that they are able to postulate that an expensive matrix inverse is a circulant matrix and hence can be replaced by a convolution operation. Doing so leads to the derivation of a previously unknown filtering, back-projection, re-projection-style rebinning algorithm that intrinsically suffers less from resolution loss than traditional interpolation-based rebinning methods.

As noted earlier, all networks are prone to adversarial attacks. Huang et al. demonstrate this huang2018considerations in their work, showing that already incorrect noise modelling may distort the entire image. Yet, the networks reconstruct visually pleasing results and artifacts cannot be as easily identified as in classical methods. One possible remedy is to follow the precision learning paradigm and fix as much of the network as possible, such that it can be analyzed with classical methods as demonstrated in wurfl2018deep. Another promising approach is Bayesian deep learning schlemper2018bayesianrecon. Here the network output is two-fold: the reconstructed image plus a confidence map on how accurate the content of the reconstructed image was actually measured.

Obviously, deep learning also plays a role in suppression of artifacts. In zhang2018convolutional, Zhang et al. demonstrate this effectively for metal artifacts. As a last example, we list Bier et al. here, as they show that deep learning-based motion tracking is also feasible for motion compensated reconstruction bier2018detecting.

## 4 Discussion

In this introduction, we reviewed the latest developments in deep learning. In particular detection, recognition, and segmentation tasks are well solved by the deep learning algorithms. Those tasks are clearly linked to perception and there is essentially no prior knowledge present. Hence, state-of-the-art architectures from other fields, such as computer vision, can often be easily adopted to medical tasks. In order to gain better understanding of the black box, reinforcement learning and modelling of artificial agents seem well suited.

In image registration, deep learning is not that broadly used. Yet, interesting approaches already exist that are able to either predict deformations directly from the image input, or take advantage of reinforcement learning-based techniques that model registration as on optimal control problem. Further benefits are obtained using deep networks for learning representations, which are either done in an unsupervised fashion or using the registration metric itself.

Computer-aided diagnosis is a hot topic with many recent publications address. We expect that simpler standard tasks that typically result in a high workload for medical doctors will be solved first. For more complex diagnoses, the current deep nets that immediately result in a decision are not that well suited, as it is difficult to understand the evidence. Hence, approaches are needed that link observations to evidence to construct a line of argument towards a decision. It is the strong belief of the authors that only if such evidence-based decision making is achieved, the new methodology will make a significant impact to computer-aided diagnosis.

Physical simulation can be accelerated dramatically with realistic outcomes as shown in the field of computer games and graphics. Therefore, the methods are highly relevant, in particular for interventional applications, in which real-time processing is mandatory. First approaches exist, yet there is considerable room for more new developments. In particular, precision learning and variational networks seem to be well suited for such tasks, as they provide some guarantees to prediction outcomes. Hence, we believe that there are many new developments to follow, in particular in radiation therapy and real-time interventional dose tracking.

Reconstruction based on data-driven methods yield impressive results. Yet, they may suffer from a “new kind” of deep learning artifacts. In particular, the work by Huang et al. huang2018considerations show these effects in great detail. Both precision learning as well as Bayesian approaches seem well suited to tackle the problem in the future. Yet, it is unclear how to benefit best from the data-driven methods while maintaining intuitive and safe image reading.

A great advantage of all the deep learning methods is that they are inherently compatible to each other and to many classical approaches. This fusion will spark many new developments in the future. In particular, the fusion on network-level using either the direct connection of networks or precision learning allows end-to-end training of algorithms. The only requirement for this deep fusion is that each operation in the hybrid net has a gradient or sub-gradient for the optimization. In fact, there are already efforts to design whole programming languages to be compatible with this kind of differential programming li2018differentiable. With such integrated networks, multi-task learning is enabled, for example, training of networks that deliver optimal reconstruction quality and the best volumetric overlap of the resulting segmentation at the same time, as already conjectured in Wang2016DeepImaging. This point may even be expanded to computer-aided diagnosis or patient benefit.

Deep learning is extremely data hungry. This is one of the main limitations that the field is currently facing, and performance grows only logarithmically with the amount of data used googlepaper. Approaches like weakly supervised training oquab2015object will only partially be able to close this gap. Hence, one hospital or one group of researchers will not be able to gather a competitive amount of data in the near future. As such, we welcome initiatives such as the grand challenges^{1}^{1}1https://grand-challenge.org or medical data donors^{2}^{2}2http://www.medicaldatadonors.org, and hope that they will be successful with their mission.

## 5 Conclusion

In this short introduction to deep learning in medical image processing we were aiming at two objectives at the same time. On the one hand, we wanted to introduce to the field of deep learning and the associated theory. On the other hand, we wanted to provide a general overview on the field and potential future applications. In particular, perceptual tasks have been studied most so far. However, with the set of tools presented here, we believe many more problems can be tackled. So far, many problems could be solved better than the classical state-of-the-art does alone, which also sparked significant interest in the public media. Generally, safety and understanding of networks is still a large concern, but methods to deal with this are currently being developed. Hence, we believe that deep learning will probably remain an active research field for the coming years.

If you enjoyed this introduction, we recommend that you have a look at our video lecture that is available at https://www.video.uni-erlangen.de/course/id/662.

## Acknowledgements

We express our thanks to Katharina Breininger, Tobias Würfl, and Vincent Christlein, who did a tremendous job when we created the deep learning course at the University of Erlangen-Nuremberg. Furthermore, we would like to thank Florin Ghesu, Bastian Bier, Yixing Huang, and again Katharina Breininger for the permission to highlight their work and images in this introduction. Last but not least, we also express our gratitude to the participants of the course “Computational Medical Imaging^{3}^{3}3https://www5.cs.fau.de/lectures/sarntal-2018/”, who were essentially the test audience of this article during the summer school “Ferienakademie 2018”.