Global Pixel Transformers for Virtual Staining of Microscopy Images
Visualizing the details of different cellular structures is of great importance to elucidate cellular functions. However, it is challenging to obtain high quality images of different structures directly due to complex cellular environments. Fluorescence staining is a popular technique to label different structures but has several drawbacks. In particular, label staining is time consuming and may affect cell morphology, and simultaneous labels are inherently limited. This raises the need of building computational models to learn relationships between unlabeled microscopy images and labeled fluorescence images, and to infer fluorescence labels of other microscopy images excluding the physical staining process. We propose to develop a novel deep model for virtual staining of unlabeled microscopy images. We first propose a novel network layer, known as the global pixel transformer layer, that fuses global information from inputs effectively. The proposed global pixel transformer layer can generate outputs with arbitrary dimensions, and can be employed for all the regular, down-sampling, and up-sampling operators. We then incorporate our proposed global pixel transformer layers and dense blocks to build an U-Net like network. We believe such a design can promote feature reusing between layers. In addition, we propose a multi-scale input strategy to encourage networks to capture features at different scales. We conduct evaluations across various fluorescence image prediction tasks to demonstrate the effectiveness of our approach. Both quantitative and qualitative results show that our method outperforms the state-of-the-art approach significantly. It is also shown that our proposed global pixel transformer layer is useful to improve the fluorescence image prediction results.
Capturing and visualizing the details of different sub-cellular structures is an important but challenging problem in cellular biology [1, 2]. Detailed information on the shapes and locations of cellular structures plays an important role in investigating cellular functions [3, 4, 5]. The widely used transmitted light microscopy can only provide low contrast images, and it is difficult to study certain structures or functional characteristics from such images [6, 7]. One popular technique to overcome these limitations is fluorescence staining, which labels different structures with dyes or dye-conjugated antibodies . For example, cell nuclei can be labeled and visualized after being stained by DAPI [8, 9]. However, fluorescence staining is time consuming, especially when cell structures are complex. In addition, due to the overlap of spectrum, there is a limit on the number of fluorescence labels to be applied simultaneously on the same microscopy image [10, 11]. Furthermore, labeling may interfere with regular physiological processes in live cells, resulting in changes in cell morphology [2, 8]. These limitations raise the need of advanced methods to label cellular structures more effectively and efficiently.
With the rapid development of deep learning methods, recent studies [8, 9, 12] propose to formulate such problems as image dense prediction tasks using deep neural networks. In such a dense prediction task, we wish to predict if each pixel on the input microscopy image belongs to a fluorescence label or not. Given microscopy images and corresponding fluorescence stained images, the models are trained to capture the relationship between them. Then for any newly obtained microscopy image, the fluorescence image can be predicted by the models based on the learned relationships. Such a virtual staining process allows us to obtain fluorescence labels from microscopy images without physical labeling .
The recent study in  proposes to use convolutional neural networks (CNNs) [14, 15, 16] for such a task and obtains promising results for prediction of fluorescence images. It stacks multiple convolutional layers to enlarge the receptive field and employs inception modules  to facilitate the training. However, only local operators, such as convolution, pooling, and deconvolution, are used in their model. Hence, the global information cannot be captured effectively and efficiently, while such information may be important to determine certain fluorescence labels. Meanwhile, another work  employs a vanilla U-Net framework for prediction of fluorescence images. For each type of fluorescence label, it builds a model to learn the relationships between microscopy images and the corresponding fluorescence label. However, such a design learns different fluorescence types separately, thereby ignoring important relationships among different fluorescence labels. In addition, it only employs local operators so that the global information cannot be effectively captured. Other studies on fluorescence image super resolution , fluorescence image restoration , and image missing modality prediction tasks [20, 21, 22, 23] employ similar network operators.
In this work, we propose a novel deep learning model, known as the global pixel transformers (GPTs), for virtual staining of microscopy images. As a radical departure from previous studies that invariably employ local operators, we develop a novel network layer, known as the global pixel transformer layer, to fuse global information efficiently and effectively. The global pixel transformer layer is inspired by the attention operators [24, 25], and each position of the output in the global pixel transformer layer fuses information from all input positions. Particularly, our proposed layer can be flexibly generalized to produce outputs of any dimensions. We build an U-Net like architecture based on our proposed global pixel transformer layer. We further develop dense blocks in our network to promote feature reusing between layers in the network. To capture both global contextual and local subtle features, we propose a multi-scale input strategy in our model to incorporate information at different scales. particularly, our model is designed in a multi-task manner to predict several target fluorescence labels simultaneously. We conduct extensive experiments to evaluate our proposed approach across various fluorescence label prediction tasks. Both quantitative and qualitative results show that our model outperforms the existing approach  significantly. Our ablation analysis shows that the proposed global pixel transformer layer is useful to improve model performance.
Ii Background and Related Work
We describe the attention operator in this section. The inputs to an attention operator include three matrices; those are, a query matrix with each query vector , a key matrix with each key vector , and a value matrix with each value vector . An attention operator computes output at each position by performing a weighted sum over all value vectors in , where the weights are acquired by attending the corresponding query vector to all key vectors in . Formally, to compute a response at a position , the attention operator first computes the weight vector as
where ensures the sum of all the elements in to be 1. Each element in measures the importance of the corresponding vector in by performing the inner product between it and . The response at position is then computed by using the weight vector to perform a weighted sum over all vectors in as
In this way, the response at position fuses the global information in by assigning an importance to each value vector referring to . For response at each position, we follow the same procedure and obtain outputs as
We rewrite outputs of an attention operator as
where denotes a column-wise softmax operator to ensure every column sum to 1. We can easily see the number of vectors in output matrix is determined by the number of vectors in query matrix . In self-attention operators, we set . Thus, response of a position is computed by the weighted average of features at all positions, thereby fusing global information from input feature maps. Note that a fully connected (FC) layer also fuses global information from whole receptive fields. However, The self-attention operator computes responses based on similarities between feature vectors at different positions, whereas a FC layer connects every neuron to compute responses using learnable weights. Moreover, a self-attention operator deals with inputs with variable sizes, while an FC layer needs sizes of input to be fixed.
Iii Global Pixel Transformers
In this section, we introduce a novel model for prediction of fluorescence images, known as the multi-scale global pixel transformers with dense blocks.
Iii-a Global Pixel Transformer Layer
Traditional deep learning models for dense prediction tasks contain several key operators, such as convolution, pooling, and deconvolution. These operators are all performed within a local neighborhood, restricting the capacity of networks to fuse global context information. To overcome this limitation, we propose a novel network layer, known as the global pixel transformer (GPT) layer, which is based on the attention operator and captures dependencies between each position on outputs and all positions on inputs, thereby fusing global information from input feature maps. Unlike the self-attention operator that generates outputs with the same dimensions as the inputs, our proposed GPT layer can generate output feature maps with arbitrary dimensions, and can be employed for both regular, down-sampling, and up-sampling operators. Specifically, we investigate three types of global pixel transformer layers, namely global down transformer (GDT) layer, global up transformer (GUT) layer, and global same transformer (GST) layer. The dimensions of feature maps are halved in a GDT layer, while those are doubled in a GUT layer and kept the same in a GST layer.
Although the three types of global pixel transformer layers generate outputs of different sizes, they share similar structure and computational pipeline. An illustration of our proposed GPT layer is provided in Figure 1. Let denote the input of the GPT layer, the first step is to compute the query tensor , key tensor and value tensor based on . We employ a generator layer to obtain the query tensor, and two convolution layers to obtain the key and value tensors as
where Generator denotes a query generator layer, and denotes a convolution layer with stride 1 and output feature maps. Hence, is equal to and is equal to . The choice of the query generator depends on the types of global pixel transformer layers. For GDT layers, we employ a convolutional layer with stride to generate . For GUT layers, we employ a deconvolutional layer with stride to generate . For GST layers, we employ a convolutional layer with stride to generate .
We then convert the each of the third-order tensors into a matrix by unfolding along mode-3 . In this way, tensor is converted into a matrix . Similarly, is converted into a matrix and is converted into a matrix . These three matrices serve as the query, key and value matrices in Eq. 4. To ensure the attention operator to be valid, we set . The output of the attention operator is computed as
Finally, the output matrix is converted back to a third-order tensor , as output of the GPT layer. To this end, each position feature in the output tensor is computed as a weighted sum of all feature vectors in , which is obtained directly from the input tensor . Apparently, global information from input features is captured and fused to generate the output through our GPT layers. In addition, the spatial sizes () of output feature maps are determined by spatial sizes of the query tensor , while the number of output feature maps depends on the value tensor . Theoretically, our proposed GPT layer can generate feature maps of arbitrary dimensions. In practice, the commonly used local operators either keep the spatial sizes of feature maps, or double the spatial sizes for up-sampling, or halve the spatial sizes for down-sampling. Hence, in this work, we propose to substitute these local operators by three types of global pixel transformer layers.
The traditional local operators, such as max pooling and convolution with a stride 2, may also capture global information by stacking the same operator many times. However, such stacking is not efficient. For example, when trying to capture the global information in an area, the max pooling need to be repeated times. However, our proposed GPT layers can capture global relationships among any two positions using only one layer. Therefore, our proposed methods are more efficient and effective compared to traditional local operators.
Iii-B Global Pixel Transformers
It is well-known that encoder-decoder architectures like U-Nets  have achieved the state-of-the-art performance in various dense prediction tasks. However, these networks employ local operators like convolution, pooling and deconvolution, which cannot efficiently capture global information. Based on our GPT layer, we propose a novel network for dense prediction tasks, known as the global pixel transformers (GPTs).
In U-Nets, down-sampling layers are employed to reduce spatial sizes and obtain high-level features, while up-sampling layers are used to recover spatial dimensions. The commonly used convolution, pooling, and deconvolution operators are performed in local neighborhood on feature maps. We propose to substitute these local operators with our proposed GPT layers. By setting different sizes for the query tensor , our proposed GPT layers can be employed for both down-sampling and up-sampling, while considering global information to build output features. Suppose an input feature map has spatial size of . For the down-sampling operator, a GDT layer halves the spatial sizes of input feature maps, which can be achieved by setting the sizes of query tensor as and . For the up-sampling operator, the spatial sizes of feature maps are doubled by setting and in a GUT layer. In addition, the GST layers are employed to transmit information from the encoder to the decoder in the bottom block of U-Nets.
In addition, due to the multiple down-sampling and up-sampling operators in U-Nets, the spatial information, such as the shapes and locations of cellular structures, is largely lost in its information flow. Since the decoder recovers the spatial sizes from high-level features, the prediction may not fully incorporate all spatial information while such spatial information is important to perform dense prediction. Hence, we adapt the idea to build skip connections between the encoder and the decoder in U-Nets. Such connections are expected to enable the sharing of spatial information and high-level features between the encoder and decoder, and hence improve the performance of dense prediction.
|A||human motor neurons||DAPI (Wide Field)||TuJ1 (Wide Field)||Islet1 (Wide Field)||286||39||1900x2600||Rubin|
|B||human motor neurons||DAPI (Confocal)||MAP2 (Confocal)||NFH (Confocal)||273||52||4600x4600||Finkbeiner|
|C||primary rat cortical cultures||DAPI (Confocal)||DEAD (Confocal)||-||936||273||2400x2400||Finkbeiner|
|D||primary rat cortical cultures||DAPI (Confocal)||MAP2 (Confocal)||NFH (Confocal)||26||13||4600x4600||Finkbeiner|
|E||human breast cancer line||DAPI (Confocal)||CellMask (Confocal)||-||13||13||3500x3500|
Iii-C Global Pixel Transformers with Dense Blocks
To perform dense prediction on images, deep networks are usually required to extract high-level features. However, a known problem for training very deep CNNs is that gradient flow in deep networks is sometimes saturated. Residual connections have been shown to be effective to solve such a problem in various popular networks, such as ResNets  and DenseNets . In ResNets, residual connections are employed in residual blocks to share the different levels of features between the non-linear transformation of the input and the identity mapping. They benefit the convergence of very deep neural networks by providing a highway for the gradients to back propagate. Recently, residual U-Net [29, 30] is proposed to inherit the benefits of both long-range skip connections and short-range residual connections. It is shown to obtain more precise results on dense prediction tasks without increasing parameters. Since DenseNets employ extreme residual connections, also known as dense connections, to build dense blocks and achieve state-of-the-art performance on image classification tasks, we follow a similar idea to use dense blocks in our proposed global transformer U-Nets.
The general structure of our model is shown in Figure 2. We combine the dense block and the GPT layer to better incorporate dense connections. For the encoder part, each dense block is followed by a GDT layer, since the dense block retains the spatial sizes of the input while the GDT layer performs down-sampling. The reduction of spatial sizes is compensated by the growth in feature map number generated by the dense block. Correspondingly, each GUT layer in the decoder is followed by a dense block, and the GUT layer recovers the spatial sizes and reduces the number of feature maps.
For each dense block in our model, residual connections are employed to connect every layer and its subsequent layers. A typical layer dense block can be defined as
where is the input to the dense block, is the output of the layer, and represents the concatenation operator. denotes a series of operators, including convolution, batch normalization (BN) , ReLU activation, and dropout . Each layer in a dense block generates new feature maps and they are concatenated with previously generated feature maps. Note that is also called the growth rate of dense block. Hence, the output of the dense block contains information regarding both the input feature maps and newly generated feature maps. A general illustration of our employed dense block is shown in Figure 2. Note that we add a convolution layer before the output to make the dense block more flexible so that the number of output feature maps can be controlled. Intuitively, a dense block encourages feature reusing between layers. In addition, compared with traditional networks of the same capacity, it can significantly reduce the number of parameters since each layer in dense block only contains new feature maps.
Iii-D Multi-Scale Input Strategy
One training strategy for dense prediction tasks is to feed the whole image as input and produce predictions for all input pixels. However, such a strategy requires excessive memory on training hardware. On modern hardware like GPUs, memory resource is always limited. This data feeding strategy becomes inefficient for large inputs, which is quite common for biological image processing tasks. One common solution is to crop small patches from the original image, and train the neural networks with these small image patches. To predict the whole image, an overlap-tile strategy can be used to allow continuous segmentation . However, such a divide-and-conquer strategy imposes a natural constraint on networks. When predicting small patches, only the local information within these patches can be captured by the network, while the global information is ignored. Furthermore, the information in local subtle area may be ignored when the sizes of local area are relatively small compared with the patch sizes. To overcome these limitations, we propose a multi-scale input strategy to incorporate sufficient global and local information to perform prediction.
Assuming that the sizes of image patches for network training are . For a image patch, let denote the center and an image patch is cropped for training. To incorporate global information, we crop another image with the same center to provide larger receptive field. This image is re-scaled to but contains more global information. This is particular useful when the original image contains pixels lying on incomplete edges. In addition, we crop another image to capture local subtle information. The image is also re-scaled to . Compared with , small subtle areas are up-scaled in , which encourages the networks to capture important details. Then we concatenate , and along the channel dimension and use them as input of networks. For the corresponding label of such input, we use the predicted image of as its label. Intuitively, we incorporate information at different scales to make predictions for one particular area. Notably, we can flexibly generalize such input strategy to multiple levels and incorporate information at different scales. Our proposed multi-scale input strategy is illustrated in the left part of Figure 2.
|Multi-Scale Input||Input||Multi-Scaling Sizes|
|Encoder||DB(2 layers) + GDT||64x64||64|
|DB(4 layers) + GDT||32x32||128|
|DB(8 layers) + GDT||16x16||256|
|Bottom Block||DB(8 layers) + GST||16x16||384|
|Decoder||GUT + DB(4 layers)||32x32||288|
|GUT + DB(2 layers)||64x64||165|
|GUT + DB(1 layers)||128x128||90|
|Cell Nuclei||Cell Viability||Cell Type|
||Condition A||Condition B||Condition C||Condition D|
|Ours||0.948 0.0027||0.896 0.0019||0.944 0.0033||0.915 0.0031||0.859 0.0022||0.860 0.0026|
Iv Experimental Studies
We use both quantitative and qualitative evaluations to demonstrate the effectiveness of our proposed model. The dataset used for evaluation and the experimental settings are presented in Sections IV-A and IV-B. We compare our experimental results with the existing approach  in Section IV-C. Finally, we provide an ablation analysis in Section IV-D.
We use the dataset in the existing work . The dataset contains 2D high-resolution microscopy images from five different laboratories. Note that a set of several such 2D microscopy images are originally z-stacks of transmitted-light images collected from one 3D biological sample . Specifically, the z-stack 2D images are collected from several planes at equidistant intervals along the z axis of a 3D sample. They collected 13 2D images from a sample. Thus, for all the 13 2D images from the same set, they share the same fluorescence image for each fluorescence label. Different laboratories obtained the microscopy images under different conditions using different methods. Two imaging modalities, namely confocal and wide field are used during microscopy photoing. In addition, three different types of cells are collected by different laboratories, including human motor neurons from induced pluripotent stem cells (iPSCs), primary rat cortical cultures, and human breast cancer line. Detailed information of this dataset is given in Table I.
|Condition A||Condition B||Condition C||Condition D|
|Multi-scale U-Nets with DBs||0.941||0.887||0.935||0.902|
|Multi-scale GPTs with DBs||0.948||0.896||0.944||0.915|
Iv-B Experimental Setup
The architecture of our model is shown in Table II. It shows the changes of feature maps through the information flow in our networks. The growth rate of our dense blocks is set to 16. We employ three GDT layers with dense blocks in our encoder to perform down-sampling and extract high-level features. Correspondingly, there are three GUT layers with dense blocks to recover the spatial sizes. For the bottom block connecting the encoder and the decoder, we employ one GST layer and one dense block. Note that the depths of different dense blocks are different.
Training examples are obtained by randomly cropping from the raw images. Since we employ the multi-scale input strategy, we crop images at three different scales; namely , , and . The network predicts fluorescence maps with sizes equal to . We train our proposed model across all training examples in a multi-task learning manner. Since there are eight fluorescence labels, our model learns eight tasks simultaneously to capture and refine common features across all training samples. Specifically, for each input image, our model generates eight fluorescence maps, and each map corresponds to one fluorescence label. In addition, for each pixel in the predicted maps, the network outputs a probability distribution over 256 pixel values, so in Table II. Cross-entropy loss is employed for network training. Note that there are at most three fluorescence labels available for a given input. The loss is calculated by only considering target labels while irrelevant labels are ignored. During training, we employ the dropout with a rate of 0.5 in our dense blocks to avoid over-fitting. To optimize the model, we employ the Adam optimizer  with a learning rate of and a batch size of 4. During the prediction stage, test patches are cropped in a sliding-window fashion. We extract patches from test images with the same sizes as those in training () by sliding a window with a constant step size. The step size is set to 64 in our experiments. Then we build predictions for the original test images based on predictions of small patches.
|Cell Nuclei-Condition A||Baseline||-||0.932||0.805||0.493||0.539||0.386||0.320||0.143||-||-||0.777|
|Cell Nuclei-Condition B||Baseline||-||0.716||0.937||0.339||0.480||0.200||0.250||0.200||1.000||-||0.818|
|Cell Nuclei-Condition C||Baseline||-||0.801||0.898||0.425||0.429||0.286||0.500||0.400||0.500||0.500||0.825|
|Cell Nuclei-Condition D||Baseline||-||0.902||0.498||0.529||0.478||0.559||0.652||0.556||0.333||0.333||0.635|
Iv-C Comparison with the Baseline
We compare our approach with the existing model  as it achieves the state-of-the-art performance on the dataset we are using. To demonstrate the effectiveness of our proposed approach, we conduct comparisons with the baseline method for three different tasks:
Prediction of Cell Nuclei: Given an image, the task is to predict the nuclei of live cells. The nuclei of live cells are labeled using DAPI on both confocal and wild field modalities. Examples created under condition A, B, C, D have fluorescence labels to investigate the cell nuclei.
Prediction of Cell Viability: Given an image, this task predicts the dead cells with cell nuclei as visual background. Dead cells on images are labeled with propidium lodide (PI) on confocal modality. These images are obtained under condition C.
Prediction of Cell Type: Given an image, this task predicts the neurons with cell nuclei as visual background. There may exist two other types of cells in the image, such as astrocytes and immature dividing cells. Neurons on images are labeled using TuJ1 under condition A.
We first compare our approach with the baseline method quantitatively, using Pearson’s correlation values calculated for each task. Following the work , one million pixels are randomly sampled from all the test images in a task, and we collect the predicted values for these pixels. These predicted results can be represented as a one million dimensional vector. Similarly, we can obtain another one million dimensional vector from the ground truth of these pixels. Then we calculate the Pearson’s correlation between these two vectors, which can indicate the similarity between them. In particular, higher Pearson’s correlation values imply that the predicted results are closer to the ground truth. The results are reported in Table III. Note that for both our method and the baseline approach, we repeat the calculations 30 times and report the average and standard deviation. We can observe that the proposed model outperforms the baseline model significantly on all of the three tasks. These results indicate that the proposed model can better capture the relationships between microscopy images and the corresponding fluorescence labels.
In addition, we compare the prediction results qualitatively. We present the prediction results for the cell nuclei task in Figure 3. Based on visual comparisons for the areas in white boxes, we can observe that our model can make more accurate predictions for many small regions. These results demonstrate the capability of our model to capture detailed information. Furthermore, confusion matrices are reported for these images to allow visualization of true versus predicted pixel values in each bin. The pixel values are normalized to and divided into 10 bins that the bin contains the pixels with values in the range . The overall accuracies (OAs) in confusion matrices indicate how many pixels are classified into the same bin as the ground truth. As shown in Figure 4, our model can predict more accurate pixel values compared with the baseline model. Similarly, we report the prediction results and the corresponding confusion matrices for the dead cell task in Figure 5. The white boxes show that the baseline misclassifies dead cells to other labels while our model has the ability to make correct predictions. We also show the results of the cell type task in Figure 6. We can clearly observe that our model achieves more accurate predictions on neurons. Finally, we report the prediction accuracies for different bins and the overall accuracies in Table V. Obviously, for all three task, we obtain more accurate predictions. Overall, both qualitative and quantitative results indicate that our model performs significantly better than the baseline approach.
Iv-D Ablation Analysis
We conduct ablation analysis on the cell nuclei prediction task to show the effectiveness of each proposed module. All models are trained under the same condition and compared with fair settings. As shown in Table IV, when employing the multi-scale input strategy, even the classic U-Nets can achieve better results than the baseline approach. By adapting to dense blocks, the performance is further improved. The best performance is achieved by incorporating all of our proposed modules. Such results indicate that all of our proposed modules are effective to improve predictive performance.
Visualizing cellular structure is important to understand cellular functions. Fluorescence staining is a popular technique but has key limitations. Here, we develop a novel deep learning model to directly predict labeled fluorescence images from unlabeled microscopy images. To fuse global information efficiently and effectively, we propose a novel global pixel transformer layer and build an U-Net like network by incorporating our proposed global pixel transformer layer and dense blocks. A novel multi-scale input strategy is also proposed to combine both global and local features for more accurate predictions. Experimental results on various fluorescence image prediction tasks indicates that our model outperforms the baseline model significantly. In addition, ablation study shows that all of our proposed modules are effective to improve performance.
This work was supported by National Science Foundation [IIS-1633359, IIS-1615035, and DBI-1641223].
-  S. Koho, E. Fazeli, J. E. Eriksson, and P. E. Hänninen, “Image quality ranking method for microscopy,” Scientific reports, vol. 6, p. 28962, 2016.
-  Y. Jo, H. Cho, S. Y. Lee, G. Choi, G. Kim, H.-s. Min, and Y. Park, “Quantitative phase imaging and artificial intelligence: A review,” IEEE Journal of Selected Topics in Quantum Electronics, vol. 25, no. 1, pp. 1–14, 2019.
-  M. Held, M. H. Schmitz, B. Fischer, T. Walter, B. Neumann, M. H. Olma, M. Peter, J. Ellenberg, and D. W. Gerlich, “Cellcognition: time-resolved phenotype annotation in high-throughput live cell imaging,” Nature methods, vol. 7, no. 9, p. 747, 2010.
-  E. Glory and R. F. Murphy, “Automated subcellular location determination and high-throughput microscopy,” Developmental cell, vol. 12, no. 1, pp. 7–16, 2007.
-  K.-C. Chou and H.-B. Shen, “Cell-ploc: a package of web servers for predicting subcellular localization of proteins in various organisms,” Nature protocols, vol. 3, no. 2, p. 153, 2008.
-  M.-A. Bray, A. N. Fraser, T. P. Hasaka, and A. E. Carpenter, “Workflow and metrics for image quality control in large-scale high-content screens,” Journal of biomolecular screening, vol. 17, no. 2, pp. 266–274, 2012.
-  W. Buchser, M. Collins, T. Garyantes, R. Guha, S. Haney, V. Lemmon, Z. Li, and O. J. Trask, “Assay development guidelines for image-based high content screening, high content analysis and high content imaging,” 2014.
-  C. Ounkomol, S. Seshamani, M. M. Maleckar, F. Collman, and G. R. Johnson, “Label-free prediction of three-dimensional fluorescence images from transmitted-light microscopy,” Nature methods, vol. 15, no. 11, p. 917, 2018.
-  E. M. Christiansen, S. J. Yang, D. M. Ando, A. Javaherian, G. Skibinski, S. Lipnick, E. Mount, A. O’Neil, K. Shah, A. K. Lee et al., “In silico labeling: Predicting fluorescent labels in unlabeled images,” Cell, vol. 173, no. 3, pp. 792–803, 2018.
-  P. I. Bastiaens and A. Squire, “Fluorescence lifetime imaging microscopy: spatial resolution of biochemical processes in the cell,” Trends in cell biology, vol. 9, no. 2, pp. 48–52, 1999.
-  Q. Wang, J. Niemi, C.-M. Tan, L. You, and M. West, “Image segmentation and dynamic lineage analysis in single-cell fluorescence microscopy,” Cytometry Part A: The Journal of the International Society for Advancement of Cytometry, vol. 77, no. 1, pp. 101–110, 2010.
-  H. Yuan, L. Cai, Z. Wang, X. Hu, S. Zhang, and S. Ji, “Computational modeling of cellular structures using conditional deep generative networks,” Bioinformatics, vol. 35, no. 12, pp. 2141–2149, 2018.
-  Y. Rivenson, H. Wang, Z. Wei, K. de Haan, Y. Zhang, Y. Wu, H. Günaydın, J. E. Zuckerman, T. Chong, A. E. Sisk et al., “Virtual histological staining of unlabelled tissue-autofluorescence images via deep learning,” Nature biomedical engineering, vol. 3, no. 6, p. 466, 2019.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
-  H. Wang, Y. Rivenson, Y. Jin, Z. Wei, R. Gao, H. Günaydın, L. A. Bentolila, C. Kural, and A. Ozcan, “Deep learning enables cross-modality super-resolution in fluorescence microscopy,” Nature Methods, vol. 16, pp. 103–110, 2019.
-  M. Weigert, U. Schmidt, T. Boothe, A. Müller, A. Dibrov, A. Jain, B. Wilhelm, D. Schmidt, C. Broaddus, S. Culley et al., “Content-aware image restoration: pushing the limits of fluorescence microscopy,” Nature methods, vol. 15, no. 12, p. 1090, 2018.
-  L. Cai, Z. Wang, H. Gao, D. Shen, and S. Ji, “Deep adversarial learning for multi-modality missing data completion,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2018, pp. 1158–1166.
-  W. Zhang, R. Li, H. Deng, L. Wang, W. Lin, S. Ji, and D. Shen, “Deep convolutional neural networks for multi-modality isointense infant brain image segmentation,” NeuroImage, vol. 108, pp. 214–224, 2015.
-  R. Li, W. Zhang, H.-I. Suk, L. Wang, J. Li, D. Shen, and S. Ji, “Deep learning based imaging data completion for improved brain disease diagnosis,” in Proceedings of the 17th International Conference on Medical Image Computing and Computer Assisted Intervention, 2014, pp. 305–312.
-  Y. Chen, H. Gao, L. Cai, M. Shi, D. Shen, and S. Ji, “Voxel deconvolutional networks for 3D brain image labeling,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2018, pp. 1226–1234.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 6000–6010.
-  Z. Wang, N. Zou, D. Shen, and S. Ji, “Global deep learning methods for multimodality isointense infant brain image segmentation,” arXiv preprint arXiv:1812.04103, 2018.
-  T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM review, vol. 51, no. 3, pp. 455–500, 2009.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
-  G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks.” in CVPR, vol. 1, no. 2, 2017, p. 3.
-  T. M. Quan, D. G. Hildebrand, and W.-K. Jeong, “Fusionnet: A deep fully residual convolutional neural network for image segmentation in connectomics,” arXiv preprint arXiv:1612.05360, 2016.
-  A. Fakhry, T. Zeng, and S. Ji, “Residual deconvolutional networks for brain electron microscopy image segmentation,” IEEE transactions on medical imaging, vol. 36, no. 2, pp. 447–456, 2017.
-  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning, 2015, pp. 448–456.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.