Diagnosis of Celiac Disease and Environmental Enteropathy on Biopsy Images Using Color Balancing on Convolutional Neural Networks
Celiac Disease (CD) and Environmental Enteropathy (EE) are common causes of malnutrition and adversely impact normal childhood development. CD is an autoimmune disorder that is prevalent worldwide and is caused by an increased sensitivity to gluten. Gluten exposure destructs the small intestinal epithelial barrier, resulting in nutrient mal-absorption and childhood under-nutrition. EE also results in barrier dysfunction but is thought to be caused by an increased vulnerability to infections. EE has been implicated as the predominant cause of under-nutrition, oral vaccine failure, and impaired cognitive development in low-and-middle-income countries. Both conditions require a tissue biopsy for diagnosis, and a major challenge of interpreting clinical biopsy images to differentiate between these gastrointestinal diseases is striking histopathologic overlap between them. In the current study, we propose a convolutional neural network (CNN) to classify duodenal biopsy images from subjects with CD, EE, and healthy controls. We evaluated the performance of our proposed model using a large cohort containing 1000 biopsy images. Our evaluations show that the proposed model achieves an area under ROC of 0.99, 1.00, and 0.97 for CD, EE, and healthy controls, respectively. These results demonstrate the discriminative power of the proposed model in duodenal biopsies classification.
1 Introduction and Related Works
Under-nutrition is the underlying cause of approximately % of the million under -year-old childhood deaths annually in low and middle-income countries (LMICs) [WHO.Children] and is a major cause of mortality in this population. Linear growth failure (or stunting) is a major complication of under-nutrition, and is associated with irreversible physical and cognitive deficits, with profound developmental implications [syed2016environmental]. A common cause of stunting in LMICs is EE, for which there are no universally accepted, clear diagnostic algorithms or non-invasive biomarkers for accurate diagnosis [syed2016environmental], making this a critical priority [naylor2015environmental]. EE has been described to be caused by chronic exposure to enteropathogens which results in a vicious cycle of constant mucosal inflammation, villous blunting, and a damaged epithelium [syed2016environmental]. These deficiencies contribute to a markedly reduced nutrient absorption and thus under-nutrition and stunting [syed2016environmental]. Interestingly, CD, a common cause of stunting in the United States, with an estimated % prevalence, is an autoimmune disorder caused by a gluten sensitivity [husby2012european] and has many shared histological features with EE (such as increased inflammatory cells and villous blunting) [syed2016environmental]. This resemblance has led to the major challenge of differentiating clinical biopsy images for these similar but distinct diseases. Therefore, there is a major clinical interest towards developing new, innovative methods to automate and enhance the detection of morphological features of EE versus CD, and to differentiate between diseased and healthy small intestinal tissue [bejnordi2017diagnostic].
In this paper, we propose a CNN-based model for classification of biopsy images. In recent years, Deep Learning architectures have received great attention after achieving state-of-the-art results in a wide variety of fundamental tasks such classification [Heidarysafa2018RMDL, kowsari2017hdltex, kowsari2018rmdl, info10040150, litjens2017survey, nobles2018identification, zhai2016doubly] or other medical domains [hegde2019comparison, zhang2018patient2vec]. CNNs in particular have proven to be very effective in medical image processing. CNNs preserve local image relations, while reducing dimensionality and for this reason are the most popular machine learning algorithm in image recognition and visual learning tasks [ker2018deep]. CNNs have been widely used for classification and segmentation in various types of medical applications such as histopathological images of breast tissues, lung images, MRI images, medical X-Ray images, etc. [gulshan2016development, litjens2017survey]. Researchers produced advanced results on duodenal biopsies classification using CNNs [Mohammad_al_boni], but those models are only robust to a single type of image stain or color distribution. Many researchers apply a stain normalization technique as part of the image pre-processing stage to both the training and validation datasets [nawaz2018classification]. In this paper, varying levels of color balancing were applied during image pre-processing in order to account for multiple stain variations.
The rest of this paper is organized as follows: In Section 2, we describe the different data sets used in this work, as well as, the required pre-processing steps. The architecture of the model is explained in Section 4. Empirical results are elaborated in Section 5. Finally, Section 6 concludes the paper along with outlining future directions.
2 Data Source
For this project, Hematoxylin and Eosin (H&E) stained duodenal biopsy glass slides were retrieved from patients. The slides were converted into whole slide images, and labeled as either EE, CD, or normal. The biopsy slides for EE patients were from the Aga Khan University Hospital (AKUH) in Karachi, Pakistan ( slides from patients) and the University of Zambia Medical Center in Lusaka, Zambia (). The slides for CD patients () and normal () were retrieved from archives at the University of Virginia (UVa). The CD and normal slides were converted into whole slide images at x magnification using the Leica SCN slide scanner (Meyer Instruments, Houston, TX) at UVa, and the digitized EE slides were of 20x magnification and shared via the Environmental Enteric Dysfunction Biopsy Investigators (EEDBI) Consortium shared WUPAX server. Characteristics of our patient population are as follows: the median (, ) age of our entire study population was (, ) months, and we had a roughly equal distribution of males (%, ) and females (%, ). The majority of our study population were histologically normal controls , followed by CD patients , and EE patients .
In this section, we cover all of the pre-processing steps which include image patching, image clustering, and color balancing. The biopsy images are unstructured (varying image sizes) and too large to process with deep neural networks; thus, requiring that images are split into multiple smaller images. After executing the split, some of the images do not contain much useful information. For instance, some only contain the mostly blank border region of the original image. In the image clustering section, the process to select useful images is described. Finally, color balancing is used to correct for varying color stains which is a common issue in histological image processing.
3.1 Image Patching
Although effectiveness of CNNs in image classification has been shown in various studies in different domains, training on high resolution Whole Slide Tissue Images (WSI) is not commonly preferred due to a high computational cost. However, applying CNNs on WSI enables losing a large amount of discriminative information due to extensive downsampling [hou2016patch]. Due to a cellular level difference between Celiac, Environmental Entropathy and normal cases, a trained classifier on image patches is likely to perform as well as or even better than a trained WSI-level classifier. Many researchers in pathology image analysis have considered classification or feature extraction on image patches [hou2016patch]. In this project, after generating patches from each images, labels were applied to each patch according to its associated original image. A CNN was trained to generate predictions on each individual patch.
In this study, after image patching, some of created patches do not contain any useful information regarding biopsies and should be removed from the data. These patches have been created from mostly background parts of WSIs. A two-step clustering process was applied to identify the unimportant patches. For the first step, a convolutional autoencoder was used to learn embedded features of each patch and in the second step we used k-means to cluster embedded features into two clusters: useful and not useful. In Figure 2, the pipeline of our clustering technique is shown which contains both the autoencoder and k-mean clustering.
An autoencoder is a type of neural network that is designed to match the model’s inputs to the outputs [goodfellow2016deep]. The autoencoder has achieved great success as a dimensionality reduction method via the powerful reprehensibility of neural networks [wang2014generalized]. The first version of autoencoder was introduced by DE. Rumelhart el at. [rumelhart1985learning] in 1985. The main idea is that one hidden layer between input and output layers has much fewer units [liang2017text] and can be used to reduce the dimensions of a feature space. For medical images which typically contain many features, using an autoencoder can help allow for faster, more efficient data processing.
A CNN-based autoencoder can be divided into two main steps [masci2011stacked] : encoding and decoding.
Where is a convolutional filter, with convolution among an input volume defined by which it learns to represent the input by combining non-linear functions:
|Total||Cluster 1||Cluster 2|
|Celiac Disease (CD)|
|Environmental Enteropathy (EE)|
where is the bias, and the number of zeros we want to pad the input with is such that: dim(I) = dim(decode(encode(I))) Finally, the encoding convolution is equal to:
The decoding convolution step produces feature maps . The reconstructed results is the result of the convolution between the volume of feature maps and this convolutional filters volume [chen2015page, geng2015high].
Where Equation 5 shows the decoding convolution with dimensions. The input’s dimensions are equal to the output’s dimensions.
Results of patch clustering has been summarized in Table 1. Obviously, patches in cluster , which were deemed useful, are used for the analysis in this paper.
3.3 Color Balancing
The concept of color balancing for this paper is to convert all images to the same color space to account for variations in H&E staining. The images can be represented with the illuminant spectral power distribution , the surface spectral reflectance , and the sensor spectral sensitivities [bianco2017improving, bianco2014error]. Using this notation [bianco2014error], the sensor responses at the pixel with coordinates can be thus described as:
where is the wavelength range of the visible light spectrum, ρ and are three-component vectors.
where is raw images from biopsy and is results for CNN input. In the following, a more compact version of Equation 7 is used:
where is exposure compensation gain, refers the diagonal matrix for the illuminant compensation and indicates the color matrix transformation.
Figure 4 shows the results of color balancing for three classes (Celiac Disease (CD), Normal and Environmental Enteropathy (EE)) with different color balancing percentages between and .
In this section, we describe Convolutional Neural Networks (CNN) including the convolutional layers, pooling layers, activation functions, and optimizer. Then, we discuss our network architecture for diagnosis of Celiac Disease and Environmental Enteropathy. As shown in figure 5, the input layers starts with image patches () and is connected to the convolutional layer (Conv ). Conv is connected to the pooling layer (MaxPooling), and then connected to Conv . Finally, the last convolutional layer (Conv ) is flattened and connected to a fully connected perception layer. The output layer contains three nodes which each node represent one class.
4.1 Convolutional Layer
CNN is a deep learning architecture that can be employed for hierarchical image classification. Originally, CNNs were built for image processing with an architecture similar to the visual cortex. CNNs have been used effectively for medical image processing. In a basic CNN used for image processing, an image tensor is convolved with a set of kernels of size . These convolution layers are called feature maps and can be stacked to provide multiple filters on the input. The element (feature) of input and output matrices can be different [li2014medical]. The process to compute a single output matrix is defined as follows:
Each input matrix is convolved with a corresponding kernel matrix , and summed with a bias value at each element. Finally, a non-linear activation function (See Section 4.3) is applied to each element [li2014medical].
In general, during the back propagation step of a CNN, the weights and biases are adjusted to create effective feature detection filters . The filters in the convolution layer are applied across all three ’channels’ or (size of the color space) [Heidarysafa2018RMDL].
4.2 Pooling Layer
To reduce the computational complexity, CNNs utilize the concept of pooling to reduce the size of the output from one layer to the next in the network. Different pooling techniques are used to reduce outputs while preserving important features [scherer2010evaluation]. The most common pooling method is max pooling where the maximum element is selected in the pooling window.
In order to feed the pooled output from stacked featured maps to the next layer, the maps are flattened into one column. The final layers in a CNN are typically fully connected [kowsari2018rmdl].
4.3 Neuron Activation
The implementation of CNN is a discriminative trained model that uses standard back-propagation algorithm using a sigmoid (Equation 10), (Rectified Linear Units (ReLU) [nair2010rectified] (Equation 11) as activation function. The output layer for multi-class classification includes a function (as shown in Equation 12).
For this CNN architecture, the optimizor [kingma2014adam] which is a stochastic gradient optimizer that uses only the average of the first two moments of gradient ( and , shown in Equation 13, 14, 15, and 16). It can handle non-stationary of the objective function as in RMSProp, while overcoming the sparse gradient issue limitation of RMSProp [kingma2014adam].
where is the first moment and indicates second moment that both are estimated. and .
4.5 Network Architecture
As shown in Table 2 and Figure 6, our CNN architecture consists of three convolution layer each followed by a pooling layer. This model receives RGB image patches with dimensions of as input. The first convolutional layer has filters with kernel size of . Then we have Pooling layer with size of which reduce the feature maps from to . The second convolutional layers with filters with kernel size of . Then Pooling layer (MaxPooling ) with size of reduces the feature maps from to . The third convolutional layer has filters with kernel size of , and final pooling layer (MaxPooling ) is scaled down to . The feature maps as shown in Table 2 is flatten and connected to fully connected layer with nodes. The output layer with three nodes to represent the three classes: (Environmental Enteropathy, Celiac Disease, and Normal).
The optimizer used is Adam (See Section 4.4) with a learning rate of , , and the loss considered is sparse categorical crossentropy [chollet2015keras]. Also for all layers, we use Rectified linear unit (ReLU) as activation function except output layer which we use (See Section 4.3).
|Layer (type)||Output Shape||
5 Empirical Results
5.1 Evaluation Setup
In the research community, comparable and shareable performance measures to evaluate algorithms are preferable. However, in reality such measures may only exist for a handful of methods. The major problem when evaluating image classification methods is the absence of standard data collection protocols. Even if a common collection method existed, simply choosing different training and test sets can introduce inconsistencies in model performance [yang1999evaluation]. Another challenge with respect to method evaluation is being able to compare different performance measures used in separate experiments. Performance measures generally evaluate specific aspects of classification task performance, and thus do not always present identical information. In this section, we discuss evaluation metrics and performance measures and highlight ways in which the performance of classifiers can be compared.
Since the underlying mechanics of different evaluation metrics may vary, understanding what exactly each of these metrics represents and what kind of information they are trying to convey is crucial for comparability. Some examples of these metrics include recall, precision, accuracy, F-measure, micro-average, and macro-average. These metrics are based on a “confusion matrix” that comprises true positives (TP), false positives (FP), false negatives (FN) and true negatives (TN) [lever2016points]. The significance of these four elements may vary based on the classification application. The fraction of correct predictions over all predictions is called accuracy (Eq. 17). The proportion of correctly predicted positives to all positives is called precision, i.e. positive predictive value (Eq. 18).
5.2 Experimental Setup
The following results were obtained using a combination of central processing units (CPUs) and graphical processing units (GPUs). The processing was done on a with cores and memory, and the GPU cards were two and a . We implemented our approaches in Python using the Compute Unified Device Architecture (CUDA), which is a parallel computing platform and Application Programming Interface (API) model created by . We also used Keras and TensorFlow libraries for creating the neural networks [abadi2016tensorflow, chollet2015keras].
5.3 Experimental Results
In this section we show that CNN with color balancing can improve the robustness of medical image classification. The results for the model trained on different color balancing values are shown in Table 3. The results shown in Table 4 are also based on the trained model using the same color balancing values. Although in Table 4, the test set is based on a different set of color balancing values: and . By testing on a different set of color balancing, these results show that this technique can solve the issue of multiple stain variations during histological image analysis.
As shown in Table 3, the f1-score of three classes (Environmental Enteropathy (EE), Celiac Disease (CD), and Normal) are , , and respectively. In Table 4, the f1-score is reduced, but not by a significant amount. The three classes (Environmental Enteropathy (EE), Celiac Disease (CD), and Normal) f1-scores are , , and respectively. The result is very similar to MA. Boni et.al [Mohammad_al_boni] which achieved 90.59% of accuracy in their mode, but without using the color balancing technique to allow differently stained images.
|Celiac Disease (CD)|
|Celiac Disease (CD)|
In Figure 7, Receiver operating characteristics (ROC) curves are valuable graphical tools for evaluating classifiers. However, class imbalances (i.e. differences in prior class probabilities) can cause ROC curves to poorly represent the classifier performance. ROC curve plots true positive rate (TPR) and false positive rate (FPR). The ROC shows that AUC of Environmental Enteropathy (EE) is , Celiac Disease (CD) is 0.99, and Normal is 0.97.
|Shifting and Reflections [Mohammad_al_boni]||No||CNN||85.13%|
|Fine-tuned ALEXNET [nawaz2018classification]||Yes||ALEXNET||89.95%|
As shown in Table 5, our model performs better compared to some other models in terms of accuracy. Among the compared models, only the fine-tuned ALEXNET [nawaz2018classification] has considered the color staining problem. This model proposes a transfer learning based approach for the classification of stained histology images. They also applied stain normalization before using images for fine tuning the model.
In this paper, we proposed a data driven model for diagnosis of diseased duodenal architecture on biopsy images using color balancing on convolutional neural networks. Validation results of this model show that it can be utilized by pathologists in diagnostic operations regarding CD and EE. Furthermore, color consistency is an issue in digital histology images and different imaging systems reproduced the colors of a histological slide differently. Our results demonstrate that application of the color balancing technique can attenuate effect of this issue in image classification.
The methods described here can be improved in multiple ways. Additional training and testing with other color balancing techniques on data sets will continue to identify architectures that work best for these problems. Also, it is possible to extend the model to more than four different color balance percentages to capture more of the complexity in the medical image classification.
This research was supported by University of Virginia, Engineering in Medicine SEED Grant , the University of Virginia Translational Health Research Institute of Virginia () Mentored Career Development Award , and the Bill and Melinda Gates Foundation (; , ; )