Deep interpretable architecture for plant diseases classification
Recently, many works have been inspired by the success of deep learning in computer vision for plant diseases classification. Unfortunately, these end-to-end deep classifiers lack transparency which can limit their adoption in practice. In this paper, we propose a new trainable visualization method for plant diseases classification based on a Convolutional Neural Network (CNN) architecture composed of two deep classifiers. The first one is named Teacher and the second one Student. This architecture leverages the multitask learning to train the Teac1her and the Student jointly. Then, the communicated representation between the Teacher and the Student is used as a proxy to visualize the most important image regions for classification. This new architecture produces sharper visualization than the existing methods in plant diseases context. All experiments are achieved on PlantVillage dataset that contains 54306 plant images.
Deep interpretable architecture for plant diseases classification
July 1, 2019
Keywords Plant diseases classification Deep visualization algorithms Convolutional Neural Networks (CNNs)
Plant diseases cause great damages to agriculture crops by significantly decreasing production . Protecting plants from diseases is vital to guarantee the quality and the quantity of crops . A successful protection strategy starts with an early detection of the disease and the right treatment to prevent its spreading. Many studies proposed the use of Convolutional Neural Network (CNN) to detect and classify diseases. This new trend produced more accurate classifiers compared to shallow machine learning approaches based on hand crafted features [3, 4, 2, 5]. Despite all these successes, CNN still suffers from the lack of transparency that limits its spreading in many domains. These CNNs are complex deep models that yield good results at the expense of explainability and interpretability. High accuracy is not sufficient for plant diseases classification. Users also need to be informed how the detection is achieved and which symptoms are present in the plant.
In this paper, we propose a classification and visualization architecture based on multitask learning of two classifiers: Teacher and Student. This architecture represents a trainable visualization method for plant diseases classification. The main contribution is the design of an interpretable deep architecture able to do classification and visualization simultaneously. The visualization algorithm is embedded directly in the network design instead of using it after the training as post treatment.
2 Related works
Visualization algorithms are used to explain CNN decision using a heatmap. This heatmap highlights the importance of each pixel for the classifier’s decision. These methods backpropagate to the input image the discriminant features used by the network for classification. Most of these methods use heuristics during the backpropagation. For example, gradients based methods like deconvolution  and guided backpropagation  filter some signals during the backpropagation according to heuristic rules. These filtered signals are ignored to produce sharp visualizations. Furthermore, Layer Wise Relevance Propagation (LRP) methods [8, 9] are based on choosing the layer’s roots to determine propagation rules for each layer in the network. The global visualization method GRAD-CAM  projects the features from the penultimate convolution layer based on a linear interpolation which degrades the precision of the produced heatmaps. These visualization algorithms can produce different heatmaps according to the chosen heuristics and propagation rules, which makes the understanding of the classifier difficult. In the present paper, we combine two classifiers to extract discriminant features from images. These discriminant features are extracted by the first classifier and projected as an input for the second classifier. This trainable visualization is similar to segmentation architectures like U-Net  where the supervision masks are replaced with a second classifier.
3 Proposed method
We propose a classification and visualization generic architecture, named Teacher/Student architecture, based on learning transfer from a first classifier (Teacher) to a second one (Student). This learning transfer from the Teacher to the Student is achieved using an autoencoder having the Teacher as an encoder. The decoder consumes the Teacher latent representations to reconstruct an image with the same dimension of the input image. This image is used as an input of the Student. The whole network (Teacher + Decoder + Student) is trained to minimize jointly the losses of the two classifiers (Teacher and Student). More formally, the network has two outputs (Teacher output) and (Student output). During the training, the loss function (1) is minimised. The hyperparameter represents the tradeoff between the Teacher loss (2) and the Student loss (3).
The Teacher/Student network is designed to reconstruct an image containing the discriminant features formed by the Teacher to help the Student training. As a side effect of this design, the reconstructed image can be used as a visualization of the important regions for the classification. This architecture represents an autoencoder to denoise the image from irrelevant features from the classification viewpoint. The difference between the usual denoising autoencoder and this architecture lies in the loss function design. The denoising autoencoder minimizes a reconstruction loss while this architecture minimizes the classification loss of two classifiers. This proposed architecture is also designed to extract the important regions for classification without using masks. For this reason, any segmentation architecture can be modified to fit the proposed architecture by modifying the loss function to include a Student classifier and avoid the segmentation masks.
Fig. 1 details the Teacher/Student architecture. For the sake of simplicity, VGG16  architecture is used as Teacher and Student. Nevertheless, the Teacher/Student architecture is flexible and other classification architectures can be used as Teacher or Student. To use another architecture, the decoder must be adapted to inverse the Teacher’s layers to reconstruct the input of the Student.
The Teacher/Student architecture is composed of the following components:
3.1 Teacher/Student architecture
The Teacher and the Student architectures are identical to standard architecture VGG16 . Skip connections (blue arrows) are used from the Teacher to the decoder. This skip connections concatenate the input tensors of pooling layer of each convolution block with deconvolution block tensors.
3.2 Reversed fully connected layers
Convolutional autoencoders reverse only convolution layers to do reconstruction task. Here, the decoder requires discriminant features from fully connected layers. Therefore, two fully connected layers are used to reverse the Teacher’s fully connected layers. Furthermore, a skip connection (red arrow) is used to reinforce the decoder by adding the vector of the Teacher’s first fully connected layer.
3.3 Deconvolution blocks
|Layer||Input tensor||Output tensor|
Deconvolution blocks reverse the flow of tensors to form the reconstructed image. The details of this block are shown in Tab. 1. This block consumes an input tensor of dimension and produce a tensor of dimension . In each deconvolution block, tensor is upsampled, concatenated with the corresponding tensor of skip connection and the depth of resulted tensor after concatenation is reduced. The first two deconvolution block (DECONV Block1, DECONV Block2) double the input tensor spatially without changing its depth . However, the other deconvolution blocks double the tensor spatially and reduce its depth .
3.4 Reconstructed image refinement
After many stages of deconvolution, the spatial dimension of the produced tensor will be equal to input image spatial dimension. However, the depth of the tensor must be reduced to match the depth of the input image. To do this, two convolution layers are applied. The first one (red in Fig. 1) reduces the depth to two channels which applies pressure on the data flow by throwing unnecessary details. The second convolution layer (green in Fig. 1) expands the tensor to three channels to use it as the Student’s input. The last decoder’s convolution layer uses sigmoid activation function to scale the values in the interval . This final tensor is used as Student’s input and can be used also as proxy to understand the communication between the Teacher and the Student.
4 Experimental results
Experimental results are conducted using the segmented version of PlantVillage dataset with black backgrounds . This dataset includes 54306 images of 14 crop species with 38 classes of diseases and healthy plants. The data set is split into a training set that contains 32572 images and a validation set that contains 21734 images.
4.1 Classification results
In this section, we present the results of training and validation of the proposed architecture. The training is based on gradient descent algorithm with the following hyperparameters: (learning rate , momentum , batch size= 16, number of iterations , multitask hyperparamater ). Training and validation are executed on a workstation containing Graphical Processing Unit GPU Nvidia GTX 1080.
At the beginning of training, the Teacher is more accurate than the student. This may be explained by the dependency of the Student on the representation constructed by the Teacher. However, the loss and the accuracy of the Teacher and the Student converge at the end of the training because the communicated representation becomes stable. Furthermore, the Student’s loss is more stable than the Teacher’s loss in validation which reinforce the hypothesis of the transfer learning from the Teacher to the Student. This transfer learning is achieved through the quality of the reconstructed image. This reconstructed image focuses on discriminant regions and filters non-discriminant regions. To assess this information filtering mechanism, we analyze the architecture as a visualization method in the following sections.
4.2 Visualization results
The visualizations depicted on Fig. 3 represent the reconstructed three channels images used as an input for the Student. In Fig. 4, important regions are segmented using a simple binary thresholding algorithm (threshold = 0.9). The thresholding algorithm is applied after a simple aggregation across channels of reconstructed image (noted ) to produce one channel heatmap. To produce this heatmap from the reconstructed image , the formula (4) is applied. This formula measures the distance between pixel’s color and black .
The produced heatmaps (Fig. 4) show clearly the symptoms of the plant disease. Furthermore, these heatmaps are sharp and precise. The healthy regions of the leaves are filtered and only the important regions are highlighted. In the next section, the proposed method is compared quantitatively to other methods using perturbation curves to show its effectiveness.
4.3 Comparison with visualization algorithms
In this section, the proposed method is compared to the following visualization algorithms : visualization based on gradient , Grad-CAM  and Layer-wise Relevance Propagation (LRP) methods (Deep Taylor , LRP-Epsilon , LRP-Z ).
4.3.1 Heatmaps comparison
Fig. 5 shows the difference between the heatmaps of the visualization algorithms. The proposed algorithm’s heatmaps are sharper than the other heatmaps. Indeed, gradient, LRP-Z and LRP-Epsilon heatmaps are noisy and difficult to explain. The gradient heatmaps are noisy because the gradients measure the pixel’s sensitivities instead of their contributions. Beside, the presence of negative and positive contributions in LRP heatmaps makes them difficult to understand. On the other hand, Deep Taylor algorithm has good and clear heatmaps compared to other algorithms. The heatmaps of Deep Taylor algorithm highlight almost the same highlighted regions by our algorithm, but with some activated regions on the background. In the proposed algorithm, the background is completely deactivated which gives clean heatmaps.
Fig. 6 shows the difference between the proposed method and Grad-CAM. The Grad-cam algorithm localizes globally the important regions. Furthermore, the Grad-CAM visualizations miss some important regions highlighted by the proposed method. This inaccurate localization is due to the resizing used by Grad-CAM to propagate the contributions from the last convolution layer to the input image. In the Teacher/Student architecture, this propagation is ensured by a trainable decoder which makes the visualizations more precise.
4.3.2 Histograms of heatmaps values
all the produced heatmaps are normalized in the interval to analyze the distribution of values in each method. This normalization based on equation (5) where and are the minimum and the maximum of respectively.
Histograms of methods are shown in Fig. 7. We notice that the heatmaps values distribution of the proposed method is different from the other methods. Grad-CAM, gradient and Deep Taylor distributions are concentrated in very small values while the density decreases gradually as values increase. LRP-Epsilon and LRP-Z have gaussian-like distribution centered on 0.5. In contrast, the proposed method distribution has three clusters:
Cluster1 (: of deactivated pixels where the heatmap values are zeros (). This cluster represents the background and non-discriminant pixels.
Cluster2 (: of pixels having small heatmap’s values ().
Cluster3 (: of pixels having high heatmap values () and can be considered as important pixels for the classifier.
4.3.3 Perturbation curves
To measure the quality of visualization method quantitatively, the produced heatmap is considered as a ranking function of pixels. Good heatmap ranks the pixels correctly according to their importance for the classification. Therefore, if we start erasing the pixels having high values in the heatmap then the classifier output decreases rapidly. To evaluate this ranking, the heatmap values are discretized into intervals of size . Afterwards, pixels are erased iteratively according to their heatmaps values in descending order. This erasing procedure is formulated in Algorithm.1To erase one pixel, a small black square with a size of three pixels centered on the pixel of interest is used. All dataset images have a black background which motivates the use of black color as reference to erase with. The function gives the classifier estimation of the probability that has the same class of initial image . The perturbation curve is traced using points to track the evolution of the classifier output during the erasing. To produce this perturbation curve that characterizes a visualization method, the perturbation curves of validation images are averaged.
Fig. 8 shows the perturbation curves of tested methods. The proposed method’s curve decreases rapidly after erasing the pixels of the cluster . Afterwards, before erasing pixels of the cluster the curve is stationary. Erasing pixels of the cluster decreases slightly . Cluster does not contribute to the decreasing of the classifier output because this cluster contains background and non-discriminant pixels.
Fig. 8 shows that the perturbation curves of the other methods decrease gradually because of the distribution of values in their heatmaps. On the other hand, the perturbation curve of the proposed method decreases steeply because the pixels of contains the most important pixels in respect to the classifier.
The proposed method has better than other methods because of the concentration of important pixels in . In the proposed method, erasing only of image can decrease of to .
5 Conclusion and further research
In this work, we have proposed an interpretable Student/Teacher architecture for plant diseases classification. This architecture leverages the multitask to produce a trainable visualization method. Our experiments demonstrate the benefit of adding the Student classifier to guide the architecture in order to reconstruct a sharp visualization images. This reconstructed images contain the discriminant regions, which helps to explain the classifier’s decision. In the future, our objective is to test the Student/Teacher architecture on other classification problems. Besides, we will work to optimize the computation cost of this architecture.
-  Inge M. Hanssen and Moshe Lapidot. Chapter 2 - major tomato viruses in the mediterranean basin. In Gad Loebenstein and Hervé Lecoq, editors, Viruses and Virus Diseases of Vegetables in the Mediterranean Basin, volume 84 of Advances in Virus Research, pages 31 – 66. Academic Press, 2012.
-  Mohammed Brahimi, Kamel Boukhalfa, and Abdelouahab Moussaoui. Deep learning for tomato diseases: Classification and symptoms visualization. Appl. Artif. Intell., 31(4):299–315, April 2017.
-  E. Fujita, Y. Kawasaki, H. Uga, S. Kagiwada, and H. Iyatomi. Basic investigation on a robust and practical plant diagnostic system. In 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 989–992, Dec 2016.
-  Yusuke Kawasaki, Hiroyuki Uga, Satoshi Kagiwada, and Hitoshi Iyatomi. Basic study of automated diagnosis of viral plant diseases using convolutional neural networks. In George Bebis, Richard Boyle, Bahram Parvin, Darko Koracin, Ioannis Pavlidis, Rogerio Feris, Tim McGraw, Mark Elendt, Regis Kopper, Eric Ragan, Zhao Ye, and Gunther Weber, editors, Advances in Visual Computing, pages 638–645, Cham, 2015. Springer International Publishing.
-  L. G. Nachtigall, R. M. Araujo, and G. R. Nachtigall. Classification of apple tree disorders using convolutional neural networks. In 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI), pages 472–476, Nov 2016.
-  Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. CoRR, abs/1311.2901, 2013.
-  Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin A. Riedmiller. Striving for simplicity: The all convolutional net. In ICLR (Workshop), 2015.
-  Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLOS ONE, 10(7):1–46, 07 2015.
-  Grégoire Montavon, Sebastian Lapuschkin, Alexander Binder, Wojciech Samek, and Klaus-Robert Müller. Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition, 65:211 – 222, 2017.
-  R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 618–626, Oct 2017.
-  Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi, editors, Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241, Cham, 2015. Springer International Publishing.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
-  David P. Hughes and Marcel Salathé. An open access repository of images on plant health to enable the development of mobile disease diagnostics through machine learning and crowdsourcing. CoRR, abs/1511.08060, 2015.
-  Karen Simonyan:2014, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLR (Workshop), 2014.