This looks like that: deep learning for interpretable image recognition

This looks like that: deep learning for interpretable image recognition

Chaofan Chen1
cfchen@cs.duke.edu &Oscar Li111footnotemark: 1
runliang.li@duke.edu &Alina Barnett1
abarnett@cs.duke.edu &Jonathan Su3
su@ll.mit.edu &Cynthia Rudin1,2
cynthia@cs.duke.edu &1Department of Computer Science, Duke University, Durham, NC, USA 27708
2Department of Electrical and Computer Engineering, Duke University, Durham, NC, USA 27708
3MIT Lincoln Laboratory, Lexington, MA 02421-6426
Contributed equallyDISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited. This material is based upon work supported by the Assistant Secretary of Defense for Research and Engineering under Air Force Contract No. FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Assistant Secretary of Defense for Research and Engineering.
Abstract

When we are faced with challenging image classification tasks, we often explain our reasoning by dissecting the image, and pointing out prototypical aspects of one class or another. The mounting evidence for each of the classes helps us make our final decision. In this work, we introduce a deep network architecture that reasons in a similar way: the network dissects the image by finding prototypical parts, and combines evidence from the prototypes to make a final classification. The algorithm thus reasons in a way that is qualitatively similar to the way ornithologists, physicians, geologists, architects, and others would explain to people on how to solve challenging image classification tasks. The network uses only image-level labels for training, meaning that there are no labels for parts of images. We demonstrate the method on the CIFAR-10 dataset and classes from the CUB-200-2011 dataset.

 

This looks like that: deep learning for interpretable image recognition


  Chaofan Chen1thanks: Contributed equally cfchen@cs.duke.edu Oscar Li111footnotemark: 1 runliang.li@duke.edu Alina Barnett1 abarnett@cs.duke.edu Jonathan Su3 su@ll.mit.edu Cynthia Rudin1,2 cynthia@cs.duke.edu 1Department of Computer Science, Duke University, Durham, NC, USA 27708 2Department of Electrical and Computer Engineering, Duke University, Durham, NC, USA 27708 3MIT Lincoln Laboratory, Lexington, MA 02421-6426thanks: DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited. This material is based upon work supported by the Assistant Secretary of Defense for Research and Engineering under Air Force Contract No. FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Assistant Secretary of Defense for Research and Engineering.

\@float

noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

How would you describe why the image in Figure 1 looks like a Florida jay? Perhaps the bird’s head looks like that of a prototypical Florida jay, even though its tail might look like that of either a blue jay or a Florida jay. When we describe how we classify images, we might focus on parts of the image and compare them with prototypical parts of images from a given class. For other images, we might look at them holistically to compare with prototypical objects of similar overall shape. This method of reasoning is commonly used in difficult identification tasks: radiologists compare suspected tumors in X-ray scans with prototypical tumor images for diagnosis of cancer; an art historian can identify a painter by looking at both fine-grained details of painting such as the brush-stroke style, as well as coarse-grained details like subject matter and color palette. Beyond radiology and art, this type of image dissection is used in geology (rock/mineral identification), architecture (style identification), fashion, zoology and entomology. The question is whether we can ask a machine learning algorithm to imitate this way of thinking, in order to explain its reasoning process to humans in a way that they might understand.

Figure 1: Image of a Florida jay and the learned prototypical parts of a Florida jay used to classify the bird’s species. The smaller images on the right are the prototypical parts of a Florida jay learned by our algorithm. They correspond to parts of a test image (left) as shown.

The goal of this work is to define a form of interpretability in image processing (this looks like that) that agrees with the way humans describe their own thinking in classification tasks. In this work, we introduce a network architecture that accommodates this definition of interpretability, where the comparison of image parts to learned prototypes is integral to the way our network reasons about new examples. Given an image of a Florida jay in Figure 1, our learning algorithm is able to identify several parts of the image where it thought that this identified part of the image looked like that prototypical part of training images. The algorithm selects a limited number of prototypical parts for each class that are useful in identifying the class of a new image. It uses an internal notion of distance from a part of the new image to these learned prototypes for providing a predicted class label. In this way, the models introduced here are interpretable, rather than simply explainable, meaning that the reasoning process is actually used by the algorithm, rather than generated afterwards as an explanation.

Our experiments indicate that the accuracy of our interpretable network is comparable with that of analogous standard (non-interpretable) deep networks on datasets of natural images such as CIFAR-10 [Krizhevsky and Hinton, 2009]. The interpretability is gained without losing substantial accuracy.

1.1 Related work

Our work relates to those that perform posthoc interpretability analysis for trained networks. In posthoc analysis, one interprets a trained network by fitting explanations to how the network performs classification. In this case, a separate modeling effort is required to generate such explanations. A classic approach to understanding networks posthoc is activation maximization (AM), in which the goal is to find an input pattern that maximizes a particular class score [Erhan et al., 2009]. There are other works that perform AM using regularized optimization [Hinton, 2012, Lee et al., 2009, Oord16, Nguyen16, simonyan2013deep, Yosinski2015], to improve the interpretability of the images from AM. However, the images from regularized AM may not faithfully represent input patterns that maximally activate a network unit, because they are produced by a separate optimization procedure that is not part of training [Montavon et al., 2017]. There is no reason that any network should be inherently interpretable, and so this optimization does not generally lead to meaningful explanations. Alternatives to AM were provided by input-specific (image-specific) posthoc visualization methods. These include deconvolution [ZeilerFe14] and gradient-based saliency visualizations [simonyan2013deep, pmlr-v70-sundararajan17a, smilkov2017smoothgrad, Selvaraju_2017_ICCV]. All of these posthoc visualization methods do not explain the reasoning process of how a network actually makes its decisions. In contrast, our network has a built-in case-based reasoning process, and the explanations generated by our network are actually used during classification and are not created posthoc.

Our work relates closely to works that build interpretability into deep neural networks without using posthoc analysis. Attention mechanisms that attempt to identify the most relevant parts of an input for various tasks have been integrated into neural networks: pinheiro2015image trained a classification network that highlights important pixels belonging to an object of interest for weakly supervised image segmentation. zhou2016learning introduced class activation maps that highlight the regions of an image most responsible for classifying the image into a particular class. Both of these works learn class-specific attention maps, which are jointly trained with proposed networks. xiao2015application proposed object-level and part-level attention models that select image patches of interest for fine-grained image classification. Lei et al. [2016] proposed a network architecture for natural language processing that extracts important phrases and uses them as rationales for predictions. Attention mechanisms have also been used in deep learning for speech recognition [Chorowski et al., 2015], image captioning [xu2015show], visual question answering [Chen et al., 2015], and contour estimation [xu2017learning]. All of these works build interpretability into neural networks by learning which parts of an input are important for their respective tasks. Our work differs from all these works in that our model not only identifies parts of images that are important for classification, but also compare those parts to learned prototypical cases during prediction. Knowing which pixels are important (saliency, attention) tells us only which pixels are used and not how those pixels are used for reasoning (consider a correct saliency map with a wrong class label for instance, and how challenging that would be to troubleshoot).

Recently there have also been attempts to quantify the interpretability of visual representations learned by a convolutional neural network (CNN). Bau et al. [2017] proposed the network dissection framework that uses the overlap between the receptive field of top activations and regions corresponding to labeled visual concepts as a measure of the interpretability of the convolutional unit. zhang2017interpretable used this measure of interpretability and proposed modifications to traditional CNNs to make them interpretable, by introducing template masks into network architecture and adding regularization terms that encourage filters to be activated by a single class and on a single region. These are useful, but the notion of interpretability considered in this work is different. We do not aim to interpret units inside the network, we are looking instead at explanations that are similar to those made by humans to each other. We do not aim to compare everything identified in the image to a known, labeled, visual class. Instead we aim to pinpoint parts of the image that are important, and similar to prototypical parts of images from a class. Our network can automatically identify, for instance, that a prototypical part of a Florida jay’s head is important for identifying it. It can do this without having having seen that part labeled on any image of a Florida jay. It is not limited by what labels have been assigned to parts of training images, and does not need any parts labeled at all.

Our work also relates closely to other prototype classification techniques in machine learning [Bien and Tibshirani, 2011, Kim et al., 2014, Priebe2003, Chenyue Wu and Esteban G. Tabak, 2017]. It relates most closely to Li et al. [2018], who proposed a network architecture that builds case-based reasoning into a neural network. However, their model requires a decoder for visualizing prototypes, and when trained on datasets of natural images such as CIFAR-10, the decoder fails to produce realistically looking prototype images. In contrast, our model does not require a decoder for prototype visualization. It “pushes” the latent representations of prototypes onto the closest latent representations of training image patches, and uses those training image patches for prototype visualization. Unlike Li et al. [2018], whose model requires the prototypes to have exactly the same shape as the latent representations of images, the prototypes in our model can have much smaller spatial dimensions than the latent representations of images in general, which means that our prototypes are prototypical parts of images. This allows for more fine-grained comparisons because different parts of an image can now be compared to different prototypes. It improves over Li et al. [2018] also through easier training due to the removal of the decoder, leading to better explanations.

2 Methodology

2.1 Network architecture

Figure 2: The network architecture.

Our network architecture consists of a regular convolutional neural network (network filters and biases denoted by and ), followed by a prototype layer and a fully connected layer (with weight matrix and no bias). Given an input image , the convolutional layers of our model extract useful features for prediction. Let be the shape of the convolutional output . The network learns prototypes , whose shape is with and . Since the depth of each prototype is the same as that of the convolutional output but the height and the width of each prototype is less than or equal to those of the convolutional output, each prototype will be used to represent some prototypical activation pattern in a patch of the convolutional output, which in turn corresponds to some prototypical patch in the image space. Hence, each prototype can be understood as the latent representation of some typical part of an image. Given a convolutional output , the -th prototype unit in the prototype layer computes the squared distances between the -th prototype and all patches of that have the same shape as , and transforms the distances into similarity scores using the negative logarithm function. The result is an activation map of similarity scores that preserves the spatial relation of the convolutional output – the upper-left value in the resulting activation map is the similarity score between the upper-left patch of and the prototype. This map of similarity scores is then reduced to a single value using max pooling for each prototype unit . Mathematically, the prototype unit performs the following computation:

If the output of the -th prototype unit is large, it means that there is a patch in the convolutional output that is very close to the -th prototype in the latent space, and this in turn means that there is a patch in the original input image that has a similar concept to what the -th prototype represents.

Hence, given the convolutional output , the prototype layer produces similarity scores between the prototypes and the patches of that are most similar to those prototypes. These scores are then multiplied by weight matrix in the fully connected layer to produce the output logits for classification.

2.2 Cost function

The cost function for training our network takes into account both classification accuracy and learning interpretable prototypes. Let be the training set of images, with labels for . We use the cross-entropy loss to penalize misclassification on the training data. The optimization problem we aim to solve is as follows:

where represents the network’s (unnormalized) predictions on training set observations . The cross entropy compares these predictions to the training labels and encourages accuracy. The term in our objective is modified from Li et al. [2018]. In their work, was used to encourage the latent representation of each training image to be close to some prototype. The prototypes in our work are not required to have the same spatial dimensions as latent representations of images, unlike Li et al. [2018]. In this work, we define as:

The minimization of requires each training image to have some patch whose feature representation is close to at least one prototype. This ensures that the latent space has a clustering structure where the most important patches from the training images will be clustered around the prototypes, which facilitates the distance based classification performed by our network.

In our cost function, we use weight decay on the convolutional filters and regularization on the weight matrix. The use of the regularization encourages the weight connections between the prototype layer and the output logits to be sparse, so that each prototype contributes to the output logits for only a few classes. This makes it easier for humans to identify the most significant contributions of each prototype to class predictions.

Finally, the constraint in the optimization problem requires each prototype to be equal to the latent representation of some training image patch . In this way, each prototype can represent some semantic concept of the corresponding patch in that training image.

2.3 Training and prototype visualization

Our optimization technique uses both gradient descent on a relaxed objective (defined shortly) and projection steps. In order to relax the constraint to a differentiable function, we define:

which is similar to a term of Li et al. [2018], but modified to handle image parts. Adding a multiple of this term to our objective, and removing the explicit constraint, we have formed our relaxed objective that we minimize with gradient descent. The minimization of the relaxed objective allows sufficient exploration of the space as well as ensuring that we do not step too far from the feasible region.

After every few epochs (5 in our experiments) of gradient descent on the relaxed objective, we compute a projection step by projecting ’s onto the feasible set while only minimally increasing the objective function. To do this, we set each to its -nearest training image patch in the latent space. Since the last iteration is a projection step, the prototypes in the final model are exactly the latent representations of some training image patches. The prototypes can then be visualized by finding the receptive fields of those patches in the pixel space.

3 Experiment 1: CIFAR-10

CIFAR-10 is a dataset of color images in classes, with 50,000 training images and 10,000 test images. We use this dataset to demonstrate that our model can achieve comparable test accuracy with that of analogous standard convolutional networks, and it can learn meaningful prototypes that correspond to typical (parts of) objects.

3.1 Accuracy

Our network for CIFAR-10 has convolutional layers before the prototype layer and the fully connected layer. The convolutional layers have the same architecture as the first layers in the ALL-CNN-C model described in SpringenbergDBR14. We replaced the last convolutional layer and the global average pooling layer in the ALL-CNN-C model with our prototype layer with prototypes of shape and a fully connected layer. During training, we used only three simple data augmentation techniques – vertical and horizontal shift of pixels by at most of height and width, respectively, and random horizontal flip, to improve the generalizability of our network. The highest test accuracy we achieved using our model is after the prototypes have been pushed onto the nearest patches of training images in the latent space. We also performed an ablation study by training a standard convolutional network that has a similar architecture to our model. In particular, we replaced the prototype layer with a convolutional layer, which uses convolutional filters of shape . The highest test accuracy achieved by this standard convolutional network in our experiment is , with the same data augmentation techniques.

Thus, our network has accuracy, whereas the non-interpretable analogy of our method has ; we did not lose accuracy to gain interpretability.111Note that there is some difference between the best test accuracy we achieved for this data for any deep network we tried () and the best previously reported accuracy for this dataset (), reported on the ALL-CNN-C model by SpringenbergDBR14; however no source code is available, and we were not able to replicate this level of accuracy.

3.2 Visualization and analysis

Figure 3: The learned prototypes.
Figure 4: Classifying a new test image (blue car on the left): The class scores for every class (e.g. automobile, truck, deer) are calculated from the similarity to each prototype. The final prediction is the argmax of the class scores; for this example it is “automobile.”

Figure 3 shows the prototypes visualized in the pixel space. Since the learned prototypes are precisely the latent representations of some patches from training images, we can visualize the prototypes by mapping them back to the receptive fields of the corresponding training images.

To understand how much each prototype contribute to class predictions, we analyze the weight connections between the prototype layer and the output logits. Table 2 shows part of the weight matrix from our trained model on the CIFAR-10 dataset. The full weight matrix can be found in the supplementary material. The leftmost column in the table displays of the prototypes learned by our model, and the remaining columns show the contributions of each prototype unit to the prediction scores of the classes. A positive entry in the weight matrix means that a high similarity to the corresponding prototype contributes positively to the prediction score of the corresponding class – given an image, if it has a patch whose latent representation is very close to the prototype, the corresponding prototype unit will be highly activated, and when the output of this prototype unit is multiplied by a positive value in the weight connection, the result is a positive contribution to the class score. On the other hand, a negative entry in the weight matrix means that a high similarity to the corresponding prototype makes a negative contribution to the prediction score of the corresponding class. Thus, a dog-face prototype should contribute positively to the prediction score of a dog class, but it should contribute negatively to the prediction score of a bird class, for example, because the presence of a dog face should reduce the possibility that the image is that of a bird. This is exactly the case in our weight matrix – Prototype 2 in Table 2 shows the face of a dog, and it contributes positively to the prediction score of the dog class, but it contributes negatively to the prediction score of the bird class. An interesting observation is that this dog-face prototype also contributes somewhat positively to the prediction scores of a cat – this should not be too surprising because dogs look more similar to cats than to birds. In our weight matrix, there is a significant number of entries with values very close to , which means that prototype has no contribution to the prediction score of the corresponding class. This shows the effectiveness of regularization in achieving the sparsity of the weight matrix.

We now look at how our model reaches a classification decision on a test image of automobile shown in the left of Figure 4. Given this image, our model computes similarity scores of it towards each prototype. The most similar prototypes are prototype 0 in Table 2 (car prototype), prototype 4 in Table 2 (the front of the car), and prototype 3 in Table 2 (the truck prototype), with similarity scores of , , and . For each prototype, the similarity score is then multiplied with the row of the weight matrix associated with that prototype to get the class score contributions from that prototype. Prototype 0 in Table 2 has a weight of for automobile and for all other classes. The contribution to the automobile class score is points, as shown in Figure 4. The interpretability of our model comes from both the distance-based similarity scores towards meaningful prototypes and an understanding of how these similarity scores contribute to the final class prediction. This model resembles a classification scoring system used by humans, where each score is produced by a sum of similarities to semantic concepts, weighted by the importance of those concepts in classification. In contrast, a standard convolutional neural network can only produce class scores that have no explicit explanations of where these scores come from.

Class label
Prototype air-plane auto-mobile bird cat deer dog frog horse ship truck
0       0 3.302       0       0       0       0       0       0       0       0
1       0       0       0 2.598 -0.012 0.243       0       0       0 -0.260
2       0       0 -0.383 0.920       0 1.356 0.040       0       0       0
3       0 1.078       0       0       0       0       0       0       0 1.491
4       0 1.212       0       0 -0.321       0 0.886       0       0 1.027
5       0       0       0       0       0 0.001       0 2.562       0       0
Table 1: A subset of the weight matrix showing how similarity to each prototype contributes to class score. Values are rounded to the nearest thousandth. The full weight matrix can be found in the supplementary material.
Figure 5: First column: 5 prototypes learned from the CIFAR-10 dataset. Second to sixth columns: the 5 closest ( distance in the latent space) patches from the training set for each prototype (excluding the patch itself). Seventh to eleventh columns: the 5 closest patches from the test set for each prototype. The closest patches for every prototype can be found in the supplementary material.

To understand the cluster structure of the latent space, we find the closest training and test image patches to each prototype in the latent space. Figure 5 shows the closest training and test patches to of the learned prototypes. The closest patches to all of the prototypes can be found in the supplementary material. As shown in Figure 5, each prototype is surrounded by image patches of the same semantic concept in the latent space. For example, the (side-facing) horse prototype shown in Figure 5 is surrounded mostly by training and test image patches of side-facing horses in the latent space. We also observe that the nearest patches for each prototype come from distinct object instances with somewhat different viewing angles and colors. This shows that our network is able to learn highly invariant representations that capture the high-level semantic concepts for clustering in the latent space.

4 Experiment 2: Bird Identification

CUB-200-2011 is a dataset of color images of bird species [CUB_200_2011]. In our second experiment, we used the same network architecture as we did on CIFAR-10, and trained our network on bird species from CUB-200-2011: parakeet auklet, indigo bunting, cardinal, gray catbird, Florida jay, song sparrow, barn sparrow, cedar waxwing, downy woodpecker, and common yellow-throat, with training images and test images. Despite the fact that the dataset has a small number of training images ( per class) and we trained our network from scratch, we were able to achieve test accuracy. We are using this experiment to demonstrate the potential of our network in learning prototypical representations of parts of birds that are important for distinguishing different species, and in comparing these prototypes with the relevant parts of an unseen test image.

In Figure 6, we show how our trained network classifies a test image of a barn sparrow. The top of Figure 6(b) displays the learned prototypes from the barn sparrow class, and the bottom of the same figure displays corresponding heat maps that highlight where in the test image the prototypes are activated. Figure 6(c) shows the patches in the original image that produced the highest activations (i.e. had the smallest distances in the latent space) for the learned prototypes. For example, the first prototype shows the crown of the head of a training image, and the crown of the head of the test image is highlighted in the first heat map. The last prototype centers around the eye and the throat of the bird, and the corresponding heat map has the highest activation at the throat and eye of the bird.

Even though we did not use any bounding boxes in training our network, neither did we constrain where the network can select the prototypes, it is striking that our network is able to learn meaningful prototypical representations of relevant parts of birds, and that it can perform comparisons based on these relevant parts.

(a)
(b)
(c)
Figure 6: (a) A test image of barn sparrow. (b) Top: four learned prototypes from our network. Bottom: for each prototype, the corresponding heat map shows where in the test image the prototype is activated. Yellow shows high activation, black shows low activation. (c) The most activated patches of the test image for each of the four learned prototypes in (b).

5 Conclusion

There are challenges in designing image classifiers that provide explanations faithful to what the network computes, and similar to those a human might provide. The supplementary material illustrates how humans typically analyze images to help each other with challenging classification tasks. The explanations produced by our network agree with this reasoning style. These networks could be more useful than previous approaches in high-stakes applications, troubleshooting by human-machine teams in challenging image classification tasks, and training humans to identify objects images. The accuracy provided by our network is comparable with that of analogous but standard (uninterpretable) deep networks; there was nothing sacrificed to gain interpretability.

References

Supplementary material

Appendix A Diagrams created by people to explain classifications to other people

Classification of images is often done by hand using prototypical parts. These prototypical parts are labeled with arrows, and descriptions or figures are provided to compare with prototypical cases. For instance, the book Gray’s Anatomy [Gray et al., 1995] has diagrams showing which parts of the image one should consider to diagnose a specific disease, in the case of Figure 7(left) it is a Chiari I Malformation. Another example, in Figure 7(right) is from the field of architecture, illustrating prototypical aspects of a Victorian house.

Figure 7: Left: Figure reproduced from Gray et al. [1995], illustrating prototypical parts useful for demonstrating how a human would classify a Chiari I Malformation. The descriptions of A-F are in the book. Right: An image where prototypical parts of a Victorian house are labeled for purposes of genre identification. Image reproduced from Newburyport: Preservation Trust [].

The types of images produced by the method in this paper are also similar to those produced by ornithologists. Figure 8 illustrates an image of a bird with prototypes, as well as a labeled image produced by ornithologists.

Figure 8: Left: Image of a Florida jay and the learned prototypical parts of a Florida jay used to classify the bird’s species. The smaller images on the right are the prototypical parts of a Florida jay learned by our algorithm. They correspond to parts of a test image (left) as shown. Right: Human-labeled image of a sparrow, reproduced from Mayntz [2016].

Appendix B More detailed results

Table 2 shows the weight matrix from our algorithm on the CIFAR-10 dataset. Each entry shows how similarity to a prototype contributes to class score. If the score is positive, looking like the prototype increases class score for that class. If the score is negative, looking like that prototype decreases the class score of that class.

Figures 9 and 10 show the prototypes learned from the CIFAR-10 dataset in the left-hand column. For each prototype, the five closest (smallest distance in the latent space) patches from the training set are shown in the middle and the five closest patches from the test set are shown on the right.

Class label
Prototype air-plane auto-mobile bird cat deer dog frog horse ship truck
0       0       0       0 -0.092       0 3.381       0       0       0       0
1       0 -0.036       0 1.282       0       0       0       0       0 2.480
2 1.092       0       0       0 0.188       0       0       0 -0.003       0
3       0       0 0.990       0 0.380       0       0       0       0       0
4       0 3.302       0       0       0       0       0       0       0       0
5       0       0       0       0       0       0       0       0 1.000       0
6       0       0 2.846       0       0       0       0       0       0       0
7       0       0       0 2.598 -0.012 0.243       0       0       0 -0.260
8       0       0       0       0       0 0.183       0 1.450       0       0
9       0       0       0       0       0       0 2.034       0       0       0
10 1.125       0       0       0       0       0       0       0       0       0
11       0       0       0       0       0       0 1.058       0       0       0
12       0       0 -0.383 0.920       0 1.356 0.040       0       0       0
13 1.541       0       0       0       0       0       0       0       0       0
14       0       0       0       0 2.030       0       0       0       0       0
15 -0.085       0       0       0 1.161 0.359       0 0.150       0       0
16       0       0       0       0 -0.001       0       0       0 0.716       0
17       0 1.078       0       0       0       0       0       0       0 1.491
18       0       0       0       0       0       0 0.764       0       0       0
19       0       0       0       0       0 0.001 -0.013 0.380       0       0
20       0       0       0       0 1.800       0       0       0       0       0
21       0 1.212       0       0 -0.321       0 0.886       0       0 1.027
22 0.399 -0.105 0.015 -0.137       0 -0.205 -0.027       0 0.555 0.147
23       0       0       0       0 -0.055       0       0       0 0.532       0
24       0       0       0       0 0.433 0.004       0 0.707       0       0
25       0       0       0 0.162       0       0 0.933       0       0       0
26       0       0       0       0       0       0       0       0 2.015       0
27       0       0 1.397       0 -0.233 0.126       0 0.081       0       0
28 1.032       0       0       0       0       0       0       0       0       0
29       0       0       0       0       0 0.001       0 2.562       0       0
Table 2: The weight matrix showing how similarity to each prototype corresponds to class score. Values are rounded to the nearest thousandth.
Figure 9: First column: the first 15 prototypes learned from the CIFAR-10 dataset. Second to sixth columns: the patches from the training set which are closest to the prototype (excluding the prototype itself). In order of distance where the leftmost image is closest to the prototype (smallest distance in the latent space). Seventh to eleventh columns: the patches from the test set which are closest to the prototype. In order of distance where the leftmost image is closest to the prototype.
Figure 10: First column: the second 15 prototypes learned from the CIFAR-10 dataset. Second to sixth columns: the patches from the training set which are closest to the prototype (excluding the prototype itself). In order of distance where the leftmost image is closest to the prototype (smallest distance in the latent space). Seventh to eleventh columns: the patches from the test set which are closest to the prototype. In order of distance where the leftmost image is closest to the prototype.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
206408
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description