Now You See Me (CME): Concept-based Model Extraction
Deep Neural Networks (DNNs) have achieved remarkable performance on a range of tasks. A key step to further empowering DNN-based approaches is improving their explainability. In this work we present CME: a concept-based model extraction framework, used for analysing DNN models via concept-based extracted models. Using two case studies (dSprites, and Caltech UCSD Birds), we demonstrate how CME can be used to (i) analyse the concept information learned by a DNN model (ii) analyse how a DNN uses this concept information when predicting output labels (iii) identify key concept information that can further improve DNN predictive performance (for one of the case studies, we showed how model accuracy can be improved by over , using only of the available concepts).
2020 \copyrightclauseCopyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Title of the Proceedings: “Proceedings of the CIKM 2020 Workshops” Editors of the Proceedings: Stefan Conrad, Ilaria Tiddi
1, 3]Dmitry Kazhdan
1, 3]Botty Dimanov
nterpretability \sepconcept extraction \sepconcept-based explanations \sepmodel extraction \seplatent space analysis \sepxai
The black-box nature of Deep Neural Networks (DNNs) hinders their widespread adoption, especially in industries under heavy regulation with high-cost of error . As a result, there has recently been a dramatic increase in research on Explainable AI (XAI), focusing on improving explainability of DL systems [4, 1].
Currently, the most widely used XAI methods are feature importance methods (also referred to as saliency methods) . For a given data point, these methods provide scores showing the importance of each feature (e.g., pixel, patch, or word vector) to the algorithm’s decision. Unfortunately, feature importance methods have been shown to be fragile to input perturbations [18, 26] or model parameter perturbations [2, 8]. Human experiments also demonstrate that feature importance explanations do not necessarily increase human understanding, trust, or ability to correct mistakes in a model [28, 16].
As a consequence, two other types of XAI approaches are receiving increasing attention: model extraction approaches, and concept-based explanation approaches. Model extraction methods (also referred to as model translation methods) approximate black-box models with simpler models to increase model explainability. Concept-based explanation approaches provide model explanations in terms of human-understandable units, rather than individual features, pixels, or characters (e.g., the concepts of a wheel and a door are important for the detection of cars) [16, 35, 10].
In this paper we introduce CME
In particular, we make the following contributions:
We present the novel CME framework, capable of analysing DNN models via concept-based extracted models
We demonstrate, using two case-studies, how CME can analyse (both quantitatively and qualitatively) the concept information a DNN model has learned, and how this information is represented accross the DNN layers
We propose a novel metric for evaluating the quality of concept extraction methods
We demonstrate, using two case-studies, how CME can analyse (both quantitatively and qualitatively) how a DNN uses concept information when predicting output labels
We demonstrate how CME can identify key concept information that can further improve DNN predictive performance
2 Related Work
2.1 Concept-based Explanations
Concept-based explanations have been used in a wide range of different ways, including: inspecting what a model has learned [10, 33], providing class-specific explanations [17, 16], and discovering causal relations between concepts . Similarly to CME, these approaches typically seek to explain model behaviour in terms of high-level concepts, extracting this concept information from a model’s latent space.
Importantly, existing concept-based explanation approaches are typically capable of handling binary-valued concepts only, which implies that multi-valued concepts have to be binarised first. For instance, given a concept such as “shape”, with possible values ‘square’ and ‘circle’, these approaches have to convert “shape” into two binary concepts ‘is_square’, and ‘is_circle’. This makes such approaches (i) computationally expensive, since the binarised concept space usually has a high cardinality, (ii) error-prone, since mutual exclusivity of concept values is now not enforced (e.g., a single data point can now have both ‘is_square’ and ‘is_circle’ concepts being true). In contrast, our approach is capable of handling multi-valued concepts directly, without binarisation.
Furthermore, concept-based explanation approaches typically rely on the latent space of a single layer when extracting concept information. DNNs have been shown to perform hierarchical feature extraction, with layers closer to the output utilising higher-level data representations, compared to layers closer to the input [13, 34]. This implies that choosing a single layer imposes an unnecessary trade-off between low- and high-level concepts. On the other hand, CME is capable of efficiently combining latent space information from multiple layers, thereby avoiding this constraint.
Finally, existing methods typically represent concept explanations as a list of concepts, with their relative importance with respect to the classification task. In contrast, our approach describes the functional relationship between concepts and outputs, thereby showing in more detail how the model utilises concept information when making predictions.
2.2 Concept Bottleneck Models
Recent work on concept-based explanations relies on models that use an intermediate concept-based representation when making predictions [19, 14]. Work in  refer to these types of models as concept bottleneck models (CBMs). A concept bottleneck model is a model which, given an input, first predicts an intermediate set of human-specified concepts, and then uses only this concept information to predict the output task label. Work in  proposes a method for turning any DNN into a concept bottleneck model given concept annotations at training time. This is achieved by resizing one of the layers to match the number of concepts provided, and re-training with an added intermediate loss that encourages the neurons in that layer to align component-wise to the provided concepts.
Crucially, CBM approaches provide ways for generating DNN models, which are explicitly encouraged to rely on specified concept information. In contrast, our approach is used for analysing DNN models (and is much cheaper computationally).
Furthermore, CBM approaches require concept annotations to be available at training time for all of the training data, which is often expensive to produce. In contrast, CME can be used with partially-labelled datasets in a semi-supervised fashion, as will be described in Section 3.
Finally, CBM approaches require the concepts themselves to be known beforehand. On the other hand, CME can efficiently utilise knowledge contained in pre-trained DNNs, in order to learn about which concepts are/aren’t required for a given task. Further details on CME/CBM comparison can be found in Appendix A.
2.3 Model Extraction
Model extraction techniques use rules [3, 37, 6], decision trees [20, 29], or other more readily explainable models  to approximate complex models, in order to study their behaviour. Provided the approximation quality (referred to as fidelity) is high enough, an extracted model can preserve many statistical properties of the original model, while remaining open to interpretation.
However, extracted models generated by existing methods represent their decision-making using the same input representation as the original model, which is typically difficult for the user to understand directly. Instead, our extracted models represent decision-making via human-understandable concepts, making them easier to interpret.
In this section we present our CME approach, describing how it can be used to analyse DNN models using concept-based extracted models.
We consider a pre-trained DNN classifier , (, ), where is mapping an input to an output class . For every DNN layer , we denote the function , () as a mapping from the input space to the hidden representation space , where denotes the number of hidden units, and can be different for each layer.
Similarly to [19, 14], we assume the existence of a concept representation , defining distinct concepts associated with the input data. is defined such that every basis vector in spans the space of possible values for one particular concept. We further assume the existence of a function , where is mapping an input to its concept representation . Thus, defines the concepts and their values (referred to as the ground truth concepts) for every input point.
In this work, we define a DNN as being concept-decomposable, if it can be well-approximated by a composition of functions and , such that . In this definition, the function is an input-to-concept function, mapping data-points from their input representation to their concept representation . The function is a concept-to-output function, mapping data-points in their concept representation to output space . Thus, when processing an input , a DNN can be seen as converting this input into an interpretable concept representation using , and using to predict the output from this representation. The significance of this decomposition is further discussed in Appendix A.
CME explores whether a given DNN is concept-decomposable, by attempting to approximate with an extracted model . In this case, is defined as , using input-to-concept and output-to-concept extracted by CME from the original DNN. We describe our approach to extracting and in the remainder of this section.
3.3 Input-to-Concept ()
When extracting from a pre-trained DNN, we assume we have access to the DNN training data and labels . Furthermore, we assume partial access to , such that a small set of training points have concept labels associated with them, while the remaining points do not (in this case ). We refer to these subsets respectively as the concept labelled dataset and concept unlabelled dataset. Using these datasets, we generate by aggregating concept label predictions across multiple layers of the given DNN model, as described below.
Given a DNN layer with hidden units, we compute the layer’s representation of the input data , obtaining . Using this data and the concept labels, we construct a semi-supervised dataset, consisting of labelled data , and unlabelled data .
Next, we rely on Semi-Supervised Multi-Task Learning (SSMTL) , in order to extract a function , which predicts concept labels from layer ’s hidden space. In this work, we treat each concept as a separate, independent task. Hence, is decomposed into separate tasks (one per concept), and is defined as where each () predicts the value of concept from .
Repeating this process for all model layers , we obtain a set of functions . For every concept , we define the “best” layer for predicting that concept as shown in (1):
Here, is a loss function (in this case the error rate), computing the predictive loss of function with respect to a concept . Finally, we define as shown in (2):
Thus, given an input , the value computed by for every concept is equal to the value computed by from that input’s representation in layer . Overall, encapsulates concept information contained in a given DNN model, and can be used to analyse how this information is represented, as well as to predict concept values for new inputs.
3.4 Concept-to-Label ()
We setup extraction of as a classification problem, in which we train to predict output labels from concept labels predicted by . We use to generate concept labels for all training data points, obtaining a set of concept labels . Next, we produce a labelled dataset, consisting of concept labels and corresponding DNN output labels , and use it to train in a supervised manner. We experimented with using Decision Trees (DTs), and Logistic Regression (LR) models for representing , as will be discussed in Section 5. Overall, can be used to analyse how a DNN uses concept information when making predictions.
4 Experimental Setup
dSprites is a well-established dataset used for evaluating unsupervised latent factor disentanglement approaches. dSprites consists of 2D pixel black-and-white shape images, procedurally generated from all possible combinations of 6 ground truth independent concepts (color, shape, scale, rotation, x and y position). Further details can be found in Appendix B, and the official dSprites repository.
We define classification tasks, used to evaluate our framework:
Task 1: This task consists of determining the shape concept value from an input image. For every image sample, we define its task label as the shape concept label of that sample.
Task 2: This task consists of discriminating between all possible shape and scale concept value combinations. We assign a distinct identifier to each possible combination of the shape and scale concept labels. For every image sample, we define its task label as the identifier corresponding to this sample’s shape and scale concept values.
Overall, Task 1 explores a scenario in which a DNN has to learn to recognise a specific concept from an input image. Task 2 explores a relatively more complex scenario, in which a DNN has to learn to recognise combinations of concepts from an input image.
We trained a Convolutional Neural Network (CNN) model  for each task. Both models had the same architecture, consisting of 3 convolutional layers, 2 dense layers with ReLUs, 50% dropout  and a softmax output layer. The models were trained using categorical cross-entropy loss, and achieved classification accuracies on their respective held-out test sets. We refer to these models as the Task 1 model and the Task 2 model in the rest of this work.
Ground-truth Concept Information
Importantly, the task and dataset definitions described in this section imply that we know precisely which concepts the models had to learn, in order to achieve task performances (shape for Task 1, and shape and scale for Task 2). We refer to this as the ground truth concept information learned by these models.
4.2 Caltech-UCSD Birds (CUB)
For our second dataset, we used Caltech-UCSD Birds 200 2011 (CUB). This dataset consists of 11,788 images of 200 bird species with every image annotated using 312 binary concept labels (e.g. beak and wing colour, shape, and pattern). We relied on concept pre-processing steps defined in  (used for de-noising concept annotations, and filtering out outlier concepts), which produces a refined set of binary concept labels for every image sample.
We relied on the standard CUB classification task, which consists of predicting the bird species from an input image.
We used the Inception-v3 architecture , pretrained on ImageNet  (except for the fully-connected layers) and fine-tuned end-to-end on the CUB dataset, following the preprocessing practices described in . The model achieved classification accuracy on a held-out test set. We refer to this model as the CUB model in the rest of this work.
Ground-truth Concept Information
Unlike dSprites, the CUB dataset does not explicitly define how the available concepts relate to the output task. Thus, we do not have access to the ground truth concept information learned by the CUB model.
We compare performance of our CME approach to two other benchmarks, described in the remainder of this section.
We rely on work in  for defining benchmark functions for the three tasks. Work in  attempts to predict presence/absence of concepts from spatially-averaged hidden layer activations of convolutional layers of a CNN model. Given a binary concept , this approach trains a logistic regressor, predicting the presence/absence of this concept in an input image from the latent representation of a given CNN layer. In case of multi-valued concepts, the concept space has to be binarised, as discussed in Section 2.2. In this case, the binarised concept value with the highest likelihood is returned.
Unlike CME,  does not provide a way of selecting the convolutional layer to use for concept extraction. We consider the best-case scenario by selecting, for all tasks, the convolutional layers yielding the best concept extraction performance. For all tasks, these layers were convolutional layers closest to the output (the 3rd conv. layer in case of dSprites tasks, and the final inception block output layer in case of the CUB task).
As discussed in Section 4.2.3, we do not have access to ground truth concept information learned by the CUB model. Instead, we rely on the pre-trained sequential bottleneck model defined in  (referred to as CBM in the rest of this work). CBM is a bottleneck model, obtained by resizing one of the layers of the CUB model to match the number of concepts provided (we refer to this as the bottleneck layer), and training the model in two steps. First, the sub-model consisting of the layers between the input layer and the bottleneck layer (inclusive) is trained to predict concept values from input data. Next, the submodel consisting of the layers between the layer following the bottleneck layer and the output layer is trained to predict task labels from the concept values predicted by the first submodel. Hence, this bottleneck model is guaranteed to solely rely on concept information that is learnable from the data, when making task label predictions. Thus, this benchmark serves as an upper bound for the concept information learnable from the dataset, and for the task performance achievable using this information. Importantly, CBM does not attempt to approximate/analyse the CUB model, but instead attempts to solve the same classification task using concept information only.
We use the first CBM submodel as a benchmark, representing the upper bound of concept information learnable from the data. We use the second submodel as a benchmark, representing the upper bound of task performance achievable from predicted concept information only. Finally, we use the entire model as an benchmark. We make use of the saved trained model from , available in their official repository
We present the results obtained by evaluating our approach using the two case studies described above.
We obtain the concept labelled dataset by returning the ground-truth concept values for a random set of samples in the model training data. For dSprites, we found that a concept labelled dataset of a samples or more worked well in practice for both tasks. Thus, we fix the size of the concept labelled dataset to in all of the dSprites experiments. For CUB, we found that a concept labelled dataset containing or more samples per class worked well in practice. Thus, we fix the size of the concept labelled dataset to samples per class in all of the CUB experiments. In the future, we intend to explore the variation of model extraction performance with the size of the concept labelled dataset in more detail.
5.1 Concept Prediction Performance
First, we evaluate the quality of functions produced by CME, Net2Vec, and CBM. For both dSprites tasks, we relied on the Label Spreading semi-supervised model , provided in scikit-learn , when learning the functions for CME. For CUB, we used logistic regression functions instead, as they gave better performance.
Figure 2 shows predictive performance of the functions on all concepts for the two dSprites tasks (averaged over 5 runs). As discussed in Section 4.1.1, we have access to the ground truth concept information learned by these models (shape concept information for Task 1, and shape and scale concept information for Task 2). For both tasks, functions extracted by CME successfully achieved high predictive accuracy on concepts relevant to the tasks, whilst achieving a low performance on concepts irrelevant to the tasks. Thus, CME was able to successfully extract the concept information contained in the task models. For both tasks, functions extracted by Net2Vec achieved a much lower performance on the relevant concepts.
As discussed in Section 4.2.3, the CUB dataset does not explicitly define how the concepts relate to the output task labels. Thus, we do not know how relevant/important different concepts are, with respect to task label prediction. In this section, we make the conservative assumption that all concepts are relevant, when evaluating functions, and explore relative concept importance in more detail in Section 5.3.
Firstly, we relied on the ‘average-per-concept’ metrics introduced in  when evaluating the function performances, by computing their predictive scores for each concept, and then averaging over all concepts. We obtained scores of 92 0.5%, 86.3 2.0%, and 85.9 2.3% for CBM, CME, and Net2Vec functions, respectively (averaged over 5 runs).
Importantly, we argue that in case of a large number of concepts, it is crucial to measure how concept mispredictions are distributed accross the test samples. For instance, consider a dSprites Task 2 function that achieves predictive accuracy on both shape and scale concepts. The average predictive accuracy on relevant concepts achieved by this will therefore be . However, if the two concepts are mis-predicted for strictly different samples (i.e. none of the samples have both shape and scale predicted incorrectly at the same time), this means that of the test samples will have one relevant concept predicted incorrectly. Given that both concepts need to be predicted correctly when using them for task label prediction, this implies that consequent task label prediction will not be able to achieve over task label accuracy. This effect becomes even more pronounced in case of a larger number of relevant concepts.
Consequently, we defined a novel cumulative misprediction error metric, which we refer to as the ‘mis-prediction-overlap’ (MPO) metric. Given a test set consisting of input samples with corresponding concept labels , and a prediction set , computes the fraction of samples in the test set, that have at least relevant concepts predicted incorrectly, as shown in Equation 3 (where denotes the indicator function):
Here, can be used to specify which concepts to measure the mis-prediction error on (i.e. in case some of the provided concepts are irrelevant). Under our assumption of all concepts being relevant, we defined as shown in Equation 4:
Using a held-out test set, we plot the metric values for , as shown in Figure 3 (averaged over 5 runs). Importantly, function performances can be evaluated by observing their scores for different values of . A larger score implies a bigger proportion of samples had at least relevant concept predicted incorrectly.
Overall, CME performed almost identically to Net2Vec, and worse than according to the metric. Similar performance to Net2Vec is likely caused by (i) concepts being binary (requiring no binarisation) (ii) the Inception-v3 model having a relatively large number of convolutional layers, implying that the final convolutional layer likely learned higher-level features, relevant to concept prediction.
Importantly, showed that both CBM and CME functions had a significant proportion of test samples with incorrectly-predicted relevant concepts (e.g. CME had an MPO score of at , implying that 25% of all test samples have at least 4 concepts predicted incorrectly). In practice, these mispredictions can have a significant impact on consequent task label predictive performance, as will be further explored in the next section.
5.2 Task Performance
In this section, we evaluate the fidelity and performance of the extracted models. For all CME and Net2Vec functions evaluated in the previous section, we trained output-to-concept functions , predicting class labels from the concept predictions. Next, for every , we defined its corresponding as discussed in Section 3, via a composition of and its associated . For every , we evaluated its fidelity and its task performance, using a held-out sample test set. Table 1 shows the fidelity of extracted models, and Table 2 shows the task performance for these models (averaged over 5 runs). The original Task 1, Task 2, and CUB models achieved task performances of 1000%, 1000%, and 82.70.4%, respectively, as described in Section 4.
For both dSprites tasks, CME models achieved high (99%+) fidelity and task performance scores, indicating that CME successfully approximated the original dSprites models. Furthermore, these scores were considerably higher than those produced by the Net2Vec models.
For the CUB task, both CME and Net2Vec models achieved relatively lower fidelity and task performance scores (in this case, performance of CME was very similar to that of Net2Vec). Crucially, the CBM model also achieved relatively low fidelity and accuracy scores (as anticipated from our metric analysis). This implies that concept information learnable from the data is insufficient for achieving high task accuracy. Hence the relatively high CUB model accuracy has to be caused by the CUB model relying on other non-concept information. Thus, the low fidelity of CME and Net2Vec is a consequence of the CUB model being non-concept-decomposable, implying that it’s behaviour cannot be explained by the desired concepts. The next section discusses possible approaches to fixing this issue.
In the previous section, we demonstrated how CME can be used to identify whether a model relies on desired concepts during decision-making. In this section, we demonstrate how CME can be used to suggest model improvements, aligning model behaviour with the desired concepts.
We trained a logistic regression model predicting task labels from ground-truth concept labels for the CUB task, obtaining an accuracy score of 96.4 0.5% on a held-out test set (averaged over 5 runs). Using this model’s coefficient magnitudes as a measure of concept importance, we discovered that the most important concepts identified this way were sufficient for achieving over 96% task accuracy using logistic regression.
Using this reduced concept set, we inspected how our CUB function performances would change, if their corresponding functions extracted these concepts perfectly. This was achieved by taking the concept predictions of these concepts on the test and training sets, setting the values of the top most important concepts to their ground truth values, training logistic regression functions on these modified training sets, and measuring their accuracies on the modified test sets (this approach is referred to as concept intervention in the rest of this work). The results are shown in Figure 4, with ranging from to .
These results demonstrate that concept information from only concepts is sufficient for achieving over task performance. Thus, predictive performance of the CUB model can be significantly improved (up to ) by ensuring that the model is able to learn and use this concept information. Crucially, these results show that CME concept intervention also significantly improves CBM model performance, indicating that the necessary concept information is not learnable from the data. Hence, undesired CUB model behaviour is likely arising due to data properties (e.g. the data not being representative with respect to key concepts), not model properties (e.g. architecture, or training regime).
Overall, we demonstrated how CME can be used to identify the key concept information that can be used to improve performance of DNN models, and ensure that they are closer aligned with the desired concept-based behaviour. Furthermore, we demonstrated how CME can be used to identify whether undesired model behaviour is caused by model properties, or data properties.
By studying CME-extracted and functions separately, we can gain additional insights into what concept information the original model learned and how this concept information is used to make predictions. We give examples of how these sub-models can be inspected in the remainder of this section.
CME extraction of functions from a DNN model is highly complementary to existing approaches on latent space analysis. For example, Figure 5 shows a t-SNE  2D projected plot of every layer’s hidden space of the dSprites Task 2 model, highlighting different concept values of the two relevant concepts, as well as the layers used by CME to predict them. Figure 5 demonstrates several important ways in which CME concept extraction can be combined with existing latent space analysis approaches, which will be discussed in the remainder of this section. Further examples are given in Appendix C.
Manifold Types Using ground-truth concept information and hidden space visualisation, it is possible to inspect the nature of latent space manifolds, with respect to specific concepts. Firstly, this inspection allows to build an intuition of how concept information is represented in a particular latent space. Secondly, it is possible to use this information when selecting the types of functions to use during concept extraction. For instance, some manifolds consist of “blobs” encoding distinct concept values (e.g. row
dense_1), suggesting that the latent space is clustered with respect to a concept’s values.
Variation Across Layers Using ground-truth concept information and hidden space visualisation, it is also possible to inspect how concept information representation varies across layers of a DNN model. Firstly, this inspection allows to build an intuition of how concept-related information is transformed by the DNN. Secondly, it is possible to use this information to identify the ‘best’ layers to extract concept information from. For instance, both rows
scale illustrate that the manifolds of higher layers become more unimodal (separating concept values) with respect to the relevant concepts. Importantly, this analysis, together with the definition of allows using different layers for extracting different concepts.
Overall, we argue that CME concept extraction can be well-integrated with existing latent space analysis approaches, in order to study which concept information is learned by a DNN, and how this information is represented across DNN layers. This type of inspection can have numerous applications, including: (i) inspecting which concepts a model has learned, and verifying whether it has learned the desired concepts (useful for model explanations and model verification), (ii) inspecting how concept information is represented across different layers (useful for fine-grained model analysis), (iii) extracting concept predictions from a DNN (useful for knowledge extraction). Further examples and analysis of extracted functions can be found in Appendix C.
functions encapsulate how a DNN uses concept information when making predictions. Hence, these functions can be inspected directly, in order to analyse model behaviour represented in terms of concepts. An example is given in Figure 6, in which we plot the decision tree function extracted by CME from the Task 1 model. Further examples are given in Appendix D.
Overall, inspection of functions can be used for (i) verifying that a DNN uses concept information correctly during decision-making, and that it’s high-level behaviour is consistent with user expectations (model verification), (ii) identifying specific concepts or concept interactions (if any) causing incorrect behaviour (model debugging), (iii) extracting new knowledge about how concept information can be used for solving a particular task (knowledge extraction). Further examples and analysis of extracted functions can be found in Appendix D.
We present CME: a concept-based model extraction framework, used for analysing DNN models via concept-based extracted models. Using two case-studies, we demonstrate how CME can be used to (i) analyse concept information learned by DNN models (ii) analyse how DNNs use concept information when making predictions (iii) identifying key concept information that can further improve DNN predictive performance. CME is a model-agnostic, general-purpose framework, which can be combined with a wide variety of different DNN models and corresponding tasks.
In this work, we assume a fixed set of concept labels available to CME before model extraction begins (i.e. the concept-labelled dataset). In the future, we intend to explore active-learning based approaches to obtaining maximally-informative concept labels in an interactive fashion. Consequently, these approaches will improve extracted model fidelity by retrieving the most informative concept labels, and reduce manual concept labelling effort.
Given the rapidly-increasing interest in concept-based explanations of DNN models, we believe our approach can play an important role in providing granular concept-based analyses of DNN models.
AW acknowledges support from the David MacKay Newton research fellowship at Darwin College, The Alan Turing Institute under EPSRC grant EP/N510129/1 & TU/B/000074, and the Leverhulme Trust via the Leverhulme Centre for the Future of Intelligence (CFI). BD acknowledges support from EPSRC Award #1778323. DK acknowledges support from EPSRC ICASE scholarship and GSK. DK and BD acknowledge the experience at Tenyks as fundamental to developing this research idea.
Appendix A Concept Decomposition
The results and findings presented in existing work on concept-based explanations suggests that users often think of tasks in terms of concepts and concept interactions (see Section 2.1 for further details). For instance, consider the task of determining the species of a bird from an image. A user will typically perform this task by first identifying relevant concepts (e.g. wing color, head color, and beak length) present in a given image, and then using the values of these concepts to infer the bird species, in a bottom-up fashion.
On the other hand, Machine Learning (ML) models usually rely on high-dimensional data representations, and infer task labels directly from these high-dimensional inputs (e.g. a CNN produces a class label from raw input pixels of an image).
Consequently, Concept Decomposition (CD) approaches attempt to explain the behaviour of such ML models by decomposing their processing into two distinct steps: concept extraction, and label prediction. In concept extraction, concept information is extracted from the high-dimensional input data. In label prediction, concept information is used to produce the output label. Hence, CD approaches attempt to explain ML model behaviour in terms of human-understandable concepts and their interactions in a bottom-up fashion, paralleling human-like reasoning more closely.
Importantly, whilst this work focuses on CNN models and tasks, the notion of CD can in principle be applied to any ML model and task.
CBMs can be seen as a special case of models performing CD, in which CD behaviour is enforced by design. Hence, these models explicitly consist of two submodels, with the first submodel extracting concept information, and the second submodel using this concept information for producing task labels. Importantly, non-CBM models can still demonstrate CD behaviour. For instance, the dSprites Task 2 model was shown to have CD behaviour, with relevant concept information extracted in the dense layers, and used for classification decisions.
a.2 CBMs & CME
The utility of CBMs is that they produce models explicitly encouraged to use CD. Consequently, these models are much more likely to rely on the desired concepts during decision-making, and be more aligned with a user’s mental model of the corresponding task.
However, a given DNN model can already exhibit CD behaviour, and use the desired concept information (e.g. as was the case with both dSprites task models). In this case, costly modifications and model re-training are unnecessary. As discussed in Section 3, CME can extract concept information from pre-trained DNNs by training concept predictors (where denotes the number of DNN layers used in concept extraction, and denotes the number of concepts). As demonstrated in Section 5, these concept predictors can consist of simpler models (e.g. LRs), trained on only a fraction of the DNN training data. Thus, the computational cost of training these concept predictors is significantly smaller, compared to training a bottleneck model on all the training data, as done in the case of CBMs.
More importantly, CBM models require knowledge of existing concepts and available concept annotations. In practice, these annotations are often expensive to produce, especially for large datasets and/or a large number of concepts. Furthermore, information about which concepts are relevant and/or sufficient for solving a given task is often not fully available either. Instead, CME is capable of using existing DNN models to extract this information automatically in a semi-supervised fashion, making concept discovery (identifying the relevant concepts), and concept annotation both faster and cheaper.
Overall, CME permits efficient interaction with pre-trained DNN models, which can be used to leverage concept-related knowledge stored in these models. Consequently, we believe that CME will be invaluable in situations where concept-related information is expensive/difficult to obtain, or is only partially-known. In these cases, a user may interact with existing DNN models via CME, in order to refine existing concept-related knowledge.
It should be noted that a CBM can trivially be approximated using CME, by defining as the output of a CBM’s concept bottleneck layer, and defining as the CBM’s submodel producing task labels from the bottleneck layer output.
a.3 Further Discussion
As discussed in Section 3, CME explores whether a DNN is concept-decomposable, by attempting to approximate it with an extracted model that is concept-decomposable by design (i.e. explicitly consists of two separate stages). Intuitively, if a given DNN learns and relies on concept information of the specified concepts during label prediction, this concept information will be contained in the DNN latent space. Hence, the DNN decision process could be separated into two steps: concept information extraction, and consequent task label prediction.
Importantly, existing CD-based approaches (such as those discussed in Section 2.2) require the set of concepts and their values to be (i) sufficient to solve the corresponding classification task (i.e. the class labels can be predicted from concept information with high accuracy) (ii) learnable from the data (i.e. the DNN model will be able to learn concept information from the given dataset), in order to achieve high task performance.
However, these works do not discuss how to handle cases where these assumptions do not hold (e.g. as was the case with the CUB task). Thus, exploring ways of efficiently discovering relevant concepts sufficient for solving a given task, as well as ways of ensuring whether this concept information is learnable from the data are both important research directions for future work.
Appendix B dSprites Dataset
dSprites is a dataset of 2D shapes, procedurally generated from 6 ground truth independent concepts (color, shape, scale, rotation, x and y position). Table 3 lists the concepts, and corresponding values. dSprites consists of pixel black-and-white images, generated from all possible combinations of these concepts, for a total of total images.
|Shape||square, ellipse, heart|
|Scale||6 values linearly spaced in|
|Rotation||40 values in|
|Position X||32 values in|
|Position Y||32 values in|
We select of the values for Position X and Position Y (keeping every other value only), and select of the values for Rotation (retaining every 5th value). This step makes the dataset size more manageable (reducing it from to samples), whilst preserving its characteristics and properties, such as concept value ranges and diversity.
Appendix C Input-to-Concept Functions
Figure 7 shows a t-SNE 2D projected plot of every layer’s hidden space of the dSprites Task 1 model, highlighting different concept values of the relevant shape concept, and which layers were used by CME to predict it.
The CUB model has a considerably larger number of layers, and a considerably larger number of task concepts. Hence, for the sake of space, we demonstrate an example here using only 6 different model layers of the CUB model, and showing only the top important concepts identified in Section 5.3. In this Figure, the concepts are named using their indices, and the layers are named following the naming convention used in . Further details regarding layer naming and/or concept naming can be found in
Mixed_7c layer. However, the figure shows that concept values are still quite mixed together for some of the points, even for later layers. This low separability indicates that concept values will still be mis-predicted for some of the points, and that concept extraction for the CUB task will likely perform suboptimally.
Appendix D Concept-to-Output Functions
Figure 9 shows the decision tree extracted for dSprites Task 2. Overall, this model has correctly learned to differentiate between classes based on the shape and scale concepts (note: there are shape and scale concept values, for a total of output classes).
- Pronounced “See Me.”
- All relevant code is available at \sephttps://github.com/dmitrykazhdan/CME
- (2018) Peeking inside the black-box: a survey on explainable artificial intelligence (xai). IEEE Access 6. Cited by: §1.
- (2018) Sanity checks for saliency maps. In Advances in Neural Information Processing Systems, pp. 9505–9515. Cited by: §1.
- (1995) Survey and critique of techniques for extracting rules from trained artificial neural networks. Knowledge-based systems 8 (6), pp. 373–389. Cited by: §2.3.
- (2020) Explainable artificial intelligence (xai): concepts, taxonomies, opportunities and challenges toward responsible ai. Information Fusion 58. Cited by: §1.
- (2020) Explainable machine learning in deployment. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 648–657. Cited by: §1.
- (2017) Enhancing transparency and control when drawing data-driven inferences about individuals. Big data 5 (3), pp. 197–212. Cited by: §2.3.
- (2018) Large scale fine-grained categorization and domain-specific transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4109–4118. Cited by: §4.2.2.
- (2020) You shouldnât trust me: learning models which conceal unfairness from multiple explanation methods. In European Conference on Artificial Intelligence, Cited by: §1.
- (2018) Net2vec: quantifying and explaining how concepts are encoded by filters in deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8730–8738. Cited by: §4.3.1, §4.3.1.
- (2019) Towards automatic concept-based explanations. In Advances in Neural Information Processing Systems, Cited by: §1, §2.1.
- (2017) European union regulations on algorithmic decision-making and a âright to explanationâ. AI magazine 38 (3), pp. 50–57. Cited by: §1.
- (2019) Explaining classifiers with causal concept effect (cace). arXiv preprint arXiv:1907.07165. Cited by: §2.1.
- (2007) Learning multiple layers of representation. Trends in cognitive sciences 11 (10), pp. 428–434. Cited by: §2.1.
- (2020) Human-in-the-loop learning of interpretable and intuitive representations. In ICML Workshop on Human Interpretability, External Links: Cited by: §2.2, §3.1.
- (2020) MARLeME: a multi-agent reinforcement learning model extraction library. arXiv preprint arXiv:2004.07928. Cited by: §2.3.
- (2018) Interpretability beyond feature attribution: quantitative testing with concept activation vectors (TCAV). In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, J. G. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 2673–2682. External Links: Cited by: §1, §1, §2.1.
- (2017) Interpretability beyond feature attribution: quantitative testing with concept activation vectors (tcav). arXiv preprint arXiv:1711.11279. Cited by: §2.1.
- (2019) The (un) reliability of saliency methods. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, pp. 267–280. Cited by: §1.
- (2020) Concept bottleneck models. In Proceedings of Machine Learning and Systems 2020, pp. 11313–11323. Cited by: Appendix C, §2.2, §3.1, §4.2, §4.3.2, §4.3.2, §5.1.2.
- (1999) Extracting decision trees from trained neural networks. Pattern recognition 32 (12). Cited by: §2.3.
- (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §4.2.2.
- (1990) Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems, pp. 396–404. Cited by: §4.1.2.
- (2008) Semi-supervised multitask learning. In Advances in Neural Information Processing Systems, Cited by: §3.3.
- (2008) Visualizing data using t-sne. Journal of Machine Learning Research 9 (Nov), pp. 2579–2605. Cited by: §5.4.1.
- (2017) DSprites: disentanglement testing sprites dataset. Note: https://github.com/deepmind/dsprites-dataset/ Cited by: §4.
- (2018) Towards robust interpretability with self-explaining neural networks. In Advances in Neural Information Processing Systems, pp. 7775–7784. Cited by: §1.
- (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12. Cited by: §5.1.
- (2018) Manipulating and measuring model interpretability. arXiv preprint arXiv:1802.07810. Cited by: §1.
- (2001) Rule extraction from neural networks via decision tree induction. In IJCNN’01. International Joint Conference on Neural Networks. Proceedings (Cat. No. 01CH37222), Vol. 3, pp. 1870–1875. Cited by: §2.3.
- (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (56), pp. 1929–1958. External Links: Cited by: §4.1.2.
- (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §4.2.2.
- (2011) The caltech-ucsd birds-200-2011 dataset. Cited by: §4.
- (2019) On concept-based explanations in deep neural networks. arXiv preprint arXiv:1910.07969. Cited by: §2.1.
- (2014) Object detectors emerge in deep scene cnns. arXiv preprint arXiv:1412.6856. Cited by: §2.1.
- (2018) Interpretable basis decomposition for visual explanation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 119–134. Cited by: §1.
- (2004) Learning with local and global consistency. In Advances in Neural Information Processing Systems 16, Cited by: §5.1.
- (2016) Deepred–rule extraction from deep neural networks. In International Conference on Discovery Science, pp. 457–473. Cited by: §2.3.