Modeling Human Categorization of Natural Images
Using Deep Feature Representations
Abstract
Over the last few decades, psychologists have developed sophisticated formal models of human categorization using simple artificial stimuli. In this paper, we use modern machine learning methods to extend this work into the realm of naturalistic stimuli, enabling human categorization to be studied over the complex visual domain in which it evolved and developed. We show that representations derived from a convolutional neural network can be used to model behavior over a database of 300,000 human natural image classifications, and find that a group of models based on these representations perform well, near the reliability of human judgments. Interestingly, this group includes both exemplar and prototype models, contrasting with the dominance of exemplar models in previous work. We are able to improve the performance of the remaining models by preprocessing neural network representations to more closely capture human similarity judgments.
Keywords: Artificial Intelligence, Cognition, Categorization, Classification, Neural Networks, Inference.
Modeling Human Categorization of Natural Images
Using Deep Feature Representations
Ruairidh M. Battleday (battleday@berkeley.edu)* Helen Wills Neuroscience Institute, University of California, Berkeley Joshua C. Peterson (peterson.c.joshua@gmail.com) Thomas L. Griffiths (tom_griffiths@berkeley.edu) Department of Psychology, University of California, Berkeley * Corresponding author
Introduction
The problem of categorization—how an intelligent agent should group stimuli into discrete concepts—is an intriguing and valuable target for psychological research: it extends many influential themes in Western classical thought (see Aristotle, trans. 1984), has clear interpretations at multiple levels of analysis (Marr, 1982), and is likely fundamental to understanding human minds and advancing artificial ones (Cohen and Lefebvre, 2005). Previous categorization research has had many successes—in particular, the development of highprecision statistical models of human behavior. In this literature, human categorization data has often been accounted for with respect to either category summaries or abstractions (“prototype” models) or stored examples in memory (“exemplar” models) (Maddox and Ashby, 1993; Reed, 1972; McKinley and Nosofsky, 1995). These seemingly disparate models can be unified mathematically as strategies for density estimation (parameteric and nonparameteric, respectively; see Ashby and AlfonsoReese, 1995), an interpretation that enables interpolation between them, most notably in mixture density estimators (Rosseel, 2002). Fully extrapolating the probabilistic reframing of categorization allows one to explain the rational choice among these estimators using Bayesian nonparametric methods, tying the complexity of the strategy to the availability of data to the learner (Griffiths et al., 2007).
While this work has been insightful and theoretically productive, we know little about how it relates to the complex visual world it was meant to describe: it derives almost exclusively from laboratory experiments using highlycontrolled and simplified perceptual stimuli (Figure 1, top row), represented mathematically by handcoded descriptions of obvious features or multidimensionalscaling (MDS) solutions of similarity judgments (Figure 1, bottom row). Human categorization abilities, by contrast, emerge from contact with the natural world, and the problems it poses. As the category divisions that result may be best understood in this context, a central challenge is to extend existing theory to account for behavior over such domains. Recent work has begun to take up this challenge (Nosofsky et al., 2017); however, a fundamental problem remains finding appropriate psychological representations to do so for large numbers of varied naturalistic stimuli.
Developments in machine learning suggest one means to solve this problem. Tackling naturalimage classification from an engineering perspective, computer scientists have achieved humanlevel accuracy using deep neural networks loosely inspired by the structure of the human brain. These networks learn representations that are used to optimize classification of large sets of natural images (LeCun et al., 2015), and hence provide a source of representations of complex naturalistic stimulus structure that can be used as input to psychological models of categorization. While it is unclear whether such models resemble human categorization or feature learning, the representations they learn are nevertheless apparently relevant to information that humans use to judge stimulus similarity, and have been shown to provide a reasonable basis for approximating human representations in psychological experiments (Lake et al., 2015; Peterson et al., 2016).
In this study, we show that the representations extracted from a convolutional neural network (CNN) can be used as input for formal prototype and exemplar models of categorization, paving the way towards a wider and more nuanced exploration of human behavior. Moreover, we do so outside of the traditional laboratory setting, using representations from three layers of a canonical CNN to model human behavior over a massive dataset of crowdsourced naturalimage category judgments (see Figure 2). We find that models based on CNN representations perform well, close to the reliability of human judgments. Surprisingly, although an exemplar model performs best overall, several variants of the prototype model are nearly as accurate, a finding that contrasts with what might be expected based on previous research, and one that highlights the importance of the representational space on the relative performance of categorization models. Models making more rigid assumptions about category structure, based on CNN representations alone, perform less well. However, we show that over some layers of the neural network their performance can be improved by integrating salient information about human behavior in an alternate way: pretransforming CNN representations to more closely approximate human ones. These results demonstrate a promising route by which modern machine learning methods can be developed as a tool to extend traditional cognitive modeling of categorization into more representative domains.
Modeling categorization of natural images
We begin with a brief review of categorization models and convolutional neural networks. More mathematical details may be found in the Methods section.
Categorization models
It seems intuitive that we categorize a novel stimulus based on its similarity to previouslylearned concepts and categories. This motivates the comparison of formal models within a common framework: categorization as the assignment of a novel stimulus, , to a category, , based on some measure of similarity, , between the feature vectors of and those of existing category members (expressed in a summary statistic, ). We may now specify a model by a summary statistic, , the similarity computation, , and a function that links similarity scores for each category to the probability of selecting that category.
The summary statistic, , represents the properties of categories that are necessary inputs for the similarity calculation under different strategies. In the psychological literature, two canonical strategies have been developed regarding these properties. In a prototype model (for example, Reed, 1972), a category prototype—the average of category members—is used for comparison: becomes the central tendency of the members of category , . In an exemplar model (for example, Nosofsky, 1986), all existing category members are used. Accordingly, represents all existing members or “exemplars” of category . For the similarity calculation, we follow Shepard (1987) and use an exponentiallydecreasing function to relate distance in stimulus feature space to similarity. We also take to be an additive function: if is a vector, becomes the summation of the similarities between and each element of . We then use the LuceShepard choice rule (Luce, 1959; Shepard, 1957) to determine the likelihood of a single categorization, made over our two categories (plane () and bird ()):
(1) 
Convolutional Neural Networks
Deep CNNs provide rich, transferable representations of natural images that enable stateoftheart performance on many core problems in machine vision (LeCun et al., 2015). They pass pixellevel input data through a series of processing layers, which either apply a convolutional filter to the activation of nodes in the previous layer, or pool a subset of them. Node activations at each layer form a vector representation of the image that is increasingly abstract, and eventually input to a simple parameteric classifier (see Figure 3 for a schematic of the CNN we use in this paper). Representation vectors from any layer can then be directly input into the categorization models described above. Beyond their use for flatobject classification, these representations have been shown to best predict brain activity in visual cortices (Agrawal et al., 2014; Mur et al., 2013) and human similarity judgments for natural images (Peterson et al., 2016).
Study 1: Fitting categorization models to human behavior using CNN representations
Design
In our first study, we fit several variants of traditional prototype and exemplar categorization models to a large dataset of human categorization decisions over natural images, using stimulus representations from multiple layers of our CNN.
Methods
Stimuli
The human decisions and accompanying CNN representations we investigate are based on the CIFAR10 dataset (Krizhevsky and Hinton, 2009), which comprises 60,000 color images from 10 categories of natural objects. Human judgments were collected for a subset of two categories: birds ( images) and planes ( images). The particular images were chosen based on uncertainty sampling: a method of increasing sample value by using intermediate models to present the stimuli they were least certain about to participants (for details, see Haas et al., 2015).
Human behavioral data
Our behavioral dataset consists of human categorization decisions made over this stimulus set—to our knowledge, the largest reported in a single study to date. Participants saw an image, and were asked whether it was a bird or a plane. These data were originally collected as part of a large project to improve crowdsourcing latency, and have not been explored in a psychological context (Haas et al., 2015). The mean number of judgments per image was (range: ).
Deep representations
We extract feature representations for each of our stimuli from all three major layers of a simplified version of the popular AlexNet CNN (Krizhevsky et al., 2012), pretrained on the CIFAR10 dataset to an overall 10class classification accuracy of 82% using Caffe (Jia et al., 2014); this network is depicted in Figure 3. Twodimensional principal component projections of these representations are shown in Figure 4, colored according to human judgments for the corresponding images. We use this network because it has a simple architecture that allows for easier exploration of layers while maintaining classification accuracy in the ballpark of much larger, stateoftheart variants.
Categorization models
We can reduce the likelihood of each judgment, given in Equation 1, as follows:
(2) 
This defines a sigmoid function around the classification boundary, in which is a freelyestimated responsescaling parameter that controls its slope, and therefore degree of determinism. As it becomes deterministic, and as it reduces to random responding. When formulated in this manner, the prototype model is equivalent to a multivariateGaussian classifier, and the exemplar model to knearestneighbors classifier with distance weighting.
To evaluate the predictions of these models against human data, we record the category label, , that the participant gives to the stimulus . We then compute the loglikelihood of the human guesses under the model:
(3) 
where is the label participant gives to the stimulus , and takes the value for , and for , acting to invert the difference of distances appropriately. Prototype and exemplar models differ in how their similarity to a category, , is calculated.
Prototype model variants
For prototype models, similarity to a category is taken to be a exponential function of the negative squared Mahalanobis distance between a stimulus vector and the category prototype:
(4) 
leading to the following general loglikelihood for a prototype model:
(5) 
The Mahalanobis distance itself is given by the following equation:
(6) 
where and are the mean—or, prototype—and covariance matrix of category . We can define a number of prototype models by using different strategies to estimate these two parameters for each groundtruth image category, resulting in five linear and four quadratic prototype models (see Table 1).
If is the same for all categories, then the boundary at which a stimulus changes which prototype it is closest to is a hyperplane, resulting in a linear model (Duda et al., 2000). Taking the mean of category representations as prototype , we test the following models, which define different linear decision boundaries in feature space:

Identity: is the identity matrix, , for both cases. In this case, the Mahalanobis distance reduces to the (squared) Euclidean distance;

Common Variance: is a diagonal matrix, with the empiricallyestimated variance of all vectors (both plane and bird) as its diagonal—i.e., ;

Common Vector Variance: , where is a vector fitted on training set data for both categories and .
The first of these models is the simplest, and is equivalent to most prototype models employed throughout the literature. We may also reduce the above equation as follows, allowing us to posit two additional “Hyperplane” models:
(7) 
Here, defines a dimensional decision hyperplane parallel to the midpoint of the line linking the prototypes, offset by a bias term representing the difference in squared length of the means, . This method corresponds to dropping the estimation of prototypes from category representations, and instead learning the projection of the line connecting the prototypes in human consensus space into the CNN representational space; equally, it can be thought of as learning a linear transformation of CNN representational space based on behavioral data.
If is allowed to vary across categories, then this classification boundary can take more complex nonlinear forms (Duda et al., 2000). Taking the mean of category representations as prototype , we test the following models, which define different quadratic decision boundaries in feature space:

Category Pooled Variance: is the empiricallyestimated scalarvalued mean of category ’s variance terms. This is also known as “poole” or “spherical” variance;

Category Variance: is a diagonal matrix with the empiricallyestimated variance of category ’s vectors as its terms;

Category Scalar Variance: is a scalar fitted on the trainingset data for category ;

Category Vector Variance: is a vector fitted on training set data for category .
Model Name  

Identity  1  
Common Variance  1  
Vector Common Variance  1 +  
Hyperplane (no bias)  1 +  
Hyperplane (bias)  2 +  
Category Pooled Variance  1  
Category Variance  1  
Category Scalar Variance  
Category Vector Variance  1 +  
Note: = number of model parameters,  
number of feature dimensions. 
Exemplar model variants
In exemplar models, the similarity between and is given by
(8) 
where is a shape parameter, and is a “specificity” parameter. This results in the following general loglikelihood for an exemplar model is as follows:
(9) 
The distance between two vectors is given by:
(10) 
where the ’s are positive dimensionalscaling parameters called “attentional weights” that must sum to one, and we use (the Euclidean norm). These attentional weights serve to modify the importance of each dimension in each distance calculation, and we build two exemplar models based on them. In the “no attention” model we eliminate them from the calculation; in the “attention” model we learn them from the data (see Table 2).
Model Name  ’s  

Attention  2 +  
No attention  2  
Note: = number of model parameters,  
number of feature dimensions. 
Optimization
We learned all model parameters with 5foldcrossvalidation and early stopping, using the Adam variant of stochastic gradient descent (Kingma and Ba, 2014) and a batch size of 256 images. For each fold, we generated a loglikelihood score for the heldout validation set every 10 batches. The earlystopping point for each model was the trial index at which the average validation loglikelihood score across folds was minimized. For each model, we conducted a grid search over Adam’s learning rate (alpha) hyperparameter, selecting the final model parameter set based on which gave the lowest crossvalidated averageloglikelihood at the model’s earlystopping point.
Model comparison
For each of our models, we present the following three measures of performance: loglikelihood, correlation with human response proportions, and Aikake Information Criterion (Akaike, 1998). As a baseline model, we use the raw output probabilities of the neural network for each image to give and , normalized to sum to one for each image. For each of our models, including the CNN softmax baseline, we computed final loglikelihood scores by generating predictions for all images in our stimulus set using the averaged crossvalidated parameters taken at the earlystopping point described above.
As an ‘ideal’ model, we use splithalf reliability, applying the SpearmanBrown correction (Spearman, 1910; Brown, 1910), which gives an indication of the interparticipant consistency and a ceiling on model performance. To do this, we generate 100 random halfsplits of the human judgments, where each halfsplit contains half the human guesses for each image. For each split, we compute the correlation between the two halves and take the mean of all 100 of these correlations to get a final reliability estimate, then applying the SpearmanBrown correction (Spearman, 1910; Brown, 1910). To compare our categorization models to this ideal model, we again take 100 random splits of the data and compute the correlation between the average of each half and our model predictions. We then average the results for predicting the two halves and average the values for all 100 splits; these values are reported as “correlation”.
We also use the Aikake Information Criterion to compare models, as, under limiting assumptions, it gives a score for each model that takes into account the relationship between the number of parameters they employ, and the loglikelihood scores they produce (Akaike, 1998):
(11) 
where = the number of parameters in the model, and is the maximum log likelihood. (Equivalently, it can be thought of as estimating the relative information lost by using that model to represent the true underlying generative process.)
Results
Model  LL  AIC  Correlation  

Baseline  
NN Softmax  168,152  336,306  0.67  1 
Ceiling  
Splithalf reliability      0.77   
Note: NN = neural network, LL = loglikelihood,  
AIC = Aikake Information Criterion, = number of parameters. 
Baseline and ceiling measures
Our ceiling and baseline measures are shown in Table 3. At , our splithalf reliability indicates a large amount of intersubject variability in image classification—beyond, perhaps, what would be expected from laboratory experiments, but understandable given the complex nature and small size of images, and the inherentlydecreased precision from crowdsourcing data and using uncertainty sampling to select stimuli (see Haas et al., 2015).
We also consider the CNN softmax output as a baseline model. For each image, the softmax function takes the inner product between a matrix of learned weights and the rasterized output of the final pooling layer, and returns a probability distribution over all of the CIFAR10 classes. These weights are learned based on minimizing classification loss over the whole CIFAR10 dataset, which comprises 50,000 training images over 10 categories. Thus, the CNN softmax is a generous “baseline”, as it includes many extra parameters learned over a much larger dataset for the related task of groundtruth image categorization. Consistent with this, it achieves a low loglikelihood score and high correlation.
Categorization models
Categorization model performance using the untransformed CNN representations is shown in Figure 5, with numerical scores in the Appendix. In terms of loglikelihood, five models consistently outperform the CNN baseline: the hyperplane models from the linear prototype class, with and without a bias term, the category vector variance model from the quadratic prototype class, and the exemplar models, with and without attentional weights. In general, these models have more parameters, meaning they are able to alter CNN representations using the human behavioral data. Although the exemplar model with attentional weights performs best overall, it is striking that the simple decision bounds formed by these prototype models allow them to perform nearly as well, and over all three layers. Indeed, when reviewing correlation scores, we can see that these five models are all performing close to the ceiling provided by the splithalf reliability. Prototype models with fewer parameters, however, performed less well, and consistently below the baseline. These models do not incorporate information about human behavior to alter the shape of their decision boundary during training, instead estimating this from the CNN representations alone.
All models perform better using more abstract and lowerdimensional representations from higher layers, with the exception of the exemplar model without attentional weights. Again, this difference is largest in the prototype models with fewer parameters. While this result may have cognitive implications, it can be explained in machine learning terms. During training, the CNN has learned a feature representation that allow it to disambiguate stimuli easily in the deepest layer (Layer 3), meaning features associated with different categories become increasingly wellseparated with depth. The greater degree of category overlap in more superficial layers therefore penalizes models that directly estimate categorical structure from CNN representations. In addition, with an increase in feature dimensions the generalization of solutions found during training is likely to worsen, as the ratio of stimuli to dimensions, and therefore available information to constrain solutions, decreases from around in Layer 3 to around in Layer 1.
In order to offer an alternate evaluation of models that takes this risk of overfitting into account, we also present model AIC scores, which penalize more highlyparameterized models. This analysis does not affect model rank in the deepest two layers, but does indicate that in Layer 1 the risk of overparameterization outweighs the benefit in loglikelihood for all models, except the exemplar without attentional weights, compared to the CNN softmax.
Study 2: Fitting models using transformed CNN representations
Design
Recent work demonstrates that human similarity judgments can be used to transform vector representations of stimuli to more closely correspond to human ones (Peterson et al., 2016). The core strategy of these techniques is to use a learned transformation of the underlying space to increase the correlation between the human similarity scores of stimuli and some measure of vector similarity—for example, the inner product. There are two reasons for doing so: first, this approach extracts and amplifies information in stimulus representations that is relevant to the human behavior being modeled, resulting in a more faithful and interpretable conceptual structure. Second, in previous categorization research, lowdimensional MDS solutions have themselves been used as representations of more complex stimuli (for example, Palmeri and Nosofsky (2001)). Using transformation techniques complements and extends this approach by retaining the information content of higherdimensional spaces, but doing so in a way that directly improves the quality of conceptual structure in lowerdimensional MDS projections. In our second set of analyses, we first transform CNN representations to more closely approximate human similarity judgments and then reevaluate our categorization models using these improved approximations.
Method
Stimuli
We collected similarity judgments for a randomlyselected subset of 60 birds and 60 planes from the CIFAR10based categorization stimuli.
Behavioral data
We collected 10 similarity judgments between each of the unique pairs of these stimuli on Amazon Mechanical Turk, giving a total of ratings from different participants. Participants were instructed to rate the similarity of four pairs of bird and plane images on a scale from 0 (not similar at all) to 10 (very similar). We paid workers $0.02 per set of four comparisons. Before each task, eight example pairs were shown to help prevent bias in early judgments. Amazon workers could repeat the task with new pairs as many times as they wanted. The result was a similarity matrix after averaging over judgments.
Transforming CNN representations
To transform our representations, we follow the method introduced in Peterson et al., (2016), which uses regularized linear regression to increase the correlation between vector inner products and the average human similarity judgment between their corresponding images. A similarity matrix S can be expressed as the matrix product of a feature matrix F and a diagonal weight matrix W (Shepard and Arabie, 1979),
(12) 
Given an existing feature matrix, the diagonal of can be obtained using through ordinary least squares. This is more evident when expressing the entries of the S matrix algebraically:
(13) 
In the context of our categorization models, we require that these weights reflect a linear transformation of squared distances; therefore, they can be further constrained to be nonnegative. We use the nonnegative least squares algorithm from the scipy python module, enforcing regularization by augmenting the row space of the matrix with orthogonal vectors, whose length is controlled via the ridge parameter (in other words, by manually implementing ridge regression—see van Wieringen, 2015). We find the optimal regularization parameter using a grid search over values with 5fold crossvalidation. We then retrain the model with these parameters on the whole dataset to yield the final diagonal coefficient matrix, . We use this transformation to generate a second set of representations for our images by premultiplying the representation matrix with the elementwise square root of the weight matrix, which is equivalent to the calculation described above.
Results
Transforming representations
Using the linear transformation described above, we are able to substantially increase their correlation to human similarity ratings, especially in the deepest layers (see Table 4). In improving pairwise correlation, this transformation also recovers key global structure in stimulus organization; see Figure 6.
Representation type  Layer 1  Layer 2  Layer 3 

Untransformed  0.004  0.03  0.12 
Transformed  0.18  0.28  0.37 
Note: Values shown are correlation (squared) between stimulus  
vector inner products and mean human similarity judgments. 
Categorization models
The results from evaluating our categorization models on these transformed CNN representations are shown in Figure 7, with numerical scores shown in the Appendix. For the prototype models, the same general pattern holds across layers: models that are able to augment computation on CNN representations with more information about human behavior perform better, with the same models performing best. With the exemplar models, the picture is more complex. The exemplar model without attentional weights performs well in the deepest two layers, scoring higher than baseline. The exemplar model with attentional weights, however, does not, and only performs best in the most superficial layer.
When comparing these results to Figure 5, an interesting pattern emerges. The transformation improves the performance of prototype models with fewer parameters—which form their decision boundaries based on the CNN representations alone—in Layers 1 and 2. The Category Scalar Variance model could also be included in this class, as it only incorporates very coarse information from human behavior; and with this model the pattern also holds. The opposite effect holds with the topperforming, more highlyparameterized models, including the exemplar model with attentional weights: scores are negatively impacted by the transformation, over all layers. The most obvious explanation for these findings is that the transformation eliminates some dimensions that are not important for capturing similarity relations, in addition to emphasizing those that are. This may have the effect of regularizing basic models, especially in higherdimensional feature spaces, but simultaneously penalizing more highlyparameterized ones. These can then no longer exploit information in the eliminated dimensions that is nonetheless relevant to categorization. Evidence for this theory comes from inspection of the transformation weight matrices (data not shown) in conjunction with the MDS solutions of human similarity judgments and CNN innerproduct spaces. For example, in Figure 6 it is clear the CNN has already formed an innerproduct space to optimize categorization. The transformation increases the separation of stimuli in different categories, but it does so by eliminating the majority of dimensions.
Discussion
Our analysis goes beyond typical evaluations of categorization models in several ways. First, we use a large collection of natural images as stimuli, enabling us to study human categorization in a domain that is representative of the environment we have evolved and learned within, and theorized about.
Second, we are able to do so because we use stateoftheart methods from computer vision to estimate the structure of these stimuli. This contrasts with previous work, in which a small number of a prioriidentified features were manipulated to define and differentiate categories, and as a consequence were limited to simple artificial stimuli.
Finally, we offset the modelling uncertainty these advances introduce by using large crowdsourced behavioral datasets to more finely assess graded category membership over stimuli and improve the utility of our representations for them.
Taken together, our results show that using representations derived from CNNs makes it possible to apply psychological models of categorization to complex naturalistic stimuli, and that the resultant models make competitive predictions about human behavior. This approach naturally complements and extends related work seeking to apply these models to natural images and categories, but relying on lowdimensional similaritybased or explicitlygiven feature spaces (Nosofsky et al., 2017; Storms et al., 2000).
Our most general finding is that categorization models that incorporate CNN representations predict human categorizations over natural images well—in particular, those that are able to augment CNN representations with information about human categorizations through free parameters. However, we are still able to use information about human behavior to improve the performance of lessflexible models by transforming their representational substrate to more closely reflect properties of its human counterpart. This is theoretically interesting because it indicates there is enough latent information and flexibility in the groundtruthtrained CNN representations to harness for such a related task. Further, it is practically encouraging because it shows we can successfully draw on preexisting representations and behavioral datasets for our model inputs, rather than procuring them for individual experiments at heavy computational cost. Preprocessing machine learning representations in this manner is a field in its infancy, and we are likely to see further benefits as more complex transformations are taken into account, along with classical considerations about the relative timing of similarityjudgments with respect to the main task (Palmeri and Nosofsky, 2001).
Working with these complex, naturalistic stimuli reveals a potentially more nuanced view of human categorization. The broad consensus from decades of laboratory studies using simple artificial stimuli was that people could learn complex category boundaries of a kind that could only be captured by an exemplar model (for example, McKinley and Nosofsky, 1995). Extrapolating from these results, we might imagine that human categorization should be thought of in terms of learning complex category boundaries in simple featurebased representations. Our results outline a different perspective. The representations formed by the CNN are complex, and within those complex representational spaces, simple category boundaries seem sufficient to capture human behavior. When we think about the cases that inspire us to theorize about categorization—of children learning to categorize furry animals as cats and dogs, say—it seems plausible that this story, developed with much more realistic stimuli, might be a reasonable alternative.
Despite its attractiveness, there are important limitations to our analysis. One caveat comes from the source of the representations of images that we used to obtain our results: the CNN that generated them was explicitly trained to classify images into categories, including the categories of birds and planes. It does so by trying to form a representation in which a simple boundary is sufficient to pick out one category from another. In this sense, these representations should be expected to favor prototype models. While this is a worthy concern, we don’t think it significantly detracts from our results. First, as illustrated in Figure 2, people’s judgments often don’t agree with the ground truth that the network was trained on, so capturing human performance using these representations is nontrivial. Second, we regard the primary contribution of our results to be an existence proof that representations exist for complex natural stimuli that allow prototype models to perform similarly to exemplar models—illustrating that complex representations and simple boundaries provide a reasonable alternative to simple representations and complex boundaries for capturing how people reason about natural categories. We simply don’t have other representations of these images that lead to better performance in predicting human behavior. Given this, an important direction for future work will be obtaining representations from other stateofthe art machine learning algorithms applied to images, including from unsupervisedlearning models (for example, Yu et al., 2016), to evaluate the impact of classification training on our results.
Categorization has traditionally been regarded as distinct from feature learning. However, our findings suggest these dual processes be considered together. When thinking about humans, feature representations are likely to have been learned early on, through a slow, datadriven learning process. Given these considerations, one might expect psychological representations to reflect the natural world, such that categorization of natural stimuli is made as efficient and as simple as possible. On the other hand, artificial or unlikely stimuli may at times carve out awkward boundaries in these spaces, which perhaps underlies the success of exemplar models up to this point. Including featurelearning in the evaluation of human categorization has been called for before (Schyns et al., 1998), and developing a deeper understanding of how these processes interact is an important next step towards more fully characterizing human categorization.
References
Appendix A Appendix
Below we present the numerical results for our models, using untransformed and transformed representations (Tables 5 and 6, respectively).
Model  Layer 1  Layer 2  Layer 3 
Prototype  Linear  
Identity (LL)  199,716  183,976  173,412 
(AIC score)  399,435  367,953  346,826 
(Correlation)  0.35  0.54  0.63 
Common Variance  191,063  183,701  174,959 
(AIC score)  382,127  367,404  349,919 
(Correlation)  0.47  0.55  0.62 
Common Vector Variance  175,596  172,123  167,787 
(AIC score)  367,579  348,346  337,626 
(Correlation)  0.62  0.64  0.68 
Hyperplane (no bias)  168,223  166,855  160,062 
(AIC score)  352,834  337,809  322,177 
(Correlation)  0.64  0.65  0.70 
Hyperplane (bias)  165,480  163,492  160,195 
(AIC score)  347,349  331,087  322,442 
(Correlation)  0.66  0.67  0.70 
Prototype  Quadratic  
Category Pooled Variance  202,026  182,958  180,102 
(AIC score)  404,054  365,917  360,206 
(Correlation)  0.36  0.55  0.63 
Category Variance  194,672  180,047  176,816 
(AIC score)  389,346  360,097  353,633 
(Correlation)  0.48  0.55  0.60 
Category Scalar Variance  197,974  181,409  169,942 
(AIC score)  395,954  362,823  339,889 
(Correlation)  0.35  0.54  0.63 
Category Vector Variance  166,784  164,728  160,278 
(AIC score)  366,342  337,654  324,658 
(Correlation)  0.65  0.67  0.70 
Exemplar  Nonparameteric  
Exemplar (no attention)  167,756  161,890  162,430 
(AIC score)  335,516  323,784  324,864 
(Correlation)  0.69  0.71  0.71 
Exemplar (attention)  162,942  158,882  156,442 
(AIC score)  342,272  321,863  314,935 
(Correlation)  0.70  0.72  0.73 
Note: LL = loglikelihood, AIC = Aikake Information Criterion.  
Bold font indicates best in each class of models. 
Model  Layer 1  Layer 2  Layer 3 
Prototype  Linear  
Identity (LL)  186,265  181,615  174,405 
(AIC score)  372,533  363,233  348,813 
(Correlation)  0.52  0.57  0.63 
Common Variance  185,900  180,501  174,567 
(AIC score)  371,802  361,005  349,136 
(Correlation)  0.53  0.58  0.63 
Vector Common Variance  181,017  175,658  170,049 
(AIC score)  378,423  355,416  342,149 
(Correlation)  0.57  0.62  0.66 
Hyperplane (no bias)  174,911  170,924  164,027 
(AIC score)  366,211  345,948  330,106 
(Correlation)  0.59  0.62  0.67 
Hyperplane (bias)  175,942  171,293  163,717 
(AIC score)  368,273  346,687  329,485 
(Correlation)  0.58  0.62  0.67 
Prototype  Quadratic  
Category Pooled Variance  188,054  178,647  171,802 
(AIC score)  376,110  357,296  343,606 
(Correlation)  0.53  0.56  0.62 
Category Variance  191,562  179,941  178,081 
(AIC score)  383,126  359,884  356,164 
(Correlation)  0.50  0.55  0.59 
Scalar Category Variance  184,156  178,482  170,681 
(AIC score)  368,318  356,969  341,368 
(Correlation)  0.51  0.56  0.62 
Vector Category Variance  173,636  169,158  164,707 
(AIC score)  380,046  346,514  333,516 
(Correlation)  0.60  0.63  0.67 
Exemplar  Nonparameteric  
Exemplar  207,919  164,062  162,628 
(AIC score)  415,841  328,128  325,259 
(Correlation)  0.60  0.69  0.70 
Exemplar (attn)  186,544  181,384  173,510 
(AIC score)  389,477  366,869  349,073 
(Correlation)  0.49  0.54  0.60 
Note: LL = loglikelihood, AIC = Aikake Information Criterion.  
Bold font indicates best in each class of models. 