Single Unit Status in Deep Convolutional Neural Network Codes for Face Identification:Sparseness Redefined

Single Unit Status in Deep Convolutional Neural Network Codes for Face Identification: Sparseness Redefined

Abstract

Deep convolutional neural networks (DCNNs) trained for face identification develop representations that generalize over variable images, while retaining subject (e.g., gender) and image (e.g., viewpoint) information. Identity, gender, and viewpoint codes were studied at the “neural unit” and ensemble levels of a face-identification network. At the unit level, identification, gender classification, and viewpoint estimation were measured by deleting units to create variably-sized, randomly-sampled subspaces at the top network layer. Identification of 3,531 identities remained high (area under the ROC approximately 1.0) as dimensionality decreased from 512 units to 16 (0.95), 4 (0.80), and 2 (0.72) units. Individual identities separated statistically on every top-layer unit. Cross-unit responses were minimally correlated, indicating that units code non-redundant identity cues. This “distributed” code requires only a sparse, random sample of units to identify faces accurately. Gender classification declined gradually and viewpoint estimation fell steeply as dimensionality decreased. Individual units were weakly predictive of gender and viewpoint, but ensembles proved effective predictors. Therefore, distributed and sparse codes co-exist in the network units to represent different face attributes. At the ensemble level, principal component analysis of face representations showed that identity, gender, and viewpoint information separated into high-dimensional subspaces, ordered by explained variance. Identity, gender, and viewpoint information contributed to all individual unit responses, undercutting a neural tuning analogy for face attributes. Interpretation of neural-like codes from DCNNs, and by analogy, high-level visual codes, cannot be inferred from single unit responses. Instead, “meaning” is encoded by directions in the high-dimensional space.

\keywords

face identification machine learning visual features sparse codes

1 Introduction

The concept of a feature is at the core of psychological and neural theories of visual perception. The link between perceptual features and neurons has been a fundamental axiom of visual neuroscience since Letvin et al. (1959) first described the receptive fields of ganglion cells as “bug perceivers” (pp. 258 [16]). At low levels of visual processing, features can be defined in precise terms and located in a retinal image. They can also be interpreted semantically (e.g., vertical line, retinal location ). These codes are sparse, because they rely on the responses of a small number of specific neurons [19, 20]. At higher levels of visual processing, where retinotopy gives way to categorical codes, the connection between receptive fields and features is unclear. Although “face-selective” may be an accurate description of a neuron’s receptive field, it provides no information about the features used to encode a face.

A fundamental difference between retinotopic representations in early visual areas and the categorical representations that emerge in ventral temporal cortex is that the latter generalize across image variation (e.g., viewpoint). Historically, face recognition algorithms relied on feature detection strategies analogous to those used in low-level vision (e.g., [25, 26]). These algorithms operated accurately only on controlled face images with limited variation in viewing conditions. Since 2014, face-identification algorithms based on deep convolutional neural networks (DCNNs) have largely overcome the limit of recognizing faces using image-based similarity [30, 32, 27, 28, 6, 24]. Similar to face codes in high-level visual cortex, DCNN codes generalize over substantial image variation. Indeed, the response properties of neurons in inferior temporal cortex can be simulated using appropriately weighted combinations of output units from a DCNN trained for object recognition [35].

The parallels between primate vision and deep-learning networks are by design [18, 29]. DCNNs employ computational strategies similar to those used in the primate visual system [10, 15] and are trained extensively with real-world images. For face identification, effective training sets consist of variable images of a large number of identities. DCNNs process images through cascaded layers of non-linear convolution and pooling operations. The face representation that emerges at the top of the network is a compact vector with an impressive capacity for robust face recognition.

Although deep networks have made significant progress on the problem of generalized face recognition, the face representation they create is poorly understood [29]. Approaches to dissecting this representation have been aimed at: a.) uncovering the information retained in the descriptor [21, 13], b.) probing the robustness of individual unit responses to image variation [21], c.) visualizing the receptive fields of individual units in the network code [22], and d.) visualizing the similarity structure of a population of ensemble face representations for images and identities [12]. We consider each in turn.

First, it is now clear that face descriptors from DCNNs retain a surprising amount of information about the original input image [21]. Specifically, the output representation from DCNNs trained for face identification can be used to predict the viewpoint (yaw and pitch) and media type (still or video image) of the input image with high accuracy [21, 18]. Therefore, deep networks achieve robust identification, not by filtering out image-based information across layers of the network, but by effectively managing it (cf. also [7, 13]).

Second, given that DCNN descriptors contain image information, it is possible that the top-layer units separate identity and image information across different units of the face descriptor. Parde et al. (2017) tested this by probing the response properties of the top-layer units in a DCNN trained for face identification to either front-facing or three-quarter-view images of faces [21]. Individual units did not respond consistently in either a view-specific or view-independent manner.

The third approach is to visualize the response preferences of units in the network [22] with the goal of translating them into perceptible images. This is done with techniques such as Activation Maximization [8] or deconvolution [37]. This approach is useful for interpreting hidden units at lower layers of DCNNs, where unit activations can be linked to locations within an input image. At higher levels of the network, however, unit responses are not bound to image locations and so image-based visualization is of limited value.

The fourth approach is to visualize the similarity structure of ensembles of DCNN unit activations. This reveals a highly organized face space, cf. [33, 18]. Specifically, visualization was applied to highly controlled face images of multiple identities that vary in viewpoint and illumination [12]. The resulting face space showed that images clustered by identity, identities separated into regions of male and female faces, illumination conditions (ambient vs. spotlight) nested within identity clusters, and viewpoint (frontal to profile) nested within illumination conditions. Therefore, from deep networks trained with in-the-wild images, a highly structured representation of identity and image variation emerges.

Approaches to examining DCNN codes have focused either on single-unit responses or on the full ensemble of units. Neither provides a complete account of how unit responses interact in a representational space to code semantic information about faces. Here, we examined the juxtaposition of unit and ensemble codes of face identity, gender, and viewpoint in a deep network. At the unit level, we probed the distribution of information about faces and images across individual units in the network’s face descriptor. Specifically, we tested identification, gender classification, and viewpoint estimation in variably-sized, randomly sampled, subspaces of top-layer units from a face-identification DCNN. We examined the minimum number of units needed to perform each of these tasks, as well as the predictive power of each individual unit. At the ensemble level, we examined identity, gender, and viewpoint codes by interpreting them as directions in the representational space created at the top-layer of a DCNN. We did this by performing principal component analysis (PCA) on the face-descriptor vectors and then analyzing the quality of identity, gender, and viewpoint information coded by each PC. The results indicate that identity and image information separate in the ensemble space, but are confounded in individual unit responses. This challenges classical tuning analogies for neural units at high levels of visual processing.

Figure 1: Identification accuracy is plotted as a function of subspace dimensionality, measured as area under the ROC curve (AUC) (A). Performance is nearly perfect (AUC 1.0) with the full 512-dimensional descriptor and shows negligible declines until subspace dimensionality reaches 16-units. Performance with as few as two units remains above chance. Correlation histogram for unit responses across images indicates that units capture non-redundant information for identification (B).

2 Results

2.1 Unit-level Face Information

Face representations were obtained by processing face images (n = 22,357) of 3,531 identities through a state-of-the-art DCNN trained for face identification. The final face-image representation was defined as the 512-dimensional, penultimate layer of the network. The distribution of identity, gender, and viewpoint was examined across units in randomly sampled subspaces of varying dimensionalities (512, 256, 128, 64, 32, 16, 8, 4, and 2 units). For each dimensionality, 50 random samples were selected.

Identity

Identification accuracy is robust in low-dimensional subspaces. Figure 1A shows face-identification accuracy as a function of the number of randomly-selected units sampled from the face representation. Face-identification accuracy is near-perfect in the full-dimensional space. A substantial number of units can be deleted with almost no effect on accuracy for identifying the 3,000+ individuals in the test set. The first substantial drop in performance is seen at 16 units ( of the full dimensionality). Accuracy remains high (AUC = 0.80) with as few as four units, and is well above chance (AUC = 0.72) with only two units. DCNN performance is robust, therefore, with very small numbers of units. Further, performance does not depend on the particular units sampled.

The remarkable stability of identification performance with random selections of very few top-layer units is consistent with two types of codes. First, it is possible that individual units provide diverse cues to identity. Combined, these cues could accumulate to provide a powerful code for identification. If this is the case, we would expect many individual units to show a measurable capacity for separating identities. Moreover, the identity information captured by individual units would be uncorrelated. Alternatively, it may be that many different units capture redundant, but effective, information for face identification. By this account, we expect the response patterns of a subset of units to be highly correlated.

Individual units yield diverse, not redundant, solutions for identity. Figure  1B shows the distribution of response correlations for all possible pairs of top-level units across all images in the test set. The distribution is centered at zero, with 95% of correlations falling below an absolute value of 0.17. Therefore, units in the DCNN capture non-redundant identity information.

Highly-distributed identity information across units. We quantified the identification capacity of individual units in the DCNN. Units with high identification capacity support maximal identity separation while simultaneously minimizing the distance between same-identity images. Therefore, a unit has identity-separation power when its responses vary more between images of different identities than within images of the same identity. We applied analysis of variance (ANOVA) to each unit’s responses to all images in the test set. For each ANOVA, identity was the independent variable, image was the random (observation) variable, and the unit responses were the dependent variable. The resulting ratios provide an index of between-identity variance to within-identity variance. All units had sufficient power to separate identities (p .000098, Bonferroni corrected for = 0.05).

Next, we calculated the proportion of variance in a unit’s response explained by identity variation ( effect size). Figure 2A (purple) shows the distribution of effect sizes across units, with an average of (minimum 0.6573, maximum 0.7611). Thus, on average, 69.1% of the variance in individual unit response is due to variation in identity. All units have a substantial capacity for separating identities.

Figure 2: Effect sizes for units (A) and principal components (B) for identity, gender, and viewpoint. For units and principal components, the top panels illustrate the dominance of identity over gender and viewpoint. Lower panels show an approximately uniform distribution of effect sizes for units (A) and differentiated effect sizes for principal components (B) for all three attributes.

Gender

Gender-prediction accuracy was measured in the abbreviated subspaces sampled for the face-identification experiments. For each sample, linear discriminant analysis (LDA) was applied to predict the labeled gender (male or female) of all images in the test set from the unit responses. Using all units, gender classification was 91.1 correct. Classification accuracy declined steadily as the number of units sampled decreased (Figure  3A).

Next, the gender-separating capacity of each individual unit was measured. An ANOVA was performed for each unit, using gender as the independent variable. Overall, 71.5% of the units were able to separate images according to gender (p 0.000098, Bonferroni corrected for = 0.05). However, gender accounted for only a very small amount of the variance in unit responses. Figure 2A (pink) shows effect size across units (mean = .0045, minimum 0, maximum = 0.041). Notwithstanding the small effect sizes, the finding that 71.5% of the units’ responses differed significantly as a function of gender is meaningful. If individual units did not possess predictive power for gender, approximately 5% ( level) of units would reach significance.

Consequently, far more units are needed to predict face gender than to predict identity. This is because fewer units have predictive power for gender than for identity, and because the predictive value of these units for gender is weaker.

Figure 3: Gender and viewpoint prediction with variable numbers of sampled features. Gender classification declines gradually (A) and viewpoint prediction declines rapidly (B) as sample size decreases. Mean performance for samples () is shown with a diamond. Different sample sizes appear in different colors.

Viewpoint

Viewpoint was predicted using linear regression, cross validated by identity subgroups, and assessed in the same samples tested for identification and gender classification. Prediction error was defined as the difference between predicted and true yaw (in degrees). Figure  3B shows prediction error as a function of the number of randomly-sampled units. Using all units, viewpoint was predicted to within 7.35 degrees. Prediction accuracy was at chance when subspace dimensionality fell to 32 units. Accurate prediction required nearly half of the units.

Viewpoint separation capacity for each unit was assessed with ANOVA, using viewpoint as the independent variable. Effect size measures the proportion of variance explained by viewpoint in each unit’s response. Figure 2A (orange) shows small effect sizes for viewpoint (average = 0.0020, minimum 0, maximum = 0.018). However, overall, 54.7% of units separated images according to viewpoint (p 0.000098, Bonferroni corrected for = 0.05). Therefore, viewpoint prediction requires far more units than identity or gender prediction.

Single Unit Summary

Multiple, qualitatively different codes co-exist within the same set of DCNN top-layer units. These codes are differentiated by the number of units needed to perform a task, and by the predictive power of individual units for the task. First, all units provide strong cues to identity that are largely uncorrelated. Therefore, small numbers of randomly chosen units can achieve robust face identification. Second, gender is coded weakly in approximately 72% of the units. Accurate gender prediction requires a larger number of units, because the set must include gender-predictive units, and these units must be combined to supply sufficient power for classification. Third, even fewer units (about 50%) code viewpoint—each very weakly. Therefore, a large number of units is needed for accurate viewpoint estimation.

2.2 Ensemble Coding of Identity, Gender, and Viewpoint

How do ensemble face representations encode identity, gender, and viewpoint in the high-dimensional space created by the DCNN? To understand unit-based face-image codes in the context of directions in this space requires a change in vantage point. A face space representation [33, 18] was generated by applying PCA to the ensemble unit responses. The axes of the space (PCs) are ordered according to the proportion of variance explained by the ensemble face-image descriptors. We re-expressed each face-image descriptor as a vector of PC coordinates. This captures a face-image representation in terms of its relationship to principal directions in the ensemble space.

Figure 4: (A) Sliding windows of PCs were then used to predict identity (purple), gender (teal), and yaw (yellow) across the full set of PC subspaces. Identification accuracy is highest when using early PCs. Gender and viewpoint classification accuracy were highest when using subspaces with the highest effect sizes for gender and viewpoint separation, respectively. (B) Similarity between principal component eigenvectors and the directions for identity (purple), gender (teal), and yaw (yellow). The identity direction is the average similarity between identity templates and PCs. The gender direction is the linear discriminant line from an LDA for gender classification. The viewpoint direction is the vector of weights from a linear regression for predicting viewpoint.

For each PC, we measured identity, gender, and viewpoint separation using the ensemble-based image code. Effect sizes were computed for the PC-based codes using ANOVA, as was done for the unit-based codes. These appear in Figure 2B. Identity (purple) dominates gender (teal) and viewpoint (yellow) information, consistent with the unit-based code (cf. Fig. 2A). In contrast to the unit-based codes, effect sizes for individual PCs are strongly differentiated by face attribute. Effect sizes for identity are highest in PCs that explain the most variance in the ensemble space. Gender information peaks in two ranges of PCs (2–10 and 164–202). For viewpoint, effect sizes peak between PCs 220 and 230. Therefore, face-image attributes are filtered into multiple subspaces and are ordered roughly according to explained variance.

Next, we show that these subspaces differ in their functional capacity to classify identity, gender, and viewpoint. Moreover, these subspaces align with directions in the representational space diagnostic of face attributes.

Face Attributes Predicted from Ensemble Codes To test the ability of different subspaces to separate attributes, we predicted identity, gender, and viewpoint from different ranges of PCs. Starting with PCs 1 to 30, we used sliding windows of 30 PCs (1–30, 2–31, 3–32, etc.) to predict each face-image attribute. Figure 4A shows that the accuracy of the predictions for the three attributes differs with the PC range. Identification accuracy is best in the subspaces that explain the most variance. Gender-classification accuracy is highest when using ranges of PCs that encompass the highest effect sizes for gender separation. Similarly, viewpoint prediction is most accurate with ranges of PCs that encompass the highest effect sizes for viewpoint separation.

Face Attributes Align with Directions in the Space At a more general level, it is possible to compare the PC directions in the space to the directions diagnostic of identity, gender, and viewpoint. We compared PCs to: a.) the directions of identity codes, b.) the direction in the space that maximally separated faces by gender (gender direction), and c.) the direction that best supported viewpoint prediction (viewpoint direction). Identity codes were created by averaging the face descriptors for all images of an identity. The gender direction was the linear discriminant line from the LDA used for gender classification. The viewpoint direction was the vector of regression coefficients for viewpoint prediction.

Figure 4B (purple) shows the average of the absolute value of cosine similarities between each PC and all identity codes. Figure 4B (teal) shows the similarity between each PC and the gender direction, and Figure 4B (yellow) shows the similarity between each PC and the viewpoint direction. These plots reveal that identity information is distributed primarily across the first 150 PCs, gender information is distributed primarily across PCs ranked between 150–200, and viewpoint information is distributed primarily across PCs ranked greater than 200.

Consistent with the effect sizes computed for each PC, as well as the attribute predictions, this result shows that identity, gender, and viewpoint are filtered roughly into subspaces ordered according to explained variance in the DCNN-generated ensemble space. This filtering reflects a prioritization of identity over gender, and of gender over viewpoint.

Figure 5: For a single example unit, absolute value of similarities between unit direction and each principal component shows confounding of unit response with identity, gender, and viewpoint (top). Density plot of similarities between the example unit and principal components associated with identity (purple), gender (blue), and viewpoint (yellow) (bottom). The distributions overlap almost completely, indicating that each type of information contributes to the unit’s activation. This finding was consistent across all unit basis vectors.

2.3 Juxtaposed Unit and Ensemble Codes

PCs capture directions that can be interpreted in terms of identity, gender, and viewpoint. How do these directions relate to the basis vectors that define the DCNN units? This will tell us whether individual units “respond preferentially” to directions that can be interpreted in terms of identity, gender, or viewpoint.

We calculated the cosine similarity between the PC directions and the unit directions (unit 1: , unit 2: , etc.). If a unit responds preferentially to viewpoint, gender, or identity, it will align closely with PCs related to a specific attribute. Confounding of semantically relevant information (identity, gender, viewpoint) in a unit’s response will yield a uniform distribution of similarities across the PCs.

Unit responses confound identity, gender, and viewpoint. Figure 5 (top) shows a uniform distribution of similarities across PCs for a single unit. We found this for all of the 512 units (see Supplemental Information). Figure 5 (bottom) shows the histogram of these similarities, separated by attribute. Identity, gender, and viewpoint information, which are separated in the high dimensional space, are confounded in unit responses. This undermines a classic tuning analogy for units. In isolation, individual units cannot be interpreted in terms of a specific identity, gender, or viewpoint.

3 Discussion

Historically, neural codes have been characterized as either sparse or distributed. Sparse codes signal a stimulus via the activity of a small number of highly-predictive units. Distributed codes signal a stimulus via the combined activity of many weakly-predictive units. The DCNN’s identity code encompasses fundamental mechanisms of both sparse (highly predictive single units) and distributed (powerful combinations of units) codes. This unusual combination of characteristics accounts for the DCNN’s remarkable resilience to unit deletion. Superimposed on the identity representation are standard distributed codes for gender and viewpoint, and likely other subject and image variables. For these codes, ensembles, not individual units, make accurate attribute predictions.

The results reveal three distinct attribute codes (identity, gender, view) in one set of units. These codes vary in the extent to which they distribute information across units. Because multiple attribute codes share the same units, the labels “sparse” or “distributed” must specify a particular attribute. In deep layers of DCNNs, where units respond to complex combinations of low-level visual features, these shared codes may be common. If these codes exist in the primate visual system, it would likely be at higher-levels of the visual processing hierarchy. In low-level visual areas (e.g., V1), neural receptive fields reference to locations in the retinotopic image, and are more likely to act as single-attribute “feature detector” codes.

Much of what appears complex in individual units is clear in the ensemble space. PCs separate attributes in the DCNN representation according to explained variance. This reflects network prioritization (identity gender viewpoint). PCs comprise a “special”, interpretable, rotation of the unit axes, because the face attributes are not represented equally in the face descriptors. The juxtaposition of unit and ensemble characteristics indicates that information coded by a deep network is in the representational space, not in any given projection of the space onto a particular set of axes, cf. [31].

How then are we to understand the units? The DCNN is optimized to separate identity, not to maximize the interpretability of the information that achieves identity separation. From a computational perspective, any orthogonal basis is as good as any other. Given the high dimensionality of the system and the stochastic nature of training, the likelihood of a DCNN converging on a semantically interpretable basis set is exceedingly low. Units serve the sole purpose of providing a set of basis axes that support maximal separation of identities in the space. There should be no expectation that the response of individual units be “tuned” to semantic features. In isolation, units provide little or no information about the visual code that operates in the high-dimensional space. Instead, units must be interpreted in an appropriate population-based computational framework [4, 13, 36].

How does this affect the way we interpret neural data? The literature is replete with reports of preferentially-tuned neurons in face-selective cortex. Electrophysiological recordings differentiate face patches, based on the tuning characteristics of neurons (e.g., PL: eyes, eye region, face outlines, [14]; ML: iris size, inter-eye distance, face shape, and face views [9]; AM: view-invariant identity [4, 9]). The problem with interpreting the responses of single neurons is evident when we consider what a neurophysiologist would conclude by recording from top-layer units in the network we analyzed.

First, most of these units would appear to be “identity-tuned”, preferring some identities (high activation) over others (low activation). However, our data show that each unit exhibits substantial identity-separation capacity (cf. effect sizes). Effect sizes consider the full range of responses, instead of making only a “high” versus “low” response comparison. The neural-tuning analogy obscures the possibility that individual units can contribute to identity coding with a relatively low-magnitude response. This response, in the context of the responses of other units, is information in a distributed code. A neurophysiologist would find “identity-tuned units” here (in what is, essentially, a distributed code), only because identity modulates the individual unit responses so saliently. These are not identity-tuned units, they are identity-separator units. Moreover, what separates identity in these units is not likely to be interpretable. This is due to the uncertain relationship between meaningful directions in the representational space and the arbitrary directions of the unit axes.

Second, no units would appear to be tuned to gender or viewpoint, because these attributes modulate the response of a unit only weakly in comparison to identity. As a consequence, the distributed codes that specify gender and viewpoint would be hidden, despite the fact that the ensemble of units contains enough information for accurate classification of both attributes. The hidden modulation of unit responses by viewpoint would, from a neural tuning perspective, imply that the units signal identity in a viewpoint-invariant way. This is, in fact, correct, but provides a misleading characterization of the importance of these units for encoding viewpoint.

Neurophysiological investigations of visual codes typically rely on neural-tuning data from single units in conjunction with population decoding methods. However, if DCNN-like codes exist in primate visual cortex, over-emphasis on neural-tuning functions in high-level areas may be counter-productive. Rather than characterizing neural units by the features or stimuli to which they respond [3], we should instead consider units as organizational axes in a representational space [31]. The importance of a unit lies in its utility for separating items within a class, not in interpreting the attributes for which it has a high or low activation. This requires a shift in perspective from the principles of sparse and distributed coding that, although helpful for understanding early visual processing, might not be appropriate in high-level vision.

4 Methods

4.1 Network

All reported data are from a 101-layered face-identification DCNN [24]. This network performs with high accuracy across changes in viewpoint, illumination, and expression (cf. performance on IARPA Janus Benchmark-C [IJB-C] [17]). Specifically, the network is based on the ResNet-101 [34] architecture. It was trained with the Universe dataset [1, 23], comprised of three smaller datasets (UMDFaces [2], UMDVideos [1], and MS-Celeb-1M [11]). The dataset includes 5,714,444 images of 58,020 identities. The network employs Crystal Loss (L2 Softmax) for training [23]. Crystal Loss scale factor was set to 50. ResNet-101 employs skip connections to retain the strength of the error signal over its 101-layer architecture. Once the training is complete, the final layer of the network is removed and the penultimate layer (512 units) is used as the identity descriptor. This penultimate layer is considered the “top layer” face-image representation.

4.2 Test Set

The test set was comprised of images from the IJB-C dataset, which contains 3,531 subjects portrayed in 31,334 still images (10,040 non-face images) and frames from 11,779 videos. For the present experiments, we used all still images in which our network could detect at least one face, and for which viewpoint information was available. In total, we selected 22,248 (9,592 female; 12,656 male) faces of 3,531 (1,503 female; 2,028 male) subjects. Note that several images contain multiple detectable faces.

4.3 Identity AUC Calculation

For face-identification, images from the test set were assigned randomly to Group A or B. AUCs were computed by comparing every image in set A (5,562 images of 3,056 identities) to every image in set B (5,562 images of 3,053 identities). In total, each AUC was computed from 30,935,844 comparisons.

4.4 Classification

Gender

Linear discriminant analysis (LDA) was used to classify face gender for each image in the dataset. For each subset of features, a LDA was trained using all images of 3,231 identities and tested on the remaining 300 identities. This process was repeated, holding out a different set of 300 identities each time until all images were classified. Gender labels for each identity were verified by human raters. The final output values were categorical gender labels that could be compared directly to the ground-truth data.

Viewpoint

Viewpoint was predicted using linear regression. Linear regression models were computed using the Moore-Penrose pseudo-inverse. For each subset of features, a regression model was trained using all images of 3,231 identities and tested on the remaining 300 identities. This process was repeated, each time holding out a different set of 300 identities, until viewpoint predictions had been assigned to each image. Ground truth for viewpoint was produced by the Hyperface system [5] and was defined as the deviation from a frontal pose, measured in degrees yaw (i.e., 0 = frontal, 90 = right profile, -90 = left profile). Output predictions were continuous values corresponding to the predicted viewpoint in degrees yaw.

Permutations

Permutation tests were used to evaluate the statistical significance of the viewpoint and gender predictions. A null distribution was generated from the original data by randomly permuting values within each unit. Predictions made from the resulting permutations () were compared to the true values from each classification test. All permutation tests were significant at , with no overlap between test value and null distribution.

4.5 Analysis of Variance

For each face-image attribute (identity, gender, viewpoint), an ANOVA was computed for each of the 512 units in the top-level DCNN output, as well as for all 512 PCs. For each ANOVA, the independent variable was either the vector of a unit’s responses or the vector of a PC’s factor scores. Face images were used as the random variable. Identity, gender, and viewpoint were used in separate analyses as the dependent variable. For viewpoint, the absolute value was binned into the following five categories. Frontal: [0, 18]; near-frontal: (18, 36]; half profile (36, 54]; near profile (54, 72]; profile (72, 150]. To account for unequal group sizes, pooled sums-of-squares were used as the error term.

Acknowledgment

Funding provided by National Eye Institute Grant R01EY029692-01 and by the Intelligence Advanced Research Projects Activity (IARPA) to AOT. This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No. 2014-14071600012 and 2019-022600002. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.

References

  1. A. Bansal, C. D. Castillo, R. Ranjan and R. Chellappa (2017) The do’s and don’ts for cnn-based face verification.. In ICCV Workshops, pp. 2545–2554. Cited by: §4.1.
  2. A. Bansal, A. Nanduri, C. D. Castillo, R. Ranjan and R. Chellappa (2017) Umdfaces: An annotated face dataset for training deep networks. In Biometrics (IJCB), 2017 IEEE International Joint Conference on, pp. 464–473. Cited by: §4.1.
  3. P. Bashivan, K. Kar and J. J. DiCarlo (2019) Neural population control via deep image synthesis. Science 364 (6439). Cited by: §3.
  4. L. Chang and D. Y. Tsao (2017) The code for facial identity in the primate brain. Cell 169 (6), pp. 1013–1028. Cited by: §3, §3.
  5. J. Chen, V. M. Patel and R. Chellappa (2016) Unconstrained face verification using deep cnn features. In Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on, pp. 1–9. Cited by: §4.4.2.
  6. J. Chen, R. Ranjan, A. Kumar, C. Chen, V. M. Patel and R. Chellappa (2015) An end-to-end system for unconstrained face verification with deep convolutional neural networks. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 118–126. Cited by: §1.
  7. J. J. DiCarlo and D. D. Cox (2007) Untangling invariant object recognition. Trends in cognitive sciences 11 (8), pp. 333–341. Cited by: §1.
  8. D. Erhan, Y. Bengio, A. Courville and P. Vincent (2009) Visualizing higher-layer features of a deep network. University of Montreal 1341 (3), pp. 1. Cited by: §1.
  9. W. A. Freiwald and D. Y. Tsao (2010) Functional compartmentalization and viewpoint generalization within the macaque face-processing system. Science 330 (6005), pp. 845–851. Cited by: §3.
  10. K. Fukushima (1988) Neocognitron: a hierarchical neural network capable of visual pattern recognition.. Neural networks 1 (2), pp. 119–130. Cited by: §1.
  11. Y. Guo, L. Zhang, Y. Hu, X. He and J. Gao (2016) Ms-celeb-1m: a dataset and benchmark for large-scale face recognition. In European Conference on Computer Vision, pp. 87–102. Cited by: §4.1.
  12. M. Q. Hill, C. J. Parde, C. D. Castillo, Y. I. Colon, R. Ranjan, J. Chen, V. Blanz and A. J. O’Toole (2019) Deep convolutional neural networks in the face of caricature. Nature Machine Intelligence 1 (11), pp. 522–529. Cited by: §1, §1.
  13. H. Hong, D. L. Yamins, N. J. Majaj and J. J. DiCarlo (2016) Explicit information for category-orthogonal object properties increases along the ventral stream. Nature neuroscience 19 (4), pp. 613. Cited by: §1, §1, §3.
  14. E. B. Issa and J. J. DiCarlo (2012) Precedence of the eye region in neural processing of faces. Journal of Neuroscience 32 (47), pp. 16666–16682. Cited by: §3.
  15. A. Krizhevsky, I. Sutskever and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  16. J. Y. Lettvin, H. R. Maturana, W. S. McCulloch and W. H. Pitts (1959) What the frog’s eye tells the frog’s brain. Proceedings of the IRE 47 (11), pp. 1940–1951. Cited by: §1.
  17. B. Maze, J. C. Adams, J. A. Duncan, N. D. Kalka, T. Miller, C. Otto, A. K. Jain, W. T. Niggel, J. Anderson, J. Cheney and P. Grother (2018) IARPA janus benchmark - c: face dataset and protocol. 2018 International Conference on Biometrics (ICB), pp. 158–165. Cited by: §4.1.
  18. A. J. O’Toole, C. D. Castillo, C. J. Parde, M. Q. Hill and R. Chellappa (2018) Face Space Representations in Deep Convolutional Neural Networks. Trends in cognitive sciences. Cited by: §1, §1, §1, §2.2.
  19. B. A. Olshausen and D. J. Field (1996) Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381 (6583), pp. 607. Cited by: §1.
  20. B. A. Olshausen and D. J. Field (1997) Sparse coding with an overcomplete basis set: a strategy employed by v1?. Vision research 37 (23), pp. 3311–3325. Cited by: §1.
  21. C. J. Parde, C. Castillo, M. Q. Hill, Y. I. Colon, S. Sankaranarayanan, J. Chen and A. J. O?Toole (2017) Face and image representation in deep cnn features. In Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE International Conference on, pp. 673–680. Cited by: §1, §1, §1.
  22. Z. Qin, F. Yu, C. Liu and X. Chen (2018) How convolutional neural network see the world-a survey of convolutional neural network visualization methods. arXiv preprint arXiv:1804.11191. Cited by: §1, §1.
  23. R. Ranjan, A. Bansal, H. Xu, S. Sankaranarayanan, J. Chen, C. D. Castillo and R. Chellappa (2018) Crystal loss and quality pooling for unconstrained face verification and recognition. arXiv preprint arXiv:1804.01159. Cited by: §4.1.
  24. R. Ranjan, S. Sankaranarayanan, C. D. Castillo and R. Chellappa (2017) An all-in-one convolutional neural network for face analysis. In Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE International Conference on, pp. 17–24. Cited by: §1, §4.1.
  25. M. Riesenhuber and T. Poggio (1999) Hierarchical models of object recognition in cortex. Nature neuroscience 2 (11), pp. 1019. Cited by: §1.
  26. M. Riesenhuber and T. Poggio (2000) Models of object recognition. Nature neuroscience 3 (11s), pp. 1199. Cited by: §1.
  27. S. Sankaranarayanan, A. Alavi, C. Castillo and R. Chellappa (2016) Triplet probabilistic embedding for face verification and clustering. arXiv preprint arXiv:1604.05417. Cited by: §1.
  28. F. Schroff, D. Kalenichenko and J. Philbin (2015) Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §1.
  29. T. J. Sejnowski (2020) The unreasonable effectiveness of deep learning in artificial intelligence. Proceedings of the National Academy of Sciences. External Links: Document, ISSN 0027-8424, Link, https://www.pnas.org/content/early/2020/01/23/1907373117.full.pdf Cited by: §1, §1.
  30. Y. Sun, X. Wang and X. Tang (2014) Deep learning face representation from predicting 10,000 classes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1891–1898. Cited by: §1.
  31. C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §3, §3.
  32. Y. Taigman, M. Yang, M. Ranzato and L. Wolf (2014) Deepface: closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1701–1708. Cited by: §1.
  33. T. Valentine (1991) A unified account of the effects of distinctiveness, inversion, and race in face recognition. The Quarterly Journal of Experimental Psychology Section A 43 (2), pp. 161–204. Cited by: §1, §2.2.
  34. Y. Wen, K. Zhang, Z. Li and Y. Qiao (2016) A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision, pp. 499–515. Cited by: §4.1.
  35. D. L. Yamins, H. Hong, C. F. Cadieu, E. A. Solomon, D. Seibert and J. J. DiCarlo (2014) Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences 111 (23), pp. 8619–8624. Cited by: §1.
  36. D. L. Yamins, H. Hong, C. F. Cadieu, E. A. Solomon, D. Seibert and J. J. DiCarlo (2014) Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences 111 (23), pp. 8619–8624. Cited by: §3.
  37. M. D. Zeiler, G. W. Taylor and R. Fergus (2011) Adaptive deconvolutional networks for mid and high level feature learning. In 2011 International Conference on Computer Vision, pp. 2018–2025. Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
410049
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description