# Counting Everyday Objects in Everyday Scenes

## Abstract

We are interested in counting the number of instances of object classes in natural, *everyday* images. Previous counting approaches tackle the problem in restricted domains such as counting pedestrians in surveillance videos. Counts can also be estimated from outputs of other vision tasks like object detection. In this work, we build dedicated models for counting designed to tackle the large variance in counts, appearances, and scales of objects found in natural scenes. Our approach is inspired by the phenomenon of subitizing – the ability of humans to make quick assessments of counts given a perceptual signal, for small count values. Given a natural scene, we employ a divide and conquer strategy while incorporating context across the scene to adapt the subitizing idea to counting. Our approach offers consistent improvements over numerous baseline approaches for counting on the PASCAL VOC 2007 and COCO datasets. Subsequently, we study how counting can be used to improve object detection. We then show a proof of concept application of our counting methods to the task of Visual Question Answering, by studying the ‘how many?’ questions in the VQA and COCO-QA datasets.

## 1Introduction

We study the scene understanding problem of counting common objects in natural scenes. That is, given for example the image in Figure 1, we want to count the number of everyday object categories present in it: for example 4 *chairs*, 1 *oven*, 1 *dining table*, 1 *potted plant* and 3 *spoons*. Such an ability to count seems innate in humans (and even in some animals [10]). Thus, as a stepping stone towards Artificial Intelligence (AI), it is desirable to have intelligent machines that can count.

Similar to scene understanding tasks such as object detection [43] and segmentation [5] which require a fine-grained understanding of the scene, object counting is a challenging problem that requires us to reason about the number of instances of objects present while tackling scale and appearance variations.

Another closely related vision task is visual question answering (VQA), where the task is to answer free form natural language questions about an image. Interestingly, questions related to the count of a particular object - *How many red cars do you see?* form a significant portion of the questions asked in common visual question answering datasets [3]. Moreover, we observe that end-to-end networks [3] trained for this task do not perform well on such counting questions. This is not surprising, since the objective is often setup to minimize the cross-entropy classification loss for the correct answer to a question, which ignores ordinal structure inherent to counting. In this work we systematically benchmark how well current VQA models do at counting, and study any benefits from dedicated models for counting on a subset of counting questions in VQA datasets in Section 5.4.

Counts can also be used as complimentary signals to aid other vision tasks like detection. If we had an estimate of how many objects were present in the image, we could use that information on a per-image basis to detect that many objects. Indeed, we find that our object counting models improve object detection performance.

We first describe some baseline approaches to counting and subsequently build towards our proposed model.

**Counting by Detection:** It is easy to realize that perfect detection of objects would imply a perfect count. While detection is sufficient for counting, localizing objects is not necessary. Imagine a scene containing a number of mugs kept on a table where the objects occlude each other. In order to count the number of mugs, we need not determine with pixel-accurate segmentations or detections where they are (which is hard in the presence of occlusions) as long as say we can determine the number of handles. Relieving the burden of detecting objects is also effective for counting when objects occur at smaller scales where detection is hard [18]. However, counting by detection or `detect`

still forms a natural approach for counting.

**Counting by Glancing:** Representations extracted from Deep Convolutional Neural Networks [42] trained on image classification have been successfully applied to a number of scene understanding tasks such as finegrained recognition [12], scene classification [12], object detection [12], . We explore how well features from a deep CNN perform at counting through instantiations of our glancing (`glance`

) models which estimate a global count for the entire scene in a single forward pass. This can be thought of as estimating the count at one shot or *glance*. This is in contrast with `detect`

, which sequentially increments its count with each detected object (Fig. Figure 2). Note that unlike detection, which optimizes for a localization objective, the `glance`

models explicitly *learn* to count.

**Counting by Subitizing:** Subitizing is a widely studied phenomenon in developmental psychology [8] which indicates that children have an ability to directly map a perceptual signal to a numerical estimate, for a small number of objects (typically 1-4). Subitizing is crucial for development and assists arithmetic and reasoning skills. An example of subitizing is how we are able to figure out the number of pips on a face of a die without having to count them or how we are able to reason about tally marks.

Inspired by subitizing, we devise a new counting approach which adopts a divide and conquer strategy, using the additive nature of counts. Note that `glance`

can be thought of as an attempt to subitize from a glance of the image. However, as illustrated in Figure 2 (center), subitizing is difficult at high counts for humans.

Inspired by this, using the divide and conquer strategy, we divide the image into non-overlapping cells (Fig. Figure 2 right). We then subitize in each cell and use addition to get the total count. We call this method associative subitizing or `aso-sub`

.

In practice, to implement this idea on real images, we incorporate context across the cells while sequentially subitizing in each one of them. We call this sequential subitizing or `seq-sub`

. For each of these cells we curate real-valued ground truth, which helps us deal with scale variations. Interestingly, we found that by incorporating context `seq-sub`

significantly outperforms the naive subitizing model `aso-sub`

described above. (see Section 5.1 for more details).

**Counting by Ensembling:** It is well known that when humans are given counting problems with large ground truth counts (counting number of pebbles in a jar), individual guesses have high variance, but an average across multiple responses tends to be surprisingly close to the ground truth. This phenomenon is popularly known as the wisdom of the crowd [16]. Inspired by this, we create an ensemble of counting methods (`ens`

).

In summary, we evaluate several natural approaches to counting, and propose a novel context and subitizing based counting model. Then we investigate how counting can improve detection. Finally, we study counting questions (‘how many?’) in the Visual Question Answering (VQA) [3] and COCO-QA [35] datasets and provide some comparisons with the state-of-the-art VQA models.

## 2Related Work

Counting problems in niche settings have been studied extensively in computer vision [45]. [7] explores a Bayesian Poisson regression method on low-level features for counting in crowds. [6] segments a surveillance video into components of homogeneous motion and regresses to counts in each region using Gaussian Process regression. Since surveillance scenes tend to be constrained and highly occluded, counting by detection is infeasible. Thus density based approaches are popular. Lempitsky and Zisserman [27] count people by estimating object density using low-level features. They show applications on surveillance and cell counting in biological images. Anchovi labs provided users interactive services to count specific objects such as swimming pools in satellite images, cells in biological images, . More recent work constructs CNN-based models for crowd counting [45] and penguin counting [4] using lower level convolutional features from shallower CNN models.

Counting problems in constrained settings have a fundamentally different set of challenges to the counting problem we study in this paper. In surveillance, for example, the challenge is to estimate the counts accurately in the presence of large number of ground truth counts, where there might be significant occlusions. In the counting problem on everyday scenes, a larger challenge is the intra-class variance in everyday objects, and high sparsity (most images will have 0 count for most object classes). Thus we need a qualitatively different set of tools to solve this problem.

Other recent work [46] studies the problem of salient object subitizing (SOS). This is the task of counting the number of salient objects in the image (independent of the category). In contrast, we are interested in counting the number of instances of objects per category. Unlike Zhang [46], who use SOS to improve salient object detection, we propose to improve generic object detection using counts. Our VQA experiments to diagnose counting performance are also similar in spirit to recent work that studies how well models perform on specific question categories (counting, attribute comparison, ) [22] or on compositional generalization [2].

## 3Approach

Our task is to accurately count the number of instances of different object classes in an image. For training, we use datasets where we have access to object annotations such as object bounding boxes and category wise counts. The count predictions from the models are evaluated using the metrics described in Section 4.2. The input to the `glance`

, `aso-sub`

and `seq-sub`

models are `fc7`

features from a VGG-16 [42] CNN model. We experiment using both off-the-shelf classification weights from ImageNet [38] and the detection fine-tuned weights from our `detect`

models.

### 3.1Detection (`detect`

)

We use the Fast R-CNN [18] object detector to count. Detectors typically perform two post processing steps on a set of preliminary boxes: non maximal suppression (NMS) and score thresholding. NMS discards highly overlapping and likely redundant detections (using a threshold to control the overlap), whereas the score threshold filters out all detections with low scores.

We steer the detector to count better by varying these two hyperparameters to find the setting where counting error is the least. We pick these parameters using grid search on a held-out val set. For each category, we first select a fixed NMS threshold of 0.3 for all the classes and vary the score threshold between 0 and 1. We then fix the score threshold to the best value and vary the NMS threshold from 0 to 1.

### 3.2Glancing (`glance`

)

Our `glance`

approach repurposes a generic CNN architecture for counting by training a multi-layered perceptron (MLP) with a L2 loss to regress to image level counts from deep representations extracted from the CNN. The MLP has batch normalization [20] and Rectified Linear Unit (ReLU) activations between hidden layers. The models were trained with a learning rate of and weight decay set to 0.95. We experiment with choices of a single hidden layer, and two hidden layers for the MLP, as well as the sizes of the hidden units. More details and ablation studies can be found in appendix.

### 3.3Subitizing (`aso-sub`

, `seq-sub`

)

In our *subitizing* inspired methods, we divide our counting problem into sub-problems on each cell in a non-overlapping grid, and add the predicted counts across the grid.

In practice, since objects in real images occur at different scales, such cells might contain fractions of an object. We adjust for this by allowing for real valued ground truth. If a cell overlapping an object is very small compared to the object, the small fractional count of the cell might be hard to estimate. On the other hand, if a cell is too large compared to objects present it might be hard to estimate the large integer count of the cell (see Figure 3). This trade-off suggests that at some canonical resolution, we would be able to count the smaller objects more easily by subitizing them, as well as predict the partial counts for larger objects. More concretely, we divide the image , into a set of non-overlapping cells such that and . Given such a partition of the image and associated CNN features , we now explain our models based on this approach:

`aso-sub`

: Our naive `aso-sub`

model treats each cell independently to regress to the real-valued ground truth. We train on an augmented version of the dataset where the dataset size is -fold ( cells per image). Unlike `glance`

, where feature extracted on the full image is used to regress to integer valued counts, `aso-sub`

models regress to real-valued counts on non-overlapping cells from features extracted per cell. Given class instance annotations as bounding boxes for a category in an image , we compute the ground truth partial counts () for the grid-cells () to be used for training as follows:

We compute the intersection of each box with the cell and add up the intersections normalized by . Further, given the cell-level count predictions , the image level count prediction is computed as . We use max to filter out negative predictions.

We experiment with dividing the image into equally sized , , and grid-cells. The architecture of the models trained on the augmented dataset are the same as `glance`

. For more details, refer to appendix.

`seq-sub`

: We motivate our proposed `seq-sub`

(Sequential Subitizing) approach by identifying a potential flaw in the naive `aso-sub`

approach. Figure 4 reveals the limitation of the `aso-sub`

model. If the cells are treated independently, the naive `aso-sub`

model will be unaware of the partial presence of the concerned object in other cells. This leads to situations where similar visual signals need to be mapped to partial and whole presence of the object in the cells (see Figure 4). This is especially pathological since Huber or L-2 losses cannot capture this multi-modality in the output space, since the implicit density associated with such losses is either laplacian or gaussian.

Interestingly, a simple solution to mitigate this issue is to model context, which resolves this ambiguity in counts. That is, if we knew about the partial class presence in other cells, we could use that information to predict the correct cell count. Thus, although the independence assumption in `aso-sub`

is convenient, it ignores the fact that the augmented dataset is not IID. While it is important to reason at a cell level, it is also necessary to be aware of the global image context to produce meaningful predictions. In essence, we propose `seq-sub`

, that takes the best of both worlds from `glance`

and `aso-sub`

.

The architecture of `seq-sub`

is shown in Figure 5. It consists of a pair of 2 stacked bi-directional sequence-to-sequence LSTMs [40]. We incorporate context across cells as

where individual are hidden layer representations of each cell feature with respective parameters and is the mechanism that captures context. This can be broken down as follows. Let be the set containing s. Let and be 2 ordered sets which are permutations of based on 2 particular sequence structures. The (traversal) sequences, as we move across the grid in the feature column, is decided on the basis of nearness of cells (see Figure 5). We experiment with the sequence structures best described for a grid as and Z which correspond to and . Each of these feature sequences are then fed to a pair of stacked Bi-LSTMs () and the corresponding cell output states are concatenated to obtain a context vector () for each cell as . The cell counts are then obatined as . The composition of and implements .

We use a Huber Loss objective to regress to the count values with a learning rate of and weight decay set to 0.95. For optimization, we use Adam [24] with a minibatch size of 64. The ground truth construction procedure for training and the count aggregation procedure for evaluation are as defined in `aso-sub`

.

## 4Experimental Setup

### 4.1Datasets

We experiment with two datasets depicting everyday objects in everyday scenes: the PASCAL VOC 2007 [13] and COCO [28].

The PASCAL VOC dataset contains a train set of images, val set of images and a test set of images, and has object categories. The COCO dataset contains a train set of images and a val set of images, with object categories. On PASCAL, we use the val set as our Count-val set and the test set as our Count-test set. On COCO, we use the first half of val as the Count-val set and the second half of val as the Count-test set. The most frequent count per object category (as one would expect in everyday scenes) is . Figure 6 shows a histogram of non-zero counts across all object categories. It can be clearly seen that although the two datasets have a fair amount of count variability, there is a clear bias towards lower count values. Note that this is unlike the crowd-counting datasets, in particular [19] where mean count is and also unlike PASCAL and COCO, the images have very little scale and appearance variations in terms of objects.

### 4.2Evaluation

We adopt the root mean squared error (RMSE) as our metric. We also evaluate on a variant of RMSE that might be better suited to human perception. The intuition behind this metric is as follows. In a real world scenario, humans tend to perceive counts in the logarithmic scale [11]. That is, a mistake of 1 for a ground truth count of 2 might seem egregious but the same mistake for a ground truth count of 25 might seem reasonable. Hence we scale each deviation by a function of the ground truth count.

We first post-process the count predictions from each method by thresholding counts at 0, and rounding predictions to closest integers to get predictions . Given these predictions and ground truth counts for a category and image , we compute RMSE as follows:

and relative RMSE as:

where is the number of images in the dataset. We then average the error across all categories to report numbers on the dataset (**mRMSE** and **m-relRMSE**).

We also evaluate the above metrics for ground truth instances with non-zero counts. This reflects more clearly how accurate the counts produced by a method (beyond predicting absence) are.

### 4.3Methods and Baselines

We compare our approaches to the following baselines:

`always-0`

: predict most-frequent ground truth count (0).

`mean`

: predict the average ground truth count on the Count-val set.

`always-1`

: predict the most frequent non-zero value (1) for all classes.

`category-mean`

: predict the average count *per* category on Count-val.

`gt-class`

: treat the *ground truth* counts as classes and predict the counts using a classification model trained with cross-entropy loss.

We evaluate the following variants of counting approaches (see Section 3 for more details):

`detect`

: We compare two methods for `detect`

. The first method finds the best NMS and score thresholds as explained in Section 3.1. The second method uses vanilla Fast R-CNN as it comes out of the box, with the default NMS and score thresholds.

`glance`

: We explore the following choices of features: (1) vanilla classification `fc7`

features `noft`

, (2) detection fine tuned `fc7`

features `ft`

, (3) `fc7`

features from a CNN trained to perform Salient Object Subitizing `sos`

[46] and (4) flattened `conv-3`

features from a CNN trained for classification

`aso-sub`

, `seq-sub`

: We examine three choices of grid sizes (Sec. Section 3.3): , , and and `noft`

and `ft`

features as above.

`ens`

: We take the best performing subset of methods and average their predictions to perform counting by ensembling (`ens`

).

## 5Results

All the results presented in the paper are averaged on 10 random splits of the test set sampled with replacement.

### 5.1Counting Results

**PASCAL VOC 2007 :** We first present results (Table. ?) for the best performing variants (picked based on the val set) of each method. We see that `seq-sub`

outperforms all other methods. Both `glance`

and `detect`

which perform equally well as per both the metrics, while `glance`

does slightly better on both metrics when evaluated on non-zero ground truth counts. To put these numbers in perspective, we find that the difference of - between `seq-sub`

and `aso-sub`

leads to a difference of 0.19% mean F-measure performance in our counting to improve detection application (Sec. Section 5.3). We also experiment with `conv3`

features to regress to the counts, similar to Zhang.et.al. [45]. We find that `conv3`

gets of 0.63 which is much worse than `fc7`

. We also tried PCA on the `conv3`

features but that did not improve performance. This indicates that our counting task is indeed more high level and needs to reason about objects rather than low-level textures. We also compare our approach with the SOS model [46] by extracting fc7 features from a model trained to perform category-independent salient object subitizing. We observe that our best performing `glance`

setup using Imagenet trained VGG-16 features outperforms the one using SOS features. This is also intuitive since SOS is a category independent task, while we want to count number of object instances of each category. Finally, we observe that the performance increment from `aso-sub`

to `seq-sub`

is not statistically significant. We hypothesize that this is because of the smaller size of the PASCAL dataset. Note that we get more consistent improvements on COCO (Table. ?), which is not only a larger dataset, but also contains scenes that are contextually richer.^{1}

**COCO :** We present results for the best performing variants (picked based on the val set) of each method. The results are summarized in Table. ?. We find that `seq-sub`

does the best on both and - as well as their non-zero variants by a significant margin. A comparison indicates that the `always-0`

baseline does better on COCO than on PASCAL. This is because COCO has many more categories than PASCAL. Thus, the chances of any particular object being present in an image decrease compared to PASCAL. The performance jump from `aso-sub`

to `seq-sub`

here is much more compared to PASCAL. Recent work by Ren and Zemel [36] on Instance Segmentation also reports counting performance on two COCO categories - *person* and *zebra*.^{2}

For both PASCAL and COCO we observe that while `ens`

outperforms other approaches in some cases, it does not always do so. We hypothesize that this is due to the poor performance of `glance`

. For detailed ablation studies on `ens`

see appendix.

### 5.2Analysis of the Predicted Counts

**Count versus Count Error :** We analyze the performance of each of the methods at different count values on the COCO Count-test set (Fig. Figure 7). We pick each count value on the x-axis and compute the over all the instances at that count value. Interestingly, we find that the *subitizing* approaches work really well across a range of count values. This supports our intuition that `aso-sub`

and `seq-sub`

are able to capture partial counts (from larger objects) as well as integer counts (from smaller objects) better which is intuitive since larger counts are likely to occur at a smaller scale. Of the two approaches, `seq-sub`

works better, likely because reasoning about global context helps us capture part-like features better compared to `aso-sub`

. This is quite clear when we look at the performance of `seq-sub`

compared to `aso-sub`

in the count range 11 to 15. For lower count values, `ens`

does the best (Fig. Figure 7). We can see that for counts , `glance`

and `detect`

performances start tailing off.

**Detection :** We tune the hyperparameters of Fast R-CNN in order to find the setting where the mean squared error is the lowest, on the Count-val splits of the datasets. We show some qualitative examples of the detection ground truth, the performance without tuning for counting (using black-box Fast R-CNN), and the performance after tuning for counting on the PASCAL dataset in Figure 8. We use untuned Fast R-CNN at a score threshold of 0.8 and NMS threshold of 0.3, as used by Girshick [18] in their demo. At this configuration, it achieves an of 0.52 on Count-test split of COCO. We find that we achieve a gain of 0.02 by tuning the hyperparameters for `detect`

.

**Subitizing :** We next analyze how different design choices in `aso-sub`

affect performance on PASCAL. We pick the best performing `aso-sub`

`-ft-1L-3\times3`

model and vary the grid sizes (as explained in Section 4). We experiment with , , and grid sizes. We observe that for `aso-sub`

the performance of grid is the best and performance deteriorates significantly as we reach grids (Fig. Figure 9).^{3}`glance`

and `detect`

settings. However, we notice that for `seq-sub`

this sweet spot lies farther out to the right.

### 5.3Counting to Improve Detection

We now explore whether counting can help *improve* detection performance (on the PASCAL dataset). Detectors are typically evaluated via the Average Precision (AP) metric, which involves a full sweep over the range of score-thresholds for the detector. While this is a useful investigative tool, in any real application (say autonomous driving), the detector must make hard decisions at some fixed threshold. This threshold could be chosen on a per-image or per-category basis. Interestingly, if we knew *how many* objects of a category are present, we could simply set the threshold so that those many objects are detected similar to Zhang [46]. Thus, we could use per-image-per-category counts as a prior to improve detection.

Note that since our goal is to intelligently pick a threshold for the detector, computing AP (which involves a sweep over the thresholds) is not possible. Hence, to quantify detection performance, we first assign to each detected box one ground truth box with which it has the highest overlap. Then for each ground truth box, we check if any detection box has greater than 0.5 overlap. If so, we assign a match between the ground truth and detection, and take them out of the pool of detections and ground truths. Through this procedure, we obtain a set of true positive and false positive detection outputs. With these outputs we compute the precision and recall values for the detector. Finally, we compute the F-measure as the harmonic mean of these precision and recall values, and average the F-measure values across images and categories. We call this the **mF** (mean F-measure) metric. As a baseline, we use the Fast-RCNN detector after NMS to do a sweep over the thresholds for each category on the validation set to find the threshold that maximizes F-measure for that category. We call this the `base`

detector.

With a fixed per-category score threshold, the `base`

detector gets a performance of 15.26% mF. With ground truth counts to select thresholds, we get a best-case `oracle`

performance of 20.17%. Finally, we pick the outputs of `ens`

and `seq-sub`

`-ft`

models and use the counts from each of these to set separate thresholds. Our counting methods undercount more often than they overcount^{4}`base`

thresholds and for the other predicted counts, we use the counts to set the thresholds. With this procedure, we get a gains of 1.64% mF and 1.74% mF over the `base`

performance using `ens`

and `seq-sub`

`-ft`

predictions respectively. Thus, counting can be used as a complimentary signal to aid detector performance, by intelligently picking the detector threshold in an image specific manner.

### 5.4VQA Experiment

We explore how well our counting approaches do on simple counting questions. Recent work [3] has explored the problem of answering free-form natural language questions for images. One of the large-scale datasets in the space is the Visual Question Answering [3] dataset. We also evaluate using the COCO-QA dataset from [35] which automatically generates questions from human captions. Around 10.28% and 7.07% of the questions in VQA and COCO-QA are “how many” questions related to counting objects. Note that both the datasets use images from the COCO [28] dataset. We apply our counting models, along with some basic natural language pre-processing to answer some of these questions.

Given the question “how many bottles are there in the fridge?” we need to reason about the object of interest (bottles), understand referring expressions (in the fridge) Note that since these questions are free form, the category of interest might not exactly correspond to an COCO category. We tackle this ambiguity by using word2vec embeddings [32]. Given a free form natural language question, we extract the noun from the question and compute the closest COCO category by checking similarity of the noun with the categories in the word2vec embedding space. In case of multiple nouns, we just retain the first noun in the sentence (since how many questions typically have the subject noun first). We then run the counting method for the COCO category (see Figure 10). More details can be found in the supplementary. Note that parsing referring expressions is still an open research problem [23]. Thus, we filter questions based on an “oracle” for resolving referring expressions. This oracle is constructed by checking if the ground truth count of the COCO category we resolve using word2vec matches with the answer for the question. Evaluating only on these questions allows us to isolate errors due to inaccurate counts. We evaluate our outputs using the metric. We use this procedure to compile a list of 1774 and 513 questions (**Count-QA**) from the VQA and COCO-QA datasets respectively, to evaluate on. We will publicly release our Count-QA subsets to help future work.

We report performances in Table. ?. The trend of increasing performance is visible from `glance`

to `ens`

. We find that `seq-sub`

significantly outperforms the other approaches. We also evaluate a state-of-the-art VQA model [15] on the Count-QA VQA subset and find that even `glance`

does better by a substantial margin.^{5}

## 6Conclusion

We study the problem of counting *everyday* objects in *everyday* scenes. We evaluate some baseline approaches to this problem using object detection, regression using global image features, and associative subitizing which involves regression on non-overlapping image cells. We propose sequential subtizing, a variant of the associative subitizing model which incorporates context across cells using a pair of stacked bi-directional LSTMs. We find that our proposed models lead to improved performance on PASCAL VOC 2007 and COCO datasets. We thoroughly evaluate the relative strengths, weaknesses and biases of our approaches, providing a benchmark for future approaches on counting, and show that an ensemble of our proposed approaches peforms the best. Further, we show that counting can be used to improve object detection and present proof-of-concept experiments on answering ‘how many?’ questions in visual question answering tasks. Our code and datasets will be made publicly available.

**Acknowledgements.** We are grateful to the developers of Torch [9] for building an excellent framework. This work was funded in part by NSF CAREER awards to DB and DP, ONR YIP awards to DP and DB, ONR Grant N00014-14-1-0679 to DB, a Sloan Fellowship to DP, ARO YIP awards to DB and DP, an Allen Distinguished Investigator award to DP from the Paul G. Allen Family Foundation, Google Faculty Research Awards to DP and DB, Amazon Academic Research Awards to DP and DB, and NVIDIA GPU donations to DB. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor.

## Appendix

In Section 8, we report results of ablation studies conducted on the Count-val split for

`glance`

,`aso-sub`

and`seq-sub`

modelsIn Section 9, we report some analyses of the count predictions generated by our models, specifically comparing object sizes with count performance and overcounting-undercounting statistics

In Section 11 we present some details of the VQA experiment performed in Section 5.4 in the main paper

In Section 10, we show results of occlusion studies performed to identify the regions of interest in the scene while estimating the counts

In Section 12, we present some qualitative examples of the predictions generated by our models

## 8Ablation Studies

We explain the architectures for `glance`

, `aso-sub`

and `seq-sub`

in Section 3. Here we report results of some ablation studies conducted on these architectures.

For `glance`

and `aso-sub`

, we search over the following architecture space. Firstly, we vary the hidden layer sizes in the set . Secondly, we vary the number of hidden layers in the model between and (with the previously selected hidden layer size). Corresponding to these settings, we search for the best performing archiecture for `ft`

(detection finetuned fc7) and `noft`

(classification fc7) features extracted from PASCAL images. For `aso-sub`

, in addition to this, we look for the best performing architecture across different grid sizes (, , ). We narrow down to some design choices with and them compare different grid sizes.

For `seq-sub`

, we vary the number of Bi-LSTM (context aggregator) units per sequence. Subsequently, we vary the grid size from to . We report studies on both PASCAL and COCO.

All results are reported on the Count-val splits of the concerned datasets.

glance

:

We find that the performance for `ft-1L`

remains more or less constant as we change the size of the hidden layers (Fig. Figure 11). In contrast, the `noft-1L`

model does best at smaller hidden layer sizes. A two hidden layer `noft`

model does better than both `1L`

models. Intuitively, this makes sense since the `noft`

features are better suited to global image statistics than the detection finetuned `ft`

features.

aso-sub

-`3\times3`

:

We next contrast different design choices for `aso-sub`

-`3\times3`

. Details of how the performance changes with different grid sizes in `aso-sub`

has been discussed the main paper. In particular, just like the previous section, we study the impact of hidden layer sizes and number of hidden layers, as well as the choice of features (`ft`

vs `noft`

) for the `aso-sub`

-`3\times3`

model (Fig. Figure 12). We find that the detection finetuned `ft`

features do much better than the classification features `noft`

for `aso-sub`

. This is likely because the `ft`

features are better adapted to statistics of local image regions than the `noft`

image classification features. We also find that increasing the number of hidden layers does not improve performance over using a single hidden layer, unlike `glance`

.

aso-sub

:

We next compare how the performance of `aso-sub`

varies as we change the size of the grids. We pick the best performing `aso-sub`

-`3\times3`

features (`ft`

) and number of hidden layers - 1. We then vary the size of the hidden layer and compare the performance of , , and `aso-sub`

approaches (Fig. Figure 13). We find that and models do much better than the model. The performance of is slightly better than the model. A similar comparison on the Count-test set can be found in the main paper.

seq-sub

:

In Figure 14 and Figure 15, we compare the effect of changing the grid size from to for the `seq-sub`

models. We use both `ft`

and `noft`

features extracted from PASCAL and COCO images. On PASCAL (Fig. Figure 14), we observe that increasing the grid size has a slight improvement in performance for both `ft`

and `noft`

features unlike COCO (Fig. Figure 15) where going from to there is a drop in performance for both `ft`

and `noft`

features. We should also note that in general `ft`

features perform better than `noft`

features similar to `aso-sub`

.

We also varied the number of Bi-LSTM (context aggregator) units from to per sequence in the `seq-sub`

architectures for a grid size of . We observe that for `ft`

features, change in the number of Bi-LSTM units does not make a difference on both PASCAL and COCO. However, for `noft`

features, going from to leads to a drop of on COCO and PASCAL.

## 9Count Analysis

Size versus Count Error :

We compare `seq-sub`

, `glance`

, `aso-sub`

and `detect`

their performance for object categories of various sizes on PASCAL (Fig. Figure 16) and COCO (Fig. Figure 17). To get the object size, we sum the number of pixels occupied by an object across images where the object occurs in the Count-val set and divide this number by the average number of (non-zero) instances of the object. This gives us an estimate of the expected size occupied per instance of an object. We show a sorting of smaller to larger categories on the x-axis in Figure 16 and Figure 17.

We find that `ens`

, `seq-sub`

and `aso-sub`

perform consistently well across the spectrum of object sizes on both PASCAL and COCO. On PASCAL, as the object size increases the error keeps on reducing. This trend is not consistent over the entire spectrum of sizes for COCO. Another interesting thing to observe is that as the object size increases, the methods start performing competitively. This also indicates that `aso-sub`

and `seq-sub`

are able to capture partial ground truth counts well, since the counts for larger categories will necessarily be partial.

Undercounting versus Overcounting :

We study whether the models proposed in the paper undercount or overcount. Specifically, we report the number of times the approaches overcount, undercount or predict the ground truth count on PASCAL (Fig. Figure 18) and COCO (Fig. Figure 19). To do this, we first filter out all instances where the ground truth count is 0 (since we cannot undercount 0). We then check if the predicted count is greater than the ground truth (overcounting) or lesser (undercounting) or equal to the ground truth (equal). We perform this analysis on the Count-test split.

On PASCAL (Fig. Figure 18), we can observe that there is a clear increase in the number of times we get the count from `detect`

to `ens`

. The models, in general undercount more often than they overcount. As we go from `detect`

to `ens`

, the improvement in performance can be accounted to the increase of the frequency of equal versus undercount. Interestingly, for ens we get the count right more number of times as opposed to undercounting the ground truth. The number of times we overcount more or less stays the same.

On COCO (Fig. Figure 19), we observe that although there is an increase in the number of times we get the count right as we go from `detect`

to `ens`

, the frequency of equal is much lower than the frequency of undercounting for all the models. This is understandable as COCO has more number of categories and objects of different categories have lesser chances of being in the same image.

Ensemble :

We study different combinations of the predictions for constructing the ensemble on the Count-test set.

On PASCAL, when we compose an ensemble of `seq-sub`

and `aso-sub`

, we get a of 0.427 as opposed to a of 0.438 with `seq-sub`

and `glance`

. One can think of combinining global and local context by taking an ensemble of `glance`

and `aso-sub`

. However, we observe that such an ensemble underperforms when compared to `seq-sub`

by 0.02 . We also consider including the `detect`

baseline in the ensemble. We see that an ensemble of `detect`

, `glance`

, `aso-sub`

and `seq-sub`

gives an error of 0.43 as opposed to an ensemble of `glance`

, `aso-sub`

and `seq-sub`

which gives 0.42. Thus `detect`

, when included in the ensemble hurts the counting performance.

On COCO, when we compose an ensemble of `seq-sub`

and `aso-sub`

, we get a of 0.351 as opposed to a of 0.363 with `seq-sub`

and `glance`

. We observe that an ensemble of `glance`

and `aso-sub`

underperforms when compared to `seq-sub`

by 0.02 . When `detect`

is included we see that an ensemble of `detect`

, `glance`

, `aso-sub`

and `seq-sub`

gives an error of 0.38 as opposed to an ensemble of `glance`

, `aso-sub`

and `seq-sub`

which gives 0.36. Just like on PASCAL, `detect`

when included in the ensemble hurts the counting performance.

## 10Occlusion Studies

In Figure 20, we perform occlusion studies to understand where `glance`

, `aso-sub`

and `seq-sub`

look in the image while estimating the counts of different objects.

For this analysis, we pick images with a spread in counts from 10 (top row) to 1 in the middle row. For each count, we identified images where all three approaches agreed on the counts so that we could analyze where each method “looks” in order to derive the corresponding counts. We pick images from the COCO Count-test split where the predicted counts for `glance`

, `aso-sub`

and `seq-sub`

are equal. The `aso-sub`

and `seq-sub`

models are trained on discretization of the images. We move sized masks across the image with a non-overlapping stride to get occlusion maps. It is interesting to observe that for the image with a large ground truth count, the occlusion maps from `seq-sub`

are very similar to those from `aso-sub`

. This confirms our intuition that for larger counts, one needs access to local texture like patterns to accumulate count densities across the image. For smaller counts (rows 2 and 3), we notice that the maps from `glance`

and `seq-sub`

are more similar, indicating that global cues such as the number of parts appearing in the image (say the number of tails of elephants), potentially captured by the distributed CNN representation are sufficient for counting. Thus, this experiment confirms our intuition that `seq-sub`

captures the best of both the `glance`

and `aso-sub`

approaches, providing us a way to “interpolate” between these approaches based on the counts.

## 11VQA Experiment

We next elaborate on more details of the VQA experiment described in Section 5.4. More specifically we discuss how we pre-process ground truth to make it numeric and give details of how we solve correspondence between a noun in a question to counts of coco categories.

As reported in the paper, we use the VQA dataset [3] and COCO-QA [35] datasets for our counting experiments. We extract the *how many?* type questions, which have numerical answers. This includes both integers (VQA) and numbers written in the form of text (COCO-QA). We parse the latter into corresponding numbers on the COCO-QA dataset. That is *five* is mapped to . From the selected questions, we extract the Nouns (singular, plural, and proper), and convert them to their singular form using the Stanford Natural Language Parser (NLTK) [1].

We train word2vec word embeddings on Wikipedia^{6}*animal*, we sum the counts for *horse*, *giraffe*, *cat*, *dog*, *zebra*, *sheep*, *cow*, *elephant*, *bear*, and *bird* and use the output as our predicted count.

## 12Qualitative Results

We finally show some qualitative examples of our predictions on COCO images in Figure 21 where `ens`

performs best. We can observe that whenever the objects present in the image are sufficiently salient, `seq-sub`

and `aso-sub`

do a sufficiently better job in estimating the count of objects as compared to `glance`

. This is because as `seq-sub`

, and `aso-sub`

have to estimate partial counts at cell levels unlike `glance`

which has to regress to the count of the entire image. For some cases when the objects present are highly occluded, we see that `seq-sub`

and `aso-sub`

do a much better job at estimating the count. In summary, we find that `ens`

as a combination of `glance`

, `aso-sub`

and `seq-sub`

gets the count right most number of times.

### Footnotes

- When the Count-val split is considered, PASCAL has an average of annotated objects per scene, unlike COCO which has annotated objects per scene.
- We compare our best performing
`seq-sub`

model with their approach. On*person*,`seq-sub`

outperforms by and . On*zebra*, [36] outperforms`seq-sub`

by a margin of and . A recent exchange with the authors suggested anomalies in their experimental setup, which may have resulted in their reported numbers being optimistic estimates of the true performance. - Going from to , one might argue that the gain in performance in
`aso-sub`

is due to more (augmented) training data. However, from the diminishing performance on increasing grid size to (which has even more data to train from), we hypothesize that this is not the case. - See appendix for more details.
- For the column corresponding to VQA, all methods are evaluated on the subset of the predictions where [21] and [15] both produced numerical answers. For [21], there were 11 non-numerical answers and for [15] there were 3 (e.g., “many”, “few”, “lot”)
- https://www.wikipedia.org/

### References

**http://www.nltk.org/.**

NLTK.**Analyzing the behavior of visual question answering models.**

A. Agrawal, D. Batra, and D. Parikh.*CoRR*, abs/1606.07356, 2016.**VQA: visual question answering.**

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. In*2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015*, pages 2425–2433, 2015.**Counting in the wild.**

C. Arteta, V. Lempitsky, and A. Zisserman. In*European Conference on Computer Vision*, 2016.**Semantic segmentation with second-order pooling.**

J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. In*Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)*, volume 7578 LNCS, pages 430–443, 2012.**Privacy preserving crowd monitoring: Counting people without people models or tracking.**

A. B. Chan and N. Vasconcelos. In*2008 IEEE Conference on Computer Vision and Pattern Recognition*, pages 1–7. IEEE, 6 2008.**Bayesian poisson regression for crowd counting.**

A. B. Chan and N. Vasconcelos. In*2009 IEEE 12th International Conference on Computer Vision*, pages 545–551. IEEE, 9 2009.**Subitizing: What is it? why teach it?**

D. H. Clements.*Teaching children mathematics*, 5(7):400, 1999.**Torch7: A matlab-like environment for machine learning.**

R. Collobert, K. Kavukcuoglu, and C. Farabet. In*BigLearn, NIPS Workshop*, 2011.**Subitizing and visual short-term memory in human and non-human species: a common shared system?**

S. Cutini and M. Bonato.*Frontiers in Psychology*, 3, 2012.**Log or linear? distinct intuitions of the number scale in western and amazonian indigene cultures.**

S. Dehaene, V. Izard, E. Spelke, and P. Pica.*Science*, 320(5880):1217–1220, 2008.**Decaf: A deep convolutional activation feature for generic visual recognition.**

J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. 10 2013.**The pascal visual object classes challenge: A retrospective.**

M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman.*International Journal of Computer Vision*, 111(1):98–136, Jan. 2015.**A discriminatively trained, multiscale, deformable part model.**

P. Felzenszwalb, D. McAllester, and D. Ramanan. In*26th IEEE Conference on Computer Vision and Pattern Recognition, CVPR*, 2008.**Multimodal compact bilinear pooling for visual question answering and visual grounding.**

A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. In*Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016*, pages 457–468, 2016.**One Vote, One Value.**

F. Galton. 75:414, Feb. 1907.**Object detection via a multi-region and semantic segmentation-aware cnn model.**

S. Gidaris and N. Komodakis. In*Proceedings of the IEEE International Conference on Computer Vision*, pages 1134–1142, 2015.**Fast r-cnn.**

R. Girshick. In*International Conference on Computer Vision (ICCV)*, 2015.**Multi-source multi-scale counting in extremely dense crowd images.**

H. Idrees, I. Saleemi, C. Seibert, and M. Shah. In*Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition*, CVPR ’13, pages 2547–2554, Washington, DC, USA, 2013. IEEE Computer Society.**Batch normalization: Accelerating deep network training by reducing internal covariate shift.**

S. Ioffe and C. Szegedy. In*Proceedings of The 32nd International Conference on Machine Learning*, pages 448–456, 2015.**Deeper lstm and normalized cnn visual question answering model.**

D. B. Jiasen Lu, Xiao Lin and D. Parikh. https://github.com/VT-vision-lab/VQA_LSTM_CNN, 2015.**CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning.**

J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. In*CVPR*, 2017.**Referit game: Referring to objects in photographs of natural scenes.**

S. Kazemzadeh, V. Ordonez, M. Matten, and T. L. Berg. In*EMNLP*, 2014.**Adam: A method for stochastic optimization.**

D. P. Kingma and J. Ba.*CoRR*, abs/1412.6980, 2014.**Universals in the development of early arithmetic cognition.**

A. Klein and P. Starkey.*New Directions for Child and Adolescent Development*, 1988(41):5–26, 1988.**Imagenet classification with deep convolutional neural networks.**

A. Krizhevsky, I. Sutskever, and G. E. Hinton. In*Advances in Neural Information Processing Systems*, pages 1097–1105, 2012.**Learning To Count Objects in Images.**

V. Lempitsky and A. Zisserman. In*Advances in Neural Information Processing Systems*, pages 1324–1332, 2010.**Microsoft COCO: Common objects in context.**

T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. In*ECCV*, 2014.**SSD: single shot multibox detector.**

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. E. Reed.*CoRR*, abs/1512.02325, 2015.**Fully convolutional networks for semantic segmentation.**

J. Long, E. Shelhamer, and T. Darrell. In*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*, pages 3431–3440, 2015.**Ask your neurons: A neural-based approach to answering questions about images.**

M. Malinowski, M. Rohrbach, and M. Fritz. In*Proceedings of the IEEE International Conference on Computer Vision*, pages 1–9, 2015.**Distributed Representations of Words and Phrases and their Compositionality.**

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. In*Advances in Neural Information Processing Systems*, pages 3111–3119, 2013.**Towards perspective-free object counting with deep learning.**

D. Oñoro Rubio and R. J. López-Sastre. In*ECCV*, 2016.**You only look once: Unified, real-time object detection.**

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. In*The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, June 2016.**Exploring models and data for image question answering.**

M. Ren, R. Kiros, and R. Zemel. In*Advances in Neural Information Processing Systems*, pages 2953–2961, 2015.**End-to-end instance segmentation and counting with recurrent attention.**

M. Ren and R. S. Zemel.*CoRR*, abs/1605.09410, 2016.**Faster r-cnn: Towards real-time object detection with region proposal networks.**

S. Ren, K. He, R. Girshick, and J. Sun. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,*Advances in Neural Information Processing Systems 28*, pages 91–99. Curran Associates, Inc., 2015.**ImageNet Large Scale Visual Recognition Challenge.**

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei.*International Journal of Computer Vision (IJCV)*, 115(3):211–252, 2015.**It’s not polite to point: Describing people with uncertain attributes.**

A. Sadovnik, A. C. Gallagher, and T. Chen. In*CVPR*, pages 3089–3096. IEEE, 2013.**Bidirectional recurrent neural networks.**

M. Schuster and K. K. Paliwal.*IEEE Trans. Signal Processing*, 45:2673–2681, 1997.**Learning to count with deep object features.**

S. Seguí, O. Pujol, and J. Vitrià. may 2015.**Very deep convolutional networks for large-scale image recognition.**

K. Simonyan and A. Zisserman. 9 2014.**Rapid object detection using a boosted cascade of simple features.**

P. Viola and M. Jones.*Computer Vision and Pattern Recognition (CVPR)*, 1:I—-511—-I—-518, 2001.**End-to-end integration of a convolution network, deformable parts model and non-maximum suppression.**

L. Wan, D. Eigen, and R. Fergus. In*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*, pages 851–859, 2015.**Cross-Scene Crowd Counting via Deep Convolutional Neural Networks.**

C. Zhang, H. Li, X. Wang, and X. Yang. In*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*, pages 833–841, 2015.**Salient object subitizing.**

J. Zhang, S. Ma, M. Sameki, S. Sclaroff, M. Betke, Z. Lin, X. Shen, B. Price, and R. Mĕch. In*IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 2015.