Counting Everyday Objects in Everyday Scenes
Abstract
We are interested in counting the number of instances of object classes in natural, everyday images. Previous counting approaches tackle the problem in restricted domains such as counting pedestrians in surveillance videos. Counts can also be estimated from outputs of other vision tasks like object detection. In this work, we build dedicated models for counting designed to tackle the large variance in counts, appearances, and scales of objects found in natural scenes. Our approach is inspired by the phenomenon of subitizing – the ability of humans to make quick assessments of counts given a perceptual signal, for small count values. Given a natural scene, we employ a divide and conquer strategy while incorporating context across the scene to adapt the subitizing idea to counting. Our approach offers consistent improvements over numerous baseline approaches for counting on the PASCAL VOC 2007 and COCO datasets. Subsequently, we study how counting can be used to improve object detection. We then show a proof of concept application of our counting methods to the task of Visual Question Answering, by studying the ‘how many?’ questions in the VQA and COCOQA datasets.
1 Introduction
We study the scene understanding problem of counting common objects in natural scenes. That is, given for example the image in Fig. 1, we want to count the number of everyday object categories present in it: for example 4 chairs, 1 oven, 1 dining table, 1 potted plant and 3 spoons. Such an ability to count seems innate in humans (and even in some animals [10]). Thus, as a stepping stone towards Artificial Intelligence (AI), it is desirable to have intelligent machines that can count.
Similar to scene understanding tasks such as object detection [43, 14, 18, 37, 17, 44, 34, 29] and segmentation [5, 30, 36] which require a finegrained understanding of the scene, object counting is a challenging problem that requires us to reason about the number of instances of objects present while tackling scale and appearance variations.
Another closely related vision task is visual question answering (VQA), where the task is to answer free form natural language questions about an image. Interestingly, questions related to the count of a particular object  How many red cars do you see? form a significant portion of the questions asked in common visual question answering datasets [3, 35]. Moreover, we observe that endtoend networks [3, 35, 31, 15] trained for this task do not perform well on such counting questions. This is not surprising, since the objective is often setup to minimize the crossentropy classification loss for the correct answer to a question, which ignores ordinal structure inherent to counting. In this work we systematically benchmark how well current VQA models do at counting, and study any benefits from dedicated models for counting on a subset of counting questions in VQA datasets in Sec. 5.4.
Counts can also be used as complimentary signals to aid other vision tasks like detection. If we had an estimate of how many objects were present in the image, we could use that information on a perimage basis to detect that many objects. Indeed, we find that our object counting models improve object detection performance.
We first describe some baseline approaches to counting and subsequently build towards our proposed model.
Counting by Detection: It is easy to realize that perfect detection of objects would imply a perfect count. While detection is sufficient for counting, localizing objects is not necessary. Imagine a scene containing a number of mugs kept on a table where the objects occlude each other. In order to count the number of mugs, we need not determine with pixelaccurate segmentations or detections where they are (which is hard in the presence of occlusions) as long as say we can determine the number of handles. Relieving the burden of detecting objects is also effective for counting when objects occur at smaller scales where detection is hard [18]. However, counting by detection or detect still forms a natural approach for counting.
Counting by Glancing: Representations extracted from Deep Convolutional Neural Networks [42, 26] trained on image classification have been successfully applied to a number of scene understanding tasks such as finegrained recognition [12], scene classification [12], object detection [12], etc. We explore how well features from a deep CNN perform at counting through instantiations of our glancing (glance) models which estimate a global count for the entire scene in a single forward pass. This can be thought of as estimating the count at one shot or glance. This is in contrast with detect, which sequentially increments its count with each detected object (Fig. 2). Note that unlike detection, which optimizes for a localization objective, the glance models explicitly learn to count.
Counting by Subitizing: Subitizing is a widely studied phenomenon in developmental psychology [8, 25, 10] which indicates that children have an ability to directly map a perceptual signal to a numerical estimate, for a small number of objects (typically 14). Subitizing is crucial for development and assists arithmetic and reasoning skills. An example of subitizing is how we are able to figure out the number of pips on a face of a die without having to count them or how we are able to reason about tally marks.
Inspired by subitizing, we devise a new counting approach which adopts a divide and conquer strategy, using the additive nature of counts. Note that glance can be thought of as an attempt to subitize from a glance of the image. However, as illustrated in Fig. 2 (center), subitizing is difficult at high counts for humans.
Inspired by this, using the divide and conquer strategy, we divide the image into nonoverlapping cells (Fig. 2 right). We then subitize in each cell and use addition to get the total count. We call this method associative subitizing or asosub.
In practice, to implement this idea on real images, we incorporate context across the cells while sequentially subitizing in each one of them. We call this sequential subitizing or seqsub. For each of these cells we curate realvalued ground truth, which helps us deal with scale variations. Interestingly, we found that by incorporating context seqsub significantly outperforms the naive subitizing model asosub described above. (see Sec. 5.1 for more details).
Counting by Ensembling: It is well known that when humans are given counting problems with large ground truth counts (e.g. counting number of pebbles in a jar), individual guesses have high variance, but an average across multiple responses tends to be surprisingly close to the ground truth. This phenomenon is popularly known as the wisdom of the crowd [16]. Inspired by this, we create an ensemble of counting methods (ens).
In summary, we evaluate several natural approaches to counting, and propose a novel context and subitizing based counting model. Then we investigate how counting can improve detection. Finally, we study counting questions (‘how many?’) in the Visual Question Answering (VQA) [3] and COCOQA [35] datasets and provide some comparisons with the stateoftheart VQA models.
2 Related Work
Counting problems in niche settings have been studied extensively in computer vision [45, 41, 7, 27]. [7] explores a Bayesian Poisson regression method on lowlevel features for counting in crowds. [6] segments a surveillance video into components of homogeneous motion and regresses to counts in each region using Gaussian Process regression. Since surveillance scenes tend to be constrained and highly occluded, counting by detection is infeasible. Thus density based approaches are popular. Lempitsky and Zisserman [27] count people by estimating object density using lowlevel features. They show applications on surveillance and cell counting in biological images. Anchovi labs provided users interactive services to count specific objects such as swimming pools in satellite images, cells in biological images, etc. More recent work constructs CNNbased models for crowd counting [45, 33] and penguin counting [4] using lower level convolutional features from shallower CNN models.
Counting problems in constrained settings have a fundamentally different set of challenges to the counting problem we study in this paper. In surveillance, for example, the challenge is to estimate the counts accurately in the presence of large number of ground truth counts, where there might be significant occlusions. In the counting problem on everyday scenes, a larger challenge is the intraclass variance in everyday objects, and high sparsity (most images will have 0 count for most object classes). Thus we need a qualitatively different set of tools to solve this problem.
Other recent work [46] studies the problem of salient object subitizing (SOS). This is the task of counting the number of salient objects in the image (independent of the category). In contrast, we are interested in counting the number of instances of objects per category. Unlike Zhang et al. [46], who use SOS to improve salient object detection, we propose to improve generic object detection using counts. Our VQA experiments to diagnose counting performance are also similar in spirit to recent work that studies how well models perform on specific question categories (counting, attribute comparison, etc.) [22] or on compositional generalization [2].
3 Approach
Our task is to accurately count the number of instances of different object classes in an image. For training, we use datasets where we have access to object annotations such as object bounding boxes and category wise counts. The count predictions from the models are evaluated using the metrics described in Sec. 4.2. The input to the glance, asosub and seqsub models are fc7 features from a VGG16 [42] CNN model. We experiment using both offtheshelf classification weights from ImageNet [38] and the detection finetuned weights from our detect models.
3.1 Detection (detect)
We use the Fast RCNN [18] object detector to count. Detectors typically perform two post processing steps on a set of preliminary boxes: non maximal suppression (NMS) and score thresholding. NMS discards highly overlapping and likely redundant detections (using a threshold to control the overlap), whereas the score threshold filters out all detections with low scores.
We steer the detector to count better by varying these two hyperparameters to find the setting where counting error is the least. We pick these parameters using grid search on a heldout val set. For each category, we first select a fixed NMS threshold of 0.3 for all the classes and vary the score threshold between 0 and 1. We then fix the score threshold to the best value and vary the NMS threshold from 0 to 1.
3.2 Glancing (glance)
Our glance approach repurposes a generic CNN architecture for counting by training a multilayered perceptron (MLP) with a L2 loss to regress to image level counts from deep representations extracted from the CNN. The MLP has batch normalization [20] and Rectified Linear Unit (ReLU) activations between hidden layers. The models were trained with a learning rate of and weight decay set to 0.95. We experiment with choices of a single hidden layer, and two hidden layers for the MLP, as well as the sizes of the hidden units. More details and ablation studies can be found in appendix.
3.3 Subitizing (asosub, seqsub)
In our subitizing inspired methods, we divide our counting problem into subproblems on each cell in a nonoverlapping grid, and add the predicted counts across the grid.
In practice, since objects in real images occur at different scales, such cells might contain fractions of an object. We adjust for this by allowing for real valued ground truth. If a cell overlapping an object is very small compared to the object, the small fractional count of the cell might be hard to estimate. On the other hand, if a cell is too large compared to objects present it might be hard to estimate the large integer count of the cell (see Fig. 3). This tradeoff suggests that at some canonical resolution, we would be able to count the smaller objects more easily by subitizing them, as well as predict the partial counts for larger objects. More concretely, we divide the image , into a set of nonoverlapping cells such that and . Given such a partition of the image and associated CNN features , we now explain our models based on this approach:
asosub : Our naive asosub model treats each cell independently to regress to the realvalued ground truth. We train on an augmented version of the dataset where the dataset size is fold ( cells per image). Unlike glance, where feature extracted on the full image is used to regress to integer valued counts, asosub models regress to realvalued counts on nonoverlapping cells from features extracted per cell. Given class instance annotations as bounding boxes for a category in an image , we compute the ground truth partial counts () for the gridcells () to be used for training as follows:
(1) 
We compute the intersection of each box with the cell and add up the intersections normalized by . Further, given the celllevel count predictions , the image level count prediction is computed as . We use max to filter out negative predictions.
We experiment with dividing the image into equally sized , , and gridcells. The architecture of the models trained on the augmented dataset are the same as glance. For more details, refer to appendix.
seqsub : We motivate our proposed seqsub (Sequential Subitizing) approach by identifying a potential flaw in the naive asosub approach. Fig. 4 reveals the limitation of the asosub model. If the cells are treated independently, the naive asosub model will be unaware of the partial presence of the concerned object in other cells. This leads to situations where similar visual signals need to be mapped to partial and whole presence of the object in the cells (see Fig. 4). This is especially pathological since Huber or L2 losses cannot capture this multimodality in the output space, since the implicit density associated with such losses is either laplacian or gaussian.
Interestingly, a simple solution to mitigate this issue is to model context, which resolves this ambiguity in counts. That is, if we knew about the partial class presence in other cells, we could use that information to predict the correct cell count. Thus, although the independence assumption in asosub is convenient, it ignores the fact that the augmented dataset is not IID. While it is important to reason at a cell level, it is also necessary to be aware of the global image context to produce meaningful predictions. In essence, we propose seqsub, that takes the best of both worlds from glance and asosub.
The architecture of seqsub is shown in Fig. 5. It consists of a pair of 2 stacked bidirectional sequencetosequence LSTMs [40]. We incorporate context across cells as
(2) 
where individual are hidden layer representations of each cell feature with respective parameters and is the mechanism that captures context. This can be broken down as follows. Let be the set containing s. Let and be 2 ordered sets which are permutations of based on 2 particular sequence structures. The (traversal) sequences, as we move across the grid in the feature column, is decided on the basis of nearness of cells (see Fig. 5). We experiment with the sequence structures best described for a grid as N and Z which correspond to and . Each of these feature sequences are then fed to a pair of stacked BiLSTMs () and the corresponding cell output states are concatenated to obtain a context vector () for each cell as . The cell counts are then obatined as . The composition of and implements .
We use a Huber Loss objective to regress to the count values with a learning rate of and weight decay set to 0.95. For optimization, we use Adam [24] with a minibatch size of 64. The ground truth construction procedure for training and the count aggregation procedure for evaluation are as defined in asosub.
4 Experimental Setup
4.1 Datasets
We experiment with two datasets depicting everyday objects in everyday scenes: the PASCAL VOC 2007 [13] and COCO [28].
The PASCAL VOC dataset contains a train set of images, val set of images and a test set of images, and has object categories. The COCO dataset contains a train set of images and a val set of images, with object categories. On PASCAL, we use the val set as our Countval set and the test set as our Counttest set. On COCO, we use the first half of val as the Countval set and the second half of val as the Counttest set. The most frequent count per object category (as one would expect in everyday scenes) is . Fig. 6 shows a histogram of nonzero counts across all object categories. It can be clearly seen that although the two datasets have a fair amount of count variability, there is a clear bias towards lower count values. Note that this is unlike the crowdcounting datasets, in particular [19] where mean count is and also unlike PASCAL and COCO, the images have very little scale and appearance variations in terms of objects.
4.2 Evaluation
We adopt the root mean squared error (RMSE) as our metric. We also evaluate on a variant of RMSE that might be better suited to human perception. The intuition behind this metric is as follows. In a real world scenario, humans tend to perceive counts in the logarithmic scale [11]. That is, a mistake of 1 for a ground truth count of 2 might seem egregious but the same mistake for a ground truth count of 25 might seem reasonable. Hence we scale each deviation by a function of the ground truth count.
We first postprocess the count predictions from each method by thresholding counts at 0, and rounding predictions to closest integers to get predictions . Given these predictions and ground truth counts for a category and image , we compute RMSE as follows:
(3) 
and relative RMSE as:
(4) 
where is the number of images in the dataset. We then average the error across all categories to report numbers on the dataset (mRMSE and mrelRMSE).
We also evaluate the above metrics for ground truth instances with nonzero counts. This reflects more clearly how accurate the counts produced by a method (beyond predicting absence) are.
4.3 Methods and Baselines
We compare our approaches to the following baselines:
always0: predict mostfrequent ground truth count (0).
mean: predict the average ground truth count on the Countval set.
always1: predict the most frequent nonzero value (1) for all classes.
categorymean: predict the average count per category on Countval.
gtclass: treat the ground truth counts as classes and predict the counts using a classification model trained with crossentropy loss.
We evaluate the following variants of counting approaches (see Sec. 3 for more details):
detect: We compare two methods for detect. The first method finds the best NMS and score thresholds as explained in Sec. 3.1. The second method uses vanilla Fast RCNN as it comes out of the box, with the default NMS and score thresholds.
glance: We explore the following choices of features: (1) vanilla classification fc7 features noft, (2) detection fine tuned fc7 features ft, (3) fc7 features from a CNN trained to perform Salient Object Subitizing sos [46] and (4) flattened conv3 features from a CNN trained for classification
asosub, seqsub: We examine three choices of grid sizes (Sec. 3.3): , , and and noft and ft features as above.
ens: We take the best performing subset of methods and average their predictions to perform counting by ensembling (ens).
5 Results
All the results presented in the paper are averaged on 10 random splits of the test set sampled with replacement.
5.1 Counting Results
PASCAL VOC 2007 : We first present results (Table. 1) for the best performing variants (picked based on the val set) of each method. We see that seqsub outperforms all other methods. Both glance and detect which perform equally well as per both the metrics, while glance does slightly better on both metrics when evaluated on nonzero ground truth counts. To put these numbers in perspective, we find that the difference of  between seqsub and asosub leads to a difference of 0.19% mean Fmeasure performance in our counting to improve detection application (Sec. 5.3). We also experiment with conv3 features to regress to the counts, similar to Zhang.et.al. [45]. We find that conv3 gets of 0.63 which is much worse than fc7. We also tried PCA on the conv3 features but that did not improve performance. This indicates that our counting task is indeed more high level and needs to reason about objects rather than lowlevel textures. We also compare our approach with the SOS model [46] by extracting fc7 features from a model trained to perform categoryindependent salient object subitizing. We observe that our best performing glance setup using Imagenet trained VGG16 features outperforms the one using SOS features. This is also intuitive since SOS is a category independent task, while we want to count number of object instances of each category. Finally, we observe that the performance increment from asosub to seqsub is not statistically significant. We hypothesize that this is because of the smaller size of the PASCAL dataset. Note that we get more consistent improvements on COCO (Table. 2), which is not only a larger dataset, but also contains scenes that are contextually richer.^{1}^{1}1When the Countval split is considered, PASCAL has an average of annotated objects per scene, unlike COCO which has annotated objects per scene.
Approach  mRMSE  mRMSEnz  mrelRMSE  mrelRMSEnz 

always0  0.66 0.02  1.96 0.03  0.28 0.03  0.59 0.00 
mean  0.65 0.02  1.81 0.03  0.31 0.01  0.52 0.00 
always1  1.14 0.01  0.96 0.03  0.98 0.00  0.17 0.03 
categorymean  0.64 0.02  1.60 0.03  0.30 0.00  0.45 0.00 
gtclass  0.55 0.02  2.12 0.07  0.24 0.00  0.88 0.01 
detect  0.50 0.01  1.92 0.08  0.26 0.01  0.85 0.02 
glancenoft2L  0.50 0.02  1.83 0.09  0.27 0.00  0.73 0.00 
glancesos2L  0.51 0.02  1.87 0.08  0.29 0.01  0.75 0.02 
asosubft1L  0.43 0.01  1.65 0.07  0.22 0.01  0.68 0.02 
seqsubft  0.42 0.01  1.65 0.07  0.21 0.01  0.68 0.02 
ens  0.42 0.17  1.68 0.08  0.20 0.00  0.65 0.01 
COCO : We present results for the best performing variants (picked based on the val set) of each method. The results are summarized in Table. 2. We find that seqsub does the best on both and  as well as their nonzero variants by a significant margin. A comparison indicates that the always0 baseline does better on COCO than on PASCAL. This is because COCO has many more categories than PASCAL. Thus, the chances of any particular object being present in an image decrease compared to PASCAL. The performance jump from asosub to seqsub here is much more compared to PASCAL. Recent work by Ren and Zemel [36] on Instance Segmentation also reports counting performance on two COCO categories  person and zebra.^{2}^{2}2We compare our best performing seqsub model with their approach. On person, seqsub outperforms by and . On zebra, [36] outperforms seqsub by a margin of and . A recent exchange with the authors suggested anomalies in their experimental setup, which may have resulted in their reported numbers being optimistic estimates of the true performance.
For both PASCAL and COCO we observe that while ens outperforms other approaches in some cases, it does not always do so. We hypothesize that this is due to the poor performance of glance. For detailed ablation studies on ens see appendix.
Approach  mRMSE  mRMSEnz  mrelRMSE  mrelRMSEnz 

always0  0.54 0.01  3.03 0.03  0.21 0.00  1.22 0.01 
mean  0.54 0.00  2.96 0.03  0.23 0.00  1.17 0.01 
always1  1.12 0.00  2.39 0.03  1.00 0.00  0.80 0.00 
categorymean  0.52 0.01  2.97 0.03  0.22 0.00  1.18 0.01 
gtclass  0.47 0.00  2.70 0.03  0.20 0.00  1.08 0.00 
detect  0.49 0.00  2.78 0.03  0.20 0.00  1.13 0.01 
glanceft1L  0.42 0.00  2.25 0.02  0.23 0.00  0.91 0.00 
glancesos1L  0.44 0.00  2.32 0.03  0.24 0.00  0.92 0.01 
asosubft1L  0.38 0.00  2.08 0.02  0.24 0.00  0.87 0.01 
seqsubft  0.35 0.00  1.96 0.02  0.18 0.00  0.82 0.01 
ens  0.36 0.00  1.98 0.02  0.18 0.00  0.81 0.01 
5.2 Analysis of the Predicted Counts
Count versus Count Error : We analyze the performance of each of the methods at different count values on the COCO Counttest set (Fig. 7). We pick each count value on the xaxis and compute the over all the instances at that count value. Interestingly, we find that the subitizing approaches work really well across a range of count values. This supports our intuition that asosub and seqsub are able to capture partial counts (from larger objects) as well as integer counts (from smaller objects) better which is intuitive since larger counts are likely to occur at a smaller scale. Of the two approaches, seqsub works better, likely because reasoning about global context helps us capture partlike features better compared to asosub. This is quite clear when we look at the performance of seqsub compared to asosub in the count range 11 to 15. For lower count values, ens does the best (Fig. 7). We can see that for counts , glance and detect performances start tailing off.
Detection : We tune the hyperparameters of Fast RCNN in order to find the setting where the mean squared error is the lowest, on the Countval splits of the datasets. We show some qualitative examples of the detection ground truth, the performance without tuning for counting (using blackbox Fast RCNN), and the performance after tuning for counting on the PASCAL dataset in Fig. 8. We use untuned Fast RCNN at a score threshold of 0.8 and NMS threshold of 0.3, as used by Girshick et al. [18] in their demo. At this configuration, it achieves an of 0.52 on Counttest split of COCO. We find that we achieve a gain of 0.02 by tuning the hyperparameters for detect.
Subitizing : We next analyze how different design choices in asosub affect performance on PASCAL. We pick the best performing asosubft1L model and vary the grid sizes (as explained in Sec. 4). We experiment with , , and grid sizes. We observe that for asosub the performance of grid is the best and performance deteriorates significantly as we reach grids (Fig. 9).^{3}^{3}3Going from to , one might argue that the gain in performance in asosub is due to more (augmented) training data. However, from the diminishing performance on increasing grid size to (which has even more data to train from), we hypothesize that this is not the case. This indicates that there is indeed a sweet spot in the discretization as we interpolate between the glance and detect settings. However, we notice that for seqsub this sweet spot lies farther out to the right.
5.3 Counting to Improve Detection
We now explore whether counting can help improve detection performance (on the PASCAL dataset). Detectors are typically evaluated via the Average Precision (AP) metric, which involves a full sweep over the range of scorethresholds for the detector. While this is a useful investigative tool, in any real application (say autonomous driving), the detector must make hard decisions at some fixed threshold. This threshold could be chosen on a perimage or percategory basis. Interestingly, if we knew how many objects of a category are present, we could simply set the threshold so that those many objects are detected similar to Zhang et al. [46]. Thus, we could use perimagepercategory counts as a prior to improve detection.
Note that since our goal is to intelligently pick a threshold for the detector, computing AP (which involves a sweep over the thresholds) is not possible. Hence, to quantify detection performance, we first assign to each detected box one ground truth box with which it has the highest overlap. Then for each ground truth box, we check if any detection box has greater than 0.5 overlap. If so, we assign a match between the ground truth and detection, and take them out of the pool of detections and ground truths. Through this procedure, we obtain a set of true positive and false positive detection outputs. With these outputs we compute the precision and recall values for the detector. Finally, we compute the Fmeasure as the harmonic mean of these precision and recall values, and average the Fmeasure values across images and categories. We call this the mF (mean Fmeasure) metric. As a baseline, we use the FastRCNN detector after NMS to do a sweep over the thresholds for each category on the validation set to find the threshold that maximizes Fmeasure for that category. We call this the base detector.
With a fixed percategory score threshold, the base detector gets a performance of 15.26% mF. With ground truth counts to select thresholds, we get a bestcase oracle performance of 20.17%. Finally, we pick the outputs of ens and seqsubft models and use the counts from each of these to set separate thresholds. Our counting methods undercount more often than they overcount^{4}^{4}4See appendix for more details., a high count implies that the ground truth count is likely to be even higher. Thus, for counts of 0, we default to the base thresholds and for the other predicted counts, we use the counts to set the thresholds. With this procedure, we get a gains of 1.64% mF and 1.74% mF over the base performance using ens and seqsubft predictions respectively. Thus, counting can be used as a complimentary signal to aid detector performance, by intelligently picking the detector threshold in an image specific manner.
5.4 VQA Experiment
We explore how well our counting approaches do on simple counting questions. Recent work [3, 35, 31, 15] has explored the problem of answering freeform natural language questions for images. One of the largescale datasets in the space is the Visual Question Answering [3] dataset. We also evaluate using the COCOQA dataset from [35] which automatically generates questions from human captions. Around 10.28% and 7.07% of the questions in VQA and COCOQA are “how many” questions related to counting objects. Note that both the datasets use images from the COCO [28] dataset. We apply our counting models, along with some basic natural language preprocessing to answer some of these questions.
Approach  mRMSE (VQA)  mRMSE (COCOQA) 

detect  2.72 0.09  2.59 0.12 
glanceft1L  2.19 0.05  1.86 0.12 
asosubft1L  1.94 0.07  1.47 0.04 
seqsubft  1.81 0.09  1.34 0.07 
ens  1.80 0.07  1.40 0.08 
Deeper LSTM [21]  2.71 0.23  N/A 
SOTA VQA [15]  3.25 0.94  N/A 
Given the question “how many bottles are there in the fridge?” we need to reason about the object of interest (bottles), understand referring expressions (in the fridge) etc. Note that since these questions are free form, the category of interest might not exactly correspond to an COCO category. We tackle this ambiguity by using word2vec embeddings [32]. Given a free form natural language question, we extract the noun from the question and compute the closest COCO category by checking similarity of the noun with the categories in the word2vec embedding space. In case of multiple nouns, we just retain the first noun in the sentence (since how many questions typically have the subject noun first). We then run the counting method for the COCO category (see Fig 10). More details can be found in the supplementary. Note that parsing referring expressions is still an open research problem [23, 39]. Thus, we filter questions based on an “oracle” for resolving referring expressions. This oracle is constructed by checking if the ground truth count of the COCO category we resolve using word2vec matches with the answer for the question. Evaluating only on these questions allows us to isolate errors due to inaccurate counts. We evaluate our outputs using the metric. We use this procedure to compile a list of 1774 and 513 questions (CountQA) from the VQA and COCOQA datasets respectively, to evaluate on. We will publicly release our CountQA subsets to help future work.
We report performances in Table. 3. The trend of increasing performance is visible from glance to ens. We find that seqsub significantly outperforms the other approaches. We also evaluate a stateoftheart VQA model [15] on the CountQA VQA subset and find that even glance does better by a substantial margin.^{5}^{5}5For the column corresponding to VQA, all methods are evaluated on the subset of the predictions where [21] and [15] both produced numerical answers. For [21], there were 11 nonnumerical answers and for [15] there were 3 (e.g., ”many”, ”few”, ”lot”)
6 Conclusion
We study the problem of counting everyday objects in everyday scenes. We evaluate some baseline approaches to this problem using object detection, regression using global image features, and associative subitizing which involves regression on nonoverlapping image cells. We propose sequential subtizing, a variant of the associative subitizing model which incorporates context across cells using a pair of stacked bidirectional LSTMs. We find that our proposed models lead to improved performance on PASCAL VOC 2007 and COCO datasets. We thoroughly evaluate the relative strengths, weaknesses and biases of our approaches, providing a benchmark for future approaches on counting, and show that an ensemble of our proposed approaches peforms the best. Further, we show that counting can be used to improve object detection and present proofofconcept experiments on answering ‘how many?’ questions in visual question answering tasks. Our code and datasets will be made publicly available.
Acknowledgements. We are grateful to the developers of Torch [9] for building an excellent framework. This work was funded in part by NSF CAREER awards to DB and DP, ONR YIP awards to DP and DB, ONR Grant N000141410679 to DB, a Sloan Fellowship to DP, ARO YIP awards to DB and DP, an Allen Distinguished Investigator award to DP from the Paul G. Allen Family Foundation, Google Faculty Research Awards to DP and DB, Amazon Academic Research Awards to DP and DB, and NVIDIA GPU donations to DB. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor.
References
 [1] NLTK. http://www.nltk.org/.
 [2] A. Agrawal, D. Batra, and D. Parikh. Analyzing the behavior of visual question answering models. CoRR, abs/1606.07356, 2016.
 [3] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. VQA: visual question answering. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 713, 2015, pages 2425–2433, 2015.
 [4] C. Arteta, V. Lempitsky, and A. Zisserman. Counting in the wild. In European Conference on Computer Vision, 2016.
 [5] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Semantic segmentation with secondorder pooling. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 7578 LNCS, pages 430–443, 2012.
 [6] A. B. Chan and N. Vasconcelos. Privacy preserving crowd monitoring: Counting people without people models or tracking. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–7. IEEE, 6 2008.
 [7] A. B. Chan and N. Vasconcelos. Bayesian poisson regression for crowd counting. In 2009 IEEE 12th International Conference on Computer Vision, pages 545–551. IEEE, 9 2009.
 [8] D. H. Clements. Subitizing: What is it? why teach it? Teaching children mathematics, 5(7):400, 1999.
 [9] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlablike environment for machine learning. In BigLearn, NIPS Workshop, 2011.
 [10] S. Cutini and M. Bonato. Subitizing and visual shortterm memory in human and nonhuman species: a common shared system? Frontiers in Psychology, 3, 2012.
 [11] S. Dehaene, V. Izard, E. Spelke, and P. Pica. Log or linear? distinct intuitions of the number scale in western and amazonian indigene cultures. Science, 320(5880):1217–1220, 2008.
 [12] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. 10 2013.
 [13] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, Jan. 2015.
 [14] P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part model. In 26th IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2008.
 [15] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 14, 2016, pages 457–468, 2016.
 [16] F. Galton. One Vote, One Value. 75:414, Feb. 1907.
 [17] S. Gidaris and N. Komodakis. Object detection via a multiregion and semantic segmentationaware cnn model. In Proceedings of the IEEE International Conference on Computer Vision, pages 1134–1142, 2015.
 [18] R. Girshick. Fast rcnn. In International Conference on Computer Vision (ICCV), 2015.
 [19] H. Idrees, I. Saleemi, C. Seibert, and M. Shah. Multisource multiscale counting in extremely dense crowd images. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’13, pages 2547–2554, Washington, DC, USA, 2013. IEEE Computer Society.
 [20] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of The 32nd International Conference on Machine Learning, pages 448–456, 2015.
 [21] D. B. Jiasen Lu, Xiao Lin and D. Parikh. Deeper lstm and normalized cnn visual question answering model. https://github.com/VTvisionlab/VQA_LSTM_CNN, 2015.
 [22] J. Johnson, B. Hariharan, L. van der Maaten, L. FeiFei, C. L. Zitnick, and R. Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017.
 [23] S. Kazemzadeh, V. Ordonez, M. Matten, and T. L. Berg. Referit game: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
 [24] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
 [25] A. Klein and P. Starkey. Universals in the development of early arithmetic cognition. New Directions for Child and Adolescent Development, 1988(41):5–26, 1988.
 [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
 [27] V. Lempitsky and A. Zisserman. Learning To Count Objects in Images. In Advances in Neural Information Processing Systems, pages 1324–1332, 2010.
 [28] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
 [29] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. E. Reed. SSD: single shot multibox detector. CoRR, abs/1512.02325, 2015.
 [30] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
 [31] M. Malinowski, M. Rohrbach, and M. Fritz. Ask your neurons: A neuralbased approach to answering questions about images. In Proceedings of the IEEE International Conference on Computer Vision, pages 1–9, 2015.
 [32] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119, 2013.
 [33] D. Oñoro Rubio and R. J. LópezSastre. Towards perspectivefree object counting with deep learning. In ECCV, 2016.
 [34] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, realtime object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 [35] M. Ren, R. Kiros, and R. Zemel. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems, pages 2953–2961, 2015.
 [36] M. Ren and R. S. Zemel. Endtoend instance segmentation and counting with recurrent attention. CoRR, abs/1605.09410, 2016.
 [37] S. Ren, K. He, R. Girshick, and J. Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 91–99. Curran Associates, Inc., 2015.
 [38] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
 [39] A. Sadovnik, A. C. Gallagher, and T. Chen. It’s not polite to point: Describing people with uncertain attributes. In CVPR, pages 3089–3096. IEEE, 2013.
 [40] M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE Trans. Signal Processing, 45:2673–2681, 1997.
 [41] S. Seguí, O. Pujol, and J. Vitrià. Learning to count with deep object features. may 2015.
 [42] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. 9 2014.
 [43] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. Computer Vision and Pattern Recognition (CVPR), 1:I—511—I—518, 2001.
 [44] L. Wan, D. Eigen, and R. Fergus. Endtoend integration of a convolution network, deformable parts model and nonmaximum suppression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 851–859, 2015.
 [45] C. Zhang, H. Li, X. Wang, and X. Yang. CrossScene Crowd Counting via Deep Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 833–841, 2015.
 [46] J. Zhang, S. Ma, M. Sameki, S. Sclaroff, M. Betke, Z. Lin, X. Shen, B. Price, and R. Mĕch. Salient object subitizing. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
Appendix

In Sec. A, we report results of ablation studies conducted on the Countval split for glance, asosub and seqsub models

In Sec. B, we report some analyses of the count predictions generated by our models, specifically comparing object sizes with count performance and overcountingundercounting statistics

In Sec. C, we show results of occlusion studies performed to identify the regions of interest in the scene while estimating the counts

In Sec. E, we present some qualitative examples of the predictions generated by our models
Appendix A Ablation Studies
We explain the architectures for glance, asosub and seqsub in Sec. 3. Here we report results of some ablation studies conducted on these architectures.
For glance and asosub, we search over the following architecture space. Firstly, we vary the hidden layer sizes in the set . Secondly, we vary the number of hidden layers in the model between and (with the previously selected hidden layer size). Corresponding to these settings, we search for the best performing archiecture for ft (detection finetuned fc7) and noft (classification fc7) features extracted from PASCAL images. For asosub, in addition to this, we look for the best performing architecture across different grid sizes (, , ). We narrow down to some design choices with and them compare different grid sizes.
For seqsub, we vary the number of BiLSTM (context aggregator) units per sequence. Subsequently, we vary the grid size from to . We report studies on both PASCAL and COCO.
All results are reported on the Countval splits of the concerned datasets.
glance : We find that the performance for ft1L remains more or less constant as we change the size of the hidden layers (Fig. 11). In contrast, the noft1L model does best at smaller hidden layer sizes. A two hidden layer noft model does better than both 1L models. Intuitively, this makes sense since the noft features are better suited to global image statistics than the detection finetuned ft features.
asosub : We next contrast different design choices for asosub. Details of how the performance changes with different grid sizes in asosub has been discussed the main paper. In particular, just like the previous section, we study the impact of hidden layer sizes and number of hidden layers, as well as the choice of features (ft vs noft) for the asosub model (Fig. 12). We find that the detection finetuned ft features do much better than the classification features noft for asosub. This is likely because the ft features are better adapted to statistics of local image regions than the noft image classification features. We also find that increasing the number of hidden layers does not improve performance over using a single hidden layer, unlike glance.
asosub : We next compare how the performance of asosub varies as we change the size of the grids. We pick the best performing asosub features (ft) and number of hidden layers  1. We then vary the size of the hidden layer and compare the performance of , , and asosub approaches (Fig. 13). We find that and models do much better than the model. The performance of is slightly better than the model. A similar comparison on the Counttest set can be found in the main paper.
seqsub : In Fig. 14 and Fig. 15, we compare the effect of changing the grid size from to for the seqsub models. We use both ft and noft features extracted from PASCAL and COCO images. On PASCAL (Fig. 14), we observe that increasing the grid size has a slight improvement in performance for both ft and noft features unlike COCO (Fig. 15) where going from to there is a drop in performance for both ft and noft features. We should also note that in general ft features perform better than noft features similar to asosub.
We also varied the number of BiLSTM (context aggregator) units from to per sequence in the seqsub architectures for a grid size of . We observe that for ft features, change in the number of BiLSTM units does not make a difference on both PASCAL and COCO. However, for noft features, going from to leads to a drop of on COCO and PASCAL.
Appendix B Count Analysis
Size versus Count Error : We compare seqsub, glance, asosub and detect their performance for object categories of various sizes on PASCAL (Fig. 16) and COCO (Fig. 17). To get the object size, we sum the number of pixels occupied by an object across images where the object occurs in the Countval set and divide this number by the average number of (nonzero) instances of the object. This gives us an estimate of the expected size occupied per instance of an object. We show a sorting of smaller to larger categories on the xaxis in Fig. 16 and Fig. 17.
We find that ens, seqsub and asosub perform consistently well across the spectrum of object sizes on both PASCAL and COCO. On PASCAL, as the object size increases the error keeps on reducing. This trend is not consistent over the entire spectrum of sizes for COCO. Another interesting thing to observe is that as the object size increases, the methods start performing competitively. This also indicates that asosub and seqsub are able to capture partial ground truth counts well, since the counts for larger categories will necessarily be partial.
Undercounting versus Overcounting :
We study whether the models proposed in the paper undercount or overcount. Specifically, we report the number of times the approaches overcount, undercount or predict the ground truth count on PASCAL (Fig. 18) and COCO (Fig. 19). To do this, we first filter out all instances where the ground truth count is 0 (since we cannot undercount 0). We then check if the predicted count is greater than the ground truth (overcounting) or lesser (undercounting) or equal to the ground truth (equal). We perform this analysis on the Counttest split.
On PASCAL (Fig. 18), we can observe that there is a clear increase in the number of times we get the count from detect to ens. The models, in general undercount more often than they overcount. As we go from detect to ens, the improvement in performance can be accounted to the increase of the frequency of equal versus undercount. Interestingly, for ens we get the count right more number of times as opposed to undercounting the ground truth. The number of times we overcount more or less stays the same.
On COCO (Fig. 19), we observe that although there is an increase in the number of times we get the count right as we go from detect to ens, the frequency of equal is much lower than the frequency of undercounting for all the models. This is understandable as COCO has more number of categories and objects of different categories have lesser chances of being in the same image.
Ensemble : We study different combinations of the predictions for constructing the ensemble on the Counttest set.
On PASCAL, when we compose an ensemble of seqsub and asosub, we get a of 0.427 as opposed to a of 0.438 with seqsub and glance. One can think of combinining global and local context by taking an ensemble of glance and asosub. However, we observe that such an ensemble underperforms when compared to seqsub by 0.02 . We also consider including the detect baseline in the ensemble. We see that an ensemble of detect, glance, asosub and seqsub gives an error of 0.43 as opposed to an ensemble of glance, asosub and seqsub which gives 0.42. Thus detect, when included in the ensemble hurts the counting performance.
On COCO, when we compose an ensemble of seqsub and asosub, we get a of 0.351 as opposed to a of 0.363 with seqsub and glance. We observe that an ensemble of glance and asosub underperforms when compared to seqsub by 0.02 . When detect is included we see that an ensemble of detect, glance, asosub and seqsub gives an error of 0.38 as opposed to an ensemble of glance, asosub and seqsub which gives 0.36. Just like on PASCAL, detect when included in the ensemble hurts the counting performance.
Appendix C Occlusion Studies
In Fig. 20, we perform occlusion studies to understand where glance, asosub and seqsub look in the image while estimating the counts of different objects.
For this analysis, we pick images with a spread in counts from 10 (top row) to 1 in the middle row. For each count, we identified images where all three approaches agreed on the counts so that we could analyze where each method “looks” in order to derive the corresponding counts. We pick images from the COCO Counttest split where the predicted counts for glance, asosub and seqsub are equal. The asosub and seqsub models are trained on discretization of the images. We move sized masks across the image with a nonoverlapping stride to get occlusion maps. It is interesting to observe that for the image with a large ground truth count, the occlusion maps from seqsub are very similar to those from asosub. This confirms our intuition that for larger counts, one needs access to local texture like patterns to accumulate count densities across the image. For smaller counts (rows 2 and 3), we notice that the maps from glance and seqsub are more similar, indicating that global cues such as the number of parts appearing in the image (say the number of tails of elephants), potentially captured by the distributed CNN representation are sufficient for counting. Thus, this experiment confirms our intuition that seqsub captures the best of both the glance and asosub approaches, providing us a way to “interpolate” between these approaches based on the counts.
Appendix D VQA Experiment
We next elaborate on more details of the VQA experiment described in Sec. 5.4. More specifically we discuss how we preprocess ground truth to make it numeric and give details of how we solve correspondence between a noun in a question to counts of coco categories.
As reported in the paper, we use the VQA dataset [3] and COCOQA [35] datasets for our counting experiments. We extract the how many? type questions, which have numerical answers. This includes both integers (VQA) and numbers written in the form of text (COCOQA). We parse the latter into corresponding numbers on the COCOQA dataset. That is five is mapped to . From the selected questions, we extract the Nouns (singular, plural, and proper), and convert them to their singular form using the Stanford Natural Language Parser (NLTK) [1].
We train word2vec word embeddings on Wikipedia^{6}^{6}6https://www.wikipedia.org/ and use cosine similarities in the embedding space as word similarity. Using these we find the COCO category or COCO supercategory that matches the most with the extracted nouns. These supercategory annotations are available as part of the COCO dataset. We run our models for the COCO category, and consider it the answer. For the extracted nouns, if the best match is with a COCO super category, we sum the counts obtained by our counting methods for each of the COCO subcategories belonging to the particular supercategory. For example, if the resolved noun is animal, we sum the counts for horse, giraffe, cat, dog, zebra, sheep, cow, elephant, bear, and bird and use the output as our predicted count.
Appendix E Qualitative Results
We finally show some qualitative examples of our predictions on COCO images in Fig. 21 where ens performs best. We can observe that whenever the objects present in the image are sufficiently salient, seqsub and asosub do a sufficiently better job in estimating the count of objects as compared to glance. This is because as seqsub, and asosub have to estimate partial counts at cell levels unlike glance which has to regress to the count of the entire image. For some cases when the objects present are highly occluded, we see that seqsub and asosub do a much better job at estimating the count. In summary, we find that ens as a combination of glance, asosub and seqsub gets the count right most number of times.