Deployment of Customized Deep Learning based Video Analytics On Surveillance Cameras

Deployment of Customized Deep Learning based Video Analytics On Surveillance Cameras

Pratik Dubal AitoeLabs (    Rohan Mahadev AitoeLabs (    Suraj Kothawade AitoeLabs (   
Kunal Dargan
AitoeLabs (
   Rishabh Iyer AitoeLabs (

This paper demonstrates the effectiveness of our customized deep learning based video analytics system in various applications focused on security, safety, customer analytics and process compliance. We describe our video analytics system comprising of Search, Summarize, Statistics and real-time alerting, and outline its building blocks. These building blocks include object detection, tracking, face detection and recognition, human and face sub-attribute analytics. In each case, we demonstrate how custom models trained using data from the deployment scenarios provide considerably superior accuracies than off-the-shelf models. Towards this end, we describe our data processing and model training pipeline, which can train and fine-tune models from videos with a quick turnaround time. Finally, since most of these models are deployed on-site, it is important to have resource constrained models which do not require GPUs. We demonstrate how we custom train resource constrained models and deploy them on embedded devices without significant loss in accuracy. To our knowledge, this is the first work which provides a comprehensive evaluation of different deep learning models on various real-world customer deployment scenarios of surveillance video analytics. By sharing our implementation details and the experiences learned from deploying customized deep learning models for various customers, we hope that customized deep learning based video analytics is widely incorporated in commercial products around the world.

Deep Learning, Convolutional Neural Networks, Computer Vision, Customized Video Analytics

Pratik Dubal, Rohan Mahadev, Suraj Kothawade, Kunal Dargan, Rishabh Iyer

1 Introduction

Visual Data, in the form of images, videos and live streams, has been growing at an unprecedented rate in the last few years. While this massive amount data is a blessing for Data Science, as it helps in improving the predictive accuracy, it is also a curse since humans are unable to consume this large amount of data. Moreover, today, machine-generated videos (via Drones, Dash-cams, Body-cams, Surveillance cameras etc.) are being generated at a rate higher than what we as humans can process. Among machine-generated videos, surveillance videos are one of the largest contributors to this growth. Surveillance cameras are deployed in several verticals, including office facilities, road intersections for traffic monitoring, ATMs and Banks, Hospitals, Manufacturing Facilities, Industrial Plants, Construction Sites, Educational Institutions, Retail stores and Malls, Hotels and Restaurants etc. Each of these verticals have their own unique video analytics applications. In most scenarios, video analytics is used for security purposes (detecting loitering and intrusion, asset tampering, suspicious activity or object detection). In other scenarios, video analytics is used for process compliance, e.g. if an event in a manufacturing plant has happened on time, or whether it was done as desired. In retail scenarios and hotels, the information from video analytics is used for getting insights in customer pattern (e.g. heat-map, flow-map, counts, dwell-times etc.) While all these applications sound very different, the analytics building blocks are the same.

Figure 1: End-to-End process for analytics

Figure 1 demonstrates the process clearly. The analytics engine consists of several building blocks, including object detection, tracking, face and human detection, human and face sub-attribute recognition, vehicle detection and vehicle sub-attribute recognition etc. The information from the analytics engine is then passed on to a business logic layer, which applies rules based on the analytics output. For example, using human detection (localizing where a human is in the video frame) if a human enters a demarcated area, it sends out a real-time alert. Similarly, by tracking the paths of the human in the video, we can compute the heat-map and flow-map of human movement.

The following sections outline the advancement of deep learning in computer vision, followed by the recent advances and challenges of video analytics for surveillance applications. Finally, we outline the main contributions of this paper.

1.1 Advancement of Deep Learning in Computer Vision

Current approaches to all but a few Computer Vision tasks involve the use of Deep Convolutional Neural Networks (CNNs). CNNs generated a lot of interest after the successful performance of AlexNet [15] in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012 competition [29]. Following its triumph, there was an upsurge in the number of deep CNN models that were being used across the Computer Vision community. The winner of ILSVRC 2014 was the even deeper GoogLeNet, which was the first CNN model to have a fundamentally different architecture than AlexNet. It was followed by ResNet [10], the winner of ILSVRC 2015, which was an astonishing 152 layers deep. It won the competition by achieving an error rate of 3.57%, beating humans at the image classification task.

Similarly, for Object Detection tasks, there has been a significant advancement in the use of CNNs in the last lustrum. It started with the introduction of the Region based family of networks [7, 6, 27]. Recently, the Single Pass family of networks, consisting of YOLO and Tiny Yolo [26, 25], along with the Single Shot MultiBox Detector (SSD) [21] have emerged as the state-of-the-art models in object detection. Their extremely fast inference times allow for object detection to take place in real-time, thus broadening the areas of application where Deep Learning can be used for vision tasks.

While there has been a remarkable improvement in the performance of Object Detection models for real-time tasks, an important role is still played by Object Tracking algorithms. Tracking algorithms are much faster than detection algorithms and help preserve the identity of an object being tracked when detection fails. Object Tracking is highly dependent on the quality of object detections. With good detections, the performance of a simple tracking algorithm increases drastically.

State-of-the-art face recognition techniques such as DeepFace [32] and Deep Face Recognition [22] both consist of CNNs. Some of the highest results on the ’Labeled Faces in the Wild’ (LFW) dataset [12] have been achieved by supervised CNNs [16]. Recently, a ResNet based face embeddings model has been proposed by King [14]. In almost all comparisons, deep face recognition models have outperformed older hand-crafted face embedding models.

1.2 Challenges of Video Analytics in Security and Surveillance

A lot of research has gone into video analytics systems for security and surveillance applications. Over the past two decades, video analytics companies have been providing solutions and products for video analytics in several domains.  Gong et al. [8] and Gouaillier and Fleurant [9] provide a good summary of the technology, problems, as well as the companies which are building analytics products in this space. Most surveillance cameras have a fixed angle of view, and for this reason, video analytics on these surveillance cameras are slightly easier than other forms of video analytics on moving cameras. Many video analytics problems can therefore be solved by background subtraction algorithms, which essentially use motion information to generate contours and motion blobs.  Sobral and Vacavant [31] provide a very comprehensive survey of background subtraction algorithms for motion analytics. These algorithms work well in low traffic situations, and where one wants high sensitive alerts for problems like intrusion detection, motion detection and asset tampering. However, background subtraction algorithms are mostly unsupervised algorithms and are not trained to specifically detect humans or other objects of interest. As a result, they cannot distinguish between motion caused by shadows or leaf movements, viz-a-viz a human or animal intrusion, and often generate a lot of false alarms. However, background subtraction algorithms are extremely fast and scale very well in embedded applications. Due to privacy and bandwidth issues, it is often not feasible and prohibitively costly to deploy video analytics solutions on the cloud. As a result, it is essential to develop resource constrained video analytics solutions which can be deployed on premise. Deep learning has dominated the landscape of computer vision for the past few years, and almost all video analytics applications can be solved with high accuracies via deep learning. However, deep learning algorithms are resource hungry and require expensive GPU cloud servers to deploy. Given this, developing resource constrained, locally deployable and embedded deep learning solutions is critical. Several very recent advances like the MobileNet family of models [11], X-NOR networks [24] and Tiny-YOLO [25] have enabled deployment of resource constrained on embedded devices.

1.3 Our Contributions

The following are the main contributions of this paper:

  • A systematic overview of what it takes to build an end-to-end video analytics system for the surveillance domain.

  • A data collection and training pipeline to ensure fast turnaround times, along with data sampling and augmentation tricks used.

  • A comprehensive data analysis, in terms of the number of images used for training in each deployment scenario, and other subtle tricks and lessons learned to get these to work.

  • A comprehensive evaluation of how custom models based on deployment scenarios provide considerably superior accuracies than off-the-shelf models.

  • Demonstrate how the powerful deep learning models can be run at reasonable frame-rates on edge devices. We also compare accuracies of resource constrained models viz-a-viz cloud enabled models.

  • Lastly, we show how the resource constrained edge models perform considerably better than off-the-shelf GPU enabled models, thereby emphasizing the power of model customization for deployments.

To our knowledge, this paper provides the first comprehensive evaluation of various computer vision tasks such as object detection and localization, face detection and face recognition, face and human sub-attribute recognition etc. In each case, we provide comprehensive evaluation of deep learning models on real-world customer data and deployment scenarios.

2 Video Analytics System Overview

To have a robust video analytics system in the surveillance domain, a pivotal role is played by the accuracy of models and the inference times. The occurrence of false positives in detections and the delay in transmission of real-time alerts may potentially hinder the effective utilization of the system. Thus, our research emphasizes on the creation of deep learning based models which are capable of achieving high accuracy rates, without compromising on the inference times.

Building on top of the recent advancements in deep learning, we propose our multi-faceted video analytics system, which focuses on performing real-time analytics such as object detection, face analytics, human and face sub-attribute recognition, all on the edge, while achieving near state-of-the-art accuracies. We divide up our analytics into four main components:

2.1 Object Detection

Object Detection is a key component and starting point of our analytics pipeline. Tremendous progress is being achieved on this problem by the region-based family [6, 27] and the single-pass family i.e. YOLO [26, 25] and SSD [21]. Even though the region based family (see Section 1.1) provides high detection accuracy, they prominently rely on ’Selective Search’ for region proposals which hampers the detection speed. Even the fastest, highest accuracy region based detection algorithm, Faster R-CNN [27] can achieve only 7 FPS, which is not a viable solution to problem scenarios that require real-time object detection. On the other hand, single-pass detectors like YOLO and SSD do not rely on bounding box proposals and still give significantly better results in terms of both speed and accuracy. Recently,  Redmon and Farhadi released YOLOv3 [25], which claims to be 3x faster than SSD with the same accuracy. Moreover, YOLO has a smaller derived network called Tiny YOLO which is capable of operating on a CPU. This paper builds upon the YOLO family of networks for performing object detection in real-time. Depending on the use case, we custom train the object detectors on the classes of objects relevant to the business needs of the deployment. For example, for monitoring safety in construction sites, we might care about objects such as humans, helmets, safety shoes etc. In these cases, we do not need to consider other objects such as cars, buses or bags etc. On the other hand, if it is a traffic scenario, the focus will be on vehicle classes, such as cars, trucks, buses, motorbikes etc.

2.2 Face Detection and Recognition

Another important part of our pipeline is face detection and recognition. For long, the detection framework laid out by Viola and Jones [34] was the go-to for face detection. Though it was fast, it produced quite a lot of false positives. Another commonly used face detection algorithm was proposed by Liao et al. [19]. They extracted a feature from an image which was computed as the difference to sum ratio between two pixel values. They called it the ’Normalized Pixel Difference’ (NPD) [19]. They used this feature along with a soft-cascade classifier to detect faces in the given image. The NPD face detector was fast and achieved state-of-the-art performance on srstandalrdts. However, we found that NPD was slow and required a lot of tuning to work in surveillance videos, due to inconsistent frame dimensions. As a result, we use a Single Shot Detector [21] model based on ResNet [3]. As illustrated in Section 4.2.1, we see that the ResNet-SSD model outperforms NPD and Haar, both in terms of speed and accuracy. For face recognition, we use a ResNet [10] based face embeddings trained by King [14] on about three million images. In Section 3, we compare accuracy results of various deep and shallow face embeddings on a surveillance face recognition dataset.

2.3 Sub-Attribute Recognition

Based on the detected objects, we then perform sub-attribute recognition. The ability to recognize sub-attributes for a localized object from a larger image allows us to index an object in multiple ways. So, first we need to detect the position of the people by running the captured frames through an object detector, as discussed in Section 4.1, after which we classify the object on the basis of its sub-attributes.

On the localized people, we run human sub-attribute recognition, which consists of recognizing the age, gender, apparel type and color etc. Similarly, in the case of vehicles, we might be interested in the make and type of the vehicle. In the case of faces, we care about the recognition of face sub-attributes such as age, gender and emotion. In the case of other objects (e.g. bags, helmets etc.) we might be interested in properties like the color and size of the object. For tackling most of these sub-attribute recognition problems, we utilize two methodologies, viz. ’Transfer Learning’ and ’Fine Tuning’.

2.3.1 Transfer Learning

In this approach, we choose a pre-trained CNN model, ideally trained for a contextually similar problem. We then choose a layer of the model, which is used to extract the features of an image when it is forward passed through the network. This extraction of feature vectors is performed for all images in the training set. Upon the completion of the feature extraction, we train a multinomial logistic regression model. On the positive side, Transfer Learning allows us to quickly train models without a GPU. It generally achieves a high accuracy on a held-out test set, when trained on a small training set. However, Transfer Learning requires us to possess domain specific knowledge, and intricacies of the base CNN model to be used, in order to identify the feature extraction layer. Moreover, the identification of the layer involved quite some experimentation and heuristics.

2.3.2 Fine Tuning

While Fine Tuning a model, the original network architecture is modified to be compliant with our training set. The weights of the original network act as the base weights for the model, and we start its training as usual. This allows us to use the embeddings that the original network may have learned and build on top of them. Thus, the trained model is more robust and suited for our task. Unfortunately, the hyper parameters need to be tuned pertinently in order to obtain good results. Fine Tuning also requires a considerably large number of images than Transfer Learning. It also requires a GPU to train the network in a feasible amount of time.

2.4 Tracking

Finally, a very important piece of video analytics is tracking the detected objects and faces. Tracking is the process of locating the position of an entity across sequential frames in a video. In multi-object tracking, we are required to map the location of detected entities in a frame in the subsequent frames. Traditional position based algorithms fail when the detected entities are close to each other. We overcome this difficulty in our system by implementing the SORT algorithm [2]. The SORT algorithm is very fast and performs much better than position based tracking.

3 Data Collection and Training Pipeline for Model Customization

Surveillance cameras have a fixed field of view and their orientations largely remain unchanged. Thus, in order to train highly accurate models, we obtain videos directly from the deployment locations. Videos, however, largely contain redundant data. Since each frame needs to be labeled by a human labeler, this will increase the cost of labeling. To tackle this issue, we collect a set of diverse frames by summarizing the video using submodular functions [33, 35, 30]. This drastically reduces the number of images that need to be annotated in order to train the model. Thus, decreasing the overall turnaround time without compromising on the accuracy of the model.

3.1 Removing Redundant Frames via Diversity Models

Given a set of items which we also call the Ground Set, define a utility function (set function) , which measures how good a subset is. In our case, the ground set comprises of frames from the video sampled at a particular FPS (say 1 frame per second). A special class of set functions, called submodular functions, form very natural models for diversity. Submodular functions exhibit a property that intuitively formalizes the idea of “diminishing returns”. That is, adding some instance to the set provides more gain in terms of the target function than adding to a larger set , where . Informally, since is a superset of and already contains more information, adding will not help as much. For more examples of submodular functions, see [33, 35, 30].

Given a budget (say of 500 frames from the video), we would like to choose the most diverse frames, based on a diversity model , so as to ensure the best coverage of the entire video. Consider a simple greedy algorithm, which, starts with and iteratively adds an element which maximizes the gain . We stop the greedy algorithm when the budget constraint is satisfied. One can show that this is a near optimal solution to the problem of maximizing the diversity model subject to a budget constraint. Given several videos from deployment locations, we summarize these videos to extract the most diverse frames, and then label the frames.

3.2 Data Augmentation for Generalization

Image classification tasks often have a very pertinent problem of a lack of sufficient labeled data, which is the most important commodity in solving a supervised learning problem. This hinders implementations of such models on a wide variety of real world problems. In addition to this, a model trained on a smaller training set is bound to over-fit the training set and will not generalize well.

As a solution to this, the idea of data augmentation is put forward, which in essence is to create more training samples from the information which is already present in the training set. Hence, by generating a larger training set, we counter the problem of over-fitting and also help the model to generalize better. Early implementations of successful data augmentation techniques can be seen on the MNIST dataset [17].

Traditionally, the techniques which are applied as a part of data augmentation include rotation, flipping, shearing and changing the color of the image. These affine transformations follow the format of , where is the representation of the original image, and are transformation parameters and is the representation of the transformed image.

In our approach, we implement an augmentation pipeline which performs these affine transformations on images by randomly choosing the transformation parameter values, hence creating a diverse set of new images. This pipeline generates a training set with an equal number of samples for each class, which is calculated as the average of the number of samples per class, pre-augmentation. To avoid adding extra noise to the dataset, transformation parameters are selected only belonging to a particular range, for example, the image rotation may not exceed more than 10 degrees.

4 Video Analytics Results

This section goes over the results for the different video analytics building blocks discussed above in various customer deployments. Due to shortage of space, we focus mainly on results for object detection, face recognition, human and face sub-attribute. The pattern of the results, however, hold for the other analytics as well not discussed in this paper.

4.1 Object Detection

In this paper, we use YOLO for illustrating the importance of custom training such networks, by providing a head-to-head comparison between off-the-shelf models trained on generic datasets and models trained on custom datasets. Also, we show that there is not a huge gap in performance between custom trained YOLO and Tiny YOLO models from a deployment perspective by providing a similar head-to-head comparison.

These state-of-the-art networks are extensively used to solve object detection problems in various scenarios like counting students in a classroom, detecting vehicles running on the highway, etc. These problems get more complicated when subcategories like boy/girl student for the first case and vehicle make/model/color for the second case are needed to be identified. Generally these models are trained on ImageNet [29], PASCAL VOC [5], Microsoft COCO [20] or any such standard dataset, these pre-trained models give good results when it comes to detecting objects in generic scenarios.  Huang et al. [13] provide a detailed comparison of these networks trained on the COCO dataset. However, they might not work well in all the real world scenarios due to the fact that the dataset used for training these models are very generic.

Below we describe four datasets that have been used throughout our experiments.

  1. Classroom Dataset (736 images): This dataset consists of classroom images with varied seating arrangements, surroundings, class strength, etc. The main customer requirements for this deployment were getting accurate student counts, detecting whether class has started or not, and uniform compliance.

  2. Community Center Dataset (5336 images): This dataset consists of indoor and outdoor images of a community place encompassing dense and sparse crowds of different age groups thereby providing an aggregated data for detecting the person class. The customer requirement was to get various statistics including counts, age/gender distribution, heat-map/flow-map etc.

  3. Traffic Dataset (999 images): This dataset showcases running roads and highways consisting of various vehicles (car, bus, truck, bicycle, motorbike, three-wheelers), with different perspectives and densities.

  4. Multi-Scenario Surveillance Dataset (8191 images): The Multi-Scenario Surveillance dataset is a blend of data from several customers, including the ones above. This dataset consists of the the most common classes seen in surveillance videos including persons, cars, buses, trucks, motorbikes etc. This dataset is similar to the PASCAL VOC dataset, except that it is focused on data from surveillance videos, instead of images downloaded from the Internet.

Table 1 illustrates the performance of YOLO and Tiny YOLO models trained on the PASCAL VOC dataset, Multi-Scenario Surveillance dataset, and the above mentioned custom datasets in terms of their Mean Average Precision (mAP) at an Intersection over Union threshold (IoU) of 0.5, along with their inference times in milliseconds.

Target Dataset Train YOLO Tiny YOLO
mAP Inference Time mAP Inference Time
(in ms) (in ms)
Classroom VOC 9.04 53 371 4.6 11 60
MSS 58.79 43 310 36.7 14 98
Custom 66.16 43 314 54.3 14 96
Community Center VOC 40.9 34 276 26.8 10 80
MSS 75.2 26 343 59.8 11 60
Custom 80.72 25 343 70.36 12 61
Traffic VOC 18.29 14 271 16.61 5 45
MSS 59.1 12 110 40.10 5 30
Custom 71.45 10 106 72.54 5 31
Multi-Scenario Surveillance VOC 9.10 55 410 4.28 16 84
Custom 47.22 51 334 32.47 6 27
Table 1: Comparison between models trained on PASCAL VOC, Multi-Scenario Surveillance (MSS) and Customized Datasets

The following are the main takeaways from the results:

  1. In all cases, we see that the customized models perform better on held-out test sets compared to the off-the-shelf PASCAL VOC and Multi-Scenario Surveillance models, even though they are trained only with a fraction of the data.

  2. As expected, the Multi-Scenario Surveillance dataset performs much better compared to the PASCAL VOC dataset. This is expected, since the Multi-Scenario Surveillance dataset consists of surveillance images, rather than images downloaded from the Internet.

  3. Even though a custom trained YOLO model performs the best, the CPU latency is too high to run it in real-time. On the other hand, a custom trained Tiny YOLO model performs well from an accuracy perspective and yet takes about one fifth the time for inference compared to YOLO (on CPU).

4.2 Face Detection and Recognition

Face detection and recognition in surveillance videos has long been of utmost importance. However, in todays world we are required to identify a lot more fine-grained attributes, such as age and gender, from a person’s face. To be able to recognize these sub-attributes, it is necessary for us to obtain certain discerning features from a face in order to classify it. For this purpose, we use a CNN for extracting facial features by passing the image of a detected face through the model.

4.2.1 Face Detection

Table 2 compares the face detection accuracy for Haar cascades, NPD and ResNet detector. The ResNet-SSD model wins from both, the speed and accuracy, perspectives on the FERET dataset. The timings for NPD and Haar cascade are obtained without tuning any hyper parameters such as the minimum and maximum face size. Though tuning these parameters may decrease the inference time (and make them comparable to SSD), they are very difficult to calibrate as they are circumstantially variant.

Detection Algorithm Precision (in %) Average Inference Time (in ms)
Viola-Jones Haar Cascade 53.39 114.94
NPD Detector 73.03 148.61
ResNet-SSD 97.81 60.29
Table 2: Detection Accuracy and Average Inference Times on the FERET Dataset

4.2.2 Face Recognition

In order to perform face recognition on a detected face, we first need to train a model on the set of faces which may be detected. As we would only have a few images of every distinct person, we chose the Transfer Learning approach, elaborated in Section 2.3.1, to train our face recognition models. We extracted features from multiple pre-trained CNNs, namely Deep Face [22], DLib-ResNet [14] and OpenFace [1], and compared the accuracy of the model on the FERET [23] and Community Center datasets, which are illustrated in Table 3.

Dataset CNN Model Used Recognition Accuracy (in %)
FERET Deep Face Recognition 95
DLib-ResNet 99.76
OpenFace 77.52
Community Center Deep Face Recognition 92.55
DLib-ResNet 92.62
OpenFace 68.70
Table 3: Face Recognition Accuracies on the FERET [23] and Community Center Datasets

4.3 Human Sub-Attribute

For our experimentation, we classify the detected people on the basis of four sub-attributes: Full Body Age, Full Body Gender, Upper Body Apparel Type, and Upper Body Apparel Color.

We use the aforementioned techniques of transfer learning and fine-tuning to establish a baseline using out-of-the-box models trained on the PETA dataset. We progressively demonstrate the use and results of techniques such as data augmentation, transfer learning and fine-tuning on customized models in the following sections.

4.3.1 Off-the-shelf model

For our baseline, we use transfer learning and fine-tuning on AlexNet and GoogLeNet, trained on the PETA dataset [4]. The PETA dataset contains a wide variety of generic surveillance cases. We modify the dataset to only use the classes which were the most common. The model accuracy is evaluated over a held-out test set. The results obtained are given in Table 4.

Training Approach CNN Model Age Gender Apparel Type Apparel Color
Transfer Learning AlexNet 67.2 75.1 59.8 53
GoogLeNet 65.8 76.7 64.1 52.5
Fine Tuning AlexNet 77.4 83.8 63 65.24
Table 4: Accuracy Results for Human Sub-Attribute Recognition (in %)

For simplicity, we only look at AlexNet and GoogleNet models. From Table 4, we observe that the fine-tuned AlexNet seems to perform the best on nearly all four cases. Since the generic dataset is quite big (consisting of 10-20k images per class), it is expected that the fine-tuned model works better than the transfer learning, as transfer learning is essentially just learning the last layer, whereas fine-tuning is fitting the entire CNN to the data.

Figure 2: Test Accuracies (in %) on the Scraped Image Set.

4.3.2 Data Augmentation

The first enhancement we can make to this off-the-shelf model is to augment the training set in each case. The motivation behind augmentation is that we want the representation of each class in the dataset for any given task to be equal. So, using the data augmentation techniques mentioned in Section 3.2, we create a new training set from the training set used of the PETA Dataset. Since, the original test set was created as a 20% fraction of the dataset, there is a bias towards certain classes in the test set. But, the augmented training set has the same sample size for each class. Thus, in order to avoid distribution bias in our evaluation, we create a separate test set, called the ’Scraped Image Set’ which contains images scraped from the web and labeled manually. From Figure 2, we can see that a model with an augmented training set is more suited to generalization as opposed to one without it. The lack of difference in the age classification case can be explained by the fact that the original dataset has a nearly equal distribution amongst its classes and augmentation did not make a significant difference to the training set.

4.3.3 Model Customization

In this section, we compare a custom trained model against the off-the-shelf models generated in the above section. Towards this end, we perform transfer learning and fine-tuning on the dataset consisting of images from the deployment scenario. This mitigates the problem of non-inclusion of attributes for diverse cases. For example, generic datasets such as the PETA dataset contains classes for clothing such as Coat, Suit, Shirt, etc. which would be a problem if deployed in a place with a different clothing norm such as Asia.

For our experiments, we used the Community Center dataset described in Section 4.1. We compare the accuracies of the off-the-shelf model against customized models on the test set of the Community Center dataset. For ease of experimentation, we only perform the transfer learning and fine-tuning on AlexNet. The results can be seen in Figure 3. We see that both in the case of the Transfer learned and fine-tuned models, the customized models perform better than the generic one.

Figure 3: Accuracies of models trained on the Community Center dataset

The following are the insights from the results:

  1. Generally, fine-tuning a model works better than transfer learning. However, if the training dataset is small, fine-tuned models tend to over-fit and need more careful hyper-parameter tuning. This is costly both in terms of resources used and time spent. As it can be seen from Figure 3, the performance of transfer learning is just as good if not better than fine-tuning. It is much faster to perform transfer learning and works better on fewer data. Hence, in a custom deployment scenario, where procuring a large custom dataset might be difficult, the use of transfer learning on custom datasets is recommended.

  2. In cases where multiple sub-attributes need to be recognized, loading fine-tuned models is resource expensive, since we load multiple models in memory. In such scenarios, transfer learning is beneficial as a single base CNN can be used as a feature extractor for multiple sub-attribute models.

  3. Data augmentation can be used, not only to increase the size of the training set, but also to increase the generalizability of the models. Due to space constraints, the comparison between augmented and non-augmented custom trained models could not be shown, but we observe similar patterns.

  4. Even with a relatively small amount of data, performing transfer learning on a custom model is better than using a generic model. So, a customized model should be used over a generic model whenever possible.

4.4 Face Sub-Attribute

Recognizing fine-grained attributes such as a person’s age or gender, from their face, has become an important task in recent times. To tackle this problem, we use our Transfer Learning approach to train a multinomial logistic regression model on the localized faces of people. Similar to the human sub-attribute recognition, we compare customized transfer learned models against off-the-shelf models for age and gender. As off-the-shelf models, we directly use the Deep Expectation (DEX) models [28] and, the AgeNet and GenderNet model by Levi and Hassner [18]. Both these models have been trained to recognize the age and gender of a detected face. For customization, we use transfer learning on the Deep Face Recognition model [22], along with the AgeNet, GenderNet, DEX-Age and DEX-Gender. All the results are obtained on the Community Center dataset.

In Table 5, we observe that the off-the-shelf models generalize poorly on the Community Center dataset. Since, these off-the-shelf models are not customized for faces detected from surveillance cameras. Using the same Transfer Learning approach mentioned in the human sub-attribute case, we train multinomial logistic regression models on the different base CNNs discussed above.

Category CNN Model Classification Accuracy (in %)
Age Generic DEX-Age 28.03
AgeNet 26.43
Age Customized DEX-Age 73.57
DEX-Gender 68.11
AgeNet 63.86
GenderNet 62.09
Deep Face Recognition 91.00
Gender Generic DEX-Gender 72.08
GenderNet 75.00
Gender Customized DEX-Age 91.24
DEX-Gender 92.52
AgeNet 72.08
GenderNet 83.21
Deep Face Recognition 97.26
Table 5: Off-the-shelf and Customized Age and Gender Classification Results on the Community Center Dataset

Firstly, we observe that the transfer learned models have considerably superior accuracies for both age and gender recognition problems. The Deep Face Recognition models perform the best on both tasks. It is worth noting that the Deep Face Recognition model learns embeddings to distinguish between people. Thus, it possesses a better embeddings representation than the other models for the age and gender recognition tasks. Moreover, we should also note that the age models seem to perform better on age recognition (compared to the gender model) and their gender counterparts perform better on gender recognition (compared to the age model). This also matches our intuitions.

5 Conclusions and Lessons learned

This paper provides an overview of what it takes to build an end-to-end video analytics system which is capable of performing deep learning in real-time on a CPU. We go over the data collection tricks and training pipelines for fast experimental turn around. We also share the significant amount of practical experience we have gained by deploying models at customer locations, including tricks like where to use data augmentation, the effectiveness of transfer learning vs fine-tuning, and the amount of data required for custom training.
The major takeaways from this paper are as follows:

  • Deep Learning does not necessarily require GPU cloud servers. It is possible to get high accuracy using on-premise CPU deployments.

  • Customization will always give better results than off-the-shelf models.

  • Customization provides superior accuracies compared to off-the-shelf models even with fraction of the training dataset.

  • Tricks such as data summarization and data augmentation in our customized training pipeline ensures we obtain high accuracy with a smaller training set.


  • Baltrušaitis et al. [2016] T. Baltrušaitis, P. Robinson, and L.-P. Morency. Openface: an open source facial behavior analysis toolkit. In IEEE Winter Conference on Applications of Computer Vision, 2016.
  • Bewley et al. [2016] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft. Simple online and realtime tracking. In 2016 IEEE International Conference on Image Processing (ICIP), pages 3464–3468, 2016. doi: 10.1109/ICIP.2016.7533003.
  • Chi et al. [2017] L. Chi, H. Zhang, and M. Chen. End-to-end face detection and recognition. arXiv preprint arXiv:1703.10818, 2017.
  • DENG et al. [2014] Y. DENG, P. Luo, C. C. Loy, and X. Tang. Pedestrian attribute recognition at far distance. In Proceedings of the 22Nd ACM International Conference on Multimedia, MM ’14, pages 789–792, 2014. ISBN 978-1-4503-3063-3.
  • Everingham et al. [2010] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, June 2010.
  • Girshick [2015] R. Girshick. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pages 1440–1448, 2015.
  • Girshick et al. [2014] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 580–587, 2014.
  • Gong et al. [2011] S. Gong, C. C. Loy, and T. Xiang. Security and surveillance. In Visual Analysis of Humans, pages 455–472. Springer, 2011.
  • Gouaillier and Fleurant [2009] V. Gouaillier and A. Fleurant. Intelligent video surveillance: Promises and challenges. Technological and commercial intelligence report, CRIM and Technôpole Defence and Security, 456:468, 2009.
  • He et al. [2015] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
  • Howard et al. [2017] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • Huang et al. [2007] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst, October 2007.
  • Huang et al. [2016] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy. Speed/accuracy trade-offs for modern convolutional object detectors. CoRR, abs/1611.10012, 2016.
  • King [2009] D. E. King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10:1755–1758, 2009.
  • Krizhevsky et al. [2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1097–1105. 2012.
  • Learned-Miller et al. [2016] E. Learned-Miller, G. B. Huang, A. RoyChowdhury, H. Li, and G. Hua. Labeled Faces in the Wild: A Survey. 2016.
  • LeCun and Cortes [2010] Y. LeCun and C. Cortes. MNIST handwritten digit database. 2010. URL
  • Levi and Hassner [2015] G. Levi and T. Hassner. Age and gender classification using convolutional neural networks. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) workshops, June 2015.
  • Liao et al. [2014] S. Liao, A. K. Jain, and S. Z. Li. A fast and accurate unconstrained face detector. CoRR, abs/1408.1656, 2014.
  • Lin et al. [2014] T. Lin et al. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014. URL
  • Liu et al. [2016] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. 2016.
  • Parkhi et al. [2015] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference, 2015.
  • Phillips et al. [2000] P. J. Phillips, H. Moon, S. A. Rizvi, and P. J. Rauss. The feret evaluation methodology for face-recognition algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(10):1090–1104, 2000.
  • Rastegari et al. [2016] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
  • Redmon and Farhadi [2018] J. Redmon and A. Farhadi. Yolov3: An incremental improvement. arXiv, 2018.
  • Redmon et al. [2016] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 779–788, 2016.
  • Ren et al. [2015] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • Rothe et al. [2016] R. Rothe, R. Timofte, and L. V. Gool. Deep expectation of real and apparent age from a single image without facial landmarks. International Journal of Computer Vision (IJCV), July 2016.
  • Russakovsky et al. [2015] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
  • Sahoo et al. [2017] A. Sahoo, V. Kaushal, K. Doctor, S. Shetty, R. Iyer, and G. Ramakrishnan. A unified multi-faceted video summarization system. arXiv preprint arXiv:1704.01466, 2017.
  • Sobral and Vacavant [2014] A. Sobral and A. Vacavant. A comprehensive review of background subtraction algorithms evaluated with synthetic and real videos. Computer Vision and Image Understanding, 122:4–21, 2014.
  • Taigman et al. [2014] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level performance in face verification. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 1701–1708, 2014.
  • Tschiatschek et al. [2014] S. Tschiatschek, R. K. Iyer, H. Wei, and J. A. Bilmes. Learning mixtures of submodular functions for image collection summarization. In Advances in Neural Information Processing Systems 27, pages 1413–1421. 2014.
  • Viola and Jones [2001] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, volume 1, pages 511–518, 2001. doi: 10.1109/CVPR.2001.990517.
  • Wei et al. [2015] K. Wei, R. Iyer, and J. Bilmes. Submodularity in data subset selection and active learning. In International Conference on Machine Learning, pages 1954–1963, 2015.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description