OmniDetector: With Neural Networks to Bounding Boxes
We propose a person detector on omnidirectional images, an accurate method to generate minimal enclosing rectangles of persons. The basic idea is to adapt the qualitative detection performance of a convolutional neural network based method, namely YOLOv2 to fish-eye images. The design of our approach picks up the idea of a state-of-the-art object detector and highly overlapping areas of images with their regions of interests. This overlap reduces the number of false negatives. Based on the raw bounding boxes of the detector we fine-tuned overlapping bounding boxes by three approaches. The non-maximum suppression, the soft non-maximum suppression and the soft non-maximum suppression with Gaussian smoothing. The evaluation was done on the PIROPO database, supplemented with bounding boxes on omnidirectional images. We achieve an average precision of 64.4 % with YOLOv2 for the class person. For this purpose we fine-tuned the soft non-maximum suppression with Gaussian smoothing.
OmniDetector: With Neural Networks to Bounding Boxes
Roman Seidel Department of Information Technology Chemnitz University of Technology Chemnitz, Germany email@example.com André Apitzsch Department of Information Technology Chemnitz University of Technology Chemnitz, Germany firstname.lastname@example.org Gangolf Hirtz Department of Information Technology Chemnitz University of Technology Chemnitz, Germany email@example.com
noticebox[b]Preprint. Work in progress.\end@float
Convolutional neural networks (CNNs) were treaded for several tasks in computer vision in the recent years. Finding objects in images (i.e. object detection) belongs to these tasks. A main requirement for the detection of objects in images for current CNNs are accurate real-world training data. In this paper we propose a method to detect objects in fish-eye images of indoor scenes using a state-of-the-art object detector.
The object detection in indoor scenes with a limited number of image sensors can be reached with images from omnidirectional cameras. These cameras are suited for capturing one room with a single sensor due to a field of view of about . Our goal is to detect objects in indoor scenes in omnidirectional data with a detector trained on perspective images.
Beside our application, the field of active assisted living, the detection of objects in omnidirectional image data can be used in mobile robots and in the field of autonomous driving.
The remainder of this paper is structured as follows: Section 2 presents previous research activities in object detection. Section 3 illustrates the working principle of a neural network based object detector. Section 4 explains how our virtual cameras are generated. Section 5 shows the theoretical background for different variants of non-maximum suppression (NMS). Section 6 describes our experiments for the generation of bounding boxes and the evaluation of our results on common error metrics. Section 7 summarizes the paper’s content, concludes our observations and gives ideas for future work. The results of our work, the image data and the evaluation of the results, can be found at https://gitlab.com/omnidetector/omnidetector.
2 Related work
State-of-the-art object detectors predict bounding boxes on perspective images over several classes. A region-based, fully connected convolutional network for accurate and efficient object detection is R-FCN Dai et al. (2016). As a standard practice, the results of the detector based on ResNet-101 architecture He et al. (2016) are post-processed with non-maximum suppression (NMS) using a threshold of 0.3 to the intersection over union (IoU) Girshick et al. (2013). The single shot multi-box detector (SSD) by Liu et al. (2016) provides an improvement of the network architecture by adding a backend extra feature layer on top of VGGNet-16 combined with the idea to use predictions from multiple feature maps with different resolutions which handles objects with various sizes. The SSD leads to competitive results on common object detection benchmark datasets, namely MS COCO Lin et al. (2014), ImageNet Russakovsky et al. (2015) and PASCAL VOC Everingham et al. (2010). The approach we follow is YOLOv2 Redmon and Farhadi (2017). It produces significant improvements to increase mean average precision (mAP) through variable size of models, multi-scale training and a joint training to predict detections for object classes without labeled detection data.
Our application is object detection, so we concentrate on datasets where labels are minimally enclosing rectangles (bounding boxes). Common real world benchmark datasets with labeled objects on perspective images are presented by (Everingham et al., 2010; Krasin et al., 2017; Li et al., 2017; Russakovsky et al., 2015). Omnidirectional images with multiple sequences in two different indoor rooms were created in the work of del Blanco and Carballeira (2016).
In contrast to our work, the authors of Spherical CNN Cohen et al. (2018) modify the architecture of ResNet. Their goal is to build a collection of spherical layers which are rotation-equivariant and expressive.
3 Object detection
Based on an excellent mAP of 73.4% (10 classes, VOC2007test) and an average precision (AP) of 81.3% (VOC2007test) for the class person, we use the You Only Look Once (YOLO) Redmon et al. (2016) approach in its second version called YOLOv2 Redmon and Farhadi (2017). To detect objects in input images YOLOv2 offers a good compromise between detection accuracy and speed. The model is trained on ImageNet Russakovsky et al. (2015) and the COCO dataset Lin et al. (2014). The approach outperforms state-of-the-art methods like Faster R-CNN Ren et al. (2015) with ResNet He et al. (2016) and SSD Liu et al. (2016), which still runs significantly faster. YOLOv2 predicts the corners of bounding boxes directly with the help of fully connected layers which are added on top of the convolutional feature extractor. Additional changes on the network architecture are the elimination of pooling layers to obtain a higher resolution output by the convolutional layers in the network. The input data size of the network is shrinked to operate on input images instead of . For the prediction of the bounding boxes in YOLOv2 the fully connected layers are replaced by anchor boxes. To counteract the effect to detect objects with a fixed size, a special feature during the training is the random selection of input size of the model, which changes every 10 batches. The smallest input is and the largest .
4 Creating virtual views from an omnidirectional image
In this chapter we describe the transformation for generating virtual perspective views from omnidirectional image data based on Findeisen et al. (2013). We assume, that the omnidirectional camera is calibrated both intrinsically and extrinsically.
The camera model describes how the coordinates of a 3D scene point are transformed into the coordinates of a 2D image point. We concentrate on the central camera model, i.e. all light rays, originating from the scene points, travel through a single point in space, called the single effective viewpoint. For the transformation between the omnidirectional and the perspective images a mathematical description is necessary for both camera models.
4.1 Perspective camera model
The perspective camera model uses the pinhole camera model as an approximation. The perspective projection of the spatial coordinates given in the camera coordinate system is stated and in normalized image coordinates . After applying an affine transformation it is possible to get pixel coordinates . For the linear mapping between the source and target camera model we use homogeneous coordinates (denoted as ). The relation between and is given by
where is the upper-triangular calibration matrix containing the camera intrinsics:
As shown in (2) the five intrinsic parameters of a pinhole camera are: the scale factors in x- and y-direction , the skewness factor and the principle point of the image .
In general a scene point is modeled in a world coordinate system, which is different from the camera coordinate system (. The orientation between these coordinate systems consists of two parts, namely a rotation and a translation (or equivalent , where is the camera center).
The relationship between the scene point in the world coordinate system and an image point in the image coordinate system is given by
where is a homogeneous matrix, called the camera projection matrix Harltey and Zisserman (2006). The matrix contains the parameters of the extrinsic and intrinsic calibration with
There are several approaches to extend the camera model defined above with a description of lens imperfections. As long as our target virtual camera is perfectly perspective and free of lens distortions, we do not discuss this issue.
4.2 Omnidirectional to perspective image mapping
Because it is mathematically impossible to transform the whole omnidirectional image into one perspective image, we transform a region of 2D image points from the omnidirectional into the perspective view. We determine the perspective images through n virtual perspective cameras Cam0, Cam1, …, Camn, which are described by their extrinsic parameters R and t (6 degrees of freedom (DOF)) and intrinsic parameters (5 DOF). Instead of determining the parameters of the perspective camera through a calibration, we model the virtual camera and determine the extrinsic (R and t) empirically.
To create the virtual perspective views we change the extrinsic camera parameter R through the variation of the angles through the rotation about the axes x, y and z represented by their Euler angles. To be more specific, we rotate about the x-axis and z-axis.
The extrinsic calibration parameters of the omnidirectional camera form the world reference with respect to the virtual perspective cameras. As K contains the scale factors in the horizontal and vertical directions (), K determines the field of view (FOV) of the target images. For perspective images with a resolution of the horizontal and vertical FOVs are:
Equation (5) allows us to define the FOV of the perspective camera and to build at least one virtual perspective camera, which is able to generate perspective images, from the omnidirectional camera. Derived from the horizontal FOV and vertical FOV we determine the diagonal FOV with:
To come to a common FOV of an usual perspective camera we choose the focal length and the diagonal image size with respect to the sensor equally. This leads to a simplification of (6) with:
The simplification leads to a diagonal FOV of about and allows us to choose and free, as long as they are equal.
5 Non-maximum suppression
Our goal is to find the most likely position of the minimal enclosing rectangle of the object. Therefore we disable the two final steps of YOLOv2 occurring at the last layers of the network. First, the reduction of the number of bounding boxes based on their confidence. Second, the union of multiple bounding boxes of one particular object through soft non-maximum suppression (Soft-NMS).
In general, the NMS is necessary due to highly overlapping areas of perspective images after the transformation to omnidirectional images. To receive the raw detections of YOLOv2 with confidences between 0 and 1, we set the confidence threshold equal to zero. To group the resulting bounding boxes, one suitable measurement is the intersection over union (IoU). The IoU for two boxes and is defined by the Jaccard index as:
Our next step for the refinement of the back-projected bounding boxes is applying Soft-NMS inspired by Bodla et al. (2017). In this approach Soft-NMS is used to separate bounding boxes to distinguish between different objects that are close to each other and to prune multiple detections for one unambiguous object, back projected from highly overlapped perspective views. Bounding boxes which are close together and fulfill the IoU > 0.5 are considered as an unique region of interest (RoI) proposal for each object. To update the confidences of the bounding boxes, in the NMS the pruning step can be formulated as a rescoring function:
Where is a bounding box with score of the detector and is the selection of the detection boxes with maximum score. The parameter describes the NMS threshold, which removes boxes from a list of detections with certain scores, as long as the is greater than or equal to the NMS threshold. The result of (10) is a confidence score between zero and one, which is used to decide what is kept or removed in the neighborhood of .
The Soft-NMS approach is able to weight the score of boxes in the neighborhood of .
Equation (11) describes the rescoring function for the Soft-NMS. The goal is to decay the scores above a threshold modeled with a linear function. The scores of the bounding boxes from the detection with a higher overlap with have a stronger potential of being false positives. As a result we get a rating of the bounding boxes with respect to without changing the number of boxes. With an increasing overlap between detection boxes and the penalty increases. At a low overlapping area between and the scores will be not affected. To penalize stronger if the becomes close to one, the pruning step can also be modeled as a Gaussian penalty function:
where is the set of back-projected raw detections of YOLOv2 and is a growing set of final detections.
6 Experimental results
Our input data are single images captured with an omnidirectional camera of an indoor scene. To qualitatively evaluate our detection results we use a labeled image dataset from omnidirectional camera geometry, namely the PIROPO database (People in Indoor ROoms with Perspective and Omnidirectional cameras). The input images have a resolution of pixels, are undistorted and captured with a ceiling-mounted omnidirectional camera. The image data contain point labels on the head of persons. To compare the results of the detection with respect to the ground truth, we manually create bounding box ground truth for the class person in 638 images. The subset of the labeled data of the PIROPO database is available on the website mentioned in Section 1.
We assume, that our start point is an image from a virtual perspective camera. The creation of virtual perspective views from omnidirectional images is described in Section 4.2. Further one of our assumptions is the deterministic behavior of the confidence computation of YOLOv2. While the location of the bounding box in the image is variable through reproducible attempts, the confidence value is constant for a true positive detection.
The way, we create the perspective images from our omnidirectional image data, is described as follows: We vary both the rotation around the x-axis and z-axis. The rotation around the z-axis corresponds to the azimuth of the omnidirectional camera model. Rotating around the x-axis matches to the elevation of the omnidirectional camera model. The elevation is changed from to with a step size of . We choose the four different perspective views to avoid the black image proportion at the boundaries of the omnidirectional image, which does not contain additional information. The azimuth is changed from to with a step size of , for covering the whole room with perspective views.
As an additional constraint, we assume in our configuration (camera’s mounting height with respect to the room size) that the person fits in one perspective image. After the calculation of the detection results in the perspective images, we transform these detections to the omnidirectional source image.
The use of a look-up-table (LUT) for back projecting the perspective images to the omnidirectional image leads to their original position of the source image in the target image. Additionally, the corners of the bounding boxes are also transformed with the help of the LUT. Through the back transformation of the bounding box corners the new boxes become larger.
6.1 Bounding box refinement
For the grouping of bounding boxes based on their confidences the YOLOv2 object detector has an included NMS, as described in Section 5. If the IoU is higher than a threshold , then multiple boxes of an object are merged. With the help of a small test set, we evaluate YOLOv2’s confidence both with the internal NMS and external NMS, which produces the same confidence values with equal thresholds. To refine multiple bounding boxes projected from the perspective views in the omnidirectional image we use three variants of NMS.
NMS First, we apply the classical NMS (see (10)) to reduce bounding boxes with a predefined overlap threshold . We vary the overlap threshold from to with a step size of .
Soft-NMS Second, the use of Soft-NMS (see (11)). The advantage of Soft-NMS is penalizing detection boxes with a higher overlap to as long as they are false positives. Based on modeling the overlap of to as a linear function the threshold controls the detection scores. To be more precise, the detection boxes with high distance to are not influenced through the function in (11). The boxes that are close together allocate a high penalty.
Soft-NMS with Gaussian smoothing Third, to retort the problem of abrupt changes to the ranked list of detections, we consider the Gaussian penalty function as shown in (12). The Gaussian penalty function is a continuous exponential function, which delivers no penalty in case of no overlap of the boxes and a high penalty at highly overlapped boxes. The update was done iteratively to all scores of the remaining detection boxes. Starting from the detectors raw data, we vary the confidence threshold with the values , , and and the Gaussian smoothing factor with the values , , , and . The corresponding results in Figure 1 show a single image from the PIROPO database with the below mentioned variations of thresholds in the rows and columns, respectively.
An effect, which is easily visible is the number of bounding boxes in the images. In the top right corner of the matrix ( and ) the number of boxes for possible candidates of true positives is high. The opposite effect, less number of true positives with a high accuracy is observable in the bottom left corner of Figure 1 (values of or and ). Using for the steering of the smoothness of the merging of the bounding boxes makes the effects explainable. The higher we select , the closer comes the exponential function in (12) to 1. Is the exponential function close to or equal to 1, the number of boxes does not change. With the knowledge, that the exponential function cannot become zero, the smaller we set , the smaller is the number of the bounding boxes in the final set . We compare The Gaussian smoothing function in the Soft-NMS delivers the best results, compared to the other variants of NMS.
6.2 Ground truth evaluation
A well working example of our approach is shown at Figure 2. In Figure 1(a) we show an omnidirectional input image from the PIROPO database. The raw detections of YOLOv2 with a high number of possible true positive candidates without NMS step is visualized in Figure 1(b). The final detection result after the bounding box refinement is shown in Figure 1(c). We apply Soft-NMS with a Gaussian smoothing function. The ground truth evaluation is done through manually annotated bounding box as shown in Figure 1(c).
As scalar evaluation metrics for the detectors result we choose precision and recall Szeliski (2010), which leads to precision-recall (PR) curves. Additionally, we determine the AP Szeliski (2010). Based on our application we concentrate on the class person, that makes the use of mAP obsolete for evaluation.
The precision and recall are based on the three basic error rates, namely the true positives (TP), the false positives (FP) and the false negatives (FN). Based on the number of these values per frame in the dataset the precision and recall are given by:
Ideally, the and values in (13) are close to one, each. The higher the values of the evaluation metrics, the larger the area under the PR curve, the better the performance of the detector.
The PR curves in Figure 3 show the evaluation of our method with manually generated ground truth. The steepest curve in Figures 2(a) to 2(d), here referenced as NMS Omni, reaches in and a precision of 1 at small recall. The constellation validates our observations, that the YOLOv2 detector localizes the objects in omnidirectional images accurate with a high number of false negatives.
For further quantitative evaluation we compute the AP that is the area under the PR curves of Figure 3 and visualized in Table 1. We vary from to with a step size of , where the value follows the PASCAL VOC notation. Additionally, we determine the weighted mean values of precision for NMS, Soft-NMS with Gaussian smoothing, Soft-NMS and apply YOLOv2 to the omnidirectional images directly. The best (i.e. highest) values of AP for each are highlighted in bold. At we reach a AP for the class person of through Soft-NMS with a Gaussian smoothing function. For from up to the AP increase.
Salient points of the PR curves in Figure 3 are intersections of the worst performing and the highest performing approach. Considering the NMS Omni graph and the Soft-NMS Gauss graph in Figure 2(a) we observe an intersection at a precision of and recall of . From this point up to recall of the bounding box refinement method with Soft-NMS Gauss outperforms all other curves without significant decrease of precision.
To the best of our knowledge, no similar approaches for omnidirectional images exists, therefore it is impossible to compare our results to other object detection approaches. Due to unavailable public training datasets with labeled fish-eye images, we did not do fine-tuning of YOLOv2 from initial weights with omnidirectional image data.
We make the following observations: After the back projection from the perspective to the omnidirectional bounding boxes are oversized, because the axis parallelism is not preserved. Through forcing parallel box edges with respect to the axis in the omnidirectional image coordinate system, we do not receive minimal enclosing rectangles. For the most of the recall and precision values the graphs of NMS and Soft-NMS are equal. Only at precision smaller than we observe different trends as shown in Figure 2(a).
In this work we present a method to detect persons in omnidirectional images based on CNNs. We apply a state-of-the-art object detector, namely YOLOv2, to virtual perspective views and transform the detections back to the omnidirectional source images. For the transformation the step size of the two angles, azimuth and elevation was selected in a way, that the perspective images are highly overlapped. In contrast to the standard implementation of YOLOv2 we use the raw detection boxes instead of applying a NMS as bounding box refinement at the end of the network. After back projection from perspective to omnidirectional images we apply three different NMS methods for pruning the back-projected bounding boxes based on confidence and overlap.
We evaluated the bounding box refinement methods, NMS, Soft-NMS with a threshold and Soft-NMS with Gaussian smoothing on our manually generated ground truth on the PIROPO database using PR curves and AP. At a we reach an AP for the class person of through Soft-NMS with Gaussian smoothing.
Based on the work of transformation from omnidirectional to perspective and vice versa there are a couple of ideas for future work. One of our central questions is: how the detection rate of the object detector changes if we consider the lens distortion parameters?
To close the gap of missing omnidirectional ground truth, we will create labeled synthetic and real-world data. To simplify the data generation we can use our approach followed by manually refinement of detections to create ground truth on omnidirectional images. To improve the approach at the point of projecting bounding boxes from perspective to omnidirectional model it is necessary to minimize the effect of oversized boxes in omnidirectional images.
This work is funded by the European Regional Development Fund (ERDF) and the Free State of Saxony under the grant number 100-241-945.
- Bodla et al.  Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S. Davis. Soft-nms - improving object detection with one line of code. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 5562–5570, 2017. doi: 10.1109/ICCV.2017.593. URL https://doi.org/10.1109/ICCV.2017.593.
- Cohen et al.  Taco S. Cohen, Mario Geiger, Jonas Köhler, and Max Welling. Spherical CNNs. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Hkbd5xZRb.
- Dai et al.  Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-FCN: object detection via region-based fully convolutional networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 379–387, 2016. URL http://papers.nips.cc/paper/6465-r-fcn-object-detection-via-region-based-fully-convolutional-networks.
- del Blanco and Carballeira  Carlos R. del Blanco and Pablo Carballeira. The piropo database (people in indoor rooms with perspective and omnidirectional cameras). https://sites.google.com/site/piropodatabase/, unpublished dataset, 2016.
- Everingham et al.  Mark Everingham, Luc J. Van Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman. The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2):303–338, 2010. doi: 10.1007/s11263-009-0275-4. URL https://doi.org/10.1007/s11263-009-0275-4.
- Findeisen et al.  Michel Findeisen, Lars Meinel, Markus Heß, André Apitzsch, and Gangolf Hirtz. A fast approach for omnidirectional surveillance with multiple virtual perspective views. In Proceedings of Eurocon 2013, International Conference on Computer as a Tool, Zagreb, Croatia, July 1-4, 2013, pages 1578–1585, 2013. doi: 10.1109/EUROCON.2013.6625187. URL https://doi.org/10.1109/EUROCON.2013.6625187.
- Girshick et al.  Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR, abs/1311.2524, 2013. URL http://arxiv.org/abs/1311.2524.
- Harltey and Zisserman  Andrew Harltey and Andrew Zisserman. Multiple view geometry in computer vision (2. ed.). Cambridge University Press, 2006. ISBN 978-0-521-54051-3.
- He et al.  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778, 2016. doi: 10.1109/CVPR.2016.90. URL https://doi.org/10.1109/CVPR.2016.90.
- Krasin et al.  Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami Abu-El-Haija, Alina Kuznetsova, Hassan Rom, Jasper Uijlings, Stefan Popov, Andreas Veit, Serge Belongie, Victor Gomes, Abhinav Gupta, Chen Sun, Gal Chechik, David Cai, Zheyun Feng, Dhyanesh Narayanan, and Kevin Murphy. Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github.com/openimages, 2017.
- Li et al.  Wen Li, Limin Wang, Wei Li, Eirikur Agustsson, and Luc Van Gool. Webvision database: Visual learning and understanding from web data. CoRR, abs/1708.02862, 2017. URL http://arxiv.org/abs/1708.02862.
- Lin et al.  Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, pages 740–755, 2014. doi: 10.1007/978-3-319-10602-1_48. URL https://doi.org/10.1007/978-3-319-10602-1_48.
- Liu et al.  Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: Single Shot MultiBox Detector, pages 21–37. 2016. doi: 10.1007/978-3-319-46448-0_2. URL https://doi.org/10.1007/978-3-319-46448-0_2.
- Redmon and Farhadi  Joseph Redmon and Ali Farhadi. YOLO9000: better, faster, stronger. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 6517–6525, 2017. doi: 10.1109/CVPR.2017.690. URL https://doi.org/10.1109/CVPR.2017.690.
- Redmon et al.  Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 779–788, 2016. doi: 10.1109/CVPR.2016.91. URL https://doi.org/10.1109/CVPR.2016.91.
- Ren et al.  Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, editors, Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 91–99, 2015.
- Russakovsky et al.  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y. URL https://doi.org/10.1007/s11263-015-0816-y.
- Szeliski  Richard Szeliski. Computer Vision: Algorithms and Applications. Springer-Verlag New York, Inc., New York, NY, USA, 1st edition, 2010. ISBN 1848829345, 9781848829343.