S-DOD-CNN: Doubly Injecting Spatially-Preserved Object Information for Event Recognition
We present a novel event recognition approach called Spatially-preserved Doubly-injected Object Detection CNN (S-DOD-CNN), which incorporates the spatially preserved object detection information in both a direct and an indirect way. Indirect injection is carried out by simply sharing the weights between the object detection modules and the event recognition module. Meanwhile, our novelty lies in the fact that we have preserved the spatial information for the direct injection. Once multiple regions-of-intereset (RoIs) are acquired, their feature maps are computed and then projected onto a spatially-preserving combined feature map using one of the four RoI Projection approaches we present. In our architecture, combined feature maps are generated for object detection which are directly injected to the event recognition module. Our method provides the state-of-the-art accuracy for malicious event recognition.
Hyungtae Lee Sungmin Eum Heesung Kwon
\addressUS Army Research Laboratory Booz Allen Hamilton Inc.
Object information provides crucial evidence for identifying the events shown in still images. There have been several attempts which make use of the object information in improving event recognition performance. Most methods perform event recognition with the aid of object detection results via feature-level fusion [1, 2] or score-level fusion [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13].
Recently, Lee et al.  introduced Doubly-injected Object Detection CNN (DOD-CNN) that incorporates the use of object detection information in a direct and an indirect way within a CNN architecture for the task of event recognition. DOD-CNN consists of three connected networks responsible for event recognition, rigid object detection, and non-rigid object detection. Three networks are co-trained while object detection information is indirectly passed onto event recognition via the shared portion of the architecture.
DOD-CNN achieves further performance improvement by directly passing intermediate output of the rigid and non-rigid object detection onto the event recognition module. More specifically, each of the two feature maps from rigid and non-rigid object detection is generated by pooling multiple per-RoI feature maps (i.e., feature maps for each region-of-interest) via batch pooling. The two feature maps are then directly injected into the event recognition module at the end of the last convolutional layer. Note that the batch pooling simply aggregates multiple feature maps along the batch direction without considering their spatial coordinates in the original image.
In this paper, we present an approach to generate a single combined feature map which safely preserves the original spatial location of the per-RoI feature maps provided by the object detection process. Per-RoI feature maps are first projected onto separate projected feature maps using a novel RoI Projection which are then aggregated into a single combined feature map. In the RoI projection, each per-RoI feature map is weighted by its object detection probability. Although our approach follows the spirit of DOD-CNN by incorporating the object detection information in two-ways (i.e., doubly injecting), the rigid and non-rigid object detection information is used in a different way by preserving the spatial context for each of the per-RoI feature map. Therefore, we call our new architecture Spatially-Preserved and Doubly-injected Object Detection CNN (S-DOD-CNN) which is depicted in Figure 1.
When projecting the per-RoI feature maps into one single projected feature map, we adopt two interpolation methods which are MAX interpolation and Linear interpolation. In MAX interpolation, a maximum value among the input points is projected into the output point. In Linear interpolation, a linearly interpolated value of the four nearest input points is projected into the output. These interpolation methods can be applied with either class-specific or class-agnostic RoI selection. While class-specific selection carries out the RoI projection for each set of object class, the class-agnostic selection considers only a small subset of RoIs among all the RoIs disregarding the object classes. Therefore, the RoI projection can be conducted in four different combinations.
In order to prove the effectiveness of using a spatially-preserved object detection feature maps for event recognition, we conducted several experiments on the malicious event classification . We have validated that all four combinations of the novel RoI projection within S-DOD-CNN provide higher accuracy than all the baselines.
DOD-CNN. DOD-CNN  consists of five shared convolutional layers (), one RoI pooling layer, and three separate modules, each responsible for event recognition, rigid object detection, and non-rigid object detection, respectively. Each module consists of two convolutional layers (), one average pooling layer (), and one fully connected layer (), where the output dimension of the last layer is set to match the number of events or objects.
DOD-CNN takes one image and multiple RoIs (approximatedly 2000 for rigid objects and 5 for non-rigid objects per image) as input. Selective search  and multi-scale sliding windows [16, 17, 18, 19] are used to generate the RoIs for rigid and non-rigid objects, respectively. For each RoI, per-RoI feature map is computed via RoI pooling and then fed into its corresponding task-specific module.
For rigid or non-rigid object detection, the output of the last convolutional layer (denoted as per-RoI feature map) is pooled into a single map along the batch direction, which is referred to as a batch pooling. The two single feature maps are then concatenated with the output of the last convolution layer of the event recognition. The concatenated map is fed into the remaining event recognition layers which are the average pooling and fully connected layer.
Batch pooling does not preserve the spatial information of the feature maps since these maps are aligned and pooled without the consideration of their spatial coordinates in the original input image. For instance, consider selecting feature points at a same location, from two different feature maps which are aligned for batch pooling. These points do not necessarily correspond to the same location in the input image as each feature map is tied with a different RoI.
S-DOD-CNN. We introduce a novel method that aggregates multiple feature maps which come from different regions in the input image while preserving the spatial information. The spatial information for each per-RoI feature map is preserved by projecting each feature map onto a location on a projected feature map which corresponds to its original spatial location within the input image.
Figure 2 illustrates how per-RoI feature maps are processed through RoI Projection (RoIProj) to generate corresponding projected feature maps. Note that before the per-RoI feature maps are fed into RoIProj, they are multiplied by its detection probability to incorporate the reliability for each detection result. The projected feature maps are then max-pooled to build a combined feature map. In our experiment, five per-RoI feature maps with the highest probability values are chosen to build the combined feature map.
Our network generates two separate combined feature maps, one for rigid and another for non-rigid object detection. These two combined feature maps are concatenated with the event recognition feature map as in Figure 1. The two combined feature maps share the same-sized and aligned receptive field with the event recognition feature map, and thus they are ‘spatially-preserved’. The event recognition feature map is the output of layer right before RoI pooling. The event recognition module intakes this concatenated map to compute the event recognition probability. As we are constructing our network based on the DOD-CNN, but with ‘spatially-preserved’ object detection information for event recognition, we call it Spatially-preserved DOD-CNN (S-DOD-CNN).
RoI Projection. When projecting the per-RoI feature maps into one projected feature map (denoted as RoIProj in Figure 2), we adopt one of the two interpolation methods: MAX interpolation or Linear interpolation. Examples of the two interpolations are shown in Figure 3. When multiple points on an input map is being projected onto a single point on an output map, the point is filled with a maximum (MAX) or a linearly interpolated value of four nearest input points (Linear).
The RoI projection can be performed in two different ways. These two methods differ based on how a subset of RoIs (the RoIs that are actually used for projection) is selected from the overall set of RoIs. Both of the selection methods utilize probability scores which are generated for each RoI after AVG & FC (see Figure 2), where is the number of classes. For class-specific selection, 5 RoIs with highest probability scores are chosen for each class. For class-agnostic selection, 5 RoIs with highest probability scores are chosen from all the RoIs without regard to which classes they come from. Therefore, the number of executions for RoI projection is either times or just once, based on which selection method is chosen. In addition, if per-RoI feature map has channels, the dimension of the channel for the combined map under the class-specific selection becomes , while it remains as under the class-agnostic selection. Overall, the RoI projection can be performed as one of the four combinations as there are two different interpolations methods (MAX/Linear) and two different RoI selection methods (class-specific/agnostic).
S-DOD-CNN is trained by using a mini-batch stochastic gradient descent (SGD) optimization approach. Event recognition and rigid object detection modules are optimized by minimizing their softmax loss while cross entropy loss is used for non-rigid object detection optimization. Each batch contains two images consisting of one malicious image and one benign image. For event recognition and non-rigid object detection, 1 and 5 RoIs are generated for each image, respectively. For rigid object detection, a batch takes 64 RoIs randomly selected from approximately 2000 RoIs generated by selective search. Accordingly, we need to prepare a large number of batches to cover the entire RoI set for training rigid object detection. A batch (which contains 2 images) consists of 2, 128, and 10 RoIs for event recognition, rigid object detection, and non-rigid object detection, respectively.
In preparing the positive and negative samples for training, we have used 0.5 and 0.1 as the rigid and non-rigid object detection thresholds for the intersection-over-union (IOU) metric, respectively. Any RoI whose IOU with respect to the ground truth bounding box is larger than the threshold is treated as a positive example. RoIs whose IOU is lower than 0.1 are treated as negative examples.
The weights in are initially inherited from the pre-trained AlexNet  trained on a large-scale Places dataset  and the remaining layers (, and layers for all three modules) are initialized according to Gaussian distribution with 0 mean and 0.01 standard deviation.
Two-stage Cascaded Optimization. To allow more batches for training rigid object detection, we use a two-stage cascaded optimization strategy. In the first stage, only the layers used to perform rigid object detection are trained. Then, in the second stage, all three tasks are jointly optimized in an end-to-end fashion. Figure 4 shows the second stage of the training process. For each training iteration in the second stage, two processes ((a) and (b) in Figure 4) are executed in order. In process (a), all the layers of the two object detection modules are optimized with a batch containing 128 RoIs of rigid object and 10 RoIs of non-rigid object. After the process (a) is done, full set of RoIs (i.e. approximately 4000 RoIs for rigid object, 10 RoIs for non-rigid object) is fed into the object detection modules. The resulting combined feature maps are injected into the event recognition module for optimization. We set the learning rate of 0.001, 50k iterations, and the step size of 30k for the first stage and the learning rate of 0.0001, 20k iterations, and the step size of 12k for the second stage.
Malicious Crowd Dataset [12, 22] is selected as it provides the appropriate components to evaluate the effects of using object information for event recognition. It contains 1133 images and is equally divided into malicious classes and benign classes. Half of the dataset is used for training and the rest is used for testing. In addition to the label of the event class, bounding box annotations of three rigid objects (police, helmet, car) and two non-rigid objects (fire, smoke) are provided.  provides details on how these objects are selected.
3.2 Performance Evaluation
To demonstrate the effectiveness of our approach, we compared the event recognition accuracy of S-DOD-CNN with two baselines: DOD-CNN which does not include direct injection and DOD-CNN which incorporates both direct and indirect injection. The accuracy is measured with average precision (AP) as shown in Table 1. S-DOD-CNN, which adopts one of the four RoI projections, provides at least 1.1% higher accuracy than both of the baselines. This verifies the effectiveness of using object detection information spatially preserved via RoI projection. RoI projection using linear interpolation and class-specific RoI selection shows the highest accuracy among all the methods but the differences are marginal.
|Method||RoI Projection||AP (%)|
|No Direct Inject. ||90.7|
In Table 2, we also analyzed how the each task performs when they are individually optimized (Single-task) or co-optimized. For No Direct Injection and DOD-CNN cases, non-rigid object detection performs better when optimized simultaneously with other tasks. However, in S-DOD-CNN, the performance of the two sub-tasks (rigid and non-rigid object detection) was degraded. This indicates that the two tasks are sacrificed to solely improve event recognition performance.
|Task||Single-task||No Direct Injection||DOD-CNN||S-DOD-CNN|
3.3 Ablation Study: Location of Building and Injecting Combined Feature Map
Applying a convolutional layer after the concatenation may not be effective if the combined feature maps (coming from object detection) are not aligned properly with the event recognition feature map. One advantage achieved by constructing combined feature maps using our approach is that the map can be injected at any position in event recognition. Table 3 shows the performance that varies according to the location of building and injection of the combined feature map. DOD-CNN, which loses RoI’s spatial information during building a feature map, shows performance degradation when the injection location is placed before any convolutional layer (i.e., in Table 3). In contrast, S-DOD-CNN does not lose any performance regardless of the injection position.
The performance of S-DOD-CNN depends greatly on the building location of the combined feature map. The best accuracy is achieved when it is constructed after . Letting the input image go through more number of convolutional layers before building the combined feature maps may have provided a richer representation.
We have devised an event recognition approach referred to as S-DOD-CNN where the object detection is exploited while preserving the spatial information. Multiple per-RoI feature maps within an object detection module are projected onto a combined feature map using one of the newly presented RoI Projections preserving the spatial location of each RoI with respect to the original image. These maps are then injected to the event recognition module. Our approach provides the state-of-the-art accuracy for malicious event recognition.
- thanks: Copyright 2020 IEEE. Published in the IEEE 2020 International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020), scheduled for 4-9 May, 2020, in Barcelona, Spain. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966.
- Limin Wang, Zhe Wang, Wenbin Du, and Yu Qiao, “Object-scene convolutional neural networks for event recognition in images,” in CVPRW, 2015.
- Limin Wang, Zhe Wang, Sheng Guo, and Yu Qiao, “Better exploiting OS-CNNs for better event recognition in image,” in ICCVW, 2015.
- Li-Jia Li and Li Fei-Fei, “What, where and who? classifying event by scene and object recognition,” in ICCV, 2007.
- Tim Althoff, Hyun Oh Song, and Trevor Darrell, “Detection bank: An object detection based video representation for multimedia event recognition,” in ACM MM, 2012.
- Ryan M. Robinson, Hyungtae Lee, Michael McCourt, Amar R. Marathe, Heesung Kwon, Chau Ton, and William D. Nothwang, “Human-autonomy sensor fusion for rapid object detection,” in IROS, 2015.
- Mihir Jain, Jan C. van Gemert, and Cees G. M. Snoek, “What do 15,000 object categories tell us about classifying and localizing actions?,” in CVPR, 2015.
- Hyungtae Lee, Heesung Kwon, Ryan M. Robinson, Daniel Donavanik, William D. Nothwang, and Amar R. Marathe, “Task-conversions for integrating human and machine perception in a unified task,” in IROS, 2016.
- Hyungtae Lee, Heesung Kwon, Ryan M. Robinson, William D. Nothwang, and Amar R. Marathe, “An efficient fusion approach for combining human and machine decisions,” in SPIE Defense+Commercial Sensing (DCS), 2016.
- Hyungtae Lee, Heesung Kwon, Ryan M. Robinson, William D. Nothwang, and Amar R. Marathe, “Dynamic belief fusion for object detection,” in WACV, 2016.
- Sungmin Eum*, Hyungtae Lee*, Heesung Kwon, and David Doermann, “IOD-CNN: Integrating object detection networks for event recognition,” in ICIP, 2017, (* indicates an equal contribution.).
- Yilun Cao*, Hyungtae Lee*, and Heesung Kwon, “Enhanced object detection via fusion with prior beliefs from image classification,” in ICIP, 2017, (* indicates equal contribution.).
- Hyungtae Lee*, Sungmin Eum*, Joel Levis*, Heesung Kwon, James Michaelis, and Michael Kolodny, “Exploitation of semantic keywords for malicious event classification,” in ICASSP, 2018, (* indicates an equal contribution.).
- Hyungtae Lee and Heesung Kwon, “DBF: Dynamic belief fusion for combining multiple object detectors,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2019.
- Hyungtae Lee, Sungmin Eum, and Heesung Kwon, “DOD-CNN: Doubly-injecting object information for event recognition,” in ICASSP, 2019.
- J.R.R. Uijlings, K.E.A. van de Sande, T. Gevers, and A.W.M. Smeulders, “Selective search for object recognition,” International Journal on Computer Vision (IJCV), vol. 104, no. 2, pp. 154–171, February 2013.
- Paul Viola and Michael Jones, “Rapid object detection using a boosted cascade of simple features,” in CVPR, 2001.
- Navneet Dalal and Bill Triggs, “Histogram of oriented gradients for human detection,” in CVPR, 2005.
- Pedro Felzenszwalb, Ross Girshick, David McAllester, and Deva Ramanan, “Object detection with discriminatively trained part based models,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 32, no. 9, pp. 1627–1645, September 2010.
- Hyungtae Lee, Vlad I. Morariu, and Larry S. Davis, “Qualitative pose estimation by discriminative deformable part models,” in ACCV, 2012.
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, “ImageNet classification with deep convolutional neural networks,” in NIPS, 2012.
- Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva, “Learning deep features for scene recognition using Places database,” in NIPS, 2014.
- Sungmin Eum, Hyungtae Lee, and Heesung Kwon, “Going deeper with CNN in malicious crowd event classification,” in SPIE Defense+Commercial Sensing (DCS), 2018.