PVSS: A Progressive Vehicle Search System for Video Surveillance Networks

PVSS: A Progressive Vehicle Search System for Video Surveillance Networks

Xin-Chen Liu, Wu Liu, Hua-Dong Ma, and Shuang-Qun Li


JD AI Research, JD.com, Beijing 100101, China

Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia, Beijing University of Posts and Telecommunications, Beijing 100876, China

 preprint for arXiv

Abstract This paper is focused on the task of searching for a specific vehicle that appeared in the surveillance networks. Existing methods usually assume the vehicle images are well cropped from the surveillance videos, then use visual attributes, like colors and types, or license plate numbers to match the target vehicle in the image set. However, a complete vehicle search system should consider the problems of vehicle detection, representation, indexing, storage, matching, and so on. Besides, attribute-based search cannot accurately find the same vehicle due to intra-instance changes in different cameras and the extremely uncertain environment. Moreover, the license plates may be misrecognized in surveillance scenes due to the low resolution and noise. In this paper, a Progressive Vehicle Search System, named as PVSS, is designed to solve the above problems. PVSS is constituted of three modules: the crawler, the indexer, and the searcher. The vehicle crawler aims to detect and track vehicles in surveillance videos and transfer the captured vehicle images, metadata and contextual information to the server or cloud. Then multi-grained attributes, such as the visual features and license plate fingerprints, are extracted and indexed by the vehicle indexer. At last, a query triplet with an input vehicle image, the time range, and the spatial scope is taken as the input by the vehicle searcher. The target vehicle will be searched in the database by a progressive process. Extensive experiments on the public dataset from a real surveillance network validate the effectiveness of the PVSS.

KeywordsVehicle Search, Video Surveillance Network, Progressive Search System, Multi-modal Data Analysis

1 Introduction

Physical object search, which aims to find an object sensed by ubiquitous sensor networks like surveillance networks, is one of the most important services provided by the Internet of Things (IoT) [1]. Vehicle, including car, bus, truck, etc., is one type of the most common objects in video surveillance networks. So vehicle search system has many potential applications in the era of IoT. The search engines of the Internet, e.g., Google, YouTuBe, and Amazon’s search engine, can assist us in looking for webpages, images, videos, and products in the information space or cyber space  [2, 3], while the task of vehicle search engine is to find the target vehicle in the physical space. Vehicle search system can provide pervasive applications such as intelligent transportation [4] and automatic surveillance [5]. Fig. 1 shows an example, in which the user can input a query vehicle, search area and time interval, the system can return the locations and timestamps of the target.

Fig. 1: A typical example of vehicle search. Given a specific vehicle, a time interval, and the spatial scope, the system returns where and when the vehicle appeared in the surveillance network.

Early vehicle retrieval methods and systems are mainly focused on the attribute-based framework [6, 7, 8]. They first classify vehicles by types, models, and colors, then index and retrieve them with the assigned attributes. Recently, vehicle search research is focused on content-based vehicle matching, also known as vehicle Re-Identification (Re-Id), which uses the content of images to fine vehicles in the database [9, 10]. Besides, multi-modal contextual information like spatiotemporal information is also explored to assist vehicle Re-Id [11, 12, 13, 14]. With the development of representation models, such as hand-crafted descriptors and Convolutional Neural Network (CNN), these methods obtain significant improvement. However, it is difficult to precisely find the specific vehicle only based on attributes because of the intra-instance changes in different cameras and the minor inter-instance difference between similar vehicles. Furthermore, existing vehicle Re-Id approaches assume that the vehicle images have been well cropped and aligned from the video frames. Therefore, they only consider the feature extraction and one-to-N matching for the vehicle images. Nevertheless, a vehicle search engine, as a complex system, must consist of many components like vehicle extraction, representation, indexing, and retrieval. Moreover, both the accuracy and efficiency should be considered when designing the system.

Fig. 2: The architecture of the progressive vehicle search system.

Towards this end, we design a progressive vehicle search system, named as PVSS, in this paper. PVSS contains three key modules: the crawler of vehicle data, the vehicle indexer based on multi-grained features, and the progressive vehicle searcher. To guarantee high accuracy and efficiency during search, a series of data structures are designed for the vehicle search system. In the crawler, not only visual contents but also contextual information are extracted from the surveillance networks. Then the multimodal data is exploited by deep learning based models to obtain discriminative and robust features of vehicles, which is then organized by the multi-level indexes. In the search process, the vehicle is searched in a progressive manner, including the from-coarse-to-fine search in the feature domain and the from-near-to-distant search in the physical space. At last, extensive experiments on a large-scale vehicle search dataset collected from real-world surveillance network shows the state-of-the-art results of the proposed system.

Compared with our previous conference paper [15], we provide more analysis on contextual information such as the spatiotemporal information in surveillance networks. For example, we discuss the temporal distance between neighboring cameras in the surveillance network by analyzing the travel time of vehicles in our collected data. We also compare the characteristics of vehicles to that of persons which have been studied in related work. Based on the analysis of spatiotemporal information of vehicles in surveillance networks, we propose a new camera neighboring graph compared to [15]. Particularly, in [15] we only adopted the fixed spatial distance between neighboring cameras as the weights of edges in the graph, which is too simple to model the spatiotemporal cues. In this new manuscript, we also use the temporal distance between neighboring cameras learned from training data to modeling the spatiotemporal relations, which further improve the performance of the system.

2 Related Work

2.1 Multimedia Retreival

In the past two decades, content based multimedia retrieval (CBMR) has been extensively studied [16, 17, 18, 19, 20, 21, 3, 22]. CBMR methods usually extract visual features from images or videos and estimate the similarity between the query and source in the database. For examples, Video Google was proposed by Sivic and Zisserman to achieve object search in videos with the idea of webpage retrieval  [16]. Lin et al. [23] exploited the 3-D representation models for content based vehicle search. Farhadi et al. [24] proposed to represent the appearance of objects by their attributes for image retrieval. Zheng et al. [25] proposed a large-scale image retrieval method with an effective visual model and efficient index structures. Liu et al. [20] designed an instant video search system for movies search on mobile devices. However, different from existing CBMR task, only depending on visual features, i.e. the appearance of vehicles, cannot give precise results because of the minor inter-class differences between very similar vehicles and varied intra-instance changes in different cameras.

2.2 Person Re-Id and Search

Content based person Re-Id has been studied for several years [26, 27, 28]. Existing person Re-Id approaches usually assume the persons have been detected and extracted from the video frames. The main topics include visual features learning from images and discriminative metrics for feature embedding [29]. Besides person Re-Id, attributes and context information are also used for person retrieval. For examples, Feris et al. [30] proposed a system for attribute-based people search in surveillance environments. Xu et al. [31] designed an object browse and retrieval system, which integrated vision features and spatial-temporal cues by a graph model for retrieval of pedestrians and cyclists.

2.3 Vehicle Re-Id and Search

In recent years, vehicle search is mainly focused on content based vehicle Re-Id, which aims to find the target vehicle from the database with a query vehicle image [10, 9]. For example, Liu et al. [10] proposed a deep CNN based method, named Deep Relative Distance Learning, to jointly learn visual features and metric mapping for vehicle Re-Id. Besides appearance features, the contextual information such as license plates and spatiotemporal records is also used for vehicle Re-Id. For examples, Liu et al. [11] proposed a progressive vehicle search method which exploits image features, license plates, and contextual information in a progressive manner. Wang et al. [13] proposed a framework to learn local landmarks and global features of vehicles and refine the results with a spatiotemporal regularization model. Similar to person Re-Id, existing vehicle Re-Id methods also assume that the vehicle images have been detected and well aligned from video frames. Therefore, they only consider the feature representation and similarity metrics for image matching. However, to build a complete search system, we consider not only the problems for content based vehicle Re-Id but also the tasks of data acquisition, organization, and retrieval.

Name Type Description
Camera ID int The unique ID of the camera that captures the track.
Vehicle ID long The unique ID of the vehicle track.
Frame ID long The ID of the first frame in the vehicle track.
Track Length int The frame count of the vehicle track.
Trajectory point[] The point sequence of the vehicle track.
Visual Features float[] The multi-grained visual features extracted from the vehicle track.
Duration float The time duration of the vehicle track.
Plate string The license plate string of the object (if recognized).
Table 1: Vehicle Track Metadata.

3 Overview

Fig. 2 illustrates the overall architecture of the PVSS system. It contains three moduels:

  • The offline vehicle crawler receives the video streams from surveillance cameras and crops vehicle image sequences from video frames.

  • The vehicle indexer extracts multi-grained visual features from vehicle tracks and constructs the multi-level indexes for efficient search

  • The online vehicle searcher performs the progressive search process with the multi-level indexes in both the feature domain and the spatiotemporal space.

Before introducing the details of each component, we first present the main data structures of PVSS in next section.

4 Data Structures

The data that we can utilize is diverse and in multiple modalities. Various semantic contents like vehicle plates, types, colors, and visual features can be extracted in online or offline manner as in [32, 11]. The data modalities include text, digits, coordinates, structures, and so on. The topology and spatiotemporal context of surveillance networks can be more complex data structures such as graphs. Therefore, these data should be described in proper structures, which are effective for retrieval and flexible for extension. In this section, we first introduce the vehicle track metadata, which is to describe the image sequences of vehicles captured by surveillance cameras. Then, the camera table is designed to index the vehicle track metadata for each camera. At last, we build a camera neighboring graph to represent the spatial topology of the surveillance networks.

4.1 Vehicle Track Metadata

According to the variety of video contents and extraction approach, the vehicle track metadata is proposed to describe vehicle image sequences which are obtained from cameras. Table 1 lists the attributes and descriptions of the metadata in detail. In our system, the vehicle tracks are extracted by the vehicle crawler frame by frame, which will be presented in Section 5.1. The object tracking method is used to group the images of the same vehicle in neighbor frames as an instance of vehicle track. As in Table 1, the unique Camera ID and Vehicle ID specify an unique vehicle. Among these attributes, the visual features are the most important information to represent the multi-grained visual representation of each vehicle, which are utilized in the indexing and search procedures. The extraction of visual features will be given in Section 5.2.

4.2 Camera Table

After the generation of vehicle track metadata, the storing and indexing of these data should be considered. In out system, the camera table is designed to index instances of vehicle track metadata for each camera.

For each camera, we allocate a camera table to index the vehicle track metadata extracted from this camera. The videos are processed by the order of time, so the metadata instances are also generated by the order of time and appended to the tails of camera tables. This keeps the entries of camera tables in the ascending order. Fig. 3 shows the structure of the camera table. In the real implementation, the camera table can be implemented by relational databases like MySQL or distributed databases like HBase in the data center. When the scale of camera tables grows up, the tables will be organized in a tree-like structure for efficient index and search.

Fig.3.  The structure of the camera table.

4.3 Camera Neighboring Graph

4.3.1 Topology Construction

The camera neighboring graph records the geo-locations of cameras and the topology of the surveillance networks, which is obtained from the infrastructure companies and the map services.

We define the graph as a directed graph . The graph is composed by the node set , the edge set , and the weight set . Fig. 4 illustrates an example of the camera neighboring graph which is built from a subset of real-world surveillance network. The nodes represent the set of cameras, which consist of the GPS coordinates and settings of cameras. The edges are the set of directed connections between neighboring cameras. The edges are determined not only by the topology of the city roads but also by the heading directions and fields of view (FOV) of cameras. So we define the view-connected edge as below:

Definition 1 (View-connected edge). A view-connected edge connects a pair of cameras in , if an vehicle can reappear in the FOV of camera directly after appearing in the FOV of camera , then there is a view-connected edge from to .

Fig. 4: An example of the camera neighboring graph. The left image is the camera locations and the city map of a real-world surveillance network. The right is the graph abstracted from the network.

4.3.2 Weight Modeling

The weight set of contains two parts. The first part is . It stores the spatial distances of neighboring cameras, which can be obtained from map services like Google Map. The second part is which contains the temporal distances between neighboring cameras learned from training data. Here we will give details about the learning of .

Several works have proposed models to estimate the travel time in surveillance networks. The author of [33] proposed a graph-based vehicle search model. According to this model, the weight of an edge is modeled by the mean time cost of all vehicles that traveled the edge during the search time. When given a search time interval, the history records in the time interval are used to compute the mean time cost in this time interval. Xu et al. [31] proposed a graph model for related object search in a campus. This model estimates the time delay between cameras using object reappearance. It is assumed that the speed of an object changes slightly, so the time delay is negatively linearly correlative to the travel speed. Using the labeled data collected from the surveillance network, a linear model of time delay and optical flow is learned with a standard regression method.

However, according to the statistics on the our dataset as shown in Fig. 5, the above two model cannot be directly applied to our scenario. We select 5 sequential edges in the surveillance network and plot the records in about one hour from 15:59:58 to 16:59:58. In Fig. 5, the top row are the time cost vs. object speed plots. We can find that the time costs are not linearly correlated with the speed of objects. Because we can only obtain the speed at the cameras, yet cannot know the speed between the cameras. The behaviors of vehicles between cameras are unpredictable with only surveillance videos. The traffic lights, pedestrians, traffic jams make the actual model very complex. So the linear model of time cost and speed would fail in our scenario.

Fig. 5: Time cost statistics scatter plot. The top part are the time cost vs. object speed plots. The bottom part are the time cost vs. record time plots. The red lines in the bottom are the mean time cost in each 600-second time slot.

The bottom part of Fig. 5 illustrates the time cost vs. record time plots. From the observation on this part, we find that in different time intervals the travel times of different vehicles change slightly. In this case, we use a slot-mean model to build the weights. We segment the whole time line into time slots with the fixed length. Supposing that set contains the time cost records on edge that fall in the time slot . We have the mean time cost :


In each time slot , is used as a parameter of the weight function. In addition, we use as the other parameter of the weight which is computed as follow:


After computing on all time slots, we have a step function for the weight vector on the edge:




where is an object metadata instance in the start camera of edge , is the total number of time slots. All weight functions on the edges constitute the temporal weight set of graph .

5 Functional Modules

5.1 Vehicle Crawler

The vehicle crawler aims to detect and crop vehicle images from video frames streamed by the surveillance network. It plays a similar role to the conventional web crawler of the Internet search engines, which crawls and downloads webpages from the World Wide Web.

To effectively locate the vehicles in the video frames, we adopt the state-of-the-art deep learning based object detection model, i.e., Faster R-CNN [34]. Faster R-CNN contains two Convolutional Neural Network (CNN) based parts. The first is the Region Proposal Network, which is a Fully Convolutional Network (FCN) to generate object proposals from the input frames. The second is a fully connected network to regress the bounding boxes of objects and the corresponding categories. To achieve precise vehicle detection, we adopt a ResNet-50 [35] based Faster R-CNN structure which is pretrained on the ImageNet dataset [36]. Then, the network is finetuned on large-scale vehicle bounding boxes from surveillance videos annotated by ourselves. After detection, a nearest neighbor tracking algorithm is adopted to associate vehicle bounding boxes of the same vehicle between neighbor frames. In our implementation, the Faster R-CNN is deployed on the GPU servers to achieve efficient the vehicle detection.

For each track, it is assigned a unique vehicle ID under the corresponding camera. The first frame of the track, the track length, and the sequence of pixel coordinates are recorded into the metadata, while the track that is shorter than 5 will be discarded. After that, we use the off-the-shelf plate recognition tool to extract the plate numbers with a confidence measure. If the tool cannot recognize the plate or return a very low confidence, the plate will be assigned as UNAVAL which means unavailable. At last, the vehicle track metadata is appended to the camera table, meanwhile the image sequences of the track is stored on the vehicle storage server.

5.2 Vehicle Indexer

The vehicle indexer contains two functions: the first is multi-grained visual feature extraction, the second is multi-level index construction.

For the vehicle tracks, we extract the appearance based coarse representation and the license plate based fine-grained feature. To learn discriminative and robust feature of vehicle appearance, we adopt the ResNet-50 [35] pretrained on ImageNet [36] as the basic network. The network is finetuned on the VeRi dataset [9] with a multi task loss function, which contains a cross entropy loss and a contrastive loss [37]. To learn effective plate feature, a ResNet-18 based siamese neural network for plate verification is trained on massive license plate pairs as in [11] . The above two feature extractor are deployed on the GPU servers for efficiency. In the implementation, we use the 2048-D “pool5” layer of ResNet-50 and the 1024-D “conv3” layer of ResNet-18 as the appearance feature and plate feature, respectively. For the images in the track, the features are extracted separately and fused by average pooling, which means that each vehicle track has a 2048-D coarse-grained feature and a 1024-D fine-grained feature.

After feature extraction, we build a two-level index for vehicle tracks with the state-of-the-art approximate nearest neighbor index algorithm, i.e., FLANN [38], due to its high efficiency. The level-1 index is built on the appearance feature vectors, while the level-2 is built on the plate feature vectors.

5.3 Vehicle Searcher

In this section, we discuss the main procedures of online vehicle search. Given a vehicle image cropped by a user and a time interval, a list of candidate target vehicles and their states will be returned, as shown in Figure 1. As mention before, the progressive search contains two aspects:

5.3.1 From-coarse-to-fine feature matching

Vehicle search is generally an one-to-N feature matching problem, in which the similarity between the query and the gallery is estimated and ranked to find the most similar target vehicle to the query. During searching, the query image or track is fed into the feature extraction module to extract its visual feature and plate feature as in Section 5.2. Then the visual feature of query is searched with the level-1 index to obtain the coarse similarity, , between the query vehicle and the gallery vehicle. Similarly, the fine similarity, , is obtained with the level-2 index using the plate feature. With the above two similarity scores, the visual similarity between the query vehicle, , and one gallery vehicle, is:


where is a hyper-parameter to balance the two scores.

In addition to the visual similarity, we also explore the spatiotemporal similarity between the query and the gallery. Given the metadata of , and , we can obtain their spatial distance, , and temporal distance, as


where is the operation to get the camera ID of a vehicle, is the location of a camera, and . Then, we adopt a two-layer fully connected neural network, i.e. the multi-layer perceptron (MLP), ,to model the spatiotemporal similarity fo , and . The input and output dimensions of the two fully connected layers are (2, 64) and (64, 1), respectively. The activation functions of the two layers are ReLU and Sigmoid, respectively. The spatiotemporal similarity, , is denoted as


where is the concatenation of two elements.

At last, to effectively integrate the visual similarity, , and spatiotemporal similarity , we exploit a fully connected layer with sigmoid activation, , to learn the suitable fusion parameter. So, the final similarity can be computed by


The neural networks and are trained with the binary cross entropy loss, which can guide the model to determine whether the query and one gallery are the same vehicle or not. During searching, the results are ranked by the similarity scores between the query and the set of gallery vehicles.

5.3.2 From-near-to-distant search

To achieve efficient vehicle search, we utilize the camera neighboring graph, , to achieve the from-near-to-distant search. Given the camera ID of the query, we traverse in the breadth-first manner. It means that the query vehicle is matched first to the vehicles in the nearest neighboring cameras then to the distant ones. After each traverse of current neighboring cameras, a list of candidate results is returned. The results will update with the traverse of but the length of the list remains constant, which guarantees the most similar results can be shown to users.

6 Experiments

6.1 Dataset

In this paper, we compare the proposed PVSS to different vehicle search methods on the VeRi dataset [39]. The VeRi dataset is collected from 20 surveillance cameras in a real-world surveillance network, which contains about 50,000 images and 9000 tracks of 776 vehicles. Each vehicle in the VeRi dataset is labeled with various attributes, such as 10 types of colors and 9 categories. Moreover, the license number plates of vehicles are annotated for more precise vehicle search. Furthermore, the context, such as the spatiotemporal information and the topology of the surveillance network, and distances are annotated. Therefore, it is suitable to evaluate the proposed progressive vehicle search system.

6.2 Experimental Settings

As the similar settings in [39], cross-camera matching is performed, which means that one vehicle image from one camera is used as the query to search for images of the same vehicle captured by other cameras. Vehicle matching is in an track-to-track manner, which means units of the query set and the gallery are both tracks of vehicles cropped from surveillance videos. In our experiments, we use 1,678 query tracks and 2,021 testing tracks as in [39].

To evaluate the accuracy of the methods, the HIT@1 (precision at rank 1), and HIT@5 (precision at rank 5) are adopted. In addition, since the query has more than one ground truth, the precision and recall should be considered in our experiments. Hence, we also use mean average precision to evaluate the comprehensive performance as in [39]. The average precision (AP) is computed for each query as


where and are the numbers of tests and ground truths respectively, is the precision at the -th position of the results, and is an indicator function that equals to if the th result is correctly matched and otherwise. Over all queries, the mean Average Precision (mAP) is formulated as


in which is the number of queries.

6.3 Comparison with Vehicle Re-Id Methods

In this section, we first compare the appearance based search component in PVSS with five appearance-based vehicle Re-Id methods. Among them, methods 1) and 2) are two vehicle Re-Id methods, while methods 3) and 4) are two state-of-the-art approaches for video-based person Re-Id. Then we compare the complete progressive vehicle search system with three state-of-the-art multi-modal data based approaches, which utilize visual features, plate features, and spatiotemporal data. The details of all methods are as follows:

1) Fusion of color and attribute (FACT) [9]. This method is the baseline method on the VeRi dataset, which integrates hand-crafted features, e.g., SIFT and Color Name, with attributes extracted by GoogleNet.

2) Progressive vehicle search (Progressive) [11]. This is a progressive vehicle search framework, which uses appearance features and plate verification for vehicle matching and refines the results with spatiotemporal information.

3) Identity feature with LSTM (ResNet + LSTM). This approach adopts the CNN+LSTM which is the state-of-the-art method for video-based person Re-Id [40]. It can model dynamic patterns of persons like actions and gaits for person Re-Id.

4) Top-push Distance Learning (TDL) [41]. This method is one of the state-of-the-art metric learning methods for video-based person Re-Id. We use the identity features extracted by ResNet as the basic features. Then the TDL method is used to aggregate and map the original features into the latent space.

5) Appearance-based search in PVSS (PVSS-App). This is a part of PVSS, which use only the appearance features for vehicle search.

6) Orientation Invariant Feature Embedding and Spatial Temporal Regularization (OIFE + STR) [13]. This method proposes an Orientation Invariant Feature Embedding model to learn 20 landmarks and extract both local and global features from vehicle images.

7) Siamese-CNN and Path-LSTM (SC + P-LSTM)- [12]. This approach exploits two ResNets [35] in a siamese structure to learn visual feautres of vehicles and a one-layer LSTM to model the spatiotemporal context.

8) PROgressive Vehicle re-ID (PROVID [39]). This progressive vehicle search framework search for vehicle in a three-step way: appearance-based coarse filtering, license plate-based fine search, and spatiotemporal re-ranking.

9) PVSS-App-Plate. This is a part of the proposed PVSS, which use the appearance and plate features for vehicle search.

10) PVSS. This is the complete progressive vehicle search system proposed in our paper.

methods mAP HIT@1 HIT@5
FACT [9] 18.00 52.44 72.29
Progressive [11] 25.11 61.26 75.98
ResNet + LSTM [40] 28.11 56.20 79.14
TDL [41] 35.65 69.61 88.02
PVSS-App 51.00 85.64 95.35
OIFE + STR [13] 51.42 68.30 89.07
SC + Path-LSTM [12] 58.27 83.49 90.04
PROVID [39] 53.42 81.56 95.11
PVSS-App-Plate 61.12 89.69 96.31
PVSS 62.62 90.58 97.14
Table 2: The results of vehicle Re-Id methods on VIVID dataset.

Table 2. The results of vehicle Re-Id methods on VIVID.

methods mAP HIT@1 HIT@5
FACT [9] 18.00 52.44 72.29
Progressive [11] 25.11 61.26 75.98
ResNet + LSTM [40] 28.11 56.20 79.14
TDL [41] 35.65 69.61 88.02
PVSS-App 51.00 85.64 95.35
OIFE + STR [13] 51.42 68.30 89.07
SC + Path-LSTM [12] 58.27 83.49 90.04
PROVID [39] 53.42 81.56 95.11
PVSS-App-Plate 61.12 89.69 96.31
PVSS 62.62 90.58 97.14

Table 2 lists the mAP, HIT@1, and HIT@5 of approaches. For appearance-only methods, we can find that the traditional methods, i.e., FACT and Progressive, are worse than deep learning based methods. This is because the hand-crafted features cannot effectively model the appearance of a vehicle and comprehensively represent the vehicles. By comparing LSTM-based methods with other deep learning-based models, we can see that LSTM-based methods obtain worse results. Although LSTM can model dynamic representation from action or gait for video-based person Re-Id, it may be failed for video-based vehicle Re-Id. The TDL performs better than the LSTM-based method, while our appearance-based part in PVSS-App achieves the best results. For the multi-modal methods, the OIFE + STR and SC + Path-LSTM obtain worse results than the proposed PVSS-App-Plate, because these two methods neglect the license plates to uniquely identify vehicles. Moreover, by incorporating spatiotemporal context, the PVSS outperforms other multi-modal search methods and achieves the best results.

7 Conclusions

This paper proposes PVSS, a progressive vehicle search system which can crawl and index vehicles captured by large-scale surveillance networks and provide vehicle search services for users. For the vehicle crawler, the vehicle detection and tracking algorithms are adopted to crop vehicle images from surveillance videos. Then, vehicle images are fed into the vehicle indexer to extract multi-grained visual features, which are utilized to build a multi-level index for vehicle search. In the online search stage, the target vehicle is searched in a from-coarse-to-fine manner with the multi-level index and in a from-near-to-distant way based on the spatiotemporal context of the surveillance network. Extensive evaluations on the VeRi dataset show the excellent performance of the PVSS.


This work was supported in part by the Funds for International Cooperation and Exchange of the National Natural Science Foundation of China (No. 61720106007), the NSFC-Guangdong Joint Fund (No. U1501254), the National Key Research and Development Plan (No. 2016YFC0801005), the National Natural Science Foundation of China (No. 61602049), and the 111 Project (B18008).


  • [1] Huadong Ma and Wu Liu. Progressive search paradigm for internet of things. IEEE MultiMedia, 2017.
  • [2] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems, 30(1):107–117, 1998.
  • [3] Chuang Gan, Ting Yao, Kuiyuan Yang, Yi Yang, and Tao Mei. You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images. In CVPR, pages 923–932, 2016.
  • [4] Junping Zhang, Fei-Yue Wang, Kunfeng Wang, Wei-Hua Lin, Xin Xu, and Cheng Chen. Data-driven intelligent transportation systems: A survey. IEEE Transaction on Intelligent Transportation Systems, 12(4):1624–1639, 2011.
  • [5] Maria Valera and Sergio A Velastin. Intelligent distributed surveillance systems: a review. IEE Proceedings - Vision, Image and Signal Processing, 152(2):192–204, 2005.
  • [6] Rogerio Feris, Behjat Siddiquie, Yun Zhai, James Petterson, Lisa Brown, and Sharath Pankanti. Attribute-based vehicle search in crowded surveillance videos. In ACM International Conference on Multimedia Retrieval, page 18, 2011.
  • [7] Rogerio Schmidt Feris, Behjat Siddiquie, James Petterson, Yun Zhai, Ankur Datta, Lisa M Brown, and Sharath Pankanti. Large-scale vehicle detection, indexing, and search in urban surveillance videos. IEEE Transaction on Multimedia, 14(1):28–42, 2012.
  • [8] Chuang Gan, Tianbao Yang, and Boqing Gong. Learning attributes equals multi-source domain generalization. In CVPR, pages 87–97, 2016.
  • [9] Xinchen Liu, Wu Liu, Huadong Ma, and Hui Yuan Fu. Large-scale vehicle re-identification in urban surveillance videos. In IEEE International Conference on Multimedia and Expo, pages 1–6, 2016.
  • [10] Hongye Liu, Yonghong Tian, Yaowei Yang, Lu Pang, and Tiejun Huang. Deep relative distance learning: Tell the difference between similar vehicles. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2167–2175, 2016.
  • [11] Xinchen Liu, Wu Liu, Tao Mei, and Huadong Ma. A deep learning-based approach to progressive vehicle re-identification for urban surveillance. In European Conference on Computer Vision, pages 869–884, 2016.
  • [12] Yantao Shen, Tong Xiao, Hongsheng Li, Shuai Yi, and Xiaogang Wang. Learning deep neural networks for vehicle re-id with visual-spatio-temporal path proposals. In IEEE International Conference on Computer Vision, 2017.
  • [13] Zhongdao Wang, Luming Tang, Xihui Liu, Zhuliang Yao, Shuai Yi, Jing Shao, Junjie Yan, Shengjin Wang, Hongsheng Li, and Xiaogang Wang. Orientation invariant feature embedding and spatial temporal regularization for vehicle re-identification. In IEEE International Conference on Computer Vision, 2017.
  • [14] Chuang Gan, Yandong Li, Haoxiang Li, Chen Sun, and Boqing Gong. Vqs: Linking segmentations to questions and answers for supervised attention in vqa and question-focused semantic segmentation.
  • [15] Xinchen Liu, Wu Liu, Huadong Ma, and Shuangqun Li. A progressive vehicle search system for video surveillance networks. In IEEE International Conference on Multimedia Big Data, 2018.
  • [16] Josef Sivic and Andrew Zisserman. Video google: A text retrieval approach to object matching in videos. In IEEE International Conference on Computer Vision, pages 1470–1477, 2003.
  • [17] Michael S Lew, Nicu Sebe, Chabane Djeraba, and Ramesh Jain. Content-based multimedia information retrieval: State of the art and challenges. ACM Transactions on Multimedia Computing, Communications, and Applications, 2(1):1–19, 2006.
  • [18] Weiming Hu, Nianhua Xie, Li Li, Xianglin Zeng, and Stephen Maybank. A survey on visual content-based video indexing and retrieval. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 41(6):797–819, 2011.
  • [19] Tao Mei, Yong Rui, Shipeng Li, and Qi Tian. Multimedia search reranking: A literature survey. ACM Computing Surveys, 46(3):38:1–38:38, 2014.
  • [20] Wu Liu, Tao Mei, and Yongdong Zhang. Instant mobile video search with layered audio-video indexing and progressive transmission. IEEE Transactions on Multimedia, 16(8):2242–2255, 2014.
  • [21] Liang Zheng, Yi Yang, and Qi Tian. SIFT meets CNN: A decade survey of instance retrieval. IEEE Trans. Pattern Anal. Mach. Intell., 40(5):1224–1244, 2018.
  • [22] Chuang Gan, Chen Sun, Lixin Duan, and Boqing Gong. Webly-supervised video recognition by mutually voting for relevant web images and web video frames. In ECCV, pages 849–866, 2016.
  • [23] Yen-Liang Lin, Ming-Kuang Tsai, Winston H Hsu, and Chih-Wei Chen. Investigating 3-d model and part information for improving content-based vehicle retrieval. IEEE Transactions on Circuits and Systems for Video Technology, 23(3):401–413, 2013.
  • [24] Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. Describing objects by their attributes. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1778–1785, 2009.
  • [25] Liang Zheng, Shengjin Wang, Ziqiong Liu, and Qi Tian. Fast image retrieval: Query pruning and early termination. IEEE Trans. Multimedia, 17(5):648–659, 2015.
  • [26] Michela Farenzena, Loris Bazzani, Alessandro Perina, Vittorio Murino, and Marco Cristani. Person re-identification by symmetry-driven accumulation of local features. In IEEE Computer Vision and Pattern Recognition, pages 2360–2367, 2010.
  • [27] Shaogang Gong, Marco Cristani, Shuicheng Yan, and Chen Change Loy. Person re-identification. Springer, 2014.
  • [28] Liang Zheng, Yi Yang, and Alexander G. Hauptmann. Person re-identification: Past, present and future. CoRR, abs/1610.02984, 2016.
  • [29] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable person re-identification: A benchmark. In IEEE International Conference on Computer Vision, pages 1116–1124, 2015.
  • [30] Rogerio Feris, Russel Bobbitt, Lisa Brown, and Sharath Pankanti. Attribute-based people search: Lessons learnt from a practical surveillance system. In ACM International Conference on Multimedia Retrieval, page 153, 2014.
  • [31] Jiejun Xu, Vignesh Jagadeesh, Zefeng Ni, Santhoshkumar Sunderrajan, and BS Manjunath. Graph-based topic-focused retrieval in distributed camera network. IEEE Transactions on Multimedia, 15(8):2046–2057, 2013.
  • [32] Linjie Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang. A large-scale car dataset for fine-grained categorization and verification. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3973–3981, 2015.
  • [33] Xinchen Liu, Huadong Ma, Huiyuan Fu, and Mo Zhou. Vehicle retrieval and trajectory inference in urban traffic surveillance scene. In ACM/IEEE International Conference on Distributed Smart Cameras, page 26, 2014.
  • [34] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pages 91–99, 2015.
  • [35] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • [36] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • [37] Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with application to face verification. In IEEE Conference on Computer Vision and Pattern Recognition, pages 539–546, 2005.
  • [38] Marius Muja and David G Lowe. Scalable nearest neighbor algorithms for high dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(11):2227–2240, 2014.
  • [39] Xinchen Liu, Wu Liu, Tao Mei, and Hua Dong Ma. Provid: Progressive and multi-modal vehicle re-identification for large-scale urban surveillance. IEEE Transaction on Multimedia, 20(3):645–658, 2018.
  • [40] Yichao Yan, Bingbing Ni, Zhichao Song, Chao Ma, Yan Yan, and Xiaokang Yang. Person re-identification via recurrent feature aggregation. In European Conference on Computer Vision, pages 701–716, 2016.
  • [41] Jinjie You, Ancong Wu, Xiang Li, and Wei-Shi Zheng. Top-push video-based person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1345–1353, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description