A Causal And-Or Graph Model for Visibility Fluent Reasoning in Human-Object Interactions††thanks: * equal contributors.
Tracking humans that are interacting with the other subjects or environment remains unsolved in visual tracking, because the visibility of the human of interests in videos is unknown and might vary over times. In particular, it is still difficult for state-of-the-art human trackers to recover complete human trajectories in crowded scenes with frequent human interactions. In this work, we consider the visibility status of a subject as a fluent variable, whose changes are mostly attributed to the subject’s interactions with the surrounding, e.g., crossing behind another objects, entering a building, or getting into a vehicle, etc. We introduce a Causal And-Or Graph (C-AOG) to represent the causal-effect relations between an object’s visibility fluents and its activities, and develop a probabilistic graph model to jointly reason the visibility fluent change (e.g., from visible to invisible) and track humans in videos. We formulate the above joint task as an iterative search of feasible causal graph structure that enables fast search algorithm, e.g., dynamic programming method. We apply the proposed method on challenging video sequences to evaluate its capabilities of estimating visibility fluent changes of subjects and tracking subjects of interests over time. Results with comparisons demonstrated that our method clearly outperforms the alternative trackers and can recover complete trajectories of humans in complicated scenarios with frequent human interactions.
A Causal And-Or Graph Model for Visibility Fluent Reasoning in Human-Object Interactions††thanks: * equal contributors.
Lei Qin, Yuanlu Xu, Xiaobai Liu, Song-Chun Zhu Dept. Computer Science and Statistics, University of California, Los Angeles (UCLA) Dept. Computer Science, San Diego State University (SDSU) Inst. Computing Technology, Chinese Academy of Sciences email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org
Copyright © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Tracking objects of interest in videos is a fundamental computer vision task that has great potentials in many video based applications, e.g., security surveillance, disaster response, border patrol. In these applications, a critical problem is how to obtain the complete trajectory of the object of interest while observing it moving in the scene through camera views. This is a challenging problem since an object of interest might under frequent interactions with the surrounding, e.g., entering a vehicle or a building, or with the other objects ,e.g., passing behind another subject. With these interactions, the visibility status of a subject will be varying over time, e.g., changes from invisible to visible and vice versa. In the literature, most state-of-the-art trackers employ appearance cues or motion cues to localize subjects in video sequences and are likely to fail to track the subjects whose visibility status change. To deal with the above challenges, in this work, we propose to explicitly reason subjects’ visibility status over time while tracking the subjects of interests in surveillance videos. The developed techniques, with slight modifications, can be generalized to other scenarios, e.g., hand-held cameras, driverless vehicles.
The key idea of our method is to introduce a fluent variable for each subject of interest to explicitly indicate his/her visibility status in videos. Fluent was firstly used by Newton to denote the time varying status of an object. It is also used to represent the varying object status in commonsense reasoning (?). In this paper, the visibility status of objects can be described as fluents varying over time. As illustrated in Fig. 1, the person and person are walking through the parking lot, while the person and person are entering a sedan. The visibility status of person’s and person’s changes first from ”visible” to ”occluded”, and then to ”contained”. This group of video frames directly demonstrates how object’s visibility fluent changes over time along with their interactions to the surrounding.
We introduce a graphical model to represent the causal relationships between object’s activities (actions/sub-events) and object’s visibility fluent changes.
Fig. 3 visualizes the action-fluent relationship using a Causal And-Or graph (C-AOG). The occlusion status of an object might be caused by multiple actions, and we need to reason the actual causality from videos. These actions are alternative choices that lead to the same occlusion status, and form the Or-nodes. Each leaf node indicates an action or sub-event that can be describe by and-nodes. Taking the videos shown in Fig. 1 for instance, the status of ”occluded” can be caused by the following actions: i) walking behind a vehicle; ii) walking behind a person; or iii) inertial action that maintains the fluent unchanged.
The basic hypothesis of this model is that, for a particular scenario (e.g., parking-lot), there are only a limited number of actions that can cause the fluent to change. Given a video sequence, we need to create the optimal C-AOG and for each Or-node select the best choice in order to get the optimal causal parse graph, shown as red lines in Fig. 3.
We develop a probabilistic graph model to reason object’s visibility fluent changes using C-AOG representation. Our formula integrates object tracking purposes as well to enable joint solution of tracking and fluent change reasoning, which are mutually beneficial. In particular, for each subject of interest, our method uses two variables to represent i) subjects’ positions in videos; and ii) visibility status as well as the best causal parse graph. We utilize a Markov Chain Prior model to describe the transitions of these variables, i.e., the current state of a subject is only dependent on the previous state. Then we reformulate the problem into an Integer Linear Programming model, and utilize dynamic programming to search the optimal states over time. In evaluations, we apply the proposed method over a set of challenging sequences that include frequent human-vehicle or human-human intersections. Results showed that our method can readily predict the correct visibility statues and correctly recover the complete trajectories. In contrast, most of the alternative trackers can only recover part of the trajectories due to the occlusion/containment.
Contributions. There are three major contributions of the proposed framework: i) a Causal And-Or Graph (C-AOG) model to represent object visibility fluents varying over time; ii) a joint probabilistic formulation for object tracking and fluent reasoning; and iii) a new occlusion reasoning dataset to cover objects with diverse fluent changes.
2 Related Work
The proposed research is closely related to the following three research streams in computer vision and AI.
Multiple object tracking has been extensively studied in the last decades. In past literature, tracking-by-detection has become the mainstream framework (?; ?). Specifically, a general detector (?; ?) is first applied to obtain detection proposals, then data association techniques (?; ?; ?) are employed to link detection proposals over times in order to get object trajectories. Our approach follows this pipeline as well but is focused on the reasoning of object visibility status.
Occlusion handling is perhaps the most fundamental problem in visual tracking, and has been extensively studied in the past literatures. These methods can be roughly divided into three categories: i) using depth information (?; ?), ii) modeling partial occlusion relations (?), iii) modeling object appearing and disappearing globally (?; ?; ?). These methods all have strong assumptions on the appearance or motion cues and mainly deal with short-term occlusions. In contrast, this paper presents a principled way to explicitly reason object visibility status during both short-term interactions, e.g., passing behind another object, or long-term interactions, e.g., entering a vehicle and moving together.
Causal-effect reasoning is a popular topic in AI but has not received much attentions in the field of computer vision. It studies, for instances, the differences between co-occurrence and causality and aims to learn causal knowledge automatically from low-level observations, e.g., images or videos. There are two popular causality models: Bayesian Network (?; ?) and grammar models (?; ?). Notably, Fire and Zhu (?) introduced a causal grammar to infer the causal-effect relationships between object’s status, e.g., door open/close, and agent’s actions, e.g., pushing the door. They have studies this problem using manually designed rules and video sequences in lab settings. In this work, we extend the causal grammar models to infer objects’ visibility fluent and ground the task on challenging videos in surveillance systems.
In this paper, we define three states for visibility fluent reasoning: visible, (partially/fully) occluded, and contained. Most multiple object tracking methods are based on tracking-by-detection framework, which obtain good performance in visible and partially occluded situations. However, when full occlusions take place, these trackers usually regard the disappearing-and-reappearing objects as new objects. Although objects in fully occluded and contained states could be invisible, there are still evidences to infer the object’s location and fill-in the complete trajectory. We can distinguish object being fully occluded and object being contained from three empirical observations.
Firstly, motion independence. In fully occluded state, such as a person staying behind a pillar, motion of the person is independent of the pillar. While in contained state, such as person in vehicle, or bag in trunk, the position and motion of the person/bag would be the same as the vehicle. So it is important to infer the visibility fluent of the object, if we want to track objects accurately in a complex environment.
Secondly, coupling actions and object fluent changes. For example, as illustrated in Fig. 2, if a person gets into a vehicle, the related sequential atomic actions are: approaching a vehicle, opening the vehicle door, getting into the vehicle, and closing the vehicle door; the related object fluent changes are vehicle door closed open closed. The fluent change is a consequence of agent actions. If the fluent-changing actions do not happen, the object should maintain its current fluent. For example, a person that is contained in a vehicle will remain contained unless he/she opens the vehicle door and gets out of the vehicle.
Thirdly, visibility in the alternative camera views. In full occlusion state, such as a person occluded by a pillar, though the person could not be observed from the current viewpoint, he/she could be seen from the other viewpoints; while in contained state, such as a person in a vehicle, this person could not be seen from any viewpoints.
In this work, we mainly study the interactions of humans and the developed methods can be expanded to other objects (e.g., animal) as well.
3.1 Causal And-Or Graph
In this paper, we propose a Causal And-Or Graph (C-AOG) to represent the action-fluent relationship, as illustrated in Fig. 3(a). A C-AOG has two types of nodes: i) Or-nodes that represent the variations or choices, and ii) And-nodes that capture the decomposition of the top-level entity. The arrow represents the causal relation between action and fluent transition. For example, we can use a C-AOG to expressively model a series of action-fluent relations.
The C-AOG is capable of representing multiple alternatives for causes of occlusion and potential transitions. There are four levels in our C-AOG: visibility fluents, possible states, state transitions and agent actions. Or nodes represent alternative causes in visibility fluents and state levels; that is, one fluent can have multiple states and one state can have multiple transitions. An event can be decomposed into possible atomic actions and represented by And nodes, e.g., opening vehicle door, opening vehicle trunk, loading a baggage.
Given a video sequence with length , we represent the scene as
where denotes all the object at time and is the size of , i.e. the number of objects at time . is unknown and will be inferred from observations. Each object is represented with its location (i.e., bounding boxes in the image), appearance features . To study the visibility fluent of a subject, we further incorporate a state variable and an action label , that is,
Thus, the state of a subject is defined as
We define a series of atomic actions that could possibly change the visibility status, e.g., walking, opening vehicle door. As illustrated in Fig. 3(b), a small set of actions could cover most common interactions among people and vehicles.
Our task is to jointly find subject locations in video frames and estimate their visibility fluents from the video sequence . Formally, we have,
where can be determined by the optimal causal parse graph at time t.
4 Problem Formulation
According to Bayes’ rule, we can solve our joint object tracking and fluent reasoning problem by maximizing a posterior (MAP),
The prior term measures the temporal consistency between successive parse graphs. Assuming is a Markov Chain structure, we can decompose as
The first term measures the location displacement. It calculates the transition distance between two successive frames and is defined as:
where is the Euclidean distance between two locations, is the speed threshold and is an indicator function. The location displacement term measures the motion consistency of object in successive frames.
The second term measures the state transition energy and is defined as:
where is the action-state transition probability, which could be learned from the training data.
The likelihood term measures how well each parse graph explains the data, which can be decomposed as
where measures the likelihood between data and object fluents and measures the likelihood between data and object actions. Given each object , the energy function is defined as:
where indicates the object detection score, indicates the container (i.e., vehicles) detection score and is the sigmoid function. When an object is in either visible or contained state, appearance information can describe the probability of the existence of itself or the object containing it (i.e., container) at this location. When an object is occluded, there is no visual evidence to determine its state. Therefore, we utilize temporal information to generate candidate locations. We employ the SSP algorithm (?) to generate trajectory fragments (i.e., tracklets). The candidate locations are identified as misses in complete object trajectories. The energy is thus defined as the cost of generating a virtual trajectory at this location. We compute this energy by computing the visual discrepancy between a neighboring tracklet before this moment and a neighboring tracklet after this moment. The appearance descriptor of a tracklet is computed as the average pooling of image descriptor over time. If the distance is below a threshold , a virtual path is generated to connect these two tracklets using B-spline fitting.
The term is defined over the object actions observed from data. In this work, we study the fluents of human and vehicles, that is,
where is the sigmoid function. The definitions of the two data-likelihood terms and are introduced in the rest of this section.
A human is represented by his/her skeleton, which consists of multiple joints estimated by sequential prediction technology (?). The feature of each joint is defined as the relative distances of this joint to four saddle points(two shoulders, the center of the body, and the middle between the two hipbones). The relative distances are normalized by dividing the length of head to eliminate the influence of scale. A feature vector concatenating the features of all joints is extracted, which is assumed to follow a Gaussian distribution:
where and are the mean and the covariance of the action respectively, which are obtained from the training data.
A vehicle is described with its viewpoint, semantic vehicle parts, and vehicle part fluents. The vehicle fluent is represented by a Hierarchical And-Or Graph, as illustrated in Fig. 4. The feature vector of vehicle fluent is obtained by computing fluent scores on each vehicle part and concatenating them together. We compute the average pooling feature for each action over the training data as the vehicle fluent template. Given vehicle fluent computed on image , the distance is defined as
We cast the intractable optimization of Eq. (5) as an Integer Linear Formulation (ILF) in order to derive a scalable and efficient inference algorithm. We denoted by the locations of vehicles, the edges between all possible pairs of nodes, whose time is consecutive and locations are close. The whole transition graph is shown as Fig. 5. Then the energy function Equation (5) can be re-written as:
where is the number of object moving from node to node , is the corresponding cost.
Since the subject of interest can only enter a nearby container (e.g., vehicle), to discover the optimal causal parse graph, we will need to jointly track the container and the subject of interest. Similar to Equation (14), the energy function of container is as following.
where is the container detection score at location . Then we add the contained constrains as:
This is the objective function we want to optimize.
We apply the proposed method over two multiple object-tracking datasets and evaluate the improvement in visual tracking brought by the outcomes of visibility status reasoning.
6.1 Implementation Details
For person and suitcase detection, we utilize the Faster R-CNN models (?) trained on the MS COCO dataset. The used network is the VGG-16 net, with score threshold 0.4 and NMS threshold 0.3. The tracklets similarity threshold is set as 0.8. The contained distance threshold is set as the width of container. For appearance descriptors , we employ the dense sampling ColorNames descriptor (?), which applies square root operator (?) and Bag-of-word encoding to the original ColorNames descriptors. For human skeleton estimation, we use the public implementation of (?). For vehicle detection and semantic part status estimation, we use the implementation provided by (?) with default parameters mentioned in their paper.
We adopt the widely used CLEAR metrics (?) to measure the performances of tracking methods. It includes four metrics, i.e., Multiple Object Detection Accuracy (MODA), Detection Precision (MODP), Multiple Object Tracking Accuracy (MOTA) and Tracking Precision (MOTP), which take into account three kinds of tracking errors: false positives, false negatives and identity switches. We also report the number of false positives (FP), false negatives (FN), identity switches (IDS) and fragments (Frag). A higher value means better for TA and TP while a lower value means better for FP, FN, IDS and Frag. If the Intersection-over-Union (IoU) ratio of tracking results to ground truth is above 0.5, we accept the tracking result as a correct hit.
People-Car dataset (?)111Available at cvlab.epfl.ch/research/surv/interacting-objects. This dataset consists of sequences on a parking lot with two synchronized bird-view cameras, with length frames. In this dataset, there are many instances of people getting in and out of cars. This dataset is challenging for the frequent interactions, light variation and low object resolution.
Tracking Interacting Objects (TIO) dataset. For current popular multiple object tracking datasets (e.g., PETS09 (?), KITTI dataset (?)), most tracked objects are pedestrian and no evident interaction visibility fluent changes. Thus we collect two new scenarios with typical human-object interactions: person, suitcase, and vehicle on several places.
Plaza. We capture video sequences in a plaza that describe people walking around, getting in/out vehicles.
ParkingLot. We capture video sequences in a parking lot that shows vehicles entering/exiting the parking lot, people getting in/out vehicles, people interacting with trunk/suitcase.
All the sequences are captured by a GoPro camera, with frame rate 30fps and resolution . The total number of frames of TIO dataset is more than 30K. There exist severe occlusions and large scale changes, making this dataset very challenging for traditional tracking methods.
Beside the above testing data, we collect another set of video clips for training. To avoid over-fitting, we set up different camera positions, different people and vehicles from the testing settings. The training data consists of video clips covering 9 events: walking, opening vehicle door, entering vehicle, exiting vehicle, closing vehicle door, opening vehicle trunk, loading baggage, unloading baggage, closing vehicle trunk. Each action category contains 42 video clips on average.
Both the datasets and short clips are annotated with the bounding boxes for people, suitcase, vehicles, and visibility fluents of people and suitcase. The types of status are ”visible”, ”occluded”, and ”contained”. We utilize VATIC (?) to annotate the videos.
6.3 Results and Comparisons
For People-Car dataset, we compare our proposed method with state-of-the-arts methods: successive shortest path algorithm (SSP) (?), K-Shortest Paths Algorithm (KSP) (?), Probability Occupancy Map (POM) (?), Linear Programming (LP2D) (?), Tracklet-Based Intertwined Flows (TIF-MIP) (?). The quantitative results are reported in Table 1. From the results, the proposed method obtains better performance than the baseline methods.
For TIO dataset, we compare the proposed method with state-of-the-arts: successive shortest path algorithm (SSP) (?), multiple hypothesis tracking with distinctive appearance model (MHT_D) (?), Markov Decision Processes with Reinforcement Learning (MDP) (?), Discrete-Continuous Energy Minimization (DCEM) (?), Discrete-continuous optimization (DCO) (?) and Joint Probabilistic Data Association (JPDA_m) (?). We use the public implementations of these methods. We set up a baseline Our-1 to analyze the effectiveness of different components. Our-1 only utilizes human data-likelihood term while Our-full utilizes both human and vehicle data-likelihood terms. Note we also utilize the estimated human joint key-points to refine the detected people bounding boxes.
We report quantitative results and comparisons in Table 2 for TIO dataset. From the results, we can observe that our method obtains superior performance to the other methods on most metrics. It validates that the proposed method can not only track visible objects correctly, but also reason locations for occluded or contained objects. The alternative methods do not work well mainly due to lack of the ability to track objects under long-term occlusion or containment in other objects. Based on comparisons of Our-1 and Our-full, we can also conclude that each type of fluent plays its role in improving the final tracking results. Some qualitative results are visualized in Fig. 6.
Empirical Studies. We further report fluent estimation results on TIO-Plaza sequences and TIO-ParkingLot sequences in Fig. 8. From the results, we can see that our method can successfully reason the visibility status of subjects. Note that the precision of containment estimation is not high, since some people get in/out the vehicle from the opposite side towards the camera, as shown in Fig. 7. Under such situation, multi-view setting might be a better way to reduce the ambiguities.
In this paper, we propose a Causal And-Or Graph model to represent the causal-effect relations between object visibility fluents and various human interactions. By jointly modeling short-term occlusions and long-term occlusions, our method can explicitly reason the visibility of subjects as well as their locations in in videos. Our method clearly outperforms the alternative methods in complicated scenarios with frequent human interactions. In this work, we focus on the human-interactions as a running-case of the proposed technique, which can be easily extended for the other types of objects (e.g., animal, drones).
- [Andriyenko, Schindler, and Roth 2012] Andriyenko, A.; Schindler, K.; and Roth, S. 2012. Discrete-continuous optimization for multi-target tracking. In IEEE Conference on Computer Vision and Pattern Recognition.
- [Arandjelovic and Zisserman 2012] Arandjelovic, R., and Zisserman, A. 2012. Three things everyone should know to improve object retrieval. In IEEE Conference on Computer Vision and Pattern Recognition.
- [Berclaz et al. 2011] Berclaz, J.; Fleuret, F.; Turetken, E.; and Fua, P. 2011. Multiple object tracking using k-shortest paths optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(9):1806–1819.
- [Dehghan, Assari, and Shah 2015] Dehghan, A.; Assari, S.; and Shah, M. 2015. Gmmcp-tracker:globally optimal generalized maximum multi clique problem for multiple object tracking. In IEEE Conference on Computer Vision and Pattern Recognition.
- [Dehghan et al. 2015] Dehghan, A.; Tian, Y.; Torr, P.; and Shah, M. 2015. Target identity-aware network flow for online multiple target tracking. In IEEE Conference on Computer Vision and Pattern Recogntion.
- [Ess et al. 2009] Ess, A.; Schindler, K.; Leibe, B.; and Gool, L. V. 2009. Improved multi-person tracking with active occlusion handling. In IEEE ICRA Workshop.
- [Felzenszwalb et al. 2010] Felzenszwalb, P.; Girshick, R. B.; McAllester, D.; and Ramanan, D. 2010. Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(9):1627–1645.
- [Ferryman and Shahrokni 2009] Ferryman, J., and Shahrokni, A. 2009. Pets2009: Dataset and challenge. In IEEE International Workshop on Performance Evaluation of Tracking and Surveillance.
- [Fire and Zhu 2016] Fire, A., and Zhu, S.-C. 2016. Learning perceptual causality from video. ACM Transactions on Intelligent Systems and Technology 7(2).
- [Fleuret et al. 2008] Fleuret, F.; Berclaz, J.; Lengagne, R.; and Fua, P. 2008. Multicamera people tracking with a probabilistic occupancy map. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(2):267–282.
- [Geiger, Lenz, and Urtasun 2012] Geiger, A.; Lenz, P.; and Urtasun, R. 2012. Are we ready for autonomous driving? the kitti vision benchmark suite. In IEEE Conference on Computer Vision and Pattern Recognition.
- [Griffiths and Tenenbaum 2005] Griffiths, T., and Tenenbaum, J. 2005. Structure and strength in causal induction. Cognitive Psychology 51(4):334–384.
- [Griffiths and Tenenbaum 2007] Griffiths, T., and Tenenbaum, J. 2007. Two proposals for causal grammars. Causal learning: Psychology, philosophy, and computation 323–345.
- [Hamid Rezatofighi et al. 2015] Hamid Rezatofighi, S.; Milan, A.; Zhang, Z.; Shi, Q.; Dick, A.; and Reid, I. 2015. Joint probabilistic data association revisited. In IEEE International Conference on Computer Vision.
- [Kasturi et al. 2009] Kasturi, R.; Goldgof, D.; Soundararajan, P.; Manohar, V.; Garofolo, J.; Bowers, R.; Boonstra, M.; Korzhova, V.; and Zhang, J. 2009. Framework for performance evaluation of face, text, and vehicle detection and tracking in video: Data, metrics, and protocol. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(2):319–336.
- [Kim et al. 2015] Kim, C.; Li, F.; Ciptadi, A.; and Rehg, J. M. 2015. Multiple hypothesis tracking revisited. In IEEE International Conference on Computer Vision.
- [Leal-Taixé et al. 2014] Leal-Taixé, L.; Fenzi, M.; Kuznetsova, A.; Rosenhahn, B.; and Savarese, S. 2014. Learning an image-based motion context for multiple people tracking. In IEEE Conference on Computer Vision and Pattern Recognition.
- [Li et al. 2016] Li, B.; Wu, T.; Xiong, C.; and Zhu, S.-C. 2016. Recognizing car fluents from video. In IEEE Conference on Computer Vision and Pattern Recognition.
- [Liang et al. 2016] Liang, W.; Zhao, Y.; Zhu, Y.; and Zhu, S.-C. 2016. What are where: Inferring containment relations from videos. In International Joint Conference on Artificial Intelligence.
- [Maksai, Wang, and Fua 2016] Maksai, A.; Wang, X.; and Fua, P. 2016. What players do with the ball: A physically constrained interaction modeling. In IEEE Conference on Computer Vision and Pattern Recognition.
- [Milan, Schindler, and Roth 2016] Milan, A.; Schindler, K.; and Roth, S. 2016. Multi-target tracking by discrete-continuous energy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(10):2054–2068.
- [Mueller 2014] Mueller, E. T. 2014. Commonsense reasoning: An event calculus based approach. Morgan Kaufmann.
- [Pearl 2009] Pearl, J. 2009. Causality: Models, reasoning and inference. Cambridge University Press.
- [Pirsiavash, Ramanan, and Fowlkes 2011] Pirsiavash, H.; Ramanan, D.; and Fowlkes, C. C. 2011. Globally-optimal greedy algorithms for tracking a variable number of objects. In IEEE Conference on Computer Vision and Pattern Recognition.
- [Ren et al. 2015] Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Conference on Neural Information Processing Systems.
- [Tang, Andriluka, and Schiele 2014] Tang, S.; Andriluka, M.; and Schiele, B. 2014. Detection and tracking of occluded people. International Journal of Computer Vision 110(1):58â69.
- [Vondrick, Patterson, and Ramanan 2013] Vondrick, C.; Patterson, D.; and Ramanan, D. 2013. Efficiently scaling up crowdsourced video annotation. International Journal of Computer Vision 101(1):184–204.
- [Wang et al. 2014] Wang, X.; Turetken, E.; Fleuret, F.; and Fua, P. 2014. Tracking interacting objects optimally using integer programming. In European Conference on Computer Vision.
- [Wang et al. 2016] Wang, X.; Turetken, E.; Fleuret, F.; and Fua, P. 2016. Tracking interacting objects using intertwined flows. IEEE Transaction on Pattern Analysis and Machine Intelligence 38(11):2312–2326.
- [Wei et al. 2015] Wei, S.-E.; Ramakrishna, V.; Kanade, T.; and Sheikh, Y. 2015. Convolutional pose machines. In IEEE Conference on Computer Vision and Pattern Recognition.
- [Wen et al. 2014] Wen, L.; Li, W.; Yan, J.; and Lei, Z. 2014. Multiple target tracking based on undirected hierarchical relation hypergraph. In IEEE Conference on Computer Vision and Pattern Recogntion.
- [Xiang, Alahi, and Savarese 2015] Xiang, Y.; Alahi, A.; and Savarese, S. 2015. Learning to track: Online multi-object tracking by decision making. In IEEE International Conference on Computer Vision.
- [Yilmaz, Xin, and Shah 2004] Yilmaz, A.; Xin, L.; and Shah, M. 2004. Contour based object tracking with occlusion handling in video acquired using mobile cameras. IEEE Transaction on Pattern Analysis and Machine Intelligence 26(11):1532–1536.
- [Yu et al. 2016] Yu, S.-I.; Meng, D.; Zuo, W.; and Hauptmann, A. 2016. The solution path algorithm for identity-aware multi-object tracking. In IEEE Conference on Computer Vision and Pattern Recogntion.
- [Zheng et al. 2015] Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; and Tian, Q. 2015. Scalable person re-identification: A benchmark. In IEEE International Conference on Computer Vision.