CrowdMOT: Crowdsourcing Strategies for Tracking Multiple Objects in Videos

CrowdMOT: Crowdsourcing Strategies for Tracking Multiple Objects in Videos

Abstract.

Crowdsourcing is a valuable approach for tracking objects in videos in a more scalable manner than possible with domain experts. However, existing frameworks do not produce high quality results with non-expert crowdworkers, especially for scenarios where objects split. To address this shortcoming, we introduce a crowdsourcing platform called CrowdMOT, and investigate two micro-task design decisions: (1) whether to decompose the task so that each worker is in charge of annotating all objects in a sub-segment of the video versus annotating a single object across the entire video, and (2) whether to show annotations from previous workers to the next individuals working on the task. We conduct experiments on a diversity of videos which show both familiar objects (aka - people) and unfamiliar objects (aka - cells). Our results highlight strategies for efficiently collecting higher quality annotations than observed when using strategies employed by today’s state-of-art crowdsourcing system.

Crowdsourcing,Computer Vision,Video Annotation
12345678910

1. Introduction

Videos provide a unique setting for studying objects in a temporal manner, which cannot be achieved with 2D images. They reveal each object’s actions and interactions, which is valuable for applications including self-driving vehicles, security surveillance, shopping behavior analysis, and activity recognition. Videos also are important for biomedical researchers who study cell lineage to learn about processes such as viral infections, tissue damage, cancer progression, and wound healing.

Many data annotation companies have emerged to meet the demand for high quality, labelled video datasets (14; 55; 22; 63; 15; 2). Some companies employ in-house, trained labellers, while other companies employ crowdsourcing strategies. Despite their progress that is paving the way for new applications in society, their methodologies remain proprietary. In other words, potentially available knowledge of how to successfully crowdsource video annotations is not in the public domain. Consequently, it is not clear whether such companies’ successes derive from novel crowdsourcing interfaces versus novel worker training protocols versus other mechanisms.

Towards filling this gap, we focus on identifying successful crowdsourcing strategies for video annotation in order to establish a scalable approach for tracking objects. A key component in analyzing videos is examining how each object behaves over time. Commonly, it is achieved by localizing each object in the video (detection) and then following all objects as they move (tracking). This task is commonly referred to as multiple object tracking (MOT) (Milan et al., 2016). One less-studied aspect of MOT is the fact that an object can split into multiple objects. This can arise, for example, for exploding objects such as ballistics, balloons, or meteors, and for self-reproducing organisms such as cells in the human body (exemplified in Figure 1). We refer to the task of tracking all fragments coming from the original object as lineage tracking.

While crowdsourcing exists as a powerful option for leveraging human workers to annotate a large number of videos (Vondrick et al., 2013; Yuen et al., 2009), existing crowdsourcing research about MOT has two key limitations. First, our analysis shows that today’s state-of-art crowdsourcing system and its employed strategies (Vondrick et al., 2013) do not consistently produce high quality results with non-expert crowdworkers (Sections 3 and 6.1). As noted in prior work (Vondrick et al., 2013), the success likely stems from employing expert workers identified through task qualification tests, a step which reduces the worker pool and so limits the extent to which such approaches can scale up. Second, prior work has only evaluated MOT solutions for specific video domains; e.g., only videos showing familiar content like people (Vondrick et al., 2013) or only videos showing unfamiliar content like biological cells (Sameki et al., 2016). This begs a question of how well MOT strategies will generalize across such distinct video domains, which can manifest unique annotation challenges such as the need for lineage tracking.

To address these concerns, we focus on (1) proposing strategies for decomposing the complex MOT into microtasks that can be completed by non-expert crowdworkers, and (2) supporting MOT annotation for both familiar (people) and unfamiliar (cell) content, thereby bridging two domains related to MOT.

We analyze two crowdsourcing strategies for collecting MOT annotations from a pool of non-expert crowdworkers for both familiar and unfamiliar video content. First, we compare two choices for decomposing the task of tracking multiple objects in videos, i.e., track all objects in a segment of the video (time-based approach that we call SingSeg) or track one object across the entire video (object-based approach that we call SingObj). Second, we examine if creating iterative tasks, where crowdworkers see the results from a previous worker on the same video, improves annotation performance.

To evaluate these strategies, we introduce a new video annotation platform for MOT, which we call CrowdMOT. CrowdMOT is designed to support lineage tracking as well as to engage non-expert workers for video annotation. Using CrowdMOT, we conduct experiments to quantify the efficacy of the two aforementioned design strategies when crowdsourcing annotations on videos with multiple objects. Our analysis with respect to several evaluation metrics on diverse videos, showing people and cells, highlights strategies for collecting much higher quality annotations from non-expert crowdworkers than is observed from strategies employed by today’s state-of-the art system, VATIC (Vondrick et al., 2013).

To summarize, our main contribution is a detailed analysis of two crowdsourcing strategy decisions: (a) which microtask design and (b) whether to use an iterative task design. Studies demonstrate the efficacy of these strategies on a variety of videos showing familiar (people) and unfamiliar (cell) content. Our findings reveal which strategies result in higher quality results when collecting MOT annotations from non-expert crowdworkers. We will publicly-share the crowdsourcing system, CrowdMOT, that incorporates these strategies.

2. Related Work

2.1. Crowdsourcing Annotations for Images vs. Videos

Since the onset of crowdsourcing, much research has centered on annotating visual content. Early on, crowdsourcing approaches were proposed for simple tasks such as tagging objects in images (Von Ahn and Dabbish, 2004), localizing objects in images with bounding rectangles (Von Ahn et al., 2006b), and describing salient information in images (Von Ahn et al., 2006a; Salisbury et al., 2017; Kohler et al., 2017). More recently, a key focus has been on developing crowdsourcing frameworks to address more complex visual analysis tasks such as counting the number of objects in an image (Sarma et al., 2015), creating stories to link collections of distinct images (Mandal et al., 2018), critiquing visual design (Luther et al., 2014), investigating the dissonance between human and machine understanding in visual tasks (Zhang et al., 2019), and tracking all objects in a video (Vondrick et al., 2013). Our work contributes to the more recent effort of developing strategies to decompose complex visual analysis tasks into simpler ones that can be completed by non-expert crowdworkers. The complexity of video annotation arises in part from the large amount of data, since even small videos consist of several thousand images that must be annotated; e.g., a typical one-minute video clip contains 1,740 images. Our work offers new insights into how to collect high quality video annotations from an anonymous, non-expert crowd.

2.2. Crowdsourcing Video Annotations

Within the scope of crowdsourcing video annotations, there are a broad range of valuable tasks. Some crowdsourcing platforms promote learning by improving content of educational videos (Cross et al., 2014) and editing captions in videos to learn foreign languages (Culbertson et al., 2017). Other crowdsourcing systems employ crowdworkers to flag frames where events of interest begin and/or end. Examples include activity recognition (Nguyen-Dinh et al., 2013), event detection (Steiner et al., 2011), behavior detection (Park et al., 2012), and more (Abu-El-Haija et al., 2016; Heilbron et al., 2015). Our work most closely relates to the body of work that requires crowdworkers to not only identify frames of interest in a video, but also to localize all objects in every video frame (Vondrick et al., 2013; Yuen et al., 2009; Sameki et al., 2016). The most popular ones that complete this MOT task include VATIC (Vondrick et al., 2013) and LabelMe Video (Yuen et al., 2009). In general, these tools exploit temporal redundancy between frames in a video to reduce the human effort involved by asking users to only annotate key frames, and have the tool interpolate annotations for the intermediate frames (Vondrick et al., 2013; Yuen et al., 2009). Our work differs in part because we propose a different strategy for decomposing the task into microtasks. Our experiments on videos showing both familar and unfamiliar content demonstrate the advantage of our strategies over strategies employed in today’s state-of-the-art crowdsourcing system (Vondrick et al., 2013).

Crowdsourcing approaches can be broadly divided into two types: parallel, in which workers solve a problem alone, and iterative, in which workers serially build on the results of other workers (Little et al., 2010). Examples of iterative tasks include interdependent tasks (Kim et al., 2017) and engaging workers in multi-turn discussions (Chen et al., 2019), which can lead to improved outcomes such as increased worker retention (Gadiraju and Dietze, 2017). Prior work has also demonstrated workers perform better on their own work after reviewing others’ work (Kobayashi et al., 2018; Zhu et al., 2014). The iterative approach has been shown to produce better results for the tasks of image description, writing, and brainstorming (Little et al., 2010, 2009; Zhang et al., 2012). More recently, an iterative process has been leveraged to crowdsource complex tasks such as masking private content in images (Kaur et al., 2017). Our work complements prior work by demonstrating the advantage of exposing workers to previous workers’ high quality annotations on the same video in order to obtain higher quality results for the MOT task.

2.5. Tracking Cells in Videos

As evidence of the importance of the cell tracking problem, many publicly-available biological tools are designed to support this type of video annotation: CellTracker (Piccinini et al., 2016), TACTICS (Shimoni et al., 2013), BioImageXD (KankaanpÃ¤Ã¤ et al., 2012), eDetect (Han et al., 2019), LEVER (Winter et al., 2016), tTt (Hilsenbeck et al., 2016), NucliTrack (Cooper et al., 2017), TrackMate (Tinevez et al., 2017), and Icy (De Chaumont et al., 2012). However, only one tool (Sameki et al., 2016) is designed for use in a crowdsourcing environment, and it was evaluated for tracking cells based on more costly object segmentations (rather than less costly, more efficient bounding boxes). Our work aims to bridge this gap by not only seeking strategies that work for cell annotation but also generalize more broadly to support videos of familiar everyday content. CrowdMOT is designed to support cell tracking, because it features lineage tracking by recognizing when a cell undergoes mitosis and so splits into children cells (exemplified in Figure 1, row 3).

3. Pilot Study: Evaluation of State-of-Art Crowdsourcing System

Our work was inspired by our observation that we obtained poor quality results when we used today’s state-of-art crowdsourcing system, which is called VATIC (Vondrick et al., 2013), to employ non-expert crowdworkers to track biological cells. Based on this initial observation, we conducted follow-up experiments to assess the reasons for the poor quality results. We chose to conduct these and subsequent experiments on both familiar everyday content and unfamiliar biological content showing cells in order to ensure our findings represent more generalized findings.

3.1. Experimental Design

Dataset. We conducted experiments with 35 videos containing 49,501 frames showing both familiar content (people) and unfamiliar content (cells). Of these, 15 videos (11,720 frames) show people 11 and the remaining 20 videos (37,781 frames) show cells 12.

VATIC Configuration. We collected annotations with the default parameters, where each video was split into smaller segments of 320 frames with 20 overlapping frames. This resulted in a total of 181 segments. A new crowdsourcing job was created for each segment and assigned to a crowdworker. VATIC then merged the tracking results from consecutive segments using the Hungarian algorithm (Munkres, 1957).

The VATIC instructions indicate to mark all objects of interest and to mark one object at a time in order to avoid confusion. That is, workers were asked to complete annotating one object across the entire video, and then rewind the video to begin annotating the next object. For each object, workers were asked to draw the rectangle tightly such that it completely encloses the object.

To assist workers with tracking multiple objects, the interface enabled them to freely navigate between frames to view their annotations at any given time. Each object is marked with a unique color along with a unique label on the top right corner of the bounding box to visually aid the worker with tracking that object.

Our implementation had only one difference from that discussed in prior work (Vondrick et al., 2013). We did not filter out workers using a “gold standard challenge”. The original implementation, in contrast, prevented workers from completing the jobs unless they passed an initial annotation test.

6.2. Study 2: Effect of Iteration on Video Annotation

We next examine the influence of iterative tasks on MOT performance for the CrowdMOT-SingObj design. To do so, we evaluate the quality of the annotations when a crowdworker does versus does not observe other object tracking results on the same video.

CrowdMOT Implementation:

We deployed the same crowdsourcing system design as used in study 1 for the CrowdMOT-SingObj microtask design. We assigned each HIT to five workers, and evaluated two rounds of consecutive HITs as described in Steps 1 and 3 below.

• Step 1: We conducted the first round of HITs on all 66 videos, in which workers were asked to annotate only one object per video. The choice of which object to annotate was left to the worker’s discretion. We refer to the results obtained on this set of videos as NonIterative.

• Step 2: After retrieving the results from step 1, we chose those videos in which workers did a good job in tracking an object for use in creating subsequent tasks. To do so, we emulated human supervision by excluding the videos with an AUC score less than 0.4, which indicates that they have poor tracking results. We refer to the remaining list of videos with good tracking results as NonIterative-Filtered. These filtered videos are used in successive tasks for workers to build on the previous annotations.

• Step 3: For the second round of HITs, we used all the videos from the NonIterative-Filtered list (i.e., Step 2 results), because they each consist of good tracking results. Workers were shown the previous object tracking results (i.e., that were collected in Step 1) and asked to choose another object for their task that was not previously annotated in the video. We refer to the results obtained in this set of videos as Iterative.

• Step 4: We finally identified those videos from the Iterative HIT (i.e., Step 3 results) that contained good results and so are suitable for further task propagation. To do so, as done for Step 2, we again emulated human supervision by excluding the videos with an AUC score less than 0.4. We refer to the remaining list of videos as Iterative-Filtered.

Dataset:

We conducted this study on a larger collection of 116,394 frames that came from 66 live cell videos from the CTMC Cell Motility dataset (Anjum and Gurari, 2020). The average number of frames per video is 1,764. The number of cells in the videos vary between 3 to 185.

Evaluation Metrics:

We used the same evaluation metrics as used in study 1. Specifically, the quality of crowdsourced results were evaluated using the following three metrics: AUC, TrAcc and Precision. In addition, human effort was calculated in terms of number of key frames annotated per HIT and the time taken to complete each HIT.

Results - Work Quality:

We compare the results obtained in NonIterative HITs (Step 1) with those obtained in Iterative HITs (Step 3). Table 2 shows the average AUC, TrAcc, and Precision scores, while Figure 7 shows the distribution of these scores. For completeness, we include the scores across all four sets of videos described in Steps 1-4 in Appendix (Table 3 and Figure 8).

As shown in Table 2, we found that workers performed better in the Iterative HITs as compared to the NonIterative HITs. Observing this overall improvement of the worker performance, we hypothesize that the existing annotation may have helped guide the workers of the second HIT to better understand the requirements and annotate accordingly. This improvement occurred despite the fact that sample videos already were provided in the instructions to show how to annotate using the tool for both scenarios (Iterative and NonIterative), as mentioned in Section 4. This suggests that observing a prior annotation on the same video offers greater guidance than only having access to a video within the instructions.

Our findings lead us to believe that observing previous annotations has more of an impact on the resulting size of the bounding box than the center location of the box. After excluding annotations with low AUC scores (i.e., AUC 0.4), we found that from the NonIterative HIT, 43 out of 66 videos consisted of satisfactory annotations, which accounted for 65%. However, using the same cutoff score of AUC with the Iterative HIT annotations, crowdworkers provided significantly better results on 41 out of 43 videos (p value = 0.0020 using Student’s t-Test). This resulted in 95% of the videos achieving better annotations. There was a slight improvement in the Precision scores as well, though it was not significant.

Across both NonIterative and Iterative results, the average TrAcc scores were above 97%. This shows that, for both approaches, workers persisted and remained reliable in terms of marking the object through its lifetime in the video. While the TrAcc scores of annotations were generally high, the cause for some objects scoring low was attributed to the temporal offset of frames for objects that underwent either a splitting or left the frame.

Next, we referred to the difficulty level of the videos to assess the impact of the varying difficulty levels. Specifically, we leveraged the difficulty ratings of the videos provided by our in-house annotators, which was assigned during ground truth generation. We observed that out of 28 easy, 24 medium and 14 hard videos, those that were removed in the first round included 5 easy, 10 medium and 8 hard videos. While, predictably, more easy videos passed to the second round of HITs, we also note that the second round consisted of about 50% of the videos belonging to medium/hard categories. In other words, workers in the second round were asked to annotate videos from a mixed bag of all three difficulty levels.

Results - Human Effort:

A total of 78 unique workers completed the 545 tasks for the 66 videos. 24 unique workers participated in the NonIterative HITs and 66 unique workers participated in the Iterative HITs, with 12 workers participating on both sets of HITs.

Crowdworkers annotated a total of 2,468 frames, which is about 2.1% of the total number of frames, with an average of 14 key frames per object in a video 19. Of these, 647 key frames were annotated for the NonIterative HITs, while 1,761 key frames were annotated for the Iterative HITs. This suggests that workers that were shown prior annotations invested more effort into submitting high quality annotations. This finding is reinforced when examining the time taken to complete the tasks. On average, NonIterative tasks took 4.6 minutes per object, while the Iterative tasks were completed in 5.6 minutes. 20

7. Discussion

7.1. Implications

We focused on designing effective strategies for employing crowdsourcing to track multiple objects in videos. Rather than relying upon expert workers (Vondrick et al., 2013), which ignores the potential of a large pool of workers, our strategies aim at leveraging the non-expert crowdsourcing market by (1) designing the microtask to be simple (i.e. SingObj) and (2) providing additional guidance in the form of prior annotations (i.e. iterative tasks). Our experiments reveal benefits of implementing these two strategies when collecting annotations on a complex task like tracking multiple objects in videos, as discussed below.

Our first strategy decomposes the video annotation task into a simpler microtask in order to facilitate gathering better results. Our experiments show that simplifying the annotation task by assigning a crowdworker to the entire lifetime of one object in a video yielded a higher quality of annotations than assigning a crowdworker ownership for all objects over the entire segment of the video. Subsequent to our analysis, we learned that our findings complement those found in studies that examine the human attention level in performing the MOT task. For instance, prior work showed that human performance decreases for object tracking as the number of objects grows and the best performance was shown in the annotation of first object (Alvarez and Franconeri, 2007; Holcombe and Chen, 2012). Another study showed that humans can track up to four objects in a video accurately (Intriligator and Cavanagh, 2001). An additional study demonstrated that tracking performance is dependent on factors such as object speed as well as spatial proximity between objects (Alvarez and Franconeri, 2007). For example, they showed that if the objects were moving at a sufficiently slow speed, humans could track up to eight objects. While our experiments demonstrate that SingObj ensures higher accuracy, further exploration could determine the limits of the conditions under which SingObj is preferable (e.g., possibly for a large number of objects).

Our second strategy demonstrates that exposing crowdworkers to prior annotations by creating iterative tasks can have a positive influence on their performance. Having an interactive workflow of microtasks through consecutive rounds of crowdsourcing, can provide crowdworkers a more holistic understanding of how their work contributes to the bigger goal of the project. This, in turn, can improve the quality of their performance, as noted by related prior work (Bigham et al., 2015).

More broadly, this work can have implications in designing crowdsourcing microtasks for other applications that similarly leverage spatial-temporal information, such as ecology (Estes et al., 2018), wireless communications (Raleigh and Cioffi, 1998), and tracking tectonic activity (Miller et al., 2006). Similar to MOT, these applications can also choose to decompose tasks either temporally (SingSeg) or spatially (SingObj). Our findings paired with the constraints imposed from our experimental design, with respect to the length of videos and total number of objects, underscore certain conditions for which we anticipate the SingObj design will yield higher-quality and more consistent results.

By releasing the CrowdMOT code publicly, we aim to encourage MOT crowdsourcing in a greater diversity of domains, including data that is both familiar and typically unfamiliar to lay people. Much crowdsourcing work examines involving non-experts to annotate data that is uncommon or unfamiliar to lay people. For example, researchers have relied on crowdsourcing to annotate lung nodule images (Boorboor et al., 2018), colonoscopy videos (Park et al., 2017), hip joints in MRI images (Chávez-Aragón et al., 2013), and cell images (Gurari et al., 2014, 2015; Sameki et al., 2015; Gurari et al., 2016). The scope of such efforts has been accelerated in part because of the Zooniverse platform, which simplifies creating crowdsourcing systems (Borne and Team, 2011). Our work intends to complement this existing effort and anticipates users may benefit from using CrowdMOT to crowdsource high-quality annotations for their biological videos showing cells (that exhibit a splitting behavior). Providing a web version of the tool empowers users and researchers to more easily annotate videos by reducing the overhead of tool installation and setup. This can be valuable for many potential users, especially those lacking domain expertise.

7.2. Limitations and Future Work

An important issue we observed with the annotations was users are often unable to annotate the object from the correct starting frame. A useful enhancement would be to ease this process by having an algorithm seed each object in the first frame it appears, and thereby guide the worker into annotating that object only.

While we are encouraged by the considerable improvements in the tracking annotation results obtained from workers using our system, it is possible more sophisticated interpolation schemes could lead to further improvements. The current framework uses linear interpolation to fill the intermediate frames between user-defined key frames with annotations, as it was similarly used in popular video annotation systems (Vondrick et al., 2013; Yuen et al., 2009). An interesting area for future work is to explore how changing the interpolation schemes (e.g., level set methods) will impact crowdworker effort and annotation quality. In the future, we also plan to increase the size of our video collection to assess the versatility of our framework on different types of videos.

Although our crowdsourcing approach yields significant improvements, future work is needed to address certain settings where we believe this approach may not be viable. One example is for very long videos, since every user has to watch the entire video to mark one object. In such scenarios, it may be beneficial to design a microtask that can integrate both SingObj and SingSeg strategies. For example, a long video can be divided into smaller segments, where each worker can then be asked to annotate one object per segment. In addition, our current framework supports objects that split into two children, like in the case of cells in biomedical research. This can be extended to support objects that undergo any number of splits, for example, in videos depicting ballistic testing that involve an object breaking into multiple pieces. Finally, future work will need to examine how to generalize MOT solutions for videos that show 10s, 100s, or more of objects that need to be tracked.

8. Conclusion

We introduce a general-purpose crowdsourcing system for multiple object tracking that also supports lineage tracking. Our experiments demonstrate significant flaws in the existing state-of-the-art crowdsourcing task design. We quantitatively demonstrate the advantage of two key micro-task design options in collecting much higher quality video annotations: (1) have a single worker annotate a single object for the entire video and (2) show workers the results of previously annotated objects on the video. To encourage further development and extension of our framework, we will publicly share our code.

Acknowledgements

This project was supported in part by grant number 2018-182764 from the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation. We thank Matthias Müller, Philip Jones and the crowdworkers for their valuable contributions in this research. We also thank the anonymous reviewers for their valuable feedback and suggestions to improve this work.

Appendix

This section includes supplementary material to Sections 3 and 6.

• Section A provides details of a pilot study conducted using the CrowdMOT-SingObj framework, which was used for comparison to the VATIC system analyzed in Section 3.

• We report evaluation results that supplement the analysis conducted in study 2 of Section 6.

Appendix A Pilot Study: Evaluation of CrowdMOT

This pilot study was motivated by our observation that we received poor quality results from using the state-of-art crowdsourcing system, VATIC (Section 3), which employs the SingSeg design. We conducted a follow-up pilot experiment to assess the quality of results obtained using our alternative design: CrowdMOT with the SingObj design. We observed a considerable improvement in the quality of results using our CrowdMOT-SingObj system over those obtained with VATIC (Section 3), which led us to conduct subsequent experiments (Section 6).

a.1. Experimental Design

Dataset. We conducted this study on the same dataset as used in Section 3 which included 35 videos containing 49,501 frames, with 15 videos showing familiar content (i.e. people) and the remaining 20 videos showing unfamiliar content (i.e. cells).

CrowdMOT Configuration. We deployed the same crowdsourcing design as used in Study 1 of Section 6 for the CrowdMOT-SingObj microtask design. We created a new HIT for each object. Workers were asked to mark only one object in the entire video and could only submit the task after both detecting and tracking an object. Our goal was to study the trend of the worker performance, so we collected annotations for only two objects per video rather for all objects in the entire video.

Each HIT was assigned to five workers. This resulted in a total of 350 jobs for 35 videos. Out of the five annotations, we picked the annotation with the highest AUC score per video to use as input for the subsequent, second posted HIT. This choice of using the annotations with the highest AUC score as input is intended to simulate a human supervision of selecting the best annotation. Evaluation was conducted for the two objects and any of their progeny for all videos.

Crowdsourcing Environment and Parameters. As was done in Section 3, we employed crowdworkers from Amazon Mechanical Turk (Turk) who completed at least 500 HITs and had at least 95 approval rating. Each worker was paid 0.50 per HIT and given 30 minutes to complete that HIT.

Evaluation Metrics. We used the same three metrics as used in Section 3 to evaluate the results, namely, AUC, TrAcc, and Precision.

a.2. Results

We observe considerable improvement using CrowdMOT with the SingObj microtask design compared to the VATIC; i.e., it results in higher scores for all three evaluation metrics. Specifically, with CrowdMOT-SingObj, the AUC score was 0.50, TrAcc was 0.96, and Precision was 0.63, as compared to 0.06, 0.42 and 0.03 with VATIC. The higher AUC and Precision scores indicate that CrowdMOT-SingObj is substantially better for tracking both the bounding boxes and their center locations. The higher TrAcc scores demonstrate that it better captures each objectâs trajectory across its entire lifetime in the video.

Appendix B Evaluation of NonIterative versus Iterative Task Designs

Supplementing Section 6.2 of the main paper, we report additional results comparing the performance of our NonIterative versus Iterative task designs. The distribution of AUC, TrAcc and Precision scores for all four sets of videos described in Step 1-4 (NonIterative, NonIterative-Filtered, Iterative and Iterative-Filtered) are illustrated in Figure 8 and the average of those scores are summarized in Table 3. The filtered sets contain those annotations that were of high quality (i.e. AUC 0.4) from the results for the NonIterative and Iterative results respectively. We observe a considerable difference in the scores between the NonIterative results and its corresponding filtered set. This observation contrasts what is observed for the the Iterative results and its corresponding filtered set. We attribute this distinction to the fact that more annotations are discarded for the NonIterative tasks (i.e 23 out of 66 HITs) than the Iterative tasks (i.e. 2 out of 43 HITs). This finding offers promising evidence that Iterative tasks yield better results for collecting MOT annotations.

Footnotes

2. journalyear: 2020
3. doi: 10.1145/1122445.1122456
4. journal: JACM
5. journalvolume: 37
6. journalnumber: 4
7. article: 111
8. publicationmonth: 8
9. ccs: Information systems Crowdsourcing
10. ccs: Computing methodologies Computer vision
11. These videos came from the MOT dataset: https://motchallenge.net/
12. These videos came from the CTMC dataset (Anjum and Gurari, 2020). We collected ground truth data for all videos from two in-house experts Our experts were two graduate students who had successfully completed a course about crowdsourcing visual content. who we trained to perform video annotation.
13. Our experts were two graduate students who had successfully completed a course about crowdsourcing visual content.
14. Of note, we conducted this experiment before June 2019, and since then, the AMT API used by the VATIC system has been deprecated, rendering VATIC incompatible for crowdsourcing with AMT.
15. https://viratdata.org/
16. Our experts were two graduate students who had successfully completed a course about crowdsourcing visual content, and we trained them to complete video annotation.
17. Our analysis is based on a single annotation per segment rather than all three crowdsourced results per segment.
18. Our analysis is based on a single annotation per object rather than all three crowdsourced results per object.
19. Our analysis is based on a single annotation per object rather than all five crowdsourced results per object.
20. Overall, we found the time taken by crowdworkers to annotate each object using CrowdMOT-SingObj is consistent with that in study 1, with the average being 5.02 minutes per job.

References

1. Youtube-8m: a large-scale video classification benchmark. arXiv preprint arXiv:1609.08675. Cited by: §2.2.
2. (2012) Alegion. Note: \urlhttps://www.alegion.com Cited by: §1.
3. How many objects can you track?: evidence for a resource-limited attentive tracking mechanism. Journal of vision 7 (13), pp. 14–14. Cited by: §7.1.
4. CTMC: cell tracking with mitosis detection dataset challenge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §6.1.2, §6.2.2, footnote 3.
5. Soylent: a word processor with a crowd inside. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, pp. 313–322. Cited by: §2.3.
6. Human-computer interaction and collective intelligence. Handbook of collective intelligence 57. Cited by: §7.1.
7. Crowdsourcing lung nodules detection and annotation. In Medical Imaging 2018: Imaging Informatics for Healthcare, Research, and Applications, Vol. 10579, pp. 105791D. Cited by: §7.1.
8. The zooniverse: a framework for knowledge discovery from citizen science data. In AGU Fall Meeting Abstracts, Cited by: §7.1.
9. Slow motion increases perceived intent. Proceedings of the National Academy of Sciences 113 (33), pp. 9250–9255. Cited by: item 4.
10. A crowdsourcing web platform-hip joint segmentation by non-expert contributors. In 2013 IEEE International Symposium on Medical Measurements and Applications (MeMeA), pp. 350–354. Cited by: §7.1.
11. Cicero: multi-turn, contextual argumentation for accurate crowdsourcing. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1–14. Cited by: §2.4.
12. Break it down: a comparison of macro-and microtasks. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 4061–4064. Cited by: §2.3.
13. Cascade: crowdsourcing taxonomy creation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1999–2008. Cited by: §2.3.
14. (2016) Clay sciences. Note: \urlhttps://www.claysciences.com Cited by: §1.
15. (2010) CloudFactory. Note: \urlhttps://www.cloudfactory.com Cited by: §1.
16. NucliTrack: an integrated nuclei tracking application. Bioinformatics 33 (20), pp. 3320–3322. Cited by: §2.5.
17. VidWiki: enabling the crowd to improve the legibility of online educational videos. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing, pp. 1167–1175. Cited by: §2.2.
18. Have your cake and eat it too: foreign language learning with a crowdsourced video captioning system. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, pp. 286–296. Cited by: §2.2.
19. Paying crowd workers for collaborative work. Proceedings of the ACM on Human-Computer Interaction 3 (CSCW), pp. 1–24. Cited by: §6.1.6.
20. Icy: an open bioimage informatics platform for extended reproducible research. Nature methods 9 (7), pp. 690. Cited by: §2.5.
21. The spatial and temporal domains of modern ecology. Nature ecology & evolution 2 (5), pp. 819. Cited by: §7.1.
22. (2007) Figure-eight. Note: \urlhttps://www.figure-eight.com Cited by: §1.
23. Improving learning through achievement priming in crowdsourced information finding microtasks. In Proceedings of the Seventh International Learning Analytics & Knowledge Conference, pp. 105–114. Cited by: §2.4.
24. Investigating the influence of data familiarity to improve the design of a crowdsourcing image annotation system. In Fourth AAAI Conference on Human Computation and Crowdsourcing, Cited by: §7.1.
25. How to use level set methods to accurately find boundaries of cells in biomedical images? evaluation of six methods paired with automated and crowdsourced initial contours. In Conference on medical image computing and computer assisted intervention (MICCAI): Interactive medical image computation (IMIC) workshop, pp. 9. Cited by: §7.1.
26. How to collect segmentations for biomedical images? a benchmark evaluating the performance of experts, crowdsourced non-experts, and algorithms. In 2015 IEEE winter conference on applications of computer vision, pp. 1169–1176. Cited by: §7.1.
27. eDetect: A Fast Error Detection and Correction Tool for Live Cell Imaging Data Analysis. iScience 13, pp. 1–8. External Links: ISSN 2589-0042, Link, Document Cited by: §2.5.
28. ActivityNet: A large-scale video benchmark for human activity understanding. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, pp. 961–970 (en). External Links: ISBN 978-1-4673-6964-0, Link, Document Cited by: §2.2.
29. Software tools for single-cell tracking and quantification of cellular and molecular properties. Nature Biotechnology 34 (7), pp. 703–706. External Links: ISSN 1087-0156, 1546-1696, Link, Document Cited by: §2.5.
30. Exhausting attentional tracking resources with a single fast-moving object. Cognition 123 (2), pp. 218–228. Cited by: §7.1.
31. The spatial resolution of visual attention. Cognitive psychology 43 (3), pp. 171–216. Cited by: §7.1.
32. Analyzing the amazon mechanical turk marketplace. XRDS: Crossroads, The ACM Magazine for Students, Forthcoming. Cited by: §5.2.
33. Efficient task decomposition in crowdsourcing. In International Conference on Principles and Practice of Multi-Agent Systems, pp. 65–73. Cited by: §5.2.
34. BioImageXD: an open, general-purpose and high-throughput image-processing platform. Nature methods 9 (7), pp. 683. Cited by: §2.5.
35. Crowdmask: using crowds to preserve privacy in crowd-powered systems via progressive filtering. In Fifth AAAI Conference on Human Computation and Crowdsourcing, Cited by: §2.4.
36. Creating better action plans for writing tasks via vocabulary-based planning. Proceedings of the ACM on Human-Computer Interaction 2 (CSCW), pp. 86. Cited by: §2.3.
37. Mechanical novel: crowdsourcing complex work through reflection and revision. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, pp. 233–245. Cited by: §2.4.
38. Crowdforge: crowdsourcing complex work. In Proceedings of the 24th annual ACM symposium on User interface software and technology, pp. 43–52. Cited by: §2.3.
39. An empirical study on short-and long-term effects of self-correction in crowdsourced microtasks. In Sixth AAAI Conference on Human Computation and Crowdsourcing, Cited by: §2.4.
40. Supporting image geolocation with diagramming and crowdsourcing. In Fifth AAAI Conference on Human Computation and Crowdsourcing, Cited by: §2.1.
41. Collaboratively crowdsourcing workflows with turkomatic. In Proceedings of the acm 2012 conference on computer supported cooperative work, pp. 1003–1012. Cited by: §2.3.
42. Crowd development. In 2013 6th International Workshop on Cooperative and Human Aspects of Software Engineering (CHASE), pp. 85–88. Cited by: §2.3.
43. Turkit: tools for iterative tasks on mechanical turk. In Proceedings of the ACM SIGKDD workshop on human computation, pp. 29–30. Cited by: §2.3, §2.4.
44. Exploring iterative and parallel human computation processes. In Proceedings of the ACM SIGKDD workshop on human computation, pp. 68–76. Cited by: §2.4.
45. CrowdCrit: crowdsourcing and aggregating visual design critique. In Proceedings of the companion publication of the 17th ACM conference on Computer supported cooperative work & social computing, pp. 21–24. Cited by: §2.1.
46. Collective story writing through linking images. In Third AAAI Conference on Human Computation and Crowdsourcing, Cited by: §2.1.
47. MOT16: a benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831. Cited by: §1.
48. Spatial and temporal evolution of the subducting pacific plate structure along the western pacific margin. Journal of Geophysical Research: Solid Earth 111 (B2). Cited by: §7.1.
49. Trackingnet: a large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 300–317. Cited by: §3.2.
50. Algorithms for the assignment and transportation problems. Journal of the society for industrial and applied mathematics 5 (1), pp. 32–38. Cited by: §3.1, §6.1.1.
51. Tagging human activities in video by crowdsourcing. In Proceedings of the 3rd ACM conference on International conference on multimedia retrieval, pp. 263–270. Cited by: §2.2.
52. Crowdsourcing for identification of polyp-free segments in virtual colonoscopy videos. In Medical Imaging 2017: Imaging Informatics for Healthcare, Research, and Applications, Vol. 10138, pp. 101380V. Cited by: §7.1.
53. Crowdsourcing micro-level multimedia annotations: the challenges of evaluation and interface. In Proceedings of the ACM multimedia 2012 workshop on Crowdsourcing for multimedia, pp. 29–34. Cited by: §2.2.
54. CellTracker (not only) for dummies. Bioinformatics 32 (6), pp. 955–957 (en). External Links: ISSN 1367-4803, Link, Document Cited by: §2.5.
55. (2015) Playment. Note: \urlhttps://playment.io/video-annotation-tool/ Cited by: §1.
56. Spatio-temporal coding for wireless communication. IEEE Transactions on communications 46 (3), pp. 357–366. Cited by: §7.1.
57. Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §3.2.
58. Best of both worlds: human-machine collaboration for object annotation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2121–2131. Cited by: §2.3.
59. Toward scalable social alt text: conversational crowdsourcing as a tool for refining vision-to-language technology for the blind. In Fifth AAAI Conference on Human Computation and Crowdsourcing, Cited by: §2.1.
60. CrowdTrack: Interactive Tracking of Cells in Microscopy Image Sequences with Crowdsourcing Support. Cited by: §1, §2.2, §2.5.
61. Characterizing image segmentation behavior of the crowd. Collective Intelligence. Cited by: §7.1.
62. Surpassing humans and computers with jellybean: crowd-vision-hybrid counting algorithms. In Third AAAI Conference on Human Computation and Crowdsourcing, Cited by: §2.1.
63. (2016) Scale. Note: \urlhttps://www.scale.com Cited by: §1.
64. TACTICS, an interactive platform for customized high-content bioimaging analysis. Bioinformatics 29 (6), pp. 817–818. Cited by: §2.5.
65. Tool diversity as a means of improving aggregate crowd performance on image segmentation tasks. Cited by: §2.3.
66. Popup: reconstructing 3d video using particle filtering to aggregate crowd responses. In Proceedings of the 24th International Conference on Intelligent User Interfaces, pp. 558–569. Cited by: §2.3.
67. Crowdsourcing event detection in youtube video. In 10th International Semantic Web Conference (ISWC 2011); 1st Workshop on Detection, Representation, and Exploitation of Events in the Semantic Web, pp. 58–67. Cited by: §2.2.
68. Productivity decomposed: getting big things done with little microtasks. In Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems, pp. 3500–3507. Cited by: §2.3.
69. TrackMate: An open and extensible platform for single-particle tracking. Methods 115, pp. 80–90. External Links: ISSN 1046-2023, Link, Document Cited by: §2.5.
70. SLADE: a smart large-scale task decomposer in crowdsourcing. IEEE Transactions on Knowledge and Data Engineering 30 (8), pp. 1588–1601. Cited by: §2.3.
71. GroundTruth: augmenting expert image geolocation with crowdsourcing and shared representations. Proceedings of the ACM on Human-Computer Interaction 3 (CSCW), pp. 1–30. Cited by: §2.3.
72. Labeling images with a computer game. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 319–326. Cited by: §2.1.
73. Improving accessibility of the web with a computer game. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 79–82. Cited by: §2.1.
74. Peekaboom: a game for locating objects in images. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 55–64. Cited by: §2.1.
75. Efficiently Scaling up Crowdsourced Video Annotation: A Set of Best Practices for High Quality, Economical Video Labeling. International Journal of Computer Vision 101 (1), pp. 184–204 (en). External Links: ISSN 0920-5691, 1573-1405, Link, Document Cited by: §1, §1, §2.1, §2.2, §3.1, §3.1, §3, item 2, item 3, item 4, item 1, §4.1, §4.2, §4.3, §4, §5.1, §6.1, §7.1, §7.2.
76. LEVER: software tools for segmentation, tracking and lineaging of proliferating cells. Bioinformatics, pp. btw406 (en). External Links: ISSN 1367-4803, 1460-2059, Link, Document Cited by: §2.5.
77. Confusing the crowd: task instruction quality on amazon mechanical turk. In Fifth AAAI Conference on Human Computation and Crowdsourcing, Cited by: item 1.
78. Online object tracking: a benchmark. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2411–2418. Cited by: §3.1, §6.1.4.
79. Labelme video: building a video database with human annotations. In 2009 IEEE 12th International Conference on Computer Vision, pp. 1451–1458. Cited by: §1, §2.2, §7.2.
80. Human computation tasks with global constraints. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 217–226. Cited by: §2.4.
81. Dissonance between human and machine understanding. Proceedings of the ACM on Human-Computer Interaction 3 (CSCW), pp. 1–23. Cited by: §2.1.
82. Reviewing versus doing: learning and performance in crowd assessment. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing, pp. 1445–1455. Cited by: §2.4.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters