Joint Attention in Autonomous Driving (JAAD)

Joint Attention in Autonomous Driving (JAAD)

Iuliia Kotseruba, Amir Rasouli and John K. Tsotsos
{yulia_k, aras, tsotsos}

In this paper we present a novel dataset for a critical aspect of autonomous driving, the joint attention that must occur between drivers and of pedestrians, cyclists or other drivers. This dataset is produced with the intention of demonstrating the behavioral variability of traffic participants. We also show how visual complexity of the behaviors and scene understanding is affected by various factors such as different weather conditions, geographical locations, traffic and demographics of the people involved. The ground truth data conveys information regarding the location of participants (bounding boxes), the physical conditions (e.g. lighting and speed) and the behavior of the parties involved.


subsecref \newrefsubsecname = \RSsectxt \RS@ifundefinedthmref \newrefthmname = theorem  \RS@ifundefinedlemref \newreflemname = lemma 

I Introduction

Autonomous driving has been a topic of interest for decades. Implementing autonomous vehicles can have a great economic and social impacts including reducing the cost of driving, increasing fuel efficiency and safety, enabling transportation for non-drivers and reducing the stress of driving by allowing motorists to rest and work while traveling [1]. As for the macroeconomic impacts, it is estimated that autonomous vehicles industry and related software and hardware technologies will account for a market size of more than 40 billion dollars by 2030 [2].

Partial autonomy has long been used in commercial vehicles in the form of technologies such as cruise control, park assist, automatic braking, etc. Fully autonomous vehicles also have been successfully developed and tested under certain conditions. For example, the DARPA challenge 2005 set the task of autonomously driving a 7.32 miles predefined terrain in the deserts of Nevada. Out of 23 final contestants 4 cars successfully completed the course within the allowable time limit (10 hours) while driving fully autonomously [3].

Despite such success stories in autonomous control systems, designing fully autonomous vehicles for urban environments still remains an unsolved problem. Aside from challenges associated with developing suitable infrastructures and regulating the autonomous behaviors [1], in order to be usable in urban environments autonomous cars must have a high level of precision and meet very high safety standards [4].

Today one of the major dilemmas faced by autonomous vehicles is how to interact with the environment including infrastructure, cars, drivers or pedestrians [5, 6, 7]. The lapses in communication can be a source of numerous erroneous behaviors [8] such as failure to predict the movement of other vehicles [9, 10] or to respond to unexpected behaviors of other drivers [11].

The impact of perceptual failures on the behavior of an autonomous car is also evident in the 2015 annual report on Google’s self-driving car [12]. This report is based on testing self-driving cars for more than miles of driving on public roads including both highways and streets. Throughout these trials, a total of 341 disengagements occurred in which the driver had to take over the car, and about of the cases occurred in busy streets. The interesting implication here is that over of the disengagements were due to“perception discrepancy” in which the vehicle was unable to understand its environment and about 10% of the cases were due to incorrect prediction of traffic participants and inability to respond to reckless behaviors.

There have been a number of recent developments to address these issues. A natural solution is establishing wireless communication between traffic participants. This approach has been tested for a number of years using cellular technology [7, 13]. This technique enables vehicle to vehicle (V2V) and vehicle to infrastructure (V2I) communication allowing tasks such as Cooperative Adaptive Cruise Control (CACC), improving positioning technologies such as GPS, and intelligent speed adoption in various roads. Peer to peer traffic communication is expected to enter the market by 2019.

Although V2V and V2I communications are deemed to solve a number of issues in autonomous driving, they also have a number of drawbacks. This technology relies heavily on cellular technology which is costly and has much lower reliability compared to traditional sensors such as radars and cameras. In addition, communication highly depends on all parties functioning properly. A malfunction in any communication device in any of the systems involved can lead to catastrophic safety issues.

Maintaining communication with pedestrians is even more important for safe autonomous driving. Inattention from both drivers and pedestrians regardless of their knowledge of the vehicle code is one of the major reasons for traffic accidents, most of which involve pedestrians at crosswalk locations [14].

Honda recently released a new technology similar to V2V communication that attempts to establish a connection with pedestrians through their cellular phones [15]. Using this method, the pedestrian’s phone broadcasts its position warning the autonomous car that a pedestrian is about to cross the street so the car can respond accordingly. This technology also can go one step further and inform the car about the state of the pedestrian, for instance, whether he/she is listening to music, texting or is on a call. Given the technological and regulatory obstacles to developing such technologies, using them in the near future does not seem feasible.

In late 2015 Google patented a different technology to communicate with pedestrians using a visual interface called pedestrian notifications [16]. In this approach, the Google car estimates the trajectories of pedestrian movements. If the car finds the behavior of a pedestrian to be uncertain (i.e. cannot decide whether or not he/she is crossing the street), it notifies the corresponding pedestrian about the action it is about to take using a screen installed on the front hood of the car. Another proposed option for communication is via a sound device or other kinds of physical devices (possibly a robotic arm). This technology has been criticized for being distracting and lacking the ability to efficiently communicate if more than one pedestrian is involved.

Given the problems associated with establishing explicit communication with other vehicles and pedestrians, Nissan, in their latest development, announced a passive method of dealing with uncertainties in the behavior of pedestrians [17, 18], in attempt to understand human drivers and pedestrians behaviors in various traffic scenarios. The main objective of this work is to passively predict pedestrian behavior using visual input and only use an interface, e.g. a green light, to inform them about the intention of the autonomous car.

Toyota in partnership with MIT and Stanford recently announced using a similar passive approach toward autonomous driving [19]. Information such as the type of equipment the pedestrian carries, his/her pose, direction of motion and behavior as well as the human driver’s reactions to events are extracted from videos of traffic situations and their 3D reconstructions. This information is used to design a control system for determining what course of action to take in given situations. At present no data resulting from this study has been released and very little information is available about the type of autonomous behavior the scientists are seeking to obtain.

Given the importance of pedestrian safety, multiple studies were conducted in the last several decades to find factors that influence decisions and behavior of traffic participants. For example, a driver’s pedestrian awareness can be measured based on whether the driver is decelerating upon seeing a pedestrian ([20], [21], [22]). Several recent studies also point out that pedestrian’s behavior, such as establishing eye-contact, smiling and other forms of non-verbal communication, can have a significant impact on the driver’s actions ([23], [24]). Although a majority of these studies are aimed at developing better infrastructure and traffic regulations, their conclusions are relevant for autonomous driving as well.

In an attempt to better understand the problem of vehicle to vehicle (V2V) and vehicle to pedestrian (V2P) communication in the autonomous driving context we suggest viewing it as an instance of joint attention and discuss why existing approaches may not be adequate in this context. We propose a novel dataset that highlights the visual and behavioral complexity of traffic scene understanding and is potentially valuable for studying the joint attention issues.

Ii Autonomous driving and joint attention

According to a common definition, joint attention is the ability to detect and influence an observable attentional behavior of another agent in social interaction and acknowledge them as an intentional agent [25]. However, it is important to note that joint attention is more than simultaneous looking, attention detection and social coordination, but also includes an intentional understanding of the observed behavior of others.

Since joint attention is a prerequisite for efficient communication, it has been gaining increasing interest in the fields of robotics and human-robot interaction. Kismet [26] and Cog [27], both built at MIT in the late 1990s, were some of the first successes in social robotics. These robots were able to maintain and follow eye gaze, reacted to the behavior of their caregivers and recognized simple gestures such as declarative pointing. More recent work in this area is likewise concerned with gaze following [28, 29, 27], pointing [30, 27] and reaching [31], turn-taking [32] and social referencing [33]. With a few exceptions [34, 35], almost all joint attention scenarios are implemented with stationary robots or robotic heads according to a recent comprehensive survey [36].

Surprisingly, despite increasing interest for joint attention in the fields of robotics and human-robot interaction, it has not been explicitly mentioned in the context of autonomous driving. For example, communication between the driver and pedestrian is an instance of joint attention. Consider the following scenario: a pedestrian crossing the street (shown in \Figrefwoman_crossing). Initially she is looking at her cell phone, but as she approaches the curb, she looks up and slows down because the vehicle is still moving. When the car slows down, she speeds up and crosses the street. In this scenario all elements of joint attention are apparent. Looking at the car and walking slower is an observable attention behavior. The driver slowing down the car indicates that he noticed the pedestrian and is yielding. His intention is clearly interpreted as such, as the pedestrian speeds up and continues to cross. A similar scene is shown in \Figrefwomen_standing. Here the pedestrian is standing at the crossing and looking both ways to find a gap in traffic. Again, once she notices that the driver is slowing down, she begins to cross.

While these are fairly typical behaviors for marked crossings, there are many more possible scenarios of communication between the traffic participants. Humans recognize a myriad of “social cues” in everyday traffic situations. Apart from establishing eye contact or waving hands, people may be making assumptions about the way a driver would behave based on visual characteristics such as the car’s make and model [6]. Understanding these social cues is not always straightforward. Aside from visual processing challenges such as variation in lighting conditions, weather or scene clutter, there is also a need to understand the context in which the social cue is observed. For instance, if the autonomous car sees someone waving his hand, it needs to know whether it is a policeman directing traffic, a pedestrian attempting to cross the street or someone hailing a taxi. Consider \Figrefgesture, where a man is crossing the street and makes a slight gesture as a signal of yielding to the driver, or \Figrefjaywalking, where a man is jaywalking and acknowledges the driver with a hand gesture. Responding to each of these scenarios from the driver’s perspective can be quite different and would require high-level reasoning and deep scene analysis.

Figure 1: Examples of joint attention

Today, automotive industry giants such as BMW, Tesla, Ford and Volkswagen, who are actively working on autonomous driving systems, rely on visual analysis technologies developed by Mobileye111 to handle obstacle avoidance, pedestrian detection or traffic scene understanding. Mobileye’s approach to solving visual tasks is to use deep learning techniques which require a large amount of data collected from hundreds of hours of driving. This system has been successfully tested and is currently being used in semi-autonomous vehicles. However, the question remains open whether deep learning suffices for achieving full autonomy in which tasks are not limited to detection of pedestrians, cars or obstacles (which are not still fully reliable [37, 38]), but also involve merging with ongoing traffic, dealing with unexpected behaviors such as jaywalking, responding to emergency vehicles, and yielding to other vehicles or pedestrians at intersections.

To answer this question we need to consider the following characteristics of deep learning algorithms. First, even though deep learning algorithms perform very well in tasks such as object recognition, they lack the ability to establish causal relationships between what is observed and the context in which it has occurred [39, 40]. This problem also has been empirically demonstrated by training neural networks over various types of data [39].

The second limitation of deep learning is the lack of robustness to changes in visual input [41]. This problem can occur when a deep neural network misclassifies an object due to minor changes (at a pixel level) to an image [42] or even recognizes an object from a randomly generated image [43].

Iii Existing datasets

The autonomous driving datasets currently available to the public are primarily intended for applications such as 3D mapping, navigation, and car and pedestrian detection. Out of these datasets only a limited number contain data that can be used for behavioral studies. Below some of these datasets are listed.

  • KITTI [44]: This is perhaps one of the most known publicly available datasets for autonomous driving. It contains data collected from various locations such as residential areas and city streets, highways and gated environments. The main application is for 3D reconstruction, 3D detection, tracking and visual odometry. Some of the videos in KITTI show pedestrians, other vehicles and cyclists movements alongside the car. The data has no annotation of their behaviors.

  • Caltech pedestrian detection benchmark [45]: This is a very large dataset of pedestrians consisting of approximately 10 hours of driving in regular traffic in urban environments. The annotations include temporal correspondence between bounding boxes around pedestrians and detailed occlusion lables.

  • Berkeley pedestrian dataset [46]: This dataset consists of a large number of videos of pedestrians collected from a stationary car at street intersections. Bounding boxes around pedestrians are provided for pedestrian detection and tracking.

  • Semantic Structure From Motion (SSFM) [47]: As the name implies, this dataset is collected for scene understanding. The annotation is limited to bounding boxes around the objects of interest and name tags for the purpose of detection. This dataset includes a number of street view videos of cars and pedestrians walking.

  • The German Traffic Sign Detection Benchmark [48]: This dataset consists of 900 high-resolution images of roads and streets some of which show pedestrians crossing and cars. The ground truth for the dataset only specifies the positions of traffic signs in the images.

  • The .enpeda.. (environment perception and driver assistance) Image Sequence Analysis Test Site (EISATS) [49]: EISATS contains short synthetic and real videos of cars driving on roads and streets. The sole purpose of this dataset is comparative performance evaluation of stereo vision and motion analysis. The available annotation is limited to the camera’s intrinsic parameters.

  • Daimler Pedestrian Benchmark Datasets [50]: These are particularly useful datasets for various scenarios of pedestrian detection such as segmentation, classification, and path prediction. The sensors of choice are monocular and binocular cameras and the datasets contain both color and grayscale images. The ground truth data is limited to the detection applications and does not include any behavioral analysis.

  • UvA Person Tracking from Overlapping Cameras Datasets [51]: These datasets mainly are concerned with the tasks of tracking, and pose and trajectory estimation using multiple cameras. The ground truth is also limited to only facilitate tracking applications.

In recent years traffic behavior of drivers and pedestrians has became a widely studied topic for collision prevention and traffic safety. Several large-scale naturalistic driving studies have been conducted in the USA [52, 53, 54], which accumulated over 4 petabytes of data (video, audio, instrumental, traffic, weather, etc) from hundreds of volunteer drivers in multiple locations. However, only some depersonalized general statistics are available to the general public [55], while only qualified researchers have access to the raw video and sensor data.

Iv The JAAD Dataset

The JAAD dataset was created to facilitate studying the behavior of traffic participants. The data consists of 346 high-resolution video clips (5-15s) with annotations showing various situations typical for urban driving. These clips were extracted from approx. 240 hours of driving videos collected in several locations. Two vehicles equipped with wide-angle video cameras were used for data collection (\Tabrefdataset). Cameras were mounted inside the cars in the center of the windshield below the rear view mirror.

Figure 2: A timeline of events recovered from the behavioral data. Here a single pedestrian is crossing the parking lot. Initially the driver is moving slow and, as he notices the pedestrian ahead, slows down to let her pass. At the same time the pedestrian crosses without looking first, then turns to check if the road is safe, and as she sees the driver yielding, continues to cross. The difference in resolution between the images is due to the changes in distance to the pedestrian as the car moves forward.

The video clips represent a wide variety of scenarios involving pedestrians and other drivers. Most of the data is collected in urban areas (downtown and suburban), only a few clips are filmed in rural locations. Many of the situations resemble the ones we have described earlier, where pedestrians wait at the designated crossings. In other samples pedestrians may be walking along the road and look back to see if there is a gap in traffic (\Figreflook-back), peek from behind the obstacle to see if it is safe to cross \Figrefpeeking, waiting to cross on a divider between the lanes, carrying heavy objects or walking with children or pets. Our dataset captures pedestrians of various ages walking alone and in groups, which may be a factor affecting their behavior. For example, elderly people and

# of clips Location Resolution Camera model
55 North York, ON, Canada GoPro HERO+
287 Kremenchuk, Ukraine Highscreen Black Box Connect
6 Hamburg, Germany Highscreen Black Box Connect
5 New York, USA GoPro HERO+
4 Lviv, Ukraine Garmin GDR-35
Table I: Locations and equipment used to capture videos in the JAAD dataset

parents with children may walk slower and be more cautious. The dataset contains fewer clips of interactions with other drivers, most of them occur in uncontrolled intersections, in parking lots or when another driver is moving across several lanes to make a turn.

Most of the videos in the dataset were recorded during the daytime and only a few clips were filmed at night, sunset and sunrise. The last two conditions are particularly challenging, as the sun is glaring directly into the camera (\FigrefSunrise). We also tried to capture a variety of weather conditions (\FigrefSample-frames), as yet another factor affecting the behavior of traffic participants. For example, during the heavy snow or rain people wearing hooded jackets or carrying umbrellas may have limited visibility of the road. Since their faces are obstructed it is also harder to tell if they are paying attention to the traffic from the driver’s perspective.

We attempted to capture all of these conditions for further analysis by providing two kinds of annotations for the data: bounding boxes and textual annotations. Bounding boxes are provided only for cars and pedestrians that interact with or require attention of the driver (e.g. another car yielding to the driver, pedestrian waiting to cross the street, etc.). Bounding boxes for each video are written into an xml file with frame number, coordinates, width, height and occlusion flag.

Categorical variable Values
time_of_day day/night
weather clear/snow/rain/cloudy
location street/indoor/parking_lot
designated_crossing yes/no
age_gender Child/Young/Adult/Senior Male/Female
Behavior event Type
Crossing state
Stopped state
Moving fast state
Moving slow state
Speed up state
Slow down state
Clear path state
Looking state
Look point
Signal point
Handwave point
Table II: Variables associated with each video and types of events represented in the dataset. There are two types of behavior events: state and point. State event may have an arbitrary duration, while point events last a short fixed amount of time (0.1 sec) and signify a quick glance or gestures made by pedestrians.

Observation id GOPR0103_528_542

Media file(s)

Player #1 GOPR0103_528_542.MP4

Observation date 2016-07-15 15:15:38


Time offset (s) 0.000

Independent variables

variable value
weather rain
age_gender AF
designated no
location plaza
time_of_day daytime
Time Media file path Media total length FPS Subject Behavior Comment Status
0.19 GOPR0088_335_344.MP4 9.01 29.97 Driver moving slow START
0.208 GOPR0088_335_344.MP4 9.01 29.97 pedestrian crossing START
0.308 GOPR0088_335_344.MP4 9.01 29.97 pedestrian looking START
1.301 GOPR0088_335_344.MP4 9.01 29.97 Driver moving slow STOP
1.302 GOPR0088_335_344.MP4 9.01 29.97 Driver slow down START
1.892 GOPR0088_335_344.MP4 9.01 29.97 pedestrian looking STOP
8.351 GOPR0088_335_344.MP4 9.01 29.97 pedestrian crossing STOP
8.99 GOPR0088_335_344.MP4 9.01 29.97 Driver slow down STOP
Figure 3: Example of textual annotation for a video created using BORIS. The file contains the id and the name of the video file, a tab-separated list of independent variables (weather, age and gender of pedestrians, whether the crossing is designated or not, location and time of the day) and a tab-separated list of events. Each event has an associated time stamp, subject, behavior and status, which may be used to recover sequence of events for analysis.

Textual annotations are created using BORIS222 [56] - event logging software for video observations. It allows to assign predefined behaviors to different subjects seen in the video, and can also save some additional data, such as video file id, location where the observation has been made, etc.

A list of all behaviors, independent variables and their values is shown in \TabrefBehavioral-data. We save the following data for each video clip: weather, time of the day, age and gender of the pedestrians, location and whether it is a designated crosswalk. Each pedestrian is assigned a label (pedestrian1, pedestrian2, etc.). We also distinguish between the driver inside the car and other drivers, which are labeled as “Driver” and “Driver_other” respectively. This is necessary for the situations where two or more drivers are interacting. Finally, a range of behaviors is defined for drivers and pedestrians: walking, standing, looking, moving, etc.

An example of textual annotation is shown in \FigrefTextual-annotation-example. The sequence of events recovered from this data is shown in \Figreftimeline_of_crossing.

Figure 4: More examples of joint attention
(a) Sunset
(b) Sunrise
(c) After a heavy snowfall
(d) During a heavy rain
(e) Multiple pedestrians crossing
(f) At the parking lot
Figure 5: Sample frames from the dataset showing different weather conditions and locations.
Figure 6: A selection of images of pedestrians from the dataset

The dataset is available to download at

V Conclusion

In this paper we presented a new dataset for the purpose of studying joint attention in the context of autonomous driving. Two types of annotations accompanying each video clip in the dataset make it suitable for pedestrian and car detection, as well as other areas of research, which could benefit from studying joint attention and human non-verbal communication, such as social robotics.


We thank Mr. Viktor Kotseruba for assistance with processing videos for this dataset.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description