DFKI Cabin Simulator: A Test Platform for Visual In-Cabin Monitoring Functions

DFKI Cabin Simulator: A Test Platform for Visual In-Cabin Monitoring Functions


We present a test platform for visual in-cabin scene analysis and occupant monitoring functions. The test platform is based on a driving simulator developed at the DFKI, consisting of a realistic in-cabin mock-up and a wide-angle projection system for a realistic driving experience. The platform has been equipped with a wide-angle 2D/3D camera system monitoring the entire interior of the vehicle mock-up of the simulator. It is also supplemented with a ground truth reference sensor system that allows to track and record the occupant’s body movements synchronously with the 2D and 3D video streams of the camera. Thus, the resulting test platform will serve as a basis to validate numerous in-cabin monitoring functions, which are important for the realization of novel human-vehicle interfaces, advanced driver assistant systems, and automated driving. Among the considered functions are occupant presence detection, size and 3D-pose estimation, and driver intention recognition. In addition, our platform will be the basis for the creation of large-scale in-cabin benchmark datasets.


n-Cabin Monitoring, Driving Simulator, Pose Estimation, Benchmark Dataset Creation

1 Introduction

In-cabin monitoring of vehicle occupants is a topic of increasing interest induced by the ongoing development of advanced driver assistant systems (ADAS) and automated driving, up to driverless vehicles and shared mobility. The requirements towards the monitoring functions are thereby manifold and changing with the level of driving automation. The demand for these novel monitoring functions comes moreover from new requirements for safety and comfort functions, as well as from novel human-vehicle interfaces.

The full monitoring and understanding of the scene in the vehicle cabin comprises not only the automatic detection, classification, and recognition of all occupants and objects, but also the estimation of the occupants’ pose and state, as well as the recognition of their activities, interactions, and intentions.

Of particular interest is thereby the monitoring of the driver’s state and intention. Many research activities have recently concentrated on the development of camera systems monitoring the driver’s face and inferring on his state such as awareness and focus of attention, to realize novel warning functions and as support for ADAS functions and future automated driving.

Although full-body pose tracking of humans has become an intensive research domain since the launch of the first Kinect camera [5, 18], the full-body pose detection in vehicles is rarely investigated. The benefits of monitoring the full-body pose in a vehicle are thereby manifold. They range from the recognition of arm gestures and intentions for comfort and advanced human-vehicle interfaces to a robust analysis of the driver’s activities and availability. The latter function is crucial for automated driving of levels 3 and 4, in which the hand-over of the vehicle control to the driver has to be managed.

Existing benchmark datasets for in-cabin monitoring functions, as, e.g. the VIVA challenge [21], often provide videos of confined areas inside the vehicle, where either the driver’s head or hand is located. Moreover, the annotations do not contain 3D ground truth data. Recently, several in-cabin benchmark datasets with precise ground truth 3D-measurements have been published [15, 16], but they are restricted to the detection of the head pose. There exist full-body pose datasets, as, e.g. the MPII dataset [1]. This benchmark contains a very wide range of scenes of which only a few are ”driving automobile”.

In this work, we go further by developing a test platform that allows recording large-scale datasets of complete in-cabin scenes. In this way, we close the gap between systems monitoring only the head or hands of the driver. Recorded data will comprise annotations for a wide range of monitoring functions, including also ground truth 3D body pose measurements of the vehicle occupants.

2 The DFKI in-cabin test platform

The in-cabin test platform is based on a driving simulator developed at the DFKI, consisting of a realistic in-cabin mock-up and a wide-angle projection system for a realistic driving experience. The test platform has been equipped with a wide-angle 2D/3D camera system for monitoring the entire interior of the vehicle mock-up and an optical ground truth reference sensor system that allows to track and record the occupant’s body movements synchronously with the 2D and 3D video streams of the camera. Moreover, the precise positioning of the front seats can be controlled and registered via a CAN-interface.

Fig. 1 shows the testing process in action. In addition to the three-screen projection of the simulator software, another monitor is mounted on the side of the simulator for the test engineer in order to control the recording or testing process. With this setup, it is possible to visualize data while recording and to test and demonstrate functionalities in real-time. The individual components of the test setup are described in more detail in the following sub-sections.

Figure 1: Driving actions recorded with a 2D/3D camera system mounted above the dashboard. The simulated street scene can be recognized on the right; the computer screen in the background displays the data stream from the camera.
(a) Full view with running simulation
(b) Simulation projectors
(c) Camera mounting front view
(d) Camera mounting back view
(e) Ground truth sensor system
(f) Ground truth sensor camera
(g) Mounting of calibration targets
(h) Seat positioning software
Figure 2: Images of the driving simulator.

2.1 Driving Simulation

The OpenDS simulator software [6] is utilized and provides a near-realistic driving environment. There are over 20 different driving scenarios and driving tasks in urban and rural environments including other traffic participants like cars and pedestrians as well as traffic lights. The software is also capable of simulating different environmental and weather conditions.

Three projectors (Fig. 1(b)) are used to project the simulated scene on a wide-angle screen covering almost the driver’s entire field of vision (Fig. 1(a)). This leads to a realistic driving experience and, most importantly, to realistic driver’s movements (e.g. head movement in turning maneuvers). Together with the simulation, the software also comprises a drive analyzer to gather the data of the USB Controller that senses the driving actions as steering, braking, and acceleration.

Of course, it is also possible to install other simulation software. This might come in handy to cover special needs for testing e.g. commercial vehicles, agricultural vehicles.

2.2 2D/3D camera system mounting

The test platform has been equipped with a wide-angle 2D/3D camera system for monitoring the entire interior of the vehicle mock-up of the simulator. The mounting position of the camera corresponds to the overhead module of the car close to the rear mirror (see Fig. 1(c) and 1(d)). In order to be capable of monitoring both driver and passenger from that position, a wide field of view to monitor the full vehicle cabin, including the driver and the front-seat passenger is crucial. The metal fixation bar also allows mounting several cameras in parallel in similar positions. Section 3 discusses several 2D/3D camera systems that had been mounted and evaluated in this test setup.

Due to the multi-purpose aluminum profile framework, the mounting is, however, also flexible enough to mount additional cameras and other sensors anywhere according to various specific needs.

2.3 Ground truth sensor system

This test platform is supplemented with a ground truth sensor system that allows to track and record the occupant’s movement at high frame rates, synchronously with the 2D and 3D video stream of the camera system monitoring the cabin of the driving simulator. The 3D tracking system chosen for that purpose is OptiTrack [12]. OptiTrack is a motion capture system that works with IR cameras detecting small reflective markers on the subject’s body joints. It has often been used as a reference system in the past (e.g. [22]). Fig. 1(e) and 1(f) show the OptiTrack cameras mounted around the driving simulator in a way that each tracking point on the subject is visible at all times. With this tracking mechanism, it is possible to record the position of each joint for every point in time with sub-millimeter accuracy. This leads to an easy and precise automated testing of algorithms. Fig. 3 shows a skeleton preview of the OptiTrack software.

Figure 3: Driving actions recorded with OptiTrack [12] sensor system to extract ground truth joint positions. The rectangular pyramids indicate the position and rotation of the 12 cameras in the room. The detected 3D positions of the markers is shown as white dots around a schematic skeleton.

2.4 Sensor Calibration and Synchronization

A metal frame holding the simulator projectors is rigidly fixed to the vehicle mock-up and defines the so-called world coordinate system. Fig. 1(g) shows the designated calibration board that can be mounted on the simulator frame. It offers a checkerboard for camera calibration as well as a calibration square for the ground truth sensor system. This setup enables extrinsic calibration of both the installed camera and the ground truth sensor system in relation to the world coordinate system. The precisely mounted board solution guarantees precise calibration and position verification. The calibration board can be unmounted from the frame to make way for undisturbed data recording.

Synchronization between the camera and ground truth sensor system is realized via a synchronization wire. The ground truth system sends periodic pulses with exposure frequency as soon as data recording is started. The camera is triggered with these pulses and gets re-synchronized to avoid drift.

2.5 Seat positioning

The positioning of the driver and passenger seat in the mock-up is controlled via the CAN bus system of the seats which originate from a real automobile (see Fig. 1(h)). This access via CAN guarantees the full traceability of the seat position while testing and also enables a precise seat adjustment according to predefined positions. Another very important aspect of the testing or development of deep learning algorithms is to test and train them with a wide variability of data. For depth imagery, the most interesting and realistic way to add data variability to the depth data is by changing the seat position. The CAN control of the seats provides the full range of seat positions with a set of four degrees of freedom: seat height, seat position, backrest position, and seat tilt. Fig. 4 schematically shows these variables. In addition to that, the steering wheel position is also alterable, though not controlled via the CAN-interface. With that variability, it is possible to cover the full range of possible real-world seat positions by a comprehensive pre-defined set of parameter combinations. The influence of the seat pose variation on the in-cabin scenery recorded by the selected 2D/3D-camera is discussed further below in Section 3 (see also Fig. 6).

Figure 4: Variables for the seat positioning mechanism.

3 In-cabin monitoring with a 2D/3D camera system

3.1 Selection of a 2D/3D camera system for in-cabin monitoring

In order to monitor the entire interior of the vehicle mock-up from a mounting position corresponding to the overhead module of the car close to the rear mirror (see Section 2.2), a camera field of view of at least 120 is required. The depth-sensing range has to cover a range of 25cm to 200cm. Also, the camera needs to capture data at a high framerate of at least 30Hz to be capable to track person’s movements.

Several depth cameras available on the market have been evaluated against these requirements. Tab. 1 summarizes the main optical characteristics of the reviewed cameras.

Camera Name Type Depth Field of View Depth Range in m
Kinect v1 [7] IR Pattern and RGB 57 x 43 1.2 - 3.5
Kinect v2 [8] ToF and RGB 70 x 60 0.5 - 4.5
Azure Kinect [9] ToF and RGB 120 x 120 0.25 - 2.88
MYNT EYE S [11] Stereo + IR Pattern 120 x 75 0.7 - 3 (IR range)
Stereolabs ZED [19] Stereo RGB 90 x 60 0.5 - 20
Structure Core [13] IR Pattern and RGB 58 x 45 0.4 - 5
Table 1: Reviewed cameras

Fig. 5 shows example RGB-D images of the various generations of Kinect cameras. The Kinect cameras are consumer cameras also commonly used in the research community, e.g. to record benchmark datasets [4]. The cameras have been mounted on the rear mirror position as described in section 2.2 (see Fig. 1(c) and 1(d)).

One recognizes in Fig. 5 that only the latest generation Azure Kinect provides a complete view of the cabin interior from the selected rear mirror perspective. The field of view of the Kinect v1 is too small to monitor the driver fully: The left arm is not visible. The Kinect v2 has a larger field of view, but the driver’s hands are too close to the camera for valid depth measurement. The other available depth sensors show limitations according to the first two generations of the Kinect, as one can already recognize from their datasheets (see Tab. 1), and were therefore not further considered for the in-cabin test setup.

Figure 5: Comparison of depth imagery from three generations of Microsoft Kinect cameras. All images were recorded with the respective cameras in the same position. Depth values are shown in false-color representation where the range was limited to 1300 millimeters for better visualization. From top to bottom: Kinect v1 [7], Kinect v2 [8], Azure Kinect [9].

With these results, we decided that the Azure Kinect is the favorable camera to use. Although it has some limitations, it is the only camera that is capable of recording the whole driver’s body. On the downside, the IR transmitter is not strong enough to provide sufficient signal everywhere on the low-remitting black leather seat for valid depth measurement.

3.2 Influence of seat adjustment on the depth image

As discussed in Section 2.5, the position of the mock-up seats can be varied over the full range of four degrees of freedom. In order to demonstrate the influence of the seat positioning on the depth image, we have recorded a scene with a driver and an empty seat with a set of pre-defined seat positions (see Fig. 6). For that purpose, each of the four parameters was set to both extreme positions while the others were kept at mean positions. One recognizes that the backrest inclination (top row in Fig. 6) has a large influence on the depth image. The measured depth values, as well as the scale and shape of both driver and empty seat, change significantly when under variation of the backrest inclination. Varying the height or position of the seat changes mainly the depth and thus the scale of the objects in the image, while a change of the tilt of the seat base has almost no influence on the depth image. These observations are important for designing an optimal test matrix covering the full range of scene variations with a manageable amount of test cases.

Figure 6: Depth map comparison of extreme seat positions. In these images, the depth of each pixel is represented in a false-color map ranging from far away (red) to very close (blue). Each row shows the two extreme positions of each seat positioning parameter, with the others kept at mean value. From top to bottom: backrest, height, position, tilt.

4 Proposed Usage

4.1 Seat occupancy classification

By monitoring the entire vehicle cabin, the DFKI test platform is supporting the development of multi-seat occupancy detection, including occupant classification, driver recognition, and object detection functions. The robustness of novel deep learning approaches for occupancy classification based on augmented and fully synthetic data will be tested against the variability of the scene in driving scenarios.

We use SVIRO [2], a large-scale synthetic dataset that provides multi-modal image data for scenarios within the passenger compartment of ten different vehicles. Fig. 7 shows an example depth image from the dataset. To show the promise of our test platform, we have trained a simple object detector for the task of detecting people on SVIRO synthetic depth images and evaluated it on real depth images captured in our driving simulator. Our model is based on popular Faster R-CNN [14] architecture with all training parameters the same as in the original paper. We used 10k synthetic images for training and let the model train until no further decrease in the loss was observed. To test our model, we recorded a driving scenario with one person in the driving seat in our driving simulator. We achieve a mean AP score of 88.6% for class ’person’ with our simple detector. Fig. 6(b) shows the results of our trained detector on an example image captured with our test platform.

(a) Depth Image from SVIRO [2] dataset
(b) Recorded depth image
Figure 7: Depth image from SVIRO dataset (left) with detected person on an image recorded with test platform (right). The depth values are coded as gray values, the detection is indicated by the green bounding box.

4.2 AutoPOSE - Driver head pose ground truth acquisition

The introduced driving simulator was used in the acquisition of the AutoPOSE dataset [17]. AutoPOSE is a new head pose and eye gaze targets dataset. An IR camera was located at the driver’s dashboard, giving almost a frontal view of the driver’s face. A Kinect v2 was placed at the location of the center mirror (rear mirror) of the car providing 3 image types, IR, depth, and RGB images. The subjects total number was 21 (10 females and 11 males). The dashboard IR camera subset consists of 1,018,885 IR images and the Kinect subset consists of 316,497 synchronized RGB, depth, and IR images. As mentioned in Section 2.3, we used the submillimeter accurate OptiTrack motion capturing system for accurate and reliable tracking data acquisition that can be used as ground truth. The driver’s head coordinate system and the cameras were calibrated and synchronized with the tracking system. Besides having the driver’s head pose, also the eye gaze targets were acquired. The subjects were asked to gaze at reflective markers placed at driving-related locations, for example, side mirrors, center rear mirror, dashboard, road view, and media center. Fig. 8 shows samples of the dataset from the IR camera perspective and the Kinect camera perspective. Please refer to the dataset paper for further details.

(a) IR Camera - Dashboard
(b) Kinect v2 [8] - Center mirror
Figure 8: Sample dataset images from [17].
                                                                                                                                                  (a) Row1: RAW images with head target reflective markers visible, Row 2: post-processing - markers covered (accurately localized). Second column shows gaze annotation lamp.
                                                                                                                                                  (b) Kinect v2 [8] color, IR, and depth (color mapped) images. Note: Intensity was improved for visibility and printing purposes.

4.3 3D body pose tracking

3D body pose tracking is the basis for a comprehensive activity recognition of the vehicle occupants, and also supports a robust analysis of the driver’s state and availability. Since it is possible to equip the subject with as many tracking markers as needed, a broad range of applications can be tested with individual joint positioning.

A first deep learning approach towards a full 3D body pose tracking in a car based on a spherical camera was published in [10] and [3] and was already demonstrated in real-time in the driving simulator (Fig. 9). Comprehensive recording of driving scenarios with ground truth data will allow to quantitatively evaluate these algorithms.

Figure 9: Running 3D pose tracking algorithm in the driving simulator

4.4 Gesture, activity, and intention recognition

Another recent research activity has been the development of novel deep-learning-based hand pose and gesture recognition in a car [20]. It was shown that special recurrent neural networks allow the recognition of more complex hand gestures than what is currently available as a state-of-the-art sensor system. The setup for this study was a time-of-flight camera installed in the rear mirror of a vehicle. The test platform allows an extension of this study towards the recognition of arm gestures and activities. The test platform offers moreover the possibility to investigate approaches to fuse the visual information with other sensor modalities, as, e.g. vehicle driving parameters, for a robust intention recognition of the driver.

4.5 Benchmark dataset

In the future, the DFKI in-cabin test platform will be used to record and annotate a large-scale database of in-cabin scenes under realistically simulated driving scenarios. The database will support the benchmark testing of a range of monitoring functions as those described above, in the first line seat occupancy classification, occupant size, and 3D-pose estimation, as well as driver activity and intention recognition. Apart from the ground truth data recording for the pose estimation, the data needs to be annotated with various, application-specific attributes, including also a 2D- and 3D- labeling of the object location, size, and orientation. Therefore, semi-automated annotation tools are under development.

5 Conclusion

In this paper, we presented a multi-purpose in-cabin test platform. It provides a near-realistic driving experience and is equipped with a 2D/3D camera setup and a ground truth sensor system. Thus it can be used to accurately evaluate driver monitoring algorithms and also to create in-cabin datasets. The construction was explained in detail, various use cases have been described and first results were shown. We lay great emphasis on high adaptability for different scenarios and we hope to use this platform for years to come in various projects.


This project has received funding within the Electronic Components and Systems for European Leadership (ECSEL) Joint Undertaking in collaboration with the European Union’s H2020 Framework Program and National Authorities, under grant agreement n 826600 (VIZTA).


  1. email:
  2. email:
  3. email:
  4. email:
  5. email:
  6. email:


  1. M. Andriluka, L. Pishchulin, P. Gehler and B. Schiele (2014) 2D human pose estimation: new benchmark and state of the art analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  2. S. D. D. Cruz, O. Wasenmüller, H. Beise, T. Stifter and D. Stricker (2020) SVIRO: synthetic vehicle interior rear seat occupancy dataset and benchmark. In IEEE Winter Conference on Applications of Computer Vision (WACV), Cited by: 6(a), §4.1.
  3. A. Elhayek, O. Kovalenko, P. Murthy, J. Malik and D. Stricker (2018) Fully automatic multi-person human motion capture for VR applications. In Virtual Reality and Augmented Reality - 15th EuroVR International Conference, EuroVR 2018, London, UK, October 22-23, 2018, Proceedings, pp. 28–47. External Links: Document, Link Cited by: §4.3.
  4. M. Firman (2016) RGBD Datasets: Past, Present and Future. In CVPR Workshop on Large Scale 3D Data: Acquisition, Modelling and Analysis, Cited by: §3.1.
  5. I. Habibie, W. Xu, D. Mehta, G. Pons-Moll and C. Theobalt (2019) In the wild human pose estimation using explicit 2d features and intermediate 3d representations. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  6. R. Math, A. Mahr, M. M. Moniri and C. Muller (2013) OpenDS: a new open-source driving simulator for research. In GMM-Fachbericht-AmE 2013, Cited by: §2.1.
  7. Microsoft (2010) Kinect camera. Note: \url http://www.xbox.com/en-US/kinect/default.htmlast accessed 2019/12/04 via https://web.archive.org Cited by: Figure 5, Table 1.
  8. Microsoft (2014) Kinect for windows v2. Note: \url https://support.xbox.com/en-US/xbox-on-windows/accessories/kinect-for-windows-v2-infolast accessed 2019/12/04 Cited by: Figure 5, Table 1, 7(b), Figure 8.
  9. Microsoft (2019) Azure kinect camera. Note: \url https://azure.microsoft.com/en-in/services/kinect-dk/last accessed 2019/12/04 Cited by: Figure 5, Table 1.
  10. P. N. Murthy, O. Kovalenko, A. Elhayek, C. Couto Gava and D. Stricker (2017) 3D human pose tracking inside car using single rgb spherical camera. In ACM Chapters Computer Science in Cars Symposium (CSCS), Cited by: §4.3.
  11. MYNTAI (2018) MYNT eye S camera. Note: \url https://www.mynteye.com/products/mynt-eye-stereo-cameralast accessed 2019/12/04 Cited by: Table 1.
  12. NaturalPoint Optitrack 3d tracking system. Note: \url https://optitrack.com/last accessed 2019/12/04 Cited by: Figure 3, §2.3.
  13. Occipital (2018) Structure core camera. Note: \url https://structure.io/structure-corelast accessed 2019/12/04 Cited by: Table 1.
  14. S. Ren, K. He, R. Girshick and J. Sun (2017) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6), pp. 1137–1149. External Links: ISSN 2160-9292, Link, Document Cited by: §4.1.
  15. M. Roth and D. M. Gavrila (2019) DD-pose - a large-scale driver head pose benchmark. In IEEE Intelligent Vehicles Symposium, Cited by: §1.
  16. A. Schwarz, M. Haurilet, M. Martinez and R. Stiefelhagen (2017) DriveAHead - a large-scale driver head pose dataset. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  17. M. Selim, A. Firintepe, A. Pagani and D. Stricker (2020) AutoPOSE: large-scale automotive driver head pose and gaze dataset with deep head pose baseline. In Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - VISAPP, Cited by: Figure 8, §4.2.
  18. J. Shotton, A. Fitzgibbon, A. Blake, A. Kipman, M. Finocchio, B. Moore and T. Sharp (2011) Real-time human pose recognition in parts from a single depth image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  19. Stereolabs (2015) Stereolabs zed camera. Note: \url https://www.stereolabs.com/zed/last accessed 2019/12/04 Cited by: Table 1.
  20. A. Tewari, B. Taetz, F. Grandidier and D. Stricker (2017) A probabilistic combination of cnn and rnn estimates for hand gesture based interaction in car. In IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Cited by: §4.4.
  21. VIVA (2016) Vision for intelligent vehicles and applications. Note: \url http://cvrr.ucsd.edu/vivachallenge/last accessed 2019/07/25 Cited by: §1.
  22. O. Wasenmüller, M. Meyer and D. Stricker (2016) CoRBS: comprehensive rgb-d benchmark for slam using kinect v2. In IEEE Winter Conference on Applications of Computer Vision (WACV), External Links: Link Cited by: §2.3.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description