SLAMCast: Large-Scale, Real-Time 3D Reconstruction and Streaming
for Immersive Multi-Client Live Telepresence

Patrick Stotko, Stefan Krumpen,
Matthias B. Hullin, Michael Weinmann, Reinhard Klein
Institute of Computer Science IIUniversity of Bonn, Germany

Real-time 3D scene reconstruction from RGB-D sensor data, as well as the exploration of such data in VR/AR settings, has seen tremendous progress in recent years. The combination of both these components into telepresence systems, however, comes with significant technical challenges. All approaches proposed so far are extremely demanding on input and output devices, compute resources and transmission bandwidth, and they do not reach the level of immediacy required for applications such as remote collaboration. Here, we introduce what we believe is the first practical client-server system for real-time capture and many-user exploration of static 3D scenes. Our system is based on the observation that interactive frame rates are sufficient for capturing and reconstruction, and real-time performance is only required on the client site to achieve lag-free view updates when rendering the 3D model. Starting from this insight, we extend previous voxel block hashing frameworks by overcoming internal dependencies and introducing, to the best of our knowledge, the first thread-safe GPU hash map data structure that is robust under massively concurrent retrieval, insertion and removal of entries on a thread level. We further propose a novel transmission scheme for volume data that is specifically targeted to Marching Cubes geometry reconstruction and enables a 90% reduction in bandwidth between server and exploration clients. The resulting system poses very moderate requirements on network bandwidth, latency and client-side computation, which enables it to rely entirely on consumer-grade hardware, including mobile devices. We demonstrate that our technique achieves state-of-the-art representation accuracy while providing, for any number of clients, an immersive and fluid lag-free viewing experience even during network outages.

remote collaboration, live telepresence, real-time reconstruction, voxel hashing, RGB-D, real-time streaming
journalyear: 2018publicationmonth: 5copyright: rightsretainedccs: Computing methodologies Virtual realityccs: Human-centered computing Virtual realityccs: Human-centered computing Collaborative interactionccs: Computing methodologies Reconstruction
Figure 1. Illustration of our novel multi-client live telepresence framework for remote collaboration: RGB-D data captured with standard consumer-grade depth sensors represents the input to our real-time large-scale reconstruction technique that is based on a novel thread-safe GPU hash map data structure. Efficient data streaming is achieved by transmitting a novel compact representation of the reconstructed model in terms of Marching Cubes indices. Multi-client live telepresence is achieved by the server’s independent handling of the requests by the clients.

1. Introduction

One of the main motivations behind virtual reality research has always been to allow users to immersively and subjectively explore remote places or environments. An experience of telepresence could benefit applications as diverse as remote collaboration, entertainment, advertisement, teaching, hazard site exploration, or rehabilitation. Thanks to advances in display technology and the emergence of high-resolution head-mounted devices, we have seen a recent surge in virtual reality solutions. However, it has long been known that traditional display parameters like resolution, frame rate and contrast are not the only factors contributing to an immersive viewing experience. The presentation of the data, its consistency, low-latency control to avoid motion sickness, the degree of awareness and the suitability of controller devices are just as important [Fontaine:1992; Held:1992; Witmer:1998]. For applications such as remote exploration, remote collaboration or teleconferencing, these conditions are not easily met, as the scene is not pre-built but needs to be reconstructed on-the-fly from 3D input data captured by a person or robotic device. A particular challenge, therefore, is to find a suitable coupling between the acquisition and viewing stages that respects the practical limitations imposed by available network bandwidth and client-side compute hardware. The data flow in a well-designed system should give multiple remote users the freedom to individually explore, for instance using head-mounted displays (HMD), the current state of reconstruction in the most responsive way possible.

Every modern 3D telepresence system has at its heart a powerful real-time 3D reconstruction framework. We distinguish two main use cases. Systems for transmitting dynamic 3D models of their users typically rely on massive well-calibrated acquisition setups such as rooms full of cameras and tracking sensors [Vasudevan:2011; Orts-Escolano:2016; Fairchild:2016], or they use sparse 3D input to gradually assemble template meshes (e.g.  [Casas:2016]) or drive avatars with a relatively low number of degrees of freedom (e.g.  [Hu:2017]). In this work, we direct our attention to the remote exploration of places using portable, consumer-grade acquisition devices, for instance in scenarios of remote inspection or consulting. This implies solving a simultaneous localization and mapping (SLAM) problem as addressed by solutions such as KinectFusion [kinectfusion2; Whelan:2012], DynamicFusion [Newcombe:2015], Fusion4D [Dou:2016] or voxel block hashing [niessner; infinitam] and variants thereof. The low-cost multi-client system proposed in this paper enables efficient remote exploration of quasi-static scenes by multiple independent clients. On the acquisition site, a user digitizes their physical environment using consumer-grade 3D capture hardware. Remote clients can perform immersive and interactive live inspection of that environment using off-the-shelf VR devices even while it is acquired and progressively refined. While comparable settings have been addressed by others before [mossel], their systems are targeted to a single user, they have high bandwidth requirements for streaming the model, they require the reconstructed scene to be tightly synchronized between server and client, and they are highly sensitive to network interruptions. In contrast, our novel remote collaboration system is designed to handle arbitrary numbers of exploration clients under real-world network conditions (including the recovery from full outages) and using consumer-grade hardware. These features are enabled by the following key innovations:

  • We propose a novel efficient large-scale 3D reconstruction suitable for multi-client telepresence that is enabled by – to the best of our knowledge – the first thread-safe GPU hash map that guarantees successful concurrent retrieval, insertion and removal of millions of entries on the fly on a thread level while preserving key uniqueness without any prior knowledge about the data.

  • A novel scene representation and transmission protocol enables the system to operate in low-bandwidth remote connection scenarios. Rather than reconstructing geometry on the server site or even performing server-side rendering, our system encodes the scene as a compressed sequence of voxel block indices and values, leaving the geometry reconstruction using Marching Cubes [Lorensen:1987:MCH] to the exploration client.

  • To overcome the inherently limited resolution of voxel-based scene representations, we propose a lightweight projective texture mapping approach that enables the visualization of texture detail at the full resolution of the depth camera on demand.

We provide a discussion of the respective challenges and design choices, and evaluate the proposed system regarding latency, visual quality and accuracy. Furthermore, we demonstrate its practicality in a multi-client remote servicing and inspection role-play scenario with non-expert users.

2. Related Work

Data Representation Flexibility Individual Exploration Re-Connection Data Management Compactness
RGB-D Data - - - easy good
Voxel Block Model easy bad
Mesh hard good
MC index based Model easy very good
Table 1. Advantages and disadvantages of different scene representations for remote collaboration systems.

In this section, we provide an overview of previous efforts related to our novel large-scale, real-time 3D reconstruction and streaming framework for immersive multi-client telepresence categorized according to the developments in the domains of telepresence, 3D reconstruction and hashing.


Real-time 3D reconstruction is a central prerequisite for many immersive telepresence applications. Early multi-camera telepresence systems did not allow the acquisition and transmission of high-quality 3D models in real-time to remote users due to limitations regarding the hardware at the time [Fuchs:1994; Kanade:1997; Mulligan:2000; Towles:2002; Tanikawa:2005; Kurillo:2008] or the applied techniques such as the lacking reconstruction accuracy of shape-from-silhouette approaches for concave surface regions [Petit:2010; Loop:2013]. Then the spreading access to affordable commodity depth sensors such as the Microsoft Kinect led to the development of several 3D reconstruction approaches at room scale [Maimone:2012; Maimone:2012b; Molyneaux:2012; Jones:2014]. However, the high sensor noise as well as temporal inconsistency in the reconstruction limited the quality of the reconstructions. In contrast, the Holoportation system [Orts-Escolano:2016] is built on top of the accurate real-time 3D reconstruction pipeline Fusion4D [Dou:2016] and involves real-time data transmission as well as AR and VR technology to achieve an end-to-end immersive teleconferencing experience. However, massive hardware requirements, i.e. several high-end GPUs running on multiple desktop computers, were needed to achieve real-time performance, where most of the expensive hardware components need to be located at the local user’s side. In the context of static scene telepresence, mossel developed an interactive single-exploration-client VR application based on current voxel block hashing techniques [infinitam]. While previous approaches are only designed for single client telepresence or do not support interactive collaboration, our approach overcomes these limitations and enables a variety of new applications.

3D Reconstruction

The key to success of the recently emerging high-quality real-time reconstruction frameworks is the underlying data representation that is used to fuse the incoming sensor measurements. Especially the modeling of surfaces in terms of implicit truncated signed distance fields (TSDFs) has become well-established for high-quality reconstructions. Earlier of these volumetric reconstruction frameworks such as KinectFusion [kinectfusion2] rely on the use of a uniform grid so that the memory requirement linearly scales with the overall grid size and not the significantly smaller subset of surface areas. As this is impractical for handling large-scale scenes, follow-up work focused on the development of efficient data structures for real-time volumetric data fusion by exploiting sparsity in the TSDF. This has been achieved based on using moving volume techniques [Roth:2012; Whelan:2015], representing scenes in terms of blocks of volumes that follow dominant planes [Henry:2013] or height maps that are parameterized over planes [Schoeps:2014] or using dense volumes only in the vicinity of the actual surface areas to store the TSDF [niessner; infinitam; Dai:2017]. The allocated blocks that need to be indexed may be addressed based on tree structures or hash maps. Tree structures model the spatial hierarchy in the representation at the cost of a complex parallelization and a time-consuming tree traversal which can be avoided with the use of hash functions that, however, discard the hierarchy. niessner proposed real-time 3D reconstruction based on a spatial voxel block hashing framework that has been later optimized [infinitam]. Drift that may lead to the accumulation of errors in the reconstructed model [niessner] can be counteracted by implementing loop closure [Kaehler:2016; Dai:2017]. Due to its efficiency, we built our remote collaboration system on top of the voxel block hashing approach and adapt the latter to the requirements discussed before. Very recently, Golodetz2018Collaborative presented a system for multi-client collaborative acquisition and reconstruction of static scenes with smartphones. Our approach is also prepared to support multi-client acquisition. However, we did not focus on this aspect in the scope of this paper but on the development of a practical collaboration system supporting an arbitrary number of exploration clients.


Lossless packing of sparse data into a dense map can be achieved via hashing. However, developing such data structures on the GPU offering the reliability of their CPU-side counterparts is highly challenging. Current voxel block hashing techniques need relatively weak guarantees regarding insertion, removal, and retrieval. Only key uniqueness is strictly required to avoid that duplicate blocks are allocated and integrated during fusion whereas insertion and removal failures are cleaned up in subsequent frames [niessner; infinitam]. Hierarchical voxel block hashing [Kaehler:2016:Hierarchical] applies spatial hashing to different voxel resolution levels for a reduced memory footprint and also only requires weak guarantees. To achieve a more reliable GPU hashing, perfect hashing approaches [Lefebvre:2006; Botelho:2013; Tran:2015] have been proposed that aim at collision-free hashing, but are hardly applicable for online reconstruction. In the context of collision handling, minimizing the maximum age of the hash map, i.e. the maximum number of required lookups during retrieval, by reordering key-value pairs similar to Cuckoo Hashing improves the robustness of the hash map construction [Garcia:2011]. Similar to Alcantara:2011:Thesis, who analyzed different collision resolving strategies, the entry size is restricted to 64-bit due to the limited support size of atomic exchange operations. However, these approaches do not support entry removal and insertion is allowed to fail in case the defined upper bound on the maximum age is not achieved. Stadium Hashing [Khorasani:2015] supports concurrent insertion and retrieval, but lacks removal, by avoiding entry reordering that would otherwise lead to synchronization issues. Recently, ashkiani2017dynamic presented a fully dynamic hash map supporting concurrent insertion, retrieval, and also removal based chaining to resolve collisions. However, their data structure cannot enforce key uniqueness, which is an essential property required by voxel block hashing frameworks to preserve model consistency. In contrast, our hash map data structure overcomes all of the aforementioned limitations and is specifically suited for continuously updated reconstruction and telepresence scenarios.

3. Remote Communication and Collaboration

Figure 2. Our novel 3D reconstruction and streaming framework for multi-client remote collaboration. RGB-D images acquired by consumer cameras, e.g. smartphones or the Kinect device, are streamed to the reconstruction client (red arrows) which updates the virtual model and transfers it to the server (blue arrows). The server converts the received data to a novel bandwidth-optimized representation based on Marching Cubes (MC) indices and manages a set of updated blocks that are queued for streaming for each connected exploration client. By design, our system supports an arbitrary number of exploration clients that can independently request the currently relevant updated parts of the model (green arrows) and integrate them into their locally generated mesh from which images are rendered in real-time and displayed on devices such as VR headsets or screens. For an immersive lag-free experience, the computational load during streaming is distributed using our novel hash map and set data structures. Red arrows are used to represent the streaming of images, while blue and green arrows are used to represent the streaming of TSDF and MC voxel blocks.

A practical remote communication and collaboration system relies on efficient data processing and transmission as well as fast and compact data structures to allow reconstructing and providing a virtual 3D model in real time to remote users. In this section, we provide a discussion of design choices relevant for such systems as well as an in-depth discussion of our proposed framework.

3.1. Design Choices

Before we describe our framework in detail, we first provide a discussion of crucial design choices that have to be taken into account in order to meet the requirements regarding usability, latency, and stability. We thus focus on the discussion of a flexible system design that benefits a variety of applications, while allowing the distribution of the computational burden according to the hardware availability respectively, i.e. to the cloud or to the remote expert’s equipment. In particular, the major challenges are given by the efficient data processing and representation (see Table 1) and streaming as well as efficient and compact data structures.

Centralized Data Processing

We focus on the development of a system that is particularly designed for collaboration tasks where users can explore and interact with the captured scene while at the same time being able to observe the other client’s interactions. For this purpose, a central server is placed between the individual clients to simplify the communication between clients and move shared computational work away from the clients. This avoids complicated and error-prone dense mesh networks between the clients and lowers the system’s hardware requirements at the user site making the system suitable for a much broader variety of users. Powerful hardware, required to scale for a large number of clients, can be provided as practical cloud services or similar services.

Data Transmission

An obvious strategy for the interactive exploration of a live-captured scene by the user is given by the transmission of the RGB-D input sequence and the reconstruction of the scene model at the exploration client’s site. Whereas the current state of the art in image and video compression techniques as well as real-time reconstruction would certainly be sufficient for the development of such systems, the flexibility of these approaches is limited regarding network outage handling and support for reconnecting to re-explore the scene. In such cases, the exploration would be forced to follow the exact reconstruction trajectory and order and is unable to skip already explored scene parts before the disconnection happened. Depending on the network quality and on how far the capturing process has proceeded, the user would have to wait a couple of minutes until novel parts of the scene can be explored. In contrast, this problem can be significantly reduced by instead streaming parts of the fused models independently from the acquisition order.

Alternatively, the full reconstruction including triangulation could be performed on the reconstruction client site and only mesh updates are streamed. Whereas a mesh-based representation is compact in comparison to the voxel block model, the number of triangles in each block largely differs depending on the amount of surface inside resulting in a significantly more complicated and less efficient data management (including updating and transmission). Furthermore, the vertex positions, which are given in the global coordinate system, are much harder to compress due to their irregular and arbitrary bit pattern. Instead, we propose a novel bandwidth-optimized representation based on Marching Cubes indices (see Section 3.4) that is even more compact after compression due to its more regular nature, and faster to manage and update in parallel. This results in a lower latency even in less optimized implementations.

Hash Data Structure

Efficient data structures are crucial for system performance and therefore have to be adequately taken into account during the design phase. While recently developed hashing approaches work well with high-frame-rate online 3D reconstruction techniques, their lack of strong guarantees regarding hash operations make them hardly applicable to use cases with high reliability requirements such as telepresense systems, where redundant transmission and holes due to missing data must be avoided. With a novel hash map data structure that supports concurrent insertion, removal, and retrieval including key uniqueness preservation while running on a thread level, we directly address these requirements. This not only simplifies the design of the overall system but also allows to avoid synchronization errors related to data management. A detailed evaluation regarding run time and further relevant design choices are provided in the supplemental material.

3.2. Proposed Remote Collaboration System

The overall server-client architecture of our novel framework for efficient large-scale 3D reconstruction and streaming for immersive remote collaboration based on consumer hardware is illustrated in Figure 2 and the tasks of the involved components are shown in Figure 3. RGB-D data acquired with commodity 3D depth sensors as present in a growing number of smartphones or the Kinect device are sent to the reconstruction client, where the 3D model of the scene is updated in real time and transmitted to the server. The server manages a copy of the reconstructed model, a corresponding, novel, bandwidth-optimized voxel block representation, and the further communication with connected exploration clients. Finally, at the exploration client, the transmitted scene parts are triangulated to update the locally generated mesh which can be immersively explored i.e. with VR devices. Clients can connect at any time before or after the capturing process has started. In the following sections, we provide more detailed descriptions of the individual components of our framework, i.e. the reconstruction client, the server, and the exploration client, which is followed by an in-depth discussion of the novel data structure (see Section 4). Additional implementation details for each of the components are provided in the supplemental material.

Figure 3. Components of our framework and their respective tasks. Images are partially provided by PresenterMedia.

3.3. Reconstruction Client

The reconstruction client receives a stream of RGB-D images acquired by a user and is responsible for the reconstruction and streaming of the virtual model. We use voxel block hashing [niessner; infinitam] to reconstruct a virtual 3D model from the image data. Since the bandwidth is limited, the as-efficient-as-possible data handling during reconstruction is of great importance. For this purpose, we follow the approach of mossel and consider only voxel blocks that have already been fully reconstructed and for which no further immediate updates have to be considered, i.e. blocks that are not visible in the current sensor’s view anymore and have been streamed out to CPU memory. In contrast, transmitting blocks that are being still actively reconstructed and, thus, will change over time which results in an undesirable visualization experience for exploration clients. Furthermore, continuously transmitting these individual blocks during the reconstruction process results in extremely increasing bandwidth requirements which make this approach infeasible to real-world scenarios. In contrast to mossel, we add the streamed-out voxel blocks to a hash set which allows us to control the amount of blocks per package that are streamed and avoids lags by distributing the work across multiple frames similar to the transfer buffer approach of the InfiniTAM system [infinitam]. To mitigate the delay caused by only transmitting only fully reconstructed parts of the scene, we add the currently visible blocks at the very end of the acquisition process as well as when the user stops moving during capturing or the hardware including the network connection are powerful enough to stream the complete amount of queued entries. In particular, we check whether the exponential moving average (EMA) of the stream set size over a period of = 5 seconds [zumbach2001operators] is below a given threshold and the last such prefetching operation is at least 5 seconds ago. The EMA is updated as




This ensures that the delayed but complete model is available to the server and the exploration clients at all times. After fetching the voxel data from the model, we compress them including the voxel block list using lossless compression [Collet:2017:zstd] and send them to the server. In addition to the pure voxel data, the reconstruction client and the exploration clients send their camera intrinsics and current camera pose to the server where they are forwarded to each connected exploration client to enable interactive collaboration. Furthermore, requests for high-resolution textures on the model by the exploration clients, required e.g. for reading text or measurement instruments, are handled by transmitting the sensor’s current RGB image to the reconstruction client where it is forwarded to the server and the exploration clients. To make our framework also capable of handling quasi-static scenes, where the scene is allowed to change between two discrete timestamps, as e.g. occurring when an instrument cabinet has to be opened before being able to read the instruments, our framework also comprises a reset function that allows the exploration client to request scene updates for selected regions. This can be achieved by deleting the reconstructed parts of the virtual model that are currently visible and propagating the list of these blocks to the server.

3.4. Server

The server component is responsible for managing the global voxel block model and the list of queued blocks for each connected exploration client. Furthermore, it forwards messages between clients and distributes camera pose data. In order to allow an efficient communication of the current model, the streamed data should be as compact as possible. Instead of transferring the model in the TSDF voxel block representation of the voxel block hashing technique [mossel] to the exploration clients, we compute and transmit a bandwidth-optimized representation based on Marching Cubes (MC) [Lorensen:1987:MCH] to the exploration clients. Thus, a TSDF voxel (12 bytes), composed of a truncated signed distance field (TSDF) value (4 bytes), a fusion weight (4 bytes), and a color (3 bytes + 1 byte alignment), is reduced to a MC voxel, i.e. a Marching Cubes index (1 byte), and a color value (3 bytes). Newly arriving data are integrated first into the TSDF voxel block model and then used to update the corresponding blocks and their seven neighbors in negative direction in the MC voxel block representation. Although only parts of the neighbors need to be recomputed, we recompute the whole blocks to avoid branch divergence and inefficient handling of such special cases. Updating the neighbors is crucial to avoid cuts in the mesh due to outdated and inconsistent MC indices. Furthermore, we cut off those voxel indices and colors where no triangles will be created, i.e. for


by setting the values and to zero. While omitting the interpolation weights might seem drastic in terms of reconstruction quality, we show that the achieved increase in compression ratio and the decrease in network bandwidth requirement easily compensate for the loss of accuracy in the reconstruction (see section 6). The list of updated MC voxel blocks is then added to each exploration client’s stream hash set. Maintaining such a set for each connected client not only enables advanced streaming strategies but also allows them to reconnect at any point in time and still explore the complete model since its stream set is initially filled with the complete list of voxel blocks. An exploration client can request data according to three different strategies, i.e. it can request voxel blocks in the order of the reconstruction, request the voxel blocks currently visible depending on its current pose and intrinsic camera parameters or it can request arbitrary voxel blocks. The first two strategies provide a direct feedback regarding the scene parts currently observed by the sensor or those in the user’s current view. The latter strategy can be used to prefetch and update the remaining parts of the model if no updates for the visible part are currently queued. After selecting all relevant blocks, a random subset of at most the size limit is chosen and the corresponding voxel data are retrieved, compressed [Collet:2017:zstd] and sent to the exploration client.

3.5. Exploration Client

The exploration client’s tasks comprise generating surface geometry from the transmitted compact representation in terms of MC indices, updating the current version of the reconstructed model at the remote site, and the respective rendering of the model in real-time. Therefore, the exploration clients are allowed to request reconstructed voxel blocks according to the order of their generation during reconstruction, depending on whether they are visible in the current view of the exploration client, or in a random order which is particularly useful in the case that the currently visible parts of the model are already complete, and thus, other parts of the scene can be prefetched.

The received MC voxel blocks are decompressed in a dedicated thread, and the block data is passed to a set of reconstruction threads which generate the scene geometry from the MC indices and colors of the voxels. Similar to mossel, we reduce the number of draw calls to the graphics API by merging voxel blocks into a mesh block instead of rendering each voxel block separately. To reduce the number of primitives rendered each frame, we compute three level of details (LoDs) from the triangle mesh, where one voxel, eight voxels or 64 voxels respectively are represented by a point and the point colors are averaged over the voxels. During the rendering pass, all visible mesh blocks are rendered, while their level of detail is chosen according to the distance from the camera. We refer to the supplemental material for more details.

To improve the communication experience, each exploration client additionally sends its own pose to the server, which forwards it to other exploration clients, so that each user can observe the poses and movements of other exploration clients within the scene. Analogously, the current pose of the reconstruction client is visualized in terms of the respectively positioned and oriented camera frustum. Furthermore, users can interactively explore the reconstructed environment beyond pure navigation by measuring 3D distances between interactively selected scene points. For the purpose of depicting structures below the resolution of the voxel hashing pipeline as e.g. required for reading measurement instruments or texts, the exploration client can send requests to the server upon which the RGB image currently captured by the sensor is directly projected onto the respective scene part and additionally visualized on a virtual measurement display.

4. Hash Map and Set Data Structures

Figure 4. Illustration of thread-safe hash map/set modifications on the GPU by maintaining the proposed invariant. The importance of thread safety has its origin in the guarantees for successful concurrent retrieval, insertion and removal while preserving key uniqueness. For this example, where four operations are triggered in parallel, the figure depicts one possible order for the operations to resolve the requested task. In the resulting structure, dead links and empty buckets might occur which, however, are not problematic and automatically cleaned up during further operations.

In order to allow the scalability of our system to an arbitrary number of remote exploration clients, we introduce thread-safe GPU hash data structures allowing fast and simple management including concurrent insertion, removal and retrieval of millions of entries dynamically with strong success guarantees. In comparison to pure 3D reconstruction, maintaining consistency in multi-client telepresence is much more challenging since streaming data between clients requires that updates are not lost e.g. due to synchronization failures. Whereas previous approaches either allow failures [niessner; infinitam; Garcia:2011] or do not ensure key uniqueness [Khorasani:2015; ashkiani2017dynamic], our robust hash data structure is not limited in this regard and represents the key to realize our real-time remote collaboration system. A detailed evaluation in terms of design choices and run-time performance can be found in the supplemental material.

Figure 5. Exemplary application scenarios supported by our system. Depending on the configuration the involved components, given by the reconstruction client (RC), the server (S), and the exploration client (EC), can share a single computer (left) or can be run on completely independent machines (middle, right) or cloud services (right).

General Design

Our streaming pipeline is built upon two different hash data structures. The individual server and client components use an internal map structure, that stores unique keys and maps a value to each of them, whereas the server-client streaming protocol relies on a set structure, which only considers the keys. Thus, the major difference lies in the kind of stored data whereas the proposed algorithm for retrieval, insertion and removal is shared among them. These operations maintain an invariant to achieve thread-safety and ensure correctness: At any time, the entry positions and the links to colliding values are preserved. For this purpose, the required data members of the structure are an array of values, i.e. key-value pairs for the map structure and keys for the set, an array of offsets to maintain the linked list, indicators to determine whether an entry is occupied, locks for synchronization, a stack of available linked list positions needed for offset computation, the current size and the capacity of the container. Similar to the approach by infinitam, the array of values, offsets, indicators and locks is divided implicitly into an ordered and an unordered part. The former part is managed through the hashing function whereas the latter one contains the linked list entries. Figure 4 demonstrates mixed insertion and removal operations on our thread-safe hash data structure. A detailed description of the stack data structure and further design remarks are provided in the supplemental material.


Finding an element in the hash map or set is a read-only operation, i.e. there is no need for synchronization since entry positions are preserved. First, the bucket of a given key value is computed according to the hashing function. In case of spatial hashing, this function could be defined as


where are the voxel block coordinates, represent prime numbers, and denotes the total number of buckets [niessner; infinitam]. Next, we check whether the entry is occupied and its key matches the query key. If both conditions are true, we have found the key and return the current position. Otherwise, we traverse the linked list through the offsets and check each entry in a similar manner.


We add a value by looping over a non-blocking insertion function until the value is found in the data structure. In the non-blocking version, we first check whether the value is already contained and can safely return in this case. Otherwise, there are two possible scenarios: The entry at the bucket position may be occupied or not. If it is free, we try to lock the entry and check whether the bucket is still free and the value is still not inserted in case we got the lock. The additional check is important to prevent inconsistencies in the data structure since threads might also try to insert values with the same key. If the scenario has not changed, the value is stored, the entry is marked to be occupied and the size is atomically incremented. Finally, the entry is unlocked. If the bucket entry is occupied, there is a hash collision that has to be resolved. Thus, we first find the end of the linked list by traversing the offsets. Afterwards, we try to lock the entry and check whether the linked list end has changed or the entry has been inserted by another thread during locking. If this is not the case, we extract a new linked list position from the stack, write the value at this position and reset the possibly non-zero offset. The insertion function is responsible for this reset to guarantee that the non-synchronizing retrieval function works correctly and is free of any race conditions even when the linked list is modified. Then, the offset from the old linked list end to the new one is computed and stored at the entry of the old end. Finally, the new linked list end is marked as occupied, the size is atomically incremented, and the acquired lock is released.


Removing elements from the hash data structure is similar to insertion. The blocking version of this function loops over the non-blocking one until the value can no longer be found. In the non-blocking function, we first check whether the entry is contained and immediately stop if it was not found. Otherwise, there are two scenarios: The entry may be located at the bucket position or inside the linked list. In the former case, we try to lock the bucket and then check whether the entry has been already removed during locking. In case the scenario has not changed, we mark the entry as unoccupied, decrement the size atomically and then reset the value. Finally, the bucket is unlocked. In contrast to the approach by niessner, the next entry is not moved to the bucket position since other threads that try to erase this value might fail to find it otherwise. We also tested manual pruning of the linked lists but this did not affect the runtime performance. In case the value is inside the linked list, we first search for the previous entry and try to lock it together with the current entry where the value is located. If there are not changes after locking, the offset of the previous entry is updated to point to the next one in case it exists or set to zero otherwise. Then, the current entry is marked as unoccupied, the size is atomically decremented, and only the value is reset but not the offset. As already mentioned, this ensures that the retrieval is free of race conditions since other threads that try to find a value inside the linked list might fail to traverse the full linked list if they visit the entry that is currently to be removed. Finally, both locks for the previous and current entry are released.

5. Application Scenarios

By design, our novel framework can handle various combinations of devices and server-client-configurations as illustrated in Figure 5.

In the simplest configuration, the reconstruction client, the server, and the exploration client run concurrently as separate processes on the same computer. The RGB-D stream needed by the reconstruction client can be provided by a Kinect sensor via an USB cable or by smartphones with built-in depth cameras via the Internet. Then, the streaming of the RGB-D images and the voxel block model is performed locally on the same computer. However, such a scenario requires powerful graphics hardware to process all the tasks, i.e. reconstruction, data management and rendering, interactively.

If less powerful hardware is available, the tasks can be split to two computers so that the reconstruction client and the server are executed at one computer whereas the exploration client runs on another machine. The communication and streaming of the model between these two components is then realized through the Internet.

The last configuration we want to highlight is specifically suited for end users. Relying on RGB-D streams provided by built-in depth cameras of smartphones, the image data can be streamed via the Internet to the reconstruction client which runs on the same computer as the server and communicates with it indirectly through the local network or directly through shared memory. This kind of configuration can be implemented using cloud-based computing and moves the burden regarding hardware requirements from the end user to the involved service. The exploration client is implemented as above. In future work, the server component could be extended to allow an arbitrary number of reconstruction clients similar to the recent approach by Golodetz2018Collaborative, which would further benefit this configuration.

6. Evaluation

Dataset Voxel Size [mm] Bandwidth [MBit/s] Total Voxel Blocks
MC 128 MC 256 MC 512 MC 1024 TSDF 512
heating_room 5 4.5 (8.0) 8.8 (12.3) 17.5 (30.9) 32.7 (71.3) 561.5 (938.8) 897
pool 5 4.6 (7.1) 9.0 (14.0) 17.8 (29.7) 29.3 (54.5) 489.3 (937.0) 637
fr1/desk2 5 8.1 (11.6) 16.2 (23.8) 32.6 (46.8) 61.0 (95.0) 764.0 (938.6) 134
fr1/room 5 12.3 (23.6) 16.4 (23.6) 32.1 (42.2) 57.6 (87.9) 739.7 (938.0) 467
heating_room 10 5.1 (7.6) 9.2 (14.4) 14.6 (27.8) 20.2 (63.7) 216.8 (937.1) 147
pool 10 5.6 (8.5) 9.9 (16.0) 13.6 (27.2) 16.9 (52.3) 176.3 (937.0) 104
fr1/desk2 10 8.7 (11.2) 14.3 (21.8) 19.6 (39.2) 24.4 (71.3) 170.1 (436.4) 23
fr1/room 10 9.2 (12.5) 15.7 (23.5) 22.9 (46.1) 28.5 (88.8) 207.8 (936.6) 86
Table 2. Bandwidth measurements of our system for various scenes. We compared the mean (and maximum) bandwidths of our optimized MC voxel structure with 128-1024 blocks/request to the standard TSDF voxel one with 512 blocks/request. Whereas the latter request rate has not been bounded to simulate a perfect network, we used a rate of 100Hz for the former. Across all scenes, our optimized representation saved more than 90% of the bandwidth (MC 512 vs. TSDF 512) and scales linearly with the request rate.
Dataset Voxel Size [mm] Time [min] Total Voxel Blocks
MC 128 MC 256 MC 512 MC 1024 TSDF 512
heating_room 5 4:06 3:08 2:40 2:32 2:31 897
pool 5 2:14 1:32 1:12 1:09 1:08 637
fr1/desk2 5 0:39 0:31 0:27 0:24 0:22 134
fr1/room 5 1:46 1:14 1:01 0:57 0:56 467
heating_room 10 1:49 1:44 1:44 1:44 1:44 147
pool 10 0:54 0:50 0:50 0:50 0:50 104
fr1/desk2 10 0:21 0:19 0:19 0:19 0:18 23
fr1/room 10 0:46 0:42 0:41 0:41 0:41 86
Table 3. Time measurements of our system for various scenes. We compared the time needed to stream the whole model represented by our optimized MC voxel structure with 128-1024 blocks/request to the standard TSDF voxel one with 512 blocks/request. Whereas the latter request rate has not been bounded to simulate a perfect network, we used a rate of 100Hz for the former. The reconstruction speed is given by TSDF 512 and serves as a lower bound for all streaming times. For a voxel resolution of 5mm, a request size of 512 voxel blocks results in the best trade-off between required bandwidth and total streaming time. Increasing the size leads to slightly better results with less latency, but substantially higher bandwidths. For a resolution of 10mm, the optimal streaming time is reached with even smaller request sizes.

After providing implementation details, we perform an analysis regarding bandwidth requirements and the visual quality of our compact scene representation. This is accompanied by the description of the usage of our framework in a live remote collaboration scenario as well as a discussion of the respective limitations.

6.1. Implementation

We implemented our framework using up to four desktop computers taking the roles of one reconstruction client, one server, and two exploration clients. Each of the computers has been equipped with an Intel Core i7-4930K CPU and 32GB RAM. Furthermore, three of them have been equipped with a NVIDIA GTX 1080 GPU with 8GB VRAM, whereas the fourth computer made use of a NVIDIA GTX TITAN X GPU with 12GB VRAM. For acquisition, we tested two different RGB-D sensors by using the Microsoft Kinect v2, which delivered data with a resolution of 512 424 pixels at 30Hz, and by using an ASUS Zenfone AR, which captured RGB-D data with a resolution of 224 172 pixels at 10Hz. Although the ASUS device is, in principle, capable of performing measurements at frame rates of 5-15Hz, we used 10Hz as a compromise between data completeness and speed. Each of the exploration client users was equipped with an HTC Vive HMD, which has a native resolution of 1080 1200 pixels per eye whereas the recommended rendering resolution (reported by the VR driver) is 1512 1680 pixels per eye, leading to a total resolution of 3024 1680 pixels. Please note that the higher recommended resolution (in comparison to the display resolution) originates from the lens distortion applied by the VR system. All computers were connected via a local network.

6.2. Bandwidth Analysis

(a) Comparison of bandwidth requirements between reconstruction client (RC) with 512 blocks/request and exploration client (EC) with 256 blocks/request.
(b) Comparison of bandwidth requirements with package sizes of 256, 512, and 1024 blocks/request.
Figure 6. Bandwidth measurements of our system over time for the pool dataset.
Figure 7. Streaming progress over time for the pool dataset. Choosing different package sizes influences the total transmission time of the virtual model to the exploration client (EC). To save bandwidth, only fully reconstructed blocks are streamed from the reconstruction client (RC) to the server (S) causing a noticeable delay. This gap becomes smaller when our prefetching approach queues the currently visible scene parts to the reconstruction client’s stream set (RC SS) and transmits them to the server.

In the following, we provide a detailed quantitative evaluation of the bandwidth requirements of our novel collaboration system. For the purpose of comparison, we recorded two datasets heating_room ((a)) and pool ((b)) with the Kinect v2, and also used two further publicly available standard datasets that were captured with the Kinect v1 [sturm12iros]. Throughout the experiment, we loaded a dataset and performed the reconstruction on the computer equipped with the NVIDIA GTX TITAN X. The model is then streamed to the server (second computer) and further transmitted to a benchmark client (third computer). Compared to the exploration client, the benchmark client is started simultaneously to the reconstruction client and requests voxel blocks with a fixed predefined frame rate of 100Hz and directly discards the received data to avoid overheads during benchmarking. Using this setup, we measured the mean and maximum bandwidth required for streaming the TSDF voxel block model from the reconstruction client to the server and the MC voxel block model from the server to the benchmark client. Furthermore, we also measured the time until the model has been completely streamed between the server and the two clients. Our parameter choices of 5mm and 10mm for the voxel size, and 60mm for the truncation region follow common choices in the context of voxel block hashing. The results of our experiment are shown in Table 2 and Table 3.

Across all scenes and voxel sizes, the measured mean and maximum bandwidths for our novel MC voxel structure scale linearly with the request rate and are over one order of magnitude smaller compared to the standard TSDF voxel representation. We measured higher bandwidths at 10mm voxel size than at 5mm for request sizes of 128 and 256 blocks. Our stream hash set automatically avoids duplicates, which saves bandwidth in case the system works at its limits and can be considered as an adaptive streaming. At 10mm this triggers substantially less and thus, more updates are sent to the server and exploration clients. We also observed by a factor of two larger bandwidth values for the datasets captured with the Kinect v1 in comparison to the ones recorded by us with the Kinect v2. This is mainly caused by the lower reliability of the RGB-D data which contains more sensor noise as well as holes, which, in turn, results in a larger number of allocated voxel blocks that need to be streamed. Furthermore, the faster motion induces an increased motion blur within the images, and thus leads to larger misalignments in the reconstructed model as well as even more block allocations. However, this problem is solely related to the reconstruction pipeline and does not affect the scalability of our collaboration system.

The overall latency of the system is determined by the duration until newly seen parts of the scene are queued for transmission, i.e. until they are streamed out to CPU memory, the latency of the network, and the package size of the exploration client’s requests. The computational burden on the reconstruction client, the server and the exploration clients is processed in real time, i.e. in the order of tens of milliseconds. Therefore, the runtime spent within the individual components has a negligible impact on the total latency of the system. In order to evaluate the bandwidth requirements and the latency of our system, we performed further measurements as depicted in Figure 6 and Figure 7. Whereas the bandwidth for transmitting the TSDF voxel block representation has a high variance and ranges up to our network’s limit of 1Gbit/s, our bandwidth optimized representation has not only lower requirements, i.e. a reduction by more than 90%, but also a significantly lower variance. For a request size of 256 blocks by the exploration client, the model is only slowly streamed to the exploration client which results in a significant delay until the complete model has been transmitted. Larger request sizes such as 512 blocks affect both the mean bandwidth and the variance while further increases have more effects to the variance since less blocks than the package size need to be streamed (see (b)). This effect also becomes apparent in Figure 7 where lower package sizes lead to a smooth streaming and larger delays whereas higher values reduce the latency. Furthermore, the delay between the reconstruction client and the server in the order of seconds is directly related to our choice of only transmitting blocks that have been streamed out to save bandwidth. Note that directly streaming the actively reconstructed voxel blocks is infeasible due to extremely increasing bandwidth requirements (see Section 3.3). Once our automatic streaming of the visible parts triggers, which can be seen in the rapid increases of the reconstruction client’s stream set (RC SS), the gap between the current model at the reconstruction client and the streamed copy at the server becomes smaller. Since the visible blocks are streamed in an arbitrary order, this results in lots of updates for already existing neighboring MC voxel blocks at the server site that need to be streamed to the exploration client. Therefore, the exploration client’s model grows slower than the server’s model but this gap is closed shortly after the server received all visible blocks. Note that the effects of this prefetching approach can be also seen in the reconstruction client’s bandwidth requirements, where high values are typically observed when this mechanism is triggered.

6.3. Scene Model Completeness and Visual Quality

In addition to the bandwidth analysis, we have also evaluated the model completeness during transmission for our novel hash map data structure in comparison to previous techniques that allow failures [niessner]. Thus, we measured the model size in terms of voxel blocks at the reconstruction client, where the streaming starts, and at the exploration client, where the data is finally transmitted to. To reduce side effects caused by distributing the computational load, we have chosen a package size of 1024 blocks (see Table 3). The results are shown in Figure 8. Whereas previous GPU hashing techniques work well for 3D reconstruction and failures can be cleaned up in subsequent frames, they are not suitable for large-scale collaboration scenarios where blocks are often sent only once to save bandwidth.

We also provide a qualitative visual comparison of our bandwidth-optimized scene representation based on Marching Cubes indices. In order to reduce the bandwidth requirements by over 90%, we omitted the interpolation of vertex positions and colors. Figure 9 shows a comparison between our approximation and the interpolated mesh, where both representations have been reconstructed using a voxel resolution of 5mm. While the interpolated model has a very smooth appearance, the quality of our approximation is slightly lower at edges but, otherwise, resembles the overall visual quality quite well. However, the quality suffers more at highly textured objects as shown in Figure 10. Please note that our system allows compensating this issue by using our projective texture mapping approach to enable higher resolution information on demand.

6.4. Live Remote Collaboration

(a) Original Voxel Block Hashing Data Structure [niessner].
(b) Our Hash Data Structure.
Figure 8. Visual comparison of model completeness for the pool dataset: While previous hash maps allow failures, our hash data structure ensures hole-free reconstructions during transmission to an exploration client.
(a) With Color Interpolation.
(b) Without Color Interpolation.
Figure 9. Visual comparison of our scene encoding for the heating_room dataset: Compared to standard mesh generation techniques that use linear interpolation, our scene encoding achieves a similar quality without interpolation in real-world scenes.

To verify the usability of our framework, we conducted a live remote collaboration experiment where a local user and two remotely connected users collaboratively inspect the local user’s environment supported by audio-communication (i.e. via Voice over IP (VoIP)). For this experiment, we selected people who were unfamiliar to our framework and received a briefing regarding the controls. Furthermore, these user have never been in the respective room before.

While one person took the role of a local user operating the acquisition device, two different remotely connected exploration clients provide support regarding maintenance and safety. The exploration clients can interactively inspect the acquired scene, i.e. the maintenance expert guides the person operating the acquisition device to allow the observation of measurement instruments. By allowing scene resets, where parts of the scene can be updated on demand, our system allows certain scene manipulations such as opening the door to a switch board that has to be checked by the maintenance expert. By requesting the current RGB image of the used sensor in addition to its current pose, higher texture resolutions can be directly achieved on the acquired scene geometry or in a separate virtual 2D display. This allows checking instruments or even reading text. Measurements performed based on the controllers belonging to the HMD devices are of sufficient accuracy to allow detecting safety issues or select respective components for replacement. The interaction flow of this experiment is also showcased in the supplemental video. We used the Kinect v2 and an ASUS Zenfone AR for RGB-D acquisition. However, the limited resolution and frame rate of the ASUS Zenfone AR sensor (224 172 pixels, up to 15Hz) affect the reconstruction quality obtained with the smartphone.

Furthermore, the users testing our framework particularly liked the options to reset certain scene parts to get on updated scene model as well as the possibility of interacting with the scene by performing measurements and inspecting details like instrument values. After network outages or wanted disconnections from the collaboration process, the capability of re-connecting to re-explore the in-the-meantime reconstructed parts of the scene was also highly appreciated and improved the overall experience significantly. In fact, they reported a good spatial understanding of the environment.

6.5. Limitations

(a) With Color Interpolation.
(b) Without Color Interpolation.
Figure 10. Challenging cases: For highly textured objects and sharp edges with high contrasts, our approximation introduces artifacts caused by missing interpolation.

Despite allowing an immersive live collaboration between an arbitrary number of clients, our system still faces some limitations. In particular, the acquisition and reconstruction of a scene with a RGB-D camera may be challenging for unexperienced users, who tend to move and turn relatively fast resulting in high angular and linear velocities as well as potential motion blur. As a consequence, the reconstruction is more susceptible to misalignments. Whereas loop-closure techniques [Kaehler:2016; Dai:2017] allow to reduce the susceptibility to misalignments, their uncontrollable update scheme during loop closing would cause nearly the entire model to be queued for streaming. This would impose much higher bandwidth requirements to the client remote connections and prohibit remote collaboration over the Internet. Furthermore, we stream the virtual model in the standard TSDF voxel representation between the reconstruction client and the server which requires both to be in a local network. However, the increasing thrust in cloud services could fill this gap. While we believe that the usability of our novel system significantly benefits from mobile devices with built-in depth cameras, the current quality and especially the frame rate of the provided RGB-D data is inferior in comparison to the Kinect sensor family resulting in low-quality reconstructions.

(a) heating_room.
(b) pool.
Figure 11. Reconstructed models of our datasets.

7. Conclusion

We presented a novel large-scale 3D reconstruction and streaming framework for immersive multi-client live telepresence that is especially suited for remote collaboration and consulting scenarios. Our framework takes RGB-D inputs acquired by a local user with commodity hardware such as smartphones or the Kinect device from which a 3D model is updated in real-time. This model is streamed to the server which further manages and controls the streaming process to the, theoretically, arbitrary number of connected remote exploration clients. As such as system needs to access and process the data in highly asynchronous manner, we have built our framework upon – to the best of our knowledge – the first thread-safe GPU hash map data structure that guarantees successful concurrent insertion, retrieval and removal on a thread level while preserving key uniqueness required by current voxel block hashing techniques. Efficient streaming is achieved by transmitting a novel, compact representation in terms of Marching Cubes indices. In addition, the inherently limited resolution of voxel-based scene representations can be overcome with a lightweight projective texture mapping approach which enables the visualization textures at the resolution of the depth sensor of the input device. As demonstrated by a variety of qualitative experiments, our framework is efficient regarding bandwidth requirements, and allows a high degree of immersion into the live captured environments.

This work was supported by the DFG projects KL 1142/11-1 (DFG Research Unit FOR 2535 Anticipating Human Behavior) and KL 1142/9-2 (DFG Research Unit FOR 1505 Mapping on Demand).



In the scope of this supplemental material, we provide further implementation details of the components of our framework in order to facilitate its reproducibility. In this context, we also extend the discussion of the hash map serving as underlying data structure in our technique and provide an evaluation of our hashing approach with respect to approaches followed in recent literature.

Appendix A Implementation Details

As mentioned in the accompanying paper, the main components of our framework are given by the reconstruction client, the server and the exploration client. In the following sections, we provide an in-depth discussion of implementation details for these components.

a.1. Reconstruction Client

To allow a transmission in the order of reconstruction, we attach each block added to the stream set with a unique counter. The transmission of the data is performed by a separate worker thread which extracts voxel block entries from the stream set up to the package size limit and collects the corresponding voxel data until the capturing process has ended and all data has been transmitted. Since InfiniTAM’s streaming component allows blocks to be allocated in both volumes that results in a delayed internal streaming of the block, we first lookup and copy their data from the active visible volume and afterwards from the passive one. Already retrieved voxel data are simply overwritten as we prefer the passive version over the active one due to its better robustness and lower susceptibility to noise. Note that this decision has no impact on the quality of the final model since at least one further update will be triggered once the active part is streamed out and merged with the passive one.

Although our system is developed to reconstruct static scenes, object interactions causing changes in the scene can still be handled to provide an updated virtual model. On demand, the user performing the acquisition can trigger a reset of the part of the scene that is currently visible for the sensor under its current pose. The list of visible blocks is then generated on the fly and all corresponding voxel data including the queued blocks in the stream set are erased. To maintain consistency across all users, the list is sent to the server which updates its model accordingly, and again forwards the message to all exploration clients which can then also reset the relevant parts of their models.

a.2. Server

The server component maintains both a copy of the transmitted TSDF voxel block model and a bandwidth-optimized representation based on Marching Cubes indices. Since only data from this optimized representation is transmitted to exploration clients, we store it in CPU memory whereas the received TSDF voxel block model resides in managed memory to allow fast processing with relaxed memory size restrictions. Recent NVIDIA GPUs and versions of the CUDA toolkit [CUDA:2016:managed] support managed memory which not only allows direct accesses from both the CPU and GPU but also uses fast GPU-CPU paging to effectively relax the memory limit to the CPU rather than the GPU memory size.

When a new package from the reconstruction client arrives, the server first decompresses the data and integrates them into the TSDF voxel block model by allocating new blocks for not yet inserted parts and overriding already existing ones. In a second pass, we compute the MC voxel block data of the received blocks and their seven neighbors in negative direction, and add these blocks to each exploration client’s stream hash set. In case a block is already inserted, the minimum of both unique counters is stored to make the update order-independent and to avoid holes in the client’s model due to delayed streaming.

a.3. Exploration Client

For HMD-based visualization, the scene has to be rendered twice each frame, which results in a high memory bandwidth for reading vertex data, and a high computational burden for the vertex shader. To cope with this, we store the position of each vertex relative to the bottom-left corner of the mesh block. Since we limit the number of voxel blocks inside a mesh block to , we can encode the position using one byte per dimension. This leads to 8 bytes per vertex, i.e. 3 bytes per position, 4 bytes per color which is stored in RGBA format, and 1 byte to align the structure. The client also stores all received voxel blocks, together with the indices of the triangles and points generated for them, in the CPU memory. This is needed for the case that a voxel block is received twice, and the structure inside the block changes or is deleted entirely, e.g. in case of partial resets of the model. If primitives need to be removed from a mesh block, we simply set the alpha value of the color to 0, which makes the primitives invisible and marks them as removed. However, as these invisible primitives also need to pass the vertex shader, we prune them by rebuilding the whole mesh block if the amount of them passes a certain threshold.

After a reconstruction thread has processed the update or rebuild for a mesh block, it is queued with all mesh blocks for which the geometry data needs to be uploaded to the GPU. When rendering a frame, we upload mesh data based on a predefined time window of 0.5 milliseconds. This time limit has proven to be sufficient to ensure proper uploading to the GPU with low latency while providing a smooth visual experience when rendering at 90Hz on a HMD.

Appendix B Hash Map and Set Data Structures

In this section, we want to elaborate on the requirements and design choices of our hash map data structure to realize a real-time remote collaboration system. Key to an interactive user experience is a fast and reliable data management that scales across multiple clients. Due to its constant amortized run-time complexity for insertion, retrieval, and removal and especially the successful integration into real-time 3D reconstruction GPU frameworks [niessner; infinitam; Kaehler:2016; Dai:2017], we have chosen spatial hashing techniques to manage our data workflow.

b.1. General Design

Since the hash map and set data structure is heavily used in the whole telepresence system, it must be highly reliable regarding insertion, retrieval, and removal to avoid artifacts such as holes in the exploration client’s virtual model during transmission. In order to maintain key uniqueness to avoid duplicate and inconsistent voxel data, the retrieval operation must be performed – at least internally – inside insertion and removal such that those duplicates can be detected and correctly handled. Therefore, both concurrent insertion and retrieval, and concurrent removal and retrieval must be supported by the underlying data structure. Concurrent insertion and removal is not strictly required by our telepresence system but also supported by extending the stack structure. In contrast, all previous techniques either do not reliably support concurrency by allowing failures, or lack support for at least one of these operations.

Considering the design choices to implement such a hash map and set data structure on the GPU, the actual implementation can be performed either on a thread level or a kernel level. We designed it on a thread level as this provides many important advantages compared to a kernel-leveled version. On the thread level, each operation is hidden behind a function and enables very simple usage and a high re-usability across our system. Furthermore, data management can be easily done in any scenario and does not rely on additional synchronization steps. The most important aspect, however, is that function or thread-leveled operations are immune to synchronization errors from outside as the whole management is performed inside the function. While kernel-leveled versions can instead be hand-tuned to the particular scenario and are thus faster, they are less generic and more susceptible to subtle errors. Since a kernel-leveled implementation that also enforces the required guarantees is provided in the open-source implementation of the original voxel block hashing framework, which we consider an extension to the originally proposed technique [niessner], we will evaluate the implications in terms of run-time regarding this design choice.

b.2. Stack

One core element of our hash data structures is a stack structure that manages the available linked list positions. This stack structure is capable of adding and removing elements to and from its end in parallel. Although a simple implementation based on an atomic counter is sufficient to make the insertion and removal operations thread-safe, we extended it to support concurrent insertion and removal as this property directly propagates to the hash map and set and requires no further modification there. Thus, we need to store the elements, indicators for each element determining whether the entry is occupied, locks for synchronization, the current size and the capacity of the container. Since the underlying arrays are all preallocated, the maximum size is limited.


Adding one element to the stack is performed by the following steps. First, the insertion position is obtained by atomically incrementing the size. In particular, this operation reads the current value, increments and writes it back, and finally returns the old read value. This minimizes synchronization overhead since determining the position is decoupled from the actual insertion. Next, we try to lock the entry at the acquired position and check whether it is not yet occupied. In case the non-blocking locking operation succeeds and the entry is free, we write the given value into the entry, mark it as occupied and release the lock. Otherwise we stop this attempt and repeat the step until success. This guarantees that mixed insertion and removal is also supported.


Returning and erasing the last element of the stack is performed similar to insertion. First, we obtain the removal position by atomically decrementing the size and, additionally to insertion, decrementing the returned value again to obtain the correct index. Then we try to lock the corresponding entry, check whether it is occupied and get its stored value. In case it is not occupied, we retry this step until success.

b.3. Applications

Besides the heavy usage within our telepresence system, our novel hash map data structure is beneficial for several applications such as the ones mentioned below.

Voxel Block Streaming

Figure 12. Our novel two-hash-map streaming approach does require any knowledge about the hash map’s implementation details and allows to adjust the size of each component independent of the others.
Dataset Voxel Size [mm] Thread Level Kernel Level Total Voxel Blocks
Multi Entry Single Entry Multi Entry Single Entry Ours
heating_room 5 0.27 (0.06) 0.36 (0.08) 0.58 (0.15) 0.74 (0.18) 0.35 (0.07) 897
pool 5 0.28 (0.06) 0.37 (0.08) 0.63 (0.16) 0.80 (0.20) 0.37 (0.08) 637
fr1/desk2 5 0.25 (0.04) 0.29 (0.04) 0.57 (0.16) 0.65 (0.17) 0.29 (0.05) 134
fr1/room 5 0.29 (0.06) 0.37 (0.08) 0.67 (0.18) 0.82 (0.22) 0.37 (0.08) 467
heating_room 10 0.14 (0.04) 0.16 (0.04) 0.32 (0.14) 0.37 (0.14) 0.16 (0.04) 147
pool 10 0.15 (0.04) 0.18 (0.05) 0.37 (0.14) 0.43 (0.15) 0.18 (0.05) 104
fr1/desk2 10 0.18 (0.04) 0.19 (0.04) 0.42 (0.16) 0.45 (0.16) 0.20 (0.04) 23
fr1/room 10 0.19 (0.04) 0.22 (0.04) 0.46 (0.15) 0.51 (0.15) 0.22 (0.04) 86
Table 4. Runtime measurements of our hash map data structure for various scenes during reconstruction. We compared the mean (and standard deviation) runtime in milliseconds to similar techniques that either operate on a thread level and allow failures, or on a kernel level with guarantees. While our data structure guarantees successful insertion, it is much faster than kernel-based approaches and only slightly slower than unsafe thread-leveled techniques.
(a) Kernel-leveled Data Structures.
(b) Thread-leveled Data Structures.
(c) Kernel-leveled Data Structures.
(d) Thread-leveled Data Structures.
Figure 13. Runtime comparison between hash data structures for the heating_room scene at 5mm voxel resolution (see \subreffig:hash_cellar_kernel and \subreffig:hash_cellar_thread) and for the fr1/desk2 scene at 10mm voxel resolution (see \subreffig:hash_fr1_desk2_kernel and \subreffig:hash_fr1_desk2_thread).

Classical swapping techniques, that are part of several out-of-core scenarios, are used to relax memory size restrictions and enable applications that scale much better such as the voxel block hashing pipeline with its CPU-GPU streaming component. Whereas the original approach by niessner uses a simple list on the CPU-site to manage the streamed data, the InfiniTAM system [infinitam] reuses the GPU hash map to implicitly manage both the GPU and the CPU volume using an index-based mapping and special flags to indicate the streaming state. This introduced coupling between the hash map size and the CPU volume size has been relaxed by mossel, who added an auxiliary index array between these two to reduce the memory footprint. However, all these approaches either shift at least one part to the CPU-cite or effectively use only a single GPU hash map. In contrast, we use two hash maps: One for the active GPU and one for the passive CPU volume (see Figure 12). Both volumes are implemented by pool data structures and both hash maps are allocated on the GPU so that all data management can be performed efficiently in parallel. Only the voxel block data buffer of the passive pool is stored in CPU memory whereas the remaining parts reside in GPU memory which has several advantages. First, there is no need for managing streaming-related logic in the raycasting or fusion step to differentiate between active and passive voxel blocks which completely decouples the streaming component from the rest of the pipeline. Furthermore, we can drop the requirement that the passive voxel block pool must have the same size as the hash map since no index hacks or streaming state management are needed anymore. This also avoids the auxiliary index buffer approach of mossel and moves and unifies all voxel block data management to the hash map data structure. A direct consequence of our approach is that during streaming, a voxel block might be allocated in both the active and the passive volume when limiting the size of the transfer buffer [infinitam]. Therefore, we simplify the two-step copy-and-merge streaming technique of InfiniTAM and consider merging as the only needed operation. In particular, our transfer buffers do not store indices to hash map entries but the actual inserted pointers to voxel blocks together with the voxel block data. This also decouples the streaming from the actual hash map implementation.

Beyond Spatial Hashing

While we follow the specific hash function definition for 3D spatial hashing used by niessner and infinitam, our data structure has no limitations regarding key-value pair size or exchangeability of the actual hash function. Thus, it can be applied to various problems beyond computer graphics that need proper and reliable on-the-fly data management of millions of entries on the GPU with enforced key uniqueness preservation. This includes file indexing in data centers or large databases with non-standard data in economics and other fields.

b.4. Evaluation

In the following, we provide a detailed evaluation of the measured runtimes for our hash data structure. Similar to the bandwidth analysis provided in the paper, we used the scenes heating_room and pool recorded with the Kinect v2 and two further datasets captured with the Kinect v1 [sturm12iros]. For the purpose of a fair comparison, we reimplemented and evaluated thread-leveled versions of the data structures following the description provided by niessner, i.e. a multi-entry hash map which resolves collisions through a neighborhood search, and by infinitam, i.e. a single-entry hash map with a stack data structure implemented via a simple atomic counter. While both of these approaches do not provide strong guarantees beneficial in the context of a remote collaboration system, they are still suitable for high-frame-rate 3D reconstruction. The comparison of our technique and these two approaches is shown in Table 4 and Figure 13. Since it is also possible to ensure successful insertion and removal by looping over the kernel and testing whether the size has changed, we further compare our thread-leveled data structure with kernel-leveled multi-entry and single-entry versions as e.g. included in the extended voxel hashing framework accompanying the work by niessner. We used a bucket size of buckets for all hash maps and two entries per bucket for the multi-entry ones. In order to evaluate the insertion performance of each approach and minimize side effects by other parts of the voxel hashing pipeline, we measured the runtime of the voxel block allocation step.

Across all scenes, we observed that the kernel-leveled approaches are significantly slower, i.e. exhibit runtimes of more than a factor of , in comparison to their thread-leveled counterparts. This is a result of the need for at least two calls of the kernel where in the second run all insertion failures are corrected. Furthermore, obtaining the hash map size involves additional costly and inefficient memory copies from GPU to CPU memory. Note that kernel performance optimizations could reduce or even cloes the gap but this requires careful manual tuning. When comparing single and multi-entry approaches, we observe that the multi-entry technique is approximately faster since first-order collisions are directly handled and no additional stack data structure is required. However, this comes at the cost of an increased memory footprint where most secondary entries remain empty and unused. Our data structure’s performance is approximately on par with the other thread-leveled approaches but provides the reliability of their kernel-leveled counterparts without further hand-tuning.

We also observed that the runtime scales almost linearly with the voxel size since the allocation step traverses over all visible blocks in the view frustum. For datasets captured with the Kinect v1, the difference between 5mm and 10mm resolution is less than for the datasets recorded with the Kinect v2. This is mainly caused by the lower field of view in conjunction with the higher image resolution of the Kinect v1 sensor. Thus, the impact of the volume traversal is higher at 10mm since more pixels try to the insert a single block. The hash map efficiently handles this by immediately returning if the block has been already inserted, thus keeping the cost in such cases as low as possible. Over the course of time, the runtimes of all hash map data structures remain constant (see (c) and (d)). At higher load factors where more collisions are observed and also over the course of time (see (a) and (b)), we observed slightly increasing values which are caused by traversing colliding entries in the linked lists. However, this impact compared to the total runtime is rather small and underlines the constant amortized asymptotic complexity of hash data structures.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description