Optimization and Manipulation of Contextual Mutual Spaces for MultiUser Virtual and Augmented Reality Interaction
Abstract
Spatial computing experiences are physically constrained by the geometry and semantics of the local user environment. This limitation is elevated in remote multiuser interaction scenarios, where finding a common virtual ground physically accessible for all participants becomes challenging. Locating a common accessible virtual ground is difficult for the users themselves, particularly if they are not aware of the spatial properties of other participants. In this paper, we introduce a framework to generate an optimal mutual virtual space for a multiuser interaction setting. The framework further recommends movement of surrounding furniture objects that expand the size of the mutual space with minimal physical effort. Finally, we demonstrate the performance of our solution on realworld datasets and also a real HoloLens application. Results show the proposed algorithm can effectively discover optimal shareable space for multiuser virtual interaction and hence facilitate remote spatial computing communication in various collaborative workflows.
2020 \setcopyrightacmlicensed \doihttps://doi.org/10.1145/3313831.XXXXXXX \isbn9781450367080/20/04
1
[500]Humancentered computing Collaborative and social computing \ccsdesc[300]Humancentered computing Virtual reality \ccsdesc[300]Humancentered computing Collaborative and social computing systems and tools \ccsdesc[100]Humancentered computing Mixed / augmented reality \ccsdesc[100]Humancentered computing Contextual design \ccsdesc[300]Applied computing Multicriterion optimization and decisionmaking \ccsdesc[100]Theory of computation Evolutionary algorithms
1 Introduction
The emerging fields of augmented reality (AR) and virtual reality (VR) have introduced a large number of exciting applications in telecommunication, immersive collaboration, and social media where multiple users can share a virtual environment. While much work has been done on 3D capturing methods, reallife avatar modeling, and virtual social platforms, one key challenge in AR/VR immersion is the scene understanding of the users’ surrounding spaces and the question of how to optimally utilize them for immersion tasks.
More specifically, acquiring an accessible 3D workspace is a prerequisite for a virtual or augmented immersion experience. Furthermore, the augmentation of the virtual data in the physical space must be compatible with the contextual properties of the physical space, such as a floor that is standable, a chair that is sittable, and a wall that is also a physical barrier of virtual interactions. For many 6 degreesoffreedom (DOF) VR applications, the user will often be asked to manually initiate a block of free space where the VR immersion can be assumed to be safe. Such a space is called a VR workspace and is typically assumed to be standable. Inferencing the above contextual information for both AR and VR can be readily done using several wellestablished 3D modeling algorithms in computer vision. Current AR devices, such as the HoloLens or MagicLeap, integrate such algorithms to estimate the layout of the space, including floors, walls, and ceilings, and typical furniture objects such as tables and chairs. In this paper, we assume such contextual information of individual spaces to be available via either a manual or algorithmic process.
However, in scenarios where an immersive experience involves multiple users, understanding of spatial constraints is elevated to that of all involved users. Since different users may participate in the immersive experience from their own spaces, which can hold very contrasting contextual properties, a consensus must be established to identify a mutual space that respects the spatial constraints of all the participants. Yet, having users manually identify such a mutual space would be imprecise and labor intensive, especially when considering the fact that it would be difficult for a user to be aware of the contextual properties of the other users’ spaces. Without more effective and efficient solutions, the establishment of a contextual mutual space will be a bottleneck for multiuser immersion experiences.
Motivated by this challenge, we present in this paper a novel method to optimize contextual mutual space in a multiuser immersion setting. Our method relies on existing semantic scene maps to identify shareable functional spaces. For illustration purposes, we will use standable and sittable as the two functions to walk through our method, although the solution is also compatible with other contextual functions. The method formulates an optimization problem to seek the maximal mutual spaces. Additionally, if one can assume the users have the freedom to rearrange furniture objects on the floor, we introduce a more delicate optimization problem to further increase the mutual space’s size while considering the users’ effort to physically move the objects as another constraint.
In this paper, we propose the use of a genetic algorithm approach to solve the two optimization problems above. Clearly, we believe other comparable algorithms that optimize these NPHard problems are equally effective. The end result is a new solution capable of automatically recommending contextual mutual space to multiple participants of virtual immersion experiences in AR/VR applications.
2 Related Work
Immersive AR/VR systems have been widely explored for remote telepresence applications, providing realtime capture, transmission and display between participants of the platform [16, 6, 30]. Using an array of cameras [56, 29, 54] or depth sensors monitoring the capture space [48, 22, 41, 39], holographic replicas or avatars of the virtual participants are projected in predefined local spaces. Such projections have been extensively developed using situated autostereo [43, 41], volumetric [21], lightfield [4], cylindrical [26], and holographic [9] displays. However, participants of such systems are mainly stationed in predefined spaces [61, 8, 38, 58] to avoid any geometrical conflicts with surrounding features in the projected space. Such approach limits freeform motion of the participants within each other’s location, an important factor for achieving colocated presence.
The importance of freeform user movement and the ability to preserve mobilitybased communication features such as walking, gestures, and head movement have been studied greatly in the context of colocated collaboration [5, 36, 25]. Another vital aspect of sharing mutual space is described in Clark’s work as grounding [14]. Grounding in communication (or common ground) is a concept that comprises the collection of "mutual knowledge, mutual beliefs, and mutual assumptions" that is essential for communication between two people. Successful grounding in communication requires parties to coordinate both the content and process [31]. As content in spatial computing can also involve the surrounding space itself, providing a common virtual ground can be critical to allow all communication features to be reflected correctly.
More recent examples have explored how telepresence can be conducted with less spatial constraints, allowing fluid user motion in both ends of the communication. Works of [17] and [7] are examples of such systems where users and their local interaction spaces are continuously captured using a cluster of registered depth and color cameras. However, these systems use stereoscopic projection which limits the ability for remote and local users to access each others space. Instead, spaces are virtually disconnected and interaction occurs through a window from one space into the other. Meanwhile, the Holoportation system introduced by OrtsEscolano et al. allows bilateral telepresence between participants where participants share a common virtual ground [46]. Their system allows the remote user to be rendered into the local user’s space as an avatar while the local user appears as an avatar in the remote user’s space as well. Such an approach is also seen in [40], where the remote and local users do not share the same functional layout of rooms, but they are calibrated in order to provide the required mutual virtual ground between users.
While telepresence systems via shared spaces present novel workflows for capturing and projecting virtual avatars, the issue of avoiding physical and virtual conflicts within the shared spaces is still an open challenge. In this regard, the work of Lehment et al. [32] may be the closest work to this paper, which proposes an automated method to align remote environments so that they minimize discrepancies in room obstacles and physical barriers. However, the method is limited to two spaces and uses a brute force search to calculate the consensus space between participants. Our method formulates rigorous optimization problems to search and manipulate a potentially unlimited number of spaces in order to find a mutual spatial boundary.
The practice to determine an optimal arrangement of discrete spatial elements is often referred to as floorplanning [15]. Automated floorplanning methodologies have been widely investigated in architectural space layouts, construction [13, 55, 47], electronic design [44, 11, 19], and industrial operation research [2]. Floorplanning aims to achieve a defined functional goal by efficiently generating and evaluating possible spatial combinations while addressing the geometrical and topological constraints of the spatial elements [20]. In electronic physical design floorplanning, proposed methodologies mostly aim at optimizing chip area and wirelengths to reduce interconnections and improve timing [23]. In construction site layout and planning, optimizing the interaction between facilities, such as total interfacility transportation costs and frequency of interfacility trips can also be implemented as objective functions [47]. In our proposed framework, we similarly integrate an objective function whose goal is to minimize the amount of effort required to move surrounding furniture while maximizing the area of the mutual virtual ground among all participants.
In floorplanning, various representation methods of spatial arrangements are coupled with optimization engines to efficiently search through all possible combinations of spatial elements. Floorplanning representations are generally divided into two main categories: slicing and nonslicing representations [57]. In slicing methodologies, the floor plan is recursively bisected until each part consists of a single module [59]. Nonslicing representation are utilized for more general use cases where no recursive bisection of a certain area takes place [18, 37, 34]. Multiple studies have integrated these representations with various optimization algorithms such as Simulated Annealing (SA) [27, 28, 59], Genetic Algorithms (GA) [51, 45, 33, 19, 60] and Particle Swarm Optimization (PSO) [53, 12, 24, 52, 42]. More recently, by applying learning based algorithms, hybrid neural networks[7] and annealed neural networks have been used to identify optimal site layout and solve construction sitelevel problems.[8]
3 Methodology
Our solution consists of the following four steps: (i) Semantic segmentation of surrounding environments; (ii) Topological scene graph generation; (iii) Mutual space identification; (iv) Optionally, manipulation of ground objects to further maximize the mutual space.In this section, we will elaborate on the details of the four steps. To start, we will define the terminologies and notations used in the paper.
Given a closed 3D room space in , one can project its enclosure, i.e., floors, ceilings, and walls, via an orthographic projection to form a 2D projection, which is commonly known as the floor plan of the space. If we assign the coordinates on the floorplan plane and the coordinate perpendicular to the floorplan plane, simplifying our optimization problems on to the plane significantly reduces the complexity of our algorithms. It also implies an assumption that there is no overlap between two objects on the plane but with different values. Nevertheless, we believe such simplification is reasonable for analyzing the majority of room structures and thus does not compromise the generality of our analysis provided herein.
Hence, we define for each user their own room space expressed as a 2D floor plan as . Each th object (e.g., furniture) in is denoted as .The collection of all objects in is denoted as . represents the boundary of the object . Similarly, represents the boundary of the room . Finally, we define the area function as .
3.1 Semantic Segmentation
Given the measurement of the surrounding physical environments as large sets of point cloud data, one can take advantage of the semantic segmentation methods widely investigated in computer vision literature [49, 35, 3] to segment their spatial boundaries and obtain their geometric properties, such as dimensions, position and orientation, object classification, functional shapes, and their weights. In doing so, we can convert the 3D point cloud data to labeled objects with a bounding box as .
Additionally, in this paper we exclude lightweight objects (such as pillows, alarm clocks, laptops, etc.) positioned on larger furniture. This is to simplify our calculations in the next steps as we assume these lightweight objects can be easily moved by the users and do not need to be considered in the optimization criteria. Such classification is dependent on the output labeled object categories above.
In the experiment section below, since the implementation of a computer vision algorithm for semantic segmentation is not the main focus of this paper, we will directly integrate a modified version of MatterPort3D [10] object classifier in our system. This module can be replaced with any other robust semantic segmentation system, as long as it provides bounding box coordinates for each object category. In a companion MatterPort3D [10] dataset, out of 1,659 unique text labels, we classify 134 of the labels as lightweight objects and filter their corresponding bounding box from our workflow.
Figure 2(a) illustrates the result of semantic segmentation of two room spaces projected onto the plane.
3.2 Topological Scene Graph
After identifying the bounding box, orientation, and category type of each object in the scene , a topological graph is readily generated that describes the relationship and constraints of the objects between one each other within . This step will allow us to identify usable spatial functions such as standing in virtual immersion, located between the objects. We categorize this type of functions as standalone spatial functions, and their spaces are called standalone spaces.
A topological scene graph will also allow us to identify other spatial functions on the objects themselves such as sitting on a chair and working on a table. But note that such functions as sitting or working are also constrained by the distances between the object that performs the function and its adjacent other objects. For example, a side of the table can not be utilized for working purposes if that side is adjacent to other furniture or building elements (such as walls, doors, etc.). We categorize this type of functions as auxiliary spatial functions, and their spaces are called auxiliary spaces.
In this paper, we will use two spatial functions standable and sittable as an example to demonstrate how to integrate both standalone spatial functions and auxiliary spatial functions in the optimization of contextual mutual spaces for mutliuser interaction in AR/VR.
Finally, we emphasize that standalone spaces and auxiliary spaces are not mutually exclusive. For example, in this paper, we will classify that a standable space can be assumed to be sittable as well. However, the vice versa may not be true. For example, a portion of a sittable space involves a part of a bed object, which we will not assume to be standable. Such contextual constraints can be highly customizable based on the content of the AR/VR application. But the framework that we are introducing in this paper is general enough to accommodate other contextual interpretations of the standalone spatial functions and auxiliary spatial functions.
In our implementation, we use a doublylinked data structure to construct the graph. For each side face of an object’s bounding box we define the closest adjacent objects to the face and calculate the distance between the object and the specified face. This information would be stored at the object level, where topological distances and constraints are referenced using pointers.
Mathematically, for each object , we define the function as the shortest distance between the points in that have the maximal value and the other objects including . Similarly, we define the functions , , and .
3.3 Mutual Space Identification
In this step, we will identify the geometrical boundaries of available spaces in each room and then align the calculated boundaries of all rooms to achieve maximum consensus on mutual spaces.
First, using the geometrical and topological properties extracted in previous steps, we are ready to calculate available spaces in each room based on two categories, namely, the standalone spaces and auxiliary spaces. Specifically, we will formulate the calculation of the two most typical spatial functions as examples again, namely, standable and sittable.
3.3.1 Standable Spaces
Standing spaces consist of the volume of the room in which no object located within a human user’s height range is present. In such spaces, user movement can be performed freely without any risk of colliding with an object in the surrounding physical environment. Activities such as intense gaming or performative arts can be safely executed within these boundaries. Such spaces are also suitable for virtual reality experiences, where users may not be aware of the physical surroundings.
We calculate the available standing space () for room simply as follows:
(1) 
3.3.2 Sittable Spaces
The calculation of maximal sittable spaces is more involved than that of the standable spaces above. As we mentioned before, sittable spaces normally extend the standable spaces by adding areas where humans are able sit on. Furniture types such as sofas, chairs, and beds include sitting areas that can extend usable spaces of a room for social functions such as general meetings, design reviews, and conference calls.
To start, we define a sittable threshold to calculate the sittable area within the bounding box of the object . In other words, is the maximum distance inward from an edge of the object’s bounding box that can be comfortably sit on. We use measurements from [50] to define the of each furniture type. If an object is classified as nonsittable, then .
Therefore, we can first calculate the nonsittable area of an object as
(2) 
where is a sphere in centered at and with radius .
We note that sittable spaces do not necessarily comprise only objects to be sit on, but rather describe an area where a sittable object can be placed in. For example, while an individual may not be able to comfortably sit on the top of the table, but the foot space bellow the table can be considered as sittable space. Therefore, in such context the sittable area of the room is always larger than its standable area.
Moreover, sittable areas of each object in the room is constrained by the topological positioning of the object. If any of the object’s boundaries is adjacent to a nonsittable object (such as a wall, bookshelf, etc) or does not contain enough standable area between itself and a nonsittable object, the sittable area of the side of the face should be excluded. For instance, if a table is positioned in the center of a room, with no other nonsittable object around it, the sittable area would be calculated by applying the sittable threshold to all four sides of the table’s boundaries. However, if the table is positioned in the corner of the room, then there will be no sittable area accumulated for the sides that are adjacent to the wall.
To simplify our calculation, we define a surrounding boundary threshold for object , which measures the distance from any object’s boundary point outward that allows that point to remain part of the sittable space of the object. In other words, if the boundary point is close to other objects or the room boundary within distance , then that point can not be sit on. defined below collects all such points for exclusion from in room :

(3) 
Therefore, the sittable space of each object is simply defined as
(4) 
Finally, the total sittable space for the room is
(5) 
Figure 3 illustrates two example rooms and compares their standing and sitting areas.
3.3.3 Maximizing Mutual Spaces
Now we consider an immersive experience where there are subjects and therefore room spaces , respectively. Then, in the coordinates, we define a rigidbody motion in as , where describes a translation and a rotation.
If we want to maximize a mutual standable space, we can apply one to each individual standable space for the th user. The optimal rigid body motion then maximizes the area of the interaction space:
(6) 
Then the maximal mutual standable space can be calculated as
(7) 
3.4 Furniture movement optimization
In the event where individual spaces include movable furniture, additional optimization can be considered to potentially increase the maximal mutual spaces. Diverging from merely considering rigidbody motions to transform just the coordinate representation of the spaces, we consider moving furniture objects in space, which has an additional cost of human effort. Consequently, we will formulate this effort as part of our optimization objective.
More specifically, given a rigidbody motion , we definite as the Euclidean distance of its translation vector. Then we define
(8) 
where is a given parameter that approximates the weight of each object. Note that such weight estimate can be looked up using architecture standards such as in [50]. Hence, if a room space has objects, then the total effort to rearrange the space is
(9) 
where denotes the collection of rigidbody motion parameters.
Since solving for the optimal object transformation is an NPHard problem, in this paper, we will demonstrate a heuristicbased but practical algorithm to optimize it in a stepbystep greedy fashion.
(10) 
where indicates the area value at the th step with respect to transformation coefficients and . The iteration would stop if the optimization cannot further increase the area of the mutual space.
4 Experiments
We created two sets of experiments to evaluate the performance of our workflow. First, to comprehensively observe how the search and recommendation system performs given various rooms types with different spatial organizations, we take advantage of available 3D datasets to be able to experiment with large quantities of realworld case studies. We randomly sample subsets of varying sizes of 3D scanned scenes from the MatterPort3D dataset, and perform the search and recommendation practice on each subset to observe how the mutual spaces are identified and maximized with our system. Second, we integrate our system as part of a Augmented Reality experience in Microsoft Hololens. This allows us to demonstrate how AR users can take advantage of our proposed system in a realworld scenario.
4.1 3D Scanned Datasets
Matterport3D [10] is a largescale RGBD dataset containing 90 buildingscale scenes. The dataset consists of various building types with diverse architecture styles, each including numerous spatial functionalities and furniture layouts. Annotations of building elements and furniture are provided with surface reconstructions as well as 2D and 3D semantic segmentation. For our experiments, we initially exclude spaces that are not generally used for multiuser interaction (bathroom, small corridors, stairs, closet, etc.). Furthermore, we randomly group the available rooms in groups of 2,3 and 4. We utilize the object category labels (mpcat40) as the ground truth for our semantic labeling purposes.
We implement our framework using the Rhinoceros3D (R3D) platform and its development libraries. For each room, we convert the labeling data structure provided by the dataset to our proposed topological scene graph. This provides the system with bounding boxes for each object and the topological constraints for their potential rearrangement. Using such a structure, we are able to extract the standable and sittable spaces for each room based on our proposed methodology. Figure 3 illustrates the available standable and sittable boundaries for two sample rooms processed by our system. We define a constant for all sittable objects.
Next, we integrate our system with a robust Strength Pareto Evolutionary Algorithm 2 (SPEA 2) [62] available through the Octopus multiobjective optimization tool in R3D. The fitness function (6) is used to maximize the mutual space for calculated standable spaces. Our genotype is comprised of the transformation parameters of each room, allowing free movement and orientation to achieve maximum spatial consensus. Therefore, a total of genes are allocated for the search process. This process would result in the shape, position and orientation of the maximum mutual boundary of the assigned rooms. We use a population size of 100, mutation probability of 10%, mutation rate of 50% and crossover rate of 80% for our search. As our system integrates a genetic search, we expect the solution to gradually converge to the global optimum. Figure 4 shows how the mutual space boundary is progressively expanded with increase of the generations in our search.
Expanding further, we extend our search by manipulating the scene with alternative furniture arrangements. As the objective goal is to achieve an increased mutual spatial boundary area with minimum effort, we calculate the based on the transformation parameters assigned to each object present in the room. However, in our current implementation, the genetic algorithm integrated in our system is not capable of adapting dynamic genotype values, and therefore cannot update the topological values of each object (, , , ) during the search process. Hence, to avoid transformations which result in physical conflicts of manipulated furniture, we penalize phenotypes that contain intersecting furniture within the scene. This penalty is added to the value, lowering the probability of such phenotypes to be selected or survive throughout the genetic generations.
The optimization can either be (i) triggered in separate attempts for each step (), where the mutual area value () is constrained based on the resulting step value, or (ii) executed in a single attempt where minimizing and maximizing are both set as objective functions. In the latter, is defined as the solution which holds the largest while . Executing the optimization in a onetime event is also likely to require additional computational cost due to the added complexity to the solution space.
Figure 4 illustrates our results for a furniture manipulation optimization task applied to three sample rooms. A total of 34 objects are located in the rooms. To shorten our gene length we do not apply rotation transformations to objects. We use a population size of 250, mutation probability of 10%, mutation rate of 50% and crossover rate of 80% for the scene manipulation search. We visualize the standable, sittable and mutual boundaries for each spatial expansion step. Moreover we report the corresponding for each room in the alternative furniture layout. Our results in this sample indicate the system can identify solutions which increase the maximum mutual boundary area up to 65% more than its initial state before furniture movement.
4.2 Augmented Reality Visualization
To explore the usability aspect of our system in realworld scenarios, we deploy the resulting spatial segmentation in augmented reality using the Microsoft Hololens, a mixed reality HMD. In this experiment, three types of rooms were defined as potential telecommunication spaces: (i) a conventional meeting room, where a large conference table is placed in the middle of the room and unused spaces are located around the table (ii) a robotics laboratory, where working desks and equipment are mainly located around the perimeter of the room, while some larger equipment and a few tables are disorderly positioned around the central section of the lab (iii) a kitchen space, where surrounding appliances and cabinets are present in the scene.
After the initial scan of the surrounding environment by the user of each room, the geometrical mesh data is sent to a central server for processing. This process happens in an offline manner, as the current Hololens hardware is incapable of processing the computations that our system would require. In addition, we scan the space using a MatterPort camera, and perform the semantic segmentation step using MatterPort classifications to locate the bounding boxes of all the furniture located in the room. We then feed the bounding box data to our system for mutual boundary search. The system outputs spatial coordinates for standable and sittable areas which are automatically updated in the Unity Game Engine to be rendered in the Hololenses.
Figure 5 shows how the spatial boundary properties are visualized within the Hololens AR experience. The red spaces indicate non standable objects, the green spaces indicate standable boundaries, and the blue spaces indicate mutual boundaries that are accessible between all users. The visualized boundaries are positioned slightly above the floor level, allowing users to identify the mutual accessible ground between their local surrounding and the remote participant’s spatial constraints.
5 Discussions
The optimization process was able to generate a welldefined Pareto front, as seen on the bottom of Figure 4, locating both the two extreme points and numerous intermediate tradeoff points representing nondominated solutions. The bottom region of the curve is flat, indicating that for a similar amount of effort, a significant increase in mutual standable area can be achieved. The tradeoff frontier thus starts at point , becoming very densely populated in its initial soft slope. This shows that for each modest increase in physical effort (that is, in moving furniture) there can be extensive gains in mutual shareable area, which is an interesting result. After , the Pareto front becomes increasingly steep, signaling that the user would now have to significantly increase physical effort levels for modest gains in shareable area. Point thus seems to indicate a breaking point of diminishing returns.
Similar to the search, in smaller furniture optimization steps, the algorithm seeks solutions which are highly dependent on the transformation parameters of the room itself, whereas in larger steps, we observe the algorithm correctly moving the objects to the more populated side of the room in order to increase the empty spaces in available. In rooms where objects are facing the center, and empty areas are initially located in the middle portion of the space, we see the objects being pushed towards the corners or outer perimeter of the room in order increase the initial unoccupied areas.
Due to the smaller gene size, calculating the optimal (maximum mutual space without furniture manipulation) executes much faster compared to optimization, where the complexity of the search mechanism radically increases due to the additional object transformation parameters. The speed of the optimization is also highly dependent on the transformation range of each object, meaning that objects in larger rooms have more movement options to choose from than those in small, constrained rooms. We observe an example of this effect in the later experiment, where the smaller space (kitchen) dominates the search process, causing the final mutual outcome between the rooms to maintain a very similar shape to the open boundaries of the smaller space. While such an effect would still provide a wellconstrained problem for mediumsized rooms with multiple objects (such as the conference room), there are many possible ways of fitting the smaller space in larger rooms with open spaces (such as the robotics laboratory), resulting in an underconstrained optimization problem.
Visualizing the mutual ground within the space itself using HoloLens allows us to understand how complex the problem can be when executed in a manual fashion. Some corner spaces that are not typically used as default social areas of an certain room, may become the only required common ground for interaction with other rooms. Overcoming this spatial bias is easily executed within the algorithm; meanwhile, this may not happen so easily and instantly when individuals are left to deal with it on their own.
However, due to the limited field of view of the HoloLens, detecting nonphysical boundaries placed at a lower visual height becomes difficult to follow. This issue proved more challenging when walking closer to the nonorthogonal edges of mutual bounding area, where an individual could easily step outside the designated area. The shareable area also included a number of voids, which resulted on an inconsistent walking path inside the standable spaces. Moreover, the accuracy of the realtime mesh reconstruction in HoloLens played a critical role in calculating the required rendering occlusions for the visualized boundaries. This was mainly because the position of the the visualization was reflected close to the floor with many object placed over it, therefore failing to detect occluding objects, a fact that often misled the user in identifying whether the space was mutually accessible or not.
6 Conclusions
We introduce a novel optimization and manipulation framework to generate an optimal common virtual space for interactions that mostly involve standing and sitting. Our framework further recommends movement of surrounding furniture objects that can expand the size of the mutual space with minimal physical effort. We integrated our system with a Strength Pareto Evolutionary Algorithm for an efficient search and optimization process. The multicriteria optimization process was able to generate a welldefined Pareto front of tradeoffs between maximizing mutual space and minimizing physical effort. The Pareto front is more densely populated in some sections of the frontier than others, clearly identifying the best tradeoffs region and the onstart of diminishing returns.
Furthermore, we experimented how the output solutions can be visualized using a HoloLens application. Results show that the proposed framework can effectively discover optimal shareable space for multiuser virtual interaction and thus provides better user experience compared to manually labeling shareable space, which would be a laborintensive and imprecise workflow. In such context, if all participants stand within the calculated mutual spatial boundaries, the line of sight between all participants will be deterministic. In addition, no remote participant will be positioned in a conflicting location for any local user and would comply to the spatial constraints for all other participants.
There are, of course, limitations to the work. First, furniture with fixed positions are not automatically detected in our system. We believe such feature can be integrated with further improvements in semantic segmentation methodologies, or can be optionally specified by the user whether an object is fixed or not. In addition, the furniture weight is calculated based on standard assumptions. We envision that with the growth of spatial computing procedures, such metadata of the surrounding environment will be customizable by the user itself and can be loaded upon each mutual spatial search execution. Future work can comprise of integrating robust floorplanning representations with the current search mechanism to minimize computation cost and complexity. Lastly, usability studies can be conducted on how to improve the visualization strategies so participants can experience the required telecommunication functionalities while preserving the mutual spatial ground.
References
 [1]
 [2] Sue AbdinnourHelm and Scott W Hadley. 2000. Tabu search based heuristics for multifloor facility layout. International Journal of Production Research 38, 2 (2000), 365–383.
 [3] Iro Armeni, Ozan Sener, Amir R. Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 2016. 3D Semantic Parsing of LargeScale Indoor Spaces. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 1534–1543. DOI:http://dx.doi.org/10.1109/CVPR.2016.170
 [4] Tibor Balogh and Péter Tamás Kovács. 2010. Realtime 3D light field transmission. In RealTime Image and Video Processing 2010, Vol. 7724. International Society for Optics and Photonics, 772406.
 [5] E Bardram. 2005. Activitybased computing: support for mobility and collaboration in ubiquitous computing. Personal and Ubiquitous Computing 9, 5 (2005), 312–322.
 [6] Stephan Beck, Andŕe Kunert, Alexander Kulik, and Bernd Froehlich. 2013a. Immersive grouptogroup telepresence. IEEE Transactions on Visualization and Computer Graphics 19, 4 (2013), 616–625. DOI:http://dx.doi.org/10.1109/TVCG.2013.33
 [7] Stephan Beck, Andre Kunert, Alexander Kulik, and Bernd Froehlich. 2013b. Immersive grouptogroup telepresence. IEEE Transactions on Visualization and Computer Graphics 19, 4 (2013), 616–625.
 [8] Hrvoje Benko, Ricardo Jota, and Andrew Wilson. 2012. MirageTable: freehand interaction on a projected augmented reality tabletop. In Proceedings of the SIGCHI conference on human factors in computing systems. ACM, 199–208.
 [9] PA Blanche, A Bablumian, R Voorakaranam, C Christenson, W Lin, T Gu, D Flores, P Wang, WY Hsieh, M Kathaperumal, and others. 2010. Holographic threedimensional telepresence using largearea photorefractive polymer. Nature 468, 7320 (2010), 80.
 [10] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niebner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. 2018. Matterport3D: Learning from RGBD data in indoor environments. In Proceedings  2017 International Conference on 3D Vision, 3DV 2017. 667–676. DOI:http://dx.doi.org/10.1109/3DV.2017.00081
 [11] YunChih Chang, YaoWen Chang, GuangMing Wu, and ShuWei Wu. 2000. B*trees: a new representation for nonslicing floorplans. In Proceedings 37th Design Automation Conference. 458–463. DOI:http://dx.doi.org/10.1109/DAC.2000.855354
 [12] Guolong Chen, Wenzhong Guo, Hongju Cheng, Xiang Fen, and Xiaotong Fang. 2008. VLSI floorplanning based on particle swarm optimization. In 2008 3rd International Conference on Intelligent System and Knowledge Engineering, Vol. 1. IEEE, 1020–1025.
 [13] M Y Cheng. 1992. Automated Site Layout of Temporary Construction Facilities UsingEnhanced Geographic Information Systems (GIS). Ph. D. Disst., Depart. of Civil Engineering, University of Texas at Austin, Texas, USA (1992).
 [14] Herbert H Clark, Susan E Brennan, and others. 1991. Grounding in communication. Perspectives on socially shared cognition 13, 1991 (1991), 127–149.
 [15] Hans Eisenmann, Frank M Johannes, and Frank M Johannes. 1998. Generic global placement and floorplanning. In Proceedings of the 35th annual Design Automation Conference. ACM, 269–274.
 [16] Henry Fuchs, Andrei State, and Jean Charles Bazin. 2014. Immersive 3D telepresence. Computer 47, 7 (2014), 46–52. DOI:http://dx.doi.org/10.1109/MC.2014.185
 [17] Markus Gross, Stephan Würmlin, Martin Naef, Edouard Lamboray, Christian Spagno, Andreas Kunz, Esther KollerMeier, Tomas Svoboda, Luc Van Gool, and others. 2003. bluec: a spatially immersive display and 3D video portal for telepresence. In ACM Transactions on Graphics (TOG), Vol. 22. ACM, 819–827.
 [18] PeiNing Guo, ChungKuan Cheng, and Takeshi Yoshimura. 2003. An Otree representation of nonslicing floorplan and its applications. In Proceedings 1999 Design Automation Conference (Cat. No. 99CH36361). IEEE, 268–273. DOI:http://dx.doi.org/10.1145/309847.309928
 [19] BahHwee Gwee and MengHiot Lim. 1999. A GA with heuristicbased decoder for IC floorplanning. INTEGRATION, the VLSI journal 28, 2 (1999), 157–172.
 [20] Jun H. Jo and John S. Gero. 1998. Space layout planning using an evolutionary approach. Artificial Intelligence in Engineering 12, 3 (Jul 1998), 149–162. DOI:http://dx.doi.org/10.1016/S09541810(97)00037X
 [21] Andrew Jones, Magnus Lang, Graham Fyffe, Xueming Yu, Jay Busch, Ian McDowall, Mark Bolas, and Paul Debevec. 2009. Achieving eye contact in a onetomany 3D video teleconferencing system. In ACM Transactions on Graphics (TOG), Vol. 28. ACM, 64.
 [22] Brett Jones, Rajinder Sodhi, Michael Murdock, Ravish Mehra, Hrvoje Benko, Andrew Wilson, Eyal Ofek, Blair MacIntyre, Nikunj Raghuvanshi, and Lior Shapira. 2014. RoomAlive: magical experiences enabled by scalable, adaptive projectorcamera units. In Proceedings of the 27th annual ACM symposium on User interface software and technology. ACM, 637–644.
 [23] Andrew B Kahng, Jens Lienig, Igor L Markov, and Jin Hu. 2011. VLSI physical design: from graph partitioning to timing closure. Springer Science & Business Media.
 [24] Prabhjit Kaur. 2014. An enhanced algorithm for floorplan design using hybrid ant colony and particle swarm optimization. Int. J. Res. Appl. Sci. Eng. Technol 2 (2014), 473–477.
 [25] Mohammad Keshavarzi, Michael Wu, Michael N Chin, Robert N Chin, and Allen Y Yang. 2019. Affordance Analysis of Virtual and Augmented Reality Mediated Communication. arXiv preprint arXiv:1904.04723 (2019).
 [26] Kibum Kim, John Bolton, Audrey Girouard, Jeremy Cooperstock, and Roel Vertegaal. 2012. TeleHuman: effects of 3d perspective on gaze and pose estimation with a lifesize cylindrical telepresence pod. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 2531–2540.
 [27] Scott Kirkpatrick, C Daniel Gelatt, and Mario P Vecchi. 1983. Optimization by simulated annealing. science 220, 4598 (1983), 671–680.
 [28] Koji Kiyota and Kunihiro Fujiyoshi. 2005. Simulated annealing search through general structure floor plans using sequencepair. Electronics and Communications in Japan (Part III: Fundamental Electronic Science) 88, 6 (2005), 28–38.
 [29] Gregorij Kurillo, Ruzena Bajcsy, Klara Nahrsted, and Oliver Kreylos. 2008. Immersive 3d environment for remote collaboration and training of physical activities. In 2008 IEEE Virtual Reality Conference. IEEE, 269–270.
 [30] C Kuster, N Ranieri, Agustina, H Zimmer, J. C. Bazin, C Sun, T Popa, and M Gross. 2012. Towards next generation 3D teleconferencing systems. In 3DTVConference. IEEE, 1–4. DOI:http://dx.doi.org/10.1109/3DTV.2012.6365454
 [31] Benny P H Lee. 2001. Mutual knowledge, background knowledge and shared beliefs: Their roles in establishing common ground. Journal of Pragmatics 33, 1 (2001), 21–44. DOI:http://dx.doi.org/https://doi.org/10.1016/S03782166(99)001289
 [32] Nicolas H. Lehment, Daniel Merget, and Gerhard Rigoll. 2014. Creating automatically aligned consensus realities for AR videoconferencing. ISMAR 2014  IEEE International Symposium on Mixed and Augmented Reality  Science and Technology 2014, Proceedings September (2014), 201–206.
 [33] ChangTzu Lin, DeSheng Chen, and YiWen Wang. 2002. An efficient genetic algorithm for slicing floorplan area optimization. In 2002 IEEE International Symposium on Circuits and Systems. Proceedings (Cat. No. 02CH37353), Vol. 2. IEEE, II–II.
 [34] JaiMing Lin and YaoWen Chang. 2005. TCG: A transitive closure graphbased representation for general floorplans. IEEE transactions on very large scale integration (VLSI) systems 13, 2 (2005), 288–292.
 [35] Chen Liu, Jiaye Wu, and Yasutaka Furukawa. 2018. FloorNet: A Unified Framework for Floorplan Reconstruction from 3D Scans. (2018), 1–18. http://arxiv.org/abs/1804.00090
 [36] Paul Luff and Christian Heath. 1998. Mobility in Collaboration.. In CSCW, Vol. 98. 305–314.
 [37] Yuchun Ma, Sheqin Dong, Xianiong Hong, Yici Cai, ChungKuan Cheng, and Jun Gu. 2001. VLSI floorplanning with boundary constraints based on corner block list. In Proceedings of the 2001 Asia and South Pacific Design Automation Conference. ACM, 509–514.
 [38] Andrew Maimone and Henry Fuchs. 2011. Encumbrancefree telepresence system with realtime 3D capture and display using commodity depth cameras. In 2011 10th IEEE International Symposium on Mixed and Augmented Reality. IEEE, 137–146.
 [39] Andrew Maimone and Henry Fuchs. 2012. Realtime volumetric 3D capture of roomsized scenes for telepresence. In 2012 3DTVConference: The True VisionCapture, Transmission and Display of 3D Video (3DTVCON). IEEE, 1–4.
 [40] Andrew Maimone, Xubo Yang, Nate Dierk, Andrei State, Mingsong Dou, and Henry Fuchs. 2013. Generalpurpose telepresence with headworn optical seethrough displays and projectorbased lighting. In 2013 IEEE Virtual Reality (VR). IEEE, 23–26.
 [41] Wojciech Matusik and Hanspeter Pfister. 2004. 3D TV: a scalable system for realtime acquisition, transmission, and autostereoscopic display of dynamic scenes. In ACM Transactions on Graphics (TOG), Vol. 23. ACM, 814–824.
 [42] D Jackuline Moni and S Arumugam. 2009. VLSI Floorplanning based on Hybrid Particle Swarm Optimization. Karunya Journal of Research 1, 1 (2009), 111–121.
 [43] Koki Nagano, Andrew Jones, Jing Liu, Jay Busch, Xueming Yu, Mark Bolas, and Paul Debevec. 2013. An autostereoscopic projector array optimized for 3D facial display. In ACM SIGGRAPH 2013 Emerging Technologies. ACM, 3.
 [44] Shigetoshi Nakatake, Kunihiro Fujiyoshi, Hiroshi Murata, and Yoji Kajitani. 1997. Module placement on BSGstructure and IC layout applications. In Proceedings of the 1996 IEEE/ACM international conference on Computeraided design. IEEE Computer Society, 484–491.
 [45] Shingo Nakaya, Tetsushi Koide, and Si Wakabayashi. 2000. An adaptive genetic algorithm for VLSI floorplanning based on sequencepair. In 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No. 00CH36353), Vol. 3. IEEE, 65–68.
 [46] Sergio OrtsEscolano, Mingsong Dou, Vladimir Tankovich, Charles Loop, Qin Cai, Philip A. Chou, Sarah Mennicken, Julien Valentin, Vivek Pradeep, Shenlong Wang, Sing Bing Kang, Christoph Rhemann, Pushmeet Kohli, Yuliya Lutchyn, Cem Keskin, Shahram Izadi, Sean Fanello, Wayne Chang, Adarsh Kowdle, Yury Degtyarev, David Kim, Philip L. Davidson, and Sameh Khamis. 2017. Holoportation. 741–754. DOI:http://dx.doi.org/10.1145/2984511.2984517
 [47] Hesham M. Osman, Maged E. Georgy, and Moheeb E. Ibrahim. 2003. A hybrid CADbased construction site layout planning system using genetic algorithms. Automation in Construction 12, 6 (2003), 749–764. DOI:http://dx.doi.org/10.1016/S09265805(03)00058X
 [48] Tomislav Pejsa, Julian Kantor, Hrvoje Benko, Eyal Ofek, and Andrew Wilson. 2016. Room2Room: Enabling LifeSize Telepresence in a Projected Augmented Reality Environment. In Proceedings of the 19th ACM Conference on ComputerSupported Cooperative Work & Social Computing (CSCW ’16). ACM, New York, NY, USA, 1716–1725. DOI:http://dx.doi.org/10.1145/2818048.2819965
 [49] Charles R. Qi, Hao Su, Matthias Niessner, Angela Dai, Mengyuan Yan, and Leonidas J. Guibas. 2016. Volumetric and MultiView CNNs for Object Classification on 3D Data. (2016). DOI:http://dx.doi.org/10.1109/CVPR.2016.609
 [50] Charles George Ramsey. 2007. Architectural graphic standards. John Wiley & Sons.
 [51] Maurizio Rebaudengo and Matteo Sonza Reorda. 1996. GALLO: A genetic algorithm for floorplan area optimization. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems 15, 8 (1996), 943–951.
 [52] B Sowmya and MP Sunil. 2013. Minimization of floorplanning area and wire length interconnection using particle swarm optimization. International Journal of Emerging Technology and Advanced Engineering 3, 8 (2013).
 [53] TsungYing Sun, ShengTa Hsieh, HsiangMin Wang, and ChengWei Lin. 2006. Floorplanning based on particle swarm optimization. In IEEE Computer Society Annual Symposium on Emerging VLSI Technologies and Architectures (ISVLSI’06). IEEE, 5–pp.
 [54] Tomohiro Tanikawa, Yasuhiro Suzuki, Koichi Hirota, and Michitaka Hirose. 2005. Real world video avatar: realtime and realsize transmission and presentation of human figure. In Proceedings of the 2005 international conference on Augmented teleexistence. ACM, 112–118.
 [55] I. D. Tommelein, R. E. Levitt, B. HayesRoth, and T. Confrey. 1991. SightPlan experiments: alternate strategies for site layout design. Computing in Civil Engineering 5, 1 (1991), 42–63. DOI:http://dx.doi.org/10.1007/BF01927759
 [56] Herman Towles, WeiChao Chen, Ruigang Yang, SangUok Kum, , Henry Fuchs, Nikhil Kelshikar, Jane Mulligan, Kostas Daniilidis, Loring Holden, Bob Zeleznik, Amela Sadagic, and Jaron Lanier. 2002. 3D TeleCollaboration Over Internet2. In In: International Workshop on Immersive Telepresence, Juan Les Pins. Citeseer.
 [57] LaungTerng Wang, YaoWen Chang, and KwangTing (Tim) Cheng (Eds.). 2009. Electronic Design Automation: Synthesis, Verification, and Test. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
 [58] WeiChao Wen, Herman Towles, Lars Nyland, Greg Welch, and Henry Fuchs. 2000. Toward a Compelling Sensation of Telepresence: Demonstrating a portal to a distant (static) office. In Proceedings Visualization 2000. VIS 2000 (Cat. No. 00CH37145). IEEE, 327–333.
 [59] D F Wong and C L Liu. 1986. A New Algorithm for Floorplan Design. In Proceedings of the 23rd ACM/IEEE Design Automation Conference (DAC ’86). IEEE Press, Piscataway, NJ, USA, 101–107. http://dl.acm.org/citation.cfm?id=318013.318030
 [60] Wang Xiaogang and others. 2002. VLSI Floorplanning Method Based on Genetic Algorithms [J]. Microprocessors 1 (2002), 1.
 [61] Cha Zhang, Qin Cai, Philip A Chou, Zhengyou Zhang, and Ricardo MartinBrualla. 2013. Viewport: A distributed, immersive teleconferencing system with infrared dot pattern. IEEE MultiMedia 20, 1 (2013), 17–27.
 [62] Eckart Zitzler, Marco Laumanns, and Lothar Thiele. 2001. SPEA2: Improving the strength Pareto evolutionary algorithm. TIKreport 103 (2001).