Optimal Assistance for Object-Rearrangement Tasks in Augmented Reality
Augmented-reality (AR) glasses—which will have access to real-time, high-fidelity data regarding a user’s environment via onboard sensors, as well as an ability to seamlessly display real-time information to the user—present a unique opportunity to provide users with assistance in completing quotidian tasks. Many such tasks—such as house cleaning, packing for a trip, or organizing a living space—can be characterized as object-rearrangement tasks defined by users navigating through an environment, picking up objects, and placing them in different locations. We introduce a novel framework for computing and displaying AR assistance that consists of (1) associating an optimal action sequence with the policy of an embodied agent and (2) presenting this optimal action sequence to the user as suggestion notifications in the AR system’s heads-up display. The embodied agent comprises a ‘hybrid’ between the AR system and the user, in that it has the observation space (i.e., sensor measurements) of the AR system and the action space (i.e., task-execution actions) of the user; its policy is learned by minimizing the the time to complete the task. In this initial study, we assume that the AR system’s observations include a map of the environment and real-time localization of the objects and user within that map. These modeling choices allow us to formalize the problem of computing AR assistance for any object-rearrangement task as a planning problem that reduces to solving a capacitated vehicle-routing problem (CVRP), which is a variant of the classical traveling salesman problem (TSP) in combinatorial optimization. In addition, we introduce a novel AR simulator that can enable web-based evaluation of AR-like assistance and associated large-scale data collection; our approach is based on the Habitat (Savva et al., 2019) simulator for embodied artificial intelligence (AI). Finally, we perform a study that evaluates how users respond to the AR assistance generated by the proposed framework on a specific quotidian object-rearrangement task—house cleaning—using our proposed AR simulator. We perform the study at scale using Amazon Mechanical Turk (AMT) (Amazon, ) and collect data using psiTurk (Gureckis et al., 2016). In particular, we study the effect of our proposed AR assistance on users’ task performance and sense of agency over a range of task difficulties. Our results indicate that providing users with the proposed form of AR assistance improves the userâs overall performance. Additionally, we show that while users report a negative impact to their agency, that they may prefer this when presented with a system that aides them in task completion.
Compared with current personal computing devices, always-on AR devices (1) have access to a much larger volume and more diverse set of sensor data and (2) are able to display real-time information to the user in a much lower friction manner (Jonker et al., 2020). This exposes the exciting potential for such AR devices to provide users with continual, contextually relevant assistance toward achieving their personalized goals. Consequently, assistive AR systems have been used quite extensively in specialized applications such as maintenance and manufacturing (Palmarini et al., 2018; Egger and Masood, 2020), education (Ibáñez and Delgado-Kloos, 2018), tourism (Yung and Khoo-Lattimore, 2019), and surgery (Vávra et al., 2017) to name a few. However, AR devices hold the promise of providing assistance to users on a much broader and less specialized class of commonly occurring quotidian activities. For example, if advances in AI can enable AR devices to reason about the structure and state of quotidian tasks such as cooking, cleaning, or organizing, then this ability could be leveraged to lend assistance to users in performing these tasks, ideally leading to improved task performance, reduced physical and cognitive effort, and preserved sense of agency. One may thus envision such AR assistance as providing “superpowers for everyday tasks”. In this work, our goal is to develop a framework for such AR assistance applicable to an important class of everyday tasks—those involving object rearrangement—and to evaluate its value to users at scale.
Making progress towards this objective of pervasive AR assistance is challenging for myriad reasons. First, there has been limited work towards formalizing the problem of computing and displaying AR assistance that can lead to improved task performance, reduced effort, and preserved sense of agency for users; while this has been investigated for specialized tasks such as AR-assisted assembly (Yang et al., 2019; Tang et al., 2003), it has not yet been pursued for a broad class of quotidian tasks. Second, no widely available consumer AR device currently exists; as such there is no large AR user base or associated infrastructure that can support the evaluation of AR assistance on users at scale. The field of human–robot interaction (HRI) faces a similar challenge; to overcome it, they have leveraged web-based studies that ask users to react to videos of humans and robots interacting (Hoffman, 2019; Malle et al., 2016). This reliance on a third-person perspective lacks immersion and limits the types of interactions that can be studied; further, no analogous approach to AR assistance would be viable as the AR assistance is not directly observable from third parties. Finally, how real users will respond to AR assistance in everyday tasks—in particular, how it affects their task performance, effort, and sense of agency—remains an open question (Berry et al., 2009; Albert and Tullis, 2013).
To address the first challenge in the context of object-rearrangement tasks, we formalize the problem of computing and displaying AR assistance by (1) associating an optimal action sequence with the policy of an embodied agent and (2) presenting this optimal action sequence to the user as suggestion notifications in the AR system’s heads-up display. In our formulation, the embodied agent comprises a user–AR-system ‘hybrid’ in that it has the observation space (i.e., sensor measurements) of the AR system and the action space (i.e., task-execution actions) of the user, and its policy is learned by minimizing the time to complete the task. In this initial study, we assume that the AR-system has full observability of the environment, which includes a map and real-time localization of the objects and user within that map. These modeling choices allow us to formalize the problem of computing AR assistance for any object-rearrangement task as a planning task that reduces to solving a capacitated vehicle-routing problem (CVRP) (Dantzig and Ramser, 1959; Golden et al., 2008) from combinatorial optimization. Because the optimal action sequence comprises a sequence of location visits along shortest paths, we present this action sequence by displaying the next shortest path to the user in the form of world-locked digital breadcrumbs in the heads-up display. If the user ignores the AR-system’s suggestion notifications and deviates from the optimal action sequence by visiting an alternative pickup or delivery location, we replan on the fly.
To address the second challenge, we propose a novel AR simulator that can enable large-scale web-based evaluation of AR assistance and associated data collection. The simulator is based on Habitat (Savva et al., 2019) for embodied AI and satisfies the key criteria we have in an AR simulator: (1) it support the observations and actions of the proposed embodied-agent policy, (2) it emulates a first-person view through an AR device, including an ability to display suggestion-notifications in a heads-up display (HUD) in the form of digital objects and information, (3) it enables a user in the loop to autonomously perform the task-execution actions, and (4) it is deployable on the web at scale via integration with Amazon Mechanical Turk (AMT) and supports data collection related to task performance and sense of agency via psiTurk (Gureckis et al., 2016) integration.
To address the third challenge, we define house cleaning as a specific object-rearrangement task, implement the task and the proposed CVRP-based assistance using OR-Tools (Google, ) in the proposed AR simulator, and evaluate it at scale using AMT. We collect user data across a range of task difficulties and types of AR assistance in order to evaluate how the proposed form of AR assistance affects users’ task performance, effort, and sense of agency. We find that by following the optimal assistance, users are able to decrease their total distance traveled though this comes at a cost of feeling less in control over their own actions. This cost may be one users are willing to pay, however, as we also find that users report preferring the optimal assistance to a system that does not provide them with the optimal solution. Additionally, we find that users are not consistent in their willingness to follow assistance.
In summary, our contributions are: (1) a novel framework for computing and displaying AR assistance for object-rearrangement tasks that employs a ‘hybrid’ single agent (i.e., the user–AR-system) and CVRP formulation, (2) a novel AR simulator that can enable large-scale web-based evaluation of AR-like assistance, and (3) a large-scale web-based study that assesses how users respond to the proposed form of AR assistance in a house-cleaning task over a range of task difficulties. To the best of our knowledge, this is the first at-scale study of AR assistance for quotidian tasks. We envision our second contribution as being useful beyond the domain of AR assistance, as it can provide a framework for at-scale user-in-the-loop evaluation of different kinds of digital assistance. Future work will entail developing and assessing perception-based learned policies that assume lower fidelity of the embodied agent’s observations and considering multi-agent formulations that incorporate models of the user’s behavior.
2. Related Work
We now review related work that relates to our three key contributions. Namely, we review existing work in AR assistance in Sec. 2.1, simulation frameworks for training and evaluating embodied-agent policies in Sec. 2.2, and assessing users’ response to digital assistance in Sec. 2.3.
2.1. AR assistance
To date, most work on employing AR devices to assist users has focused on displaying predefined information overlays that can be useful in completing a prescribed task; the information overlays are often spatially registered to objects or locations relevant to the task. For example, researchers have investigated approaches wherein the AR device overlays information or instructions on parts to be assembled or maintained (Thomas and David, 1992; Tang et al., 2003), retrieves and displays maintenance documentation using object recognition (Didier et al., 2005), overlaying medical imaging data on patients in real time (Bajura et al., 1992), or provides location and activity-based, world-locked information to the users during indoor navigation (Mulloni et al., 2011).
Critically, none of the above approaches are truly robust for complex multi-step tasks, as they do not leverage the device’s on-board sensors to infer the current task state; as such, they are unable to provide the user with up-to-date assistance toward optimal task completion or properly adapt when users deviate from the system’s suggested steps. To fill this gap, planning-based
Instead, our proposed approach for AR assistance applicable to complex object-rearrangement tasks described in Sec. 3 leverages only the sensor measurements of the AR device, and executes a policy whose state is informed only by these sensor measurements, thus enabling up-to-date assistance toward optimal task completion. In this initial study, we assume the device has ‘full environment knowledge’ via its sensors and can support mapping and localization of the user and objects (Newcombe et al., 2020). This allows us to compute the policy using a planner based on a capacitated vehicle routing problem. Perhaps the most closely related work to ours is the planning-based cognitive assistant developed by Hu et al. (Hu et al., 2020), which employs a perception module driven by computer vision to identify task state in a sandwich-assembly task and re-plan using a finite state machine. In contrast to this work, we consider a much broader category of object-rearrangement tasks and consider planners based on a general capacitated vehicle routing problem as described in Sec. 3.2. Further, our embodied AI formulation enables straightforward extensions that go beyond the strict environment-knowledge requirements of planning to learning policies that can employ perception-based partial environment observations.
2.2. Embodied-agent simulation frameworks
The robotics and embodied AI communities have leveraged simulation frameworks to train and evaluate embodied-agent policies in lieu of expensive real-world experimentation infrastructure (Savva et al., 2019; Kolve et al., 2017; Koenig and Howard, 2004; Erickson et al., 2019). Most efforts consider fully autonomous agents (e.g. robots) whose actions directly change the physical state of the environment.
Our AR-assistance application is fundamentally different from these, in that actions that can modify the physical state of the environment belong to the human agent. The AR device (our parallel to the autonomous system) can only influence the physical state of the environment indirectly via suggestion-notification actions displayed to the human. Thus, for AR assistance, any realistic evaluation of the how the embodied-agent policy affects task performance must involve a human in the loop—or a high-fidelity model of the human’s behavior. Evaluating how assistance affects important qualitative aspects of the user experience (e.g., effort, sense of agency) necessitates collecting data from real users at scale. On the other hand, training of the embodied-agent policy can be done in a variety of ways that need not rely on a human in the loop; Sec. 3 describes in detail our proposed formulation for computing the policy that does not require a human in the loop—only a planner coupled to a simulator that can compute the geodesic distances between locations and track task state—although future work will consider human-in-the-loop training loop analogously to Refs. (Li et al., 2016; Zhang et al., 2019). Thus, for learning assistive AR policies, rather different demands are placed on the environments used for evaluation and training; the only strong compatibility requirement is that the evaluation environment must support the states and actions associated with the trained policy to deploy it in real time. Here, we focus on the evaluation environment.
Arguably the closest related work is in the field of human–robot interaction (HRI), where researchers are interested in evaluating users’ response to robotic assistance (Brooks, 2018). In this field, the two most common evaluation environments comprise either (1) in-person studies where users are asked to respond to robotic assistance (Dragan et al., 2013), and (2) web-based studies where users are asked to view third-person videos or photos of humans and robots interacting and react accordingly (Hoffman, 2019; Malle et al., 2016). The former category suffers from lack of scalability, while the latter category lacks immersion due to its reliance on a third-person perspective.
Thus, we believe that our proposed evaluation framework—which is based on an AR simulator and is described in Sec. 4—combines the attractive attributes of each of these paradigms, as it enables scalable, first-person interaction of a user with an embodied assistant and thus we hypothesize can capture more realistic user experiences—and thus yield higher quality user-experience data—at scale.
2.3. Evaluating user response to digital assistance
It is generally challenging to characterize the user experience for digital assistance owing to a plethora of factors involved and owing to the difficulty of measuring some of these factors with high fidelity (Albert and Tullis, 2013). Past studies on evaluating users’ response to digital assistance have focused on evaluating factors such as usability, utility, effectiveness, and agency (Berry et al., 2009). In our study, we focus primarily on evaluating effectiveness as measured by the user’s task performance and sense of agency. We focus on task performance as it is easy to measure and directly addresses the key value proposition of AR assistance for complex tasks. We consider agency due to its central role in human–computer interaction (HCI) research and its relatively uncharacterized role in immersive AR-like assistive scenarios.
The sense of agency refers to the feeling of being in the driver’s seat when it comes to selecting one’s actions (Moore, 2016). HCI research has long recognized the sense of agency as a key factor in characterizing how people experience interactions with technology (Limerick et al., 2014; Moore, 2016; Shneiderman and Plaisant, 2010). In particular, one of the eight classic rules of interface design places emphasis on designing interfaces that support the user’s sense of agency (Shneiderman and Plaisant, 2010). Further, the sense of agency may also influence a user’s acceptance of technology (Baronas and Louis, 1988; Le Goff et al., 2018). Despite the importance of agency, it has received limited attention in the domain of AR. It is important to address this topic early in developing novel AR technologies, as enabling user’s sense of agency is indeed challenging in assistive and immersive systems; for example, previous research has found a reduction in sense of agency with increase in automation (Berberian et al., 2012).
The study presented in Sec. 5 evaluates task performance and the user’s sense of agency on the proposed cleaning-house task. Akin to other studies on AR assistance that have shown improved task performance for instance in assembly tasks (Yang et al., 2019; Tang et al., 2003), we show faster task completion times with the use of the proposed AR assistance. Further, we study the effectiveness over a range of task-difficulty and assistance-fidelity levels. Our study also suggests that while reports of user agency may be negatively affected by increased assistance in this scenario, this may be a cost participants are willing to pay in order to realize improved task performance; these are promising results, and can serve as a baseline for alternative interfaces for displaying the computed AR assistance to the user.
3. AR-Assistance Model
The goal of our AR assistance model is to formalize the problem of computing and displaying AR assistance for object-rearrangement tasks. We achieve this by adopting the perspetive of embodied AI and (1) associate an optimal action sequence with the policy of an embodied agent, and (2) present this optimal action sequence to the user as suggestion notifications in the AR system’s heads-up display. To particularize this setup, we must define the embodied agent and associated partially observable Markov decision process (POMDP) ingredients: the states, observations, actions, and rewards characterizing the environment, agent, and task.
3.1. Embodied AI formulation
We begin by defining the objective (and associated reward) of the task as moving each object in question from its initial position to its final desired position in as little time as possible. Minimal time to task completion is not only an intuitive choice for an objective and reward, it also draws inspiration from the long history of AI assistants for task and time management (Myers et al., 2007).
Regarding the choice of embodied agent, one could adopt a fully multiagent perspective and consider the user and AR system to be independent but cooperating agents with the shared aforementioned objective but with different observation and action spaces. In this case, one could learn the AR system’s policy and present its action sequence to the user as suggestion notifications. Unfortunately, this approach introduces significant challenges, as the user is included in the AR system’s environment and thus any simulation-based learning algorithm for the AR-system’s policy would require modeling the user’s policy acting on an observation space augmented by the AR-system’s displayed information (Buşoniu et al., 2010). Instead, we simplify this setup and consider a single-agent formulation with an agent comprising a ‘hybrid’ between the AR system and the user. Namely, it has the observation space (i.e., sensor measurements) of the AR system and the actions (i.e., physical task-execution actions) of the user; in the case of object-rearrangement tasks, the latter corresponds to navigation and object pick/place actions. In this case, learning- or planning-based approaches for computing the embodied agent’s policy require modeling only the physical dynamics of object rearrangement.
As mentioned above, the embodied-agent’s observations correspond to those of the AR system. While many current AR head-mounted devices are equipped with only RGB video, future devices are likely to be equipped with much more high-fidelity observations. Indeed, it is likely that—at least for familiar environments—the AR device will have access to a complete map and will be able to perform localization of the user and objects (Newcombe et al., 2020). As such, for this initial study, we assume that the AR device can has complete observations that include a map of the environment, the current position of all objects in question, and the position of the user. Thus, the observations and state coincide in this case, and the POMDP reduces to an MDP with deterministic transition dynamics, exposing the use for a deterministic planner as described in Sec. 3.2.
Finally, we assume that the user can carry only two objects at once, the AR device has knowledge of the desired final location of all objects in question (i.e., the ‘goal state’ of the environment), and the AR device can calculate the shortest path between any two points on the map. These assumptions allow us to compute the policy of the embodied agent by a planner that solves a capacitated vehicle routing problem (CVRP), which we describe in Sec. 3.2 below. We remark that future work will relax the above assumptions and consider multiagent formulations, partial observations, and model-free learning of embodied-agent policies.
3.2. Object-rearrangement planner: capacitated vehicle routing problem
We now formulate the (single-vehicle) CVRP (Dantzig and Ramser, 1959; Golden et al., 2008) for object-rearrangement tasks. We assume that we are rearranging objects such that each object has an initial location and final (desired) location, thus yielding total locations of interest (including the initial position of the user). Given any arbitrary enumeration of these locations with the zeroth location corresponding to the user’s initial position (i.e., the depot), we decompose these locations into pickup locations associating with the objects’ current locations and dropoff locations associated with the final locations such that , , and . We assume that transportation costs can be calculated from an operator that calculates the geodesic distance between any two locations and satisfies , . If any two pickup or dropoff locations and coincide, we treat them as separate locations with zero separating distance such that . This assumes the user’s time to complete the task is proportional to their total path length. We assume the user has a capacity constraint of (which we set to be in the experiments, which assumes a user can carry one object per hand). We introduce a delivery operator that maps each pickup location to its corresponding dropoff location. We associate any solution to the problem with an invertible operator that maps the step number to location index; note that invertibility of this operator assumes that each location is visited exactly once and a location visit is associated with the pickup or dropoff of the appropriate object. Assuming the user has already executed steps of the task by visiting locations with and the task initialized at , the planner defines the optimal sequence of location visits as the solution to the combinatorial optimization problem
where is the indicator function that evaluates to 1 if its argument is in the set and evaluates to zero otherwise.
The objective function in problem (1) corresponds to the transportation costs (i.e., total distance traveled); the first set of constraints corresponds to the capacity constraints; the second set of constraints ensures that each object’s pickup location is visited before its dropoff location; the third set of constraints enforces that the solution is computed from the current state of the task at step . Given the solution to problem (1), at step of the task, the AR device provides assistance by displaying to the user the shortest path between locations and in the form of world-locked digital breadcrumbs in the heads-up display. If the user ‘violates’ assistance and instead visits a feasible alternative location , then the system replans by solving problem (1) with and resumes. We emphasize that replanning is essential to ensure robustness to realistic user behavior in complex multi-step object-rearrangement tasks. In practice, we solve planning problem (1) using OR-Tools (Google, ).
4. AR simulator and deployment at-scale
To evaluate how the proposed AR-assistance model described in Section 3 affects the user experience, an AR simulator must (1) support the observations and actions of the embodied-agent policy, (2) emulate the first-person view through an AR device, including support of the display of suggestion-notifications in the form of digital objects and information in a heads-up display (HUD), (3) enable a user in the loop to autonomously perform the task-execution actions, and (4) be deployable on the web at scale and support data collection related to task performance and sense of agency.
We make several modifications to Habitat to generate our AR simulator. First, we implement a virtual HUD to mimic the first-person view through an AR device (see Fig. 2(a)); the HUD supports the display of information relevant to the proposed AR assistance, as well as the ability to display world-locked digital objects in the environment. A message bar on the top of the HUD alerts users of important interactions, including when the user places an item successfully, or when a user attempts an infeasible action (e.g., attempting to pick/place an item when they are at an excessive distance from a pickup or dropoff location). Second, we introduce a ‘virtual knapsack’ of capacity two that represents the user’s current inventory; we display the current contents of this backpack as the user’s inventory in the virtual HUD. Third, we introduce a rudimentary object pick/place action that either (1) picks up an object and places it in the virtual knapsack if the user is within a specified distance of a pickup location and the knapsack is not full, or (2) places an object in inventory in its dropoff location if the user is within a specified distance of the dropoff location and the associated object is in inventory. Fourth, we integrate Habitat with the OR-Tools planner to support real-time replanning as described in Section 3.2.
5. Study setup
The overarching goal of our study is to evaluate how users respond to the AR assistance generated by the proposed framework on a specific quotidian object-rearrangement task—house cleaning—using our proposed AR simulator. Sec. 5.1 describes the house-cleaning task. Sec. 5.2 describes the two key variables that we varied during the study. Sec. 5.3 describes the metrics that we employed to evaluate the user experience. Sec. 5.4 provides an overview of the Human Intelligence Task (HIT) characterizing our study.
5.1. House-cleaning task
The main house-cleaning task is explained to participants through the following prompt:
In this HIT, you are a guest at a short-term rental. You have been staying at the house for the past several days and checkout time is fast approaching. You must clean up this house according to the hostâs instructions in as little time as possible. In order to avoid a late fee, you must navigate through the house to pick up the misplaced items and place them in the appropriate bins before checkout. For instance, you will be asked to place socks in a laundry hamper, books in a bookshelf, dishes in a dish bin, etc… You will be performing the task in a 3D virtual environment using keyboard controls, described later.
The participant is then placed in a virtual environment within the AR simulator described in Sec. 4 and must complete the task using the keyboard navigation and pick/place actions; they can leverage any AR assistance we provide in the HUD. We consider six semantic object categories and employ a specific bin for each category: dishes (dish bin), toys (toy box), books (bookshelf), laundry (laundry hamper), office supplies (office-supply box), recycling (recycling bin). Fig. 1(a) illustrates some of the objects and bins used in the study. We note that this formulation leads to a capacitated vehicle routing problem as described in Sec. 3.2, where denotes the number objects to be cleaned up, and delivery locations are repeated when more than one object is associated with the same bin. Appendix C describes how the bins, objects, and starting location are determined for a given experiment.
5.2. Study conditions: assistance fidelity and task difficulty
We vary two key variables across experiments: assistance fidelity, which represents how much assistance the AR simulator provides to the user toward efficient task completion; and task difficulty as measured by the number of objects that must be cleaned up. We hypothesize that these two variables will be the key drivers of the user experience and task performance. We now describe in more detail how we control these variables.
We consider three different levels of assistance fidelity: no assistance, object-highlighting assistance, and optimal assistance. Each assistance level is characterized by both a world-locked digital-object component and a text-information component; see Fig. 2. In all conditions, text-information assistance includes a list of bins including their picture and semantic location.
No assistance (None). Participants receive no assistance from the system in this condition, which serves as our control. The egocentric frame contains only the scene, rendered objects and rendered bins; there are no additional visual cues. The text-information assistance provides a (randomized) list of objects the participant must reorganize in order to complete the task. Each item in this list contains the name of the object, a picture of the object, and the bin in which it should be placed. Once a participant picks up an item, the text corresponding to the selected item is crossed out and move to the bottom of the list.
Object-highlighting assistance. This form of assistance is designed to provide assistance to the participant under the assumption that the AR device knows the location of the objects and bins salient to the house-cleaning task and can highlight them. Such assistance—which does not rely on knowledge of traversable paths in the environment nor a real-time planner—would be especially helpful in situations where certain items are obstructed from view or are difficult to spot. This form of assistance enables participants to understand the rough locations of all objects at once and form a plan themselves. We implement this form of assistance by placing a digital flagpole over each object as depicted in Fig. 2; the corresponding text-information assistance includes all of the information from the No-assistance condition but with the addition of the name of the room in which the object can be found. Again, this list is randomized per participant and list items are crossed off as they are completed.
Optimal assistance. Optimal assistance corresponds to the form of AR assistance proposed in Sec. 3. A solution to this problem for the lowest difficulty setting is shown in Fig. 3. To display this information to the participant, we display the next segment of the optimal path in the egocentric frame as a trail of digital breadcrumbs, which we set to red spheres. After the user executes a feasible pick/place action, we display the next segment of the optimal path (which may involve replanning as described in Sec. 3. To prevent the participant from losing their orientation with respect to these start and end positions of the optimal path segment, the -coordinate of the path’s breadcrumbs start at the participantâs chest level and end at the floor level (see Fig. 2). In contrast to other assistance conditions, the text-information assistance is now ordered according to the optimal path: each step lists the action the participant should take, the object they should perform it on, a picture of this object, and the objectâs location. Additionally, each step is numbered in order to emphasize the importance of the listâs order. As before, items are crossed off the list as they are completed.
We consider four levels of task difficulty, where we define difficulty in terms of the number of misplaced objects that must be cleaned up. To control for certain objects being easier to reorganize (due to visual salience or bin location), the number of each object type remained fixed for each difficulty setting. A single difficulty level can thus be defined by the ratio of the number of objects to the number of bins. We use four task-difficulty levels, where this ratio corresponds to 1:1 (6 total objects), 2:1 (12 total objects), 3:1 (18 total objects), and 4:1 (24 total objects), respectively. Appendix C describes our approach for generating the bin, object, and initial user locations for any of these conditions. Fig. 3 shows the bin locations as squares in each difficulty setting, with each color representing a specific semantic object category.
To determine the effect of assistance fidelity and task difficulty on participants’ task performance and experience, we collect an array of objective and subjective participant-response data.
We employ four metrics to evaluate task performance: (1) Normalized Deviations: the number of deviations from the optimal location ordering (accounting for replanning) normalized by the total number of possible deviations, (2) Inverse Path Length (IPL): the ratio of the minimal possible path length for to the sequence of location visits taken by the participant
We employ subjective metrics that are measured using a five-point Likert scale, which focus on the following two categories
Agency, which is defined as the feeling of being in control (Moore, 2016).
“I am in charge of deciding what step I complete next during the house cleaning task” (Control what to do)
“I am responsible for the speed at which I completed the task” (Control of speed)
“I feel that I need to follow the suggestions given to me by the system” (Need to follow)
“I prefer that the system show me what to do next rather than figure it out myself during the house cleaning task” (Prefer to show)
Utility and Usability, which is aimed at measuring usefulness, user-friendliness, and acceptability; it is inspired by the System Usability Scale (SUS) (Bangor et al., 2008).
“The assistance provided to me by the system during the house cleaning task helped me complete the task faster than if I had used the help provided to me during the training task” (Usable)
“I found the help given to me by the system to be useful” (Useful)
5.4. HIT overview
The HIT for the study consists of four phases: (1) task setup, (2) house familiarization and navigation-controls training, (3) cleaning-task execution, and (4) survey. In the first phase, the participant is presented with the house-cleaning-scenario prompt described in Sec. 5.1. In the second phase, the participant is familiarized with the 3D layout of the short-term rental house using a pre-recorded fly-through video that displays information regarding room names and bins locations. To acquaint the participant with the keyboard controls, they are given a simple training task of finding and picking up a single object and placing it in its appropriate bin without any imposed time limit or incentive. The third phase corresponds to executing the main cleaning task described in Sec. 5.1 with specified task-difficulty and assistance-fidelity levels. In the final phase, the participant is given a survey that collects demographic information (e.g., age, gender), responses to the subjective questions described in Sec. 5.3.2, and a free-form response to capture anything else relevant to their experience. Except for the actual task in the third phase of the HIT—which varied across participants depending on the study conditions described in Sec. 5.2—all other phases were consistent across participants.
6. User study
We conducted a between-subjects user study ( participants, male, female, other, no answer; three assistance-fidelity levels; four task-difficulty levels) on AMT using our AR simulator setup described in Sec. 4. Participants were compensated for completing the 25-minute-long study. Appendix A reports details on the number of participants in each condition. We first present and evaluate the hypotheses related to different assistance fidelity in easiest task difficulty condition using objective and subjective metrics defined in Sec. 5. We then show the interaction effects between assistance and task difficulty.
6.1. Effects of varying assistance fidelity
We now conduct a study that varies the assistance fidelity and fixes task difficulty to the easiest level (i.e., 6 total objects).
With regards to varying the assistance fidelity, we had the following hypotheses:
Participants will follow optimal assistance when presented with it.
Participants will have higher task performance when presented with optimal assistance.
Participant agency will be unaffected by assistance fidelity.
Participants will perceive the AR system equipped with optimal assistance as more usable and useful than the AR system equipped with no assistance or object-highlighting assistance.
Objective metrics. In the case of optimal assistance, we observe that participants are more likely to pick and place items in the optimal order than they are in either the none or object-highlighting assistance conditions as measured by normalized deviations (Fig. 6). This suggests that people may not able to compute independently the optimal ordering of location visits for object-rearrangement tasks based on only a first-person perspective, even with object highlighting. Further, in the case of optimal assistance, participants follow the shortest-path trajectory between points more closely than they do in either the no assistance or object-highlighting assistance conditions (see IPL in Fig. 6). This suggests that when participants freely navigate within an environment, they do not naturally take the shortest paths to get from point to point. Taken together, these two findings support our hypothesis H1 that participants will follow optimal assistance when presented with it.
Even though participants tend to follow optimal assistance, this does not necessarily translate to improvements in all performance metrics. We measure task performance in two additional ways: total distance traveled and total task completion time. We observe that participants presented with object-highlighting and optimal assistance generated significantly shorter total paths than those generated in the no-assistance case, but the average path length traveled with these two forms of assistance was not substantially different (Fig. 6). This indicates that it is possible to get users to follow shorter paths than those they might find on their own, but that simply highlighting objects may be sufficient for decreasing user path distance. Interestingly, even though participants found shorter paths in the optimal and object-highlighting conditions, there was no significant difference in the total task completion time between the three conditions. This further indicates that even though we are able to shorten path length, this may come at some cost to the speed at which a user completes the task (potentially in the interpretation of the interface). Ultimately, navigating this trade-off is likely user or task specific, and can likely be made more favorable with alternative and personalized interface design. Taken together, these two findings partially confirm our hypothesis H2 that participants will have higher task performance when presented with optimal assistance.
Subjective metrics. Agency. We found that participants generally felt in control of what they should do next and how quickly they completed the task. Even so, participants provided with optimal assistance, while still generally agreeing with feeling in control, reported that they felt less in control than in the other assistance conditions. Participants provided with no assistance and object-highlighting assistance were neutral about their feelings of needing to follow the assistance; participants provided with optimal assistance rated that they did feel the need to follow the assistance. Finally, participants provided with no assistance and object-highlighting assistance disagreed that they would prefer to have the system show them what to do next and where to go next. Since they were not exposed to a condition where they were provided this information, they are likely rating this against their idea of what such a system might look like. Participants who actually were exposed to optimal assistance agreed that they preferred this to not having this assistive information. So, even though participants in the optimal assistance condition felt less in control overall, they seemed to prefer this than to an alternative. Overall, this does not support our hypothesis H3 that participants’ sense of agency would remain unaffected by assistance fidelity; however, it seems that despite feeling a slight loss in their sense of agency, they may actually prefer this sacrifice in order to obtain useful information for optimal task completion.
Usability and Utility. We found that participants generally felt that they were faster with assistance than without assistance. Participants provided with both optimal assistance and object-highlighting assistance believed that they completed the task faster with the additional information than they would have without it (i.e., with no assistance). While participants in all assistance-fidelity conditions perceived the assistance to be useful, those provided with optimal assistance perceived it to be more useful than in either the object-highlighting or no assistance conditions. These results are especially interesting given that no significant difference was found in total task completion time between assistance-fidelity levels. Overall, these findings partially support our hypothesis H4 that participants will be more likely to perceive the optimal assistance to be usable and useful than other assistance types.
6.2. Interactions between assistance and task difficulty
We had the following hypothesis:
As the task difficulty increases, participants will be more willing to accept the assistance.
Across the board, we observed that participants deviated more from the optimal ordering as task-difficulty increased (Fig. 8). In the cases of no assistance and object-highlighting assistance, this trend conforms to our expectation that people have difficulties in finding the optimal solution in object-rearrangement tasks. We still observe this trend, however, when people are explicitly given the optimal ordering as in the optimal-assistance condition. Interestingly, though, the variance in the optimal-assistance condition is consistently much greater than those in either the no assistance or object-highlighting assistance cases. This suggests that the underlying distribution of assistance acceptance is likely multi-modal, and future work can investigate mechanisms to increase acceptance among various sub-populations of users.
We also observed that participants were more likely to take shorter paths in the optimal-assistance condition than in the other two conditions, and that participants provided with object-highlighting assistance were more likely to take shorter paths than those provided with no assistance (Fig. 8). However, there was no change in IPL as the task-difficulty level increased. Taken together, this refutes our hypothesis H5 that users would be more likely to accept assistance as the task became harder. In fact, the data suggest that they are either less likely or just as likely to accept the assistance as the task difficulty increases.
This work has presented (1) a novel framework for computing and displaying AR assistance for object-rearrangement tasks that characterize a broad category of quotidian tasks, (2) a novel AR simulator that can enable web-based evaluation of AR-like assistance and large-scale data collection, and (3) a study that assesses how users respond to the proposed AR assistance in the AR simulator on a specific object-rearrangement task: house cleaning.
The study illustrated several salient trends. First, by following the optimal assistance, participants were able to reduce the overall distance they travelled, suggesting that people do not immediately solve this problem optimally and could benefit from a system like the one we propose. Second, though participants’ reported feeling less in control over their own actions when following the optimal assistance they were also more likely to agree that they preferred a system that told them what to do than users in either other group. This indicates that users may be willing to sacrifice a small amount of agency in favor of a system that provides useful assistance. Finally, users were less likely or equally likely to accept the assistance as the difficulty of the task increased, though the population of users in the optimal assistance condition exhibited a much wider variability of assistance acceptance. This indicates that there are potential subgroups in the user population, and that future work should be conducted to discover these groups and develop personalized assistance systems.
Future work will explore extensions of the current framework and study, including developing and assessing perception-based learned policies that assume lower fidelity of the embodied agentâs observations, considering multi-agent formulations that incorporate models of the userâs behavior, and extending the current assistance framework and AR simulator to other quotidian object-rearrangement tasks and assessing the user experience in those settings.
Acknowledgements.The authors would like to thank Gideon Stocek and Blaise Ritchie for setting up the back-end servers, front-end development and design of the study logic and flow, integration with ORTools for online optimal path calculations, and integration with psiTurk for deployment on AMT. The authors would also like to thank Yan Xu and Mei Gao for help with the UX design decisions, survey design, and interpretation of early results; Joshua Walton for feedback on the design of the assistance and HUD; Hrvoje Benko and Tanya Jonker for early feedback on study design; Michael Shvartsman for pointer to psiTurk and James Hillis for forming cross-function connection with the Habitat team. Lastly, the authors are grateful to Amanpreet Singh, Mandeep Baines, Oleksandr Maksymets, Alexander Clegg, and Dhruv Batra from the Habitat team for their support in understanding and debugging various features related to Habitat for our setup.
Appendix A Participants statistics in our study
Table 1 shows the number of participant in each of our twelve conditions. Participants were uniformly randomly assigned to one of these conditions when they accepted to participate in the study.
|6 objects||12 objects||18 objects||24 objects|
Appendix B Web application implementation details
The setup works as follows: we serve a Human Intelligence Task (HIT) to AMT using a psiTurk server, which also allows us to advertise our HIT on through the AMT web portal. Through this portal, participants can view a study description, compensation details, and the estimated completion time. After accepting a HIT, participants consent to study participation, which initiates serving the approximately 8 GB Habitat WebGL application to the userâs web browser using a combination of a psiTurk server and an NGINX server (Reese, 2008). The majority of the application is loaded directly onto the participantâs computer, with the exception of the OR-Tools replanning module. When a disagreement between the participantâs actual path and the computed optimal path occurs, the participantâs web browser communicates with the psiTurk server to recalculate the optimal path and send this back to the client. After completing the survey, the data collected during the experiment (e.g., keyboard actions, time spent completing each phase, survey responses) are transmitted back to the server where it is stored in a MySQL database for later use. After the participant completes the experiment, we employ psiTurk to approve and disburse payment through AMT.
Appendix C Disorganized house generation
We now describe how we generated the disorganized house for the study described in Section 5.1.
c.1. Bin placement
Bins were placed manually within the environment in semantically reasonable locations. For example, the dish bin was placed on top of the counter next to the sink and the office supply box was placed on top of thew desk in the office. Receptacle locations were kept fixed in each difficulty setting.
c.2. Object placement
To sample object locations within the scene we randomly sampled a total of 40 navigable scene points using the Habitat simulator. We used rejection sampling in order to ensure that sampled points were at least one unit of distance away from every other sampled point and receptacle location within the scene. We then used the first N points from this list to define the object locations in the scene. For example, the lowest difficulty setting (6 objects) contained the first 6 points from this list, the highest difficulty setting contained the first 30 points. This way, each scene built on top of the previous scene in order to control for any single scene having an outlying dispersion between points. Each point was assigned a semantic object category, as well. This was kept consistent throughout each difficulty setting. The actual object model used for any individual point was held constant within difficulty settings, but was randomly sampled across difficulty settings. Thus, if a sampled point was assigned to the books and magazines category in the lowest difficulty setting, it would be assigned to the books and magazines category in the highest difficulty setting, as well, but it might not be exactly the same book. Object locations for each difficulty setting are shown as circles in Fig. 3.
c.3. Starting Location
The starting position of each participant was held constant for any individual scene. This position was calculated by first finding the centroid of all of objects placed in the scene. This point was determined to be navigable using the Habitat simulator. If the point was not navigable, rejection sampling was used to find a navigable point within a small radius surrounding the centroid. This method allowed us to ensure that the participantâs starting position did not bias their solution by their starting location. Additionally, this method allowed us to generate our optimal assistance, discussed in Sec. 3.2.
Appendix D Subjective metrics
“I am in charge of deciding where I go next during the house cleaning task”(Control where to go))
“I am responsible for the speed at which I completed the task” (Control of speed)
“I feel that I need to follow the suggestions given to me by the system” (Need to follow)
“I prefer that the system show me what to do next rather than figure it out myself during the house cleaning task” (Prefer to show what)
“I prefer that the system show me where to go next rather than figure it out myself during the house cleaning task” (Prefer to show where)
Utility and Self-efficacy:
“I completed the task as quickly as I could”
“I followed the help given to me by the system”
“I took longer than I needed to to complete the task”
“I found the help given to me by the system to be useful”
“I found a better way to complete the task than offered to me by the system”
“I found the house cleaning task more difficult to complete than the training task”
“I thought the system was easy to use”
“I would imagine that most people would learn to use the help provided by the system very quickly”
“I would like to use the help provided to me by system during the house cleaning task frequently in real life house cleaning scenarios”
“The assistance provided to me by the system during the house cleaning task helped me complete the task faster than if I had used the help provided to me during the training task”
“The help provided to me by the system during the house cleaning task was sufficient in order to accomplish the task”.
Appendix E Statistical analysis
e.1. Effects of varying assistance fidelity
Objective metrics. We performed four one-way ANOVA tests to measure the effect of assistance on normalized deviations, IPL, task completion distance, and task completion time. The effect of assistance was statistically significant on normalized deviations (, ), IPL (, ) and task-completion distance (, ) (see Figs. 6–6). No statistically significant effect of assistance was found on task-completion time (, ). We performed post hoc Tukey honest significant difference tests to measure differences between each group.
Subjective metrics. We performed six one-way ANOVA tests to test for a main effect of assistance type within each question (Sec. 5.3). We found a statistically significant effect of assistance on Control what to do (, ), Need to follow (, ), Prefer to show (, ), Useful (, ) and Usable (, ). Following each ANOVA, we performed a post hoc Tukey test when a main effect was found; Fig. 7 summarizes these results.
e.2. Interactions between assistance and task difficulty
We conducted two individual two-way factorial ANOVA tests to measure the main effects of assistance type and task difficulty on normalized deviations and IPL, and any combined effects of assistance type and difficulty (see Fig. 8). Assistance type (, ) and difficulty (, ) had statistically significant effects on normalized deviations. The interaction between the two independent variables was also statistically significant (, ). The effect of assistance type on IPL was statistically significant (, ). There was no significant effect of task difficulty on IPL, and the interaction effect was not statistically significant. We performed post hoc Tukey honest significant difference tests, where effects were found.
- copyright: acmcopyright
- journalyear: 2021
- doi: 10.1145/1122445
- conference: College Station ’21: ACM Conference on Intelligent User Interfaces; April 13–17, 2021; College Station, Texas
- booktitle: College Station ’21: Conference on Intelligent User Interfaces, April 13–17, 2021, College Station, Texas
- price: 15.00
- isbn: 978-1-4503-XXXX-X/18/06
- ccs: Computing methodologies Planning and scheduling
- ccs: Human-centered computing Mixed / augmented reality
- Planning refers to the process of computing an agent’s policy using a model for the transition dynamics of the environment, i.e., a model that predicts the effect of the agent’s actions on the environment states (Sutton and Barto, 2018).
- That is, in the notation of Sec. 3.2.
- These questions only comprise a subset of the questions that we asked in the original study. We limit our analysis here to this subset due to space constraints; see the Appendix D for the full list.
- Context-aware maintenance support for augmented reality assistance and synchronous multi-user collaboration. Procedia CIRP 59 (1), pp. 18–22. Cited by: §2.1.
- Measuring the user experience: collecting, analyzing, and presenting usability metrics. Newnes. Cited by: §1, §2.3.
- Mechanical turk. External Links: Cited by: Optimal Assistance for Object-Rearrangement Tasks in Augmented Reality, Figure 1.
- Merging virtual objects with the real world: seeing ultrasound imagery within the patient. ACM SIGGRAPH Computer Graphics 26 (2), pp. 203–210. Cited by: §2.1.
- An empirical evaluation of the system usability scale. Intl. Journal of Human–Computer Interaction 24 (6), pp. 574–594. Cited by: item 2.
- Restoring a sense of control during implementation: how user involvement leads to system acceptance. Mis Quarterly, pp. 111–124. Cited by: §2.3.
- Automation technology and sense of control: a window on human agency. PLoS One 7 (3), pp. e34075. Cited by: §2.3.
- Evaluating user-adaptive systems: lessons from experiences with a personalized meeting scheduling assistant.. In IAAI, Vol. 9, pp. 40–46. Cited by: §1, §2.3.
- HepaVision2âa software assistant for preoperative planning in living-related liver transplantation and oncologic liver surgery. In CARS 2002 Computer Assisted Radiology and Surgery, pp. 341–346. Cited by: §2.1.
- A brave, creative, and happy hri. ACM New York, NY, USA. Cited by: §2.2.
- Multi-agent reinforcement learning: an overview. In Innovations in multi-agent systems and applications-1, pp. 183–221. Cited by: §3.1.
- The truck dispatching problem. Management science 6 (1), pp. 80–91. Cited by: §1, §3.2.
- AMRA: augmented reality assistance for train maintenance tasks. Cited by: §2.1.
- Legibility and predictability of robot motion. In 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 301–308. Cited by: §2.2.
- Augmented reality in support of intelligent manufacturing–a systematic literature review. Computers & Industrial Engineering 140, pp. 106195. Cited by: §1.
- Assistive gym: a physics simulation framework for assistive robotics. arXiv preprint arXiv:1910.04700. Cited by: §2.2.
- A novel augmented reality-based interface for robot path planning. International Journal on Interactive Design and Manufacturing (IJIDeM) 8 (1), pp. 33–42. Cited by: §2.1.
- The vehicle routing problem: latest advances and new challenges. Vol. 43, Springer Science & Business Media. Cited by: §1, §3.2.
- OR-tools. External Links: Cited by: §1, §3.2.
- PsiTurk: an open-source framework for conducting replicable behavioral experiments online. Behavior research methods 48 (3), pp. 829–842. Cited by: Optimal Assistance for Object-Rearrangement Tasks in Augmented Reality, §1, §4.
- Evaluating fluency in human–robot collaboration. IEEE Transactions on Human-Machine Systems 49 (3), pp. 209–218. Cited by: §1, §2.2.
- Interactive planning-based cognitive assistance on the edge. In 3rd USENIX Workshop on Hot Topics in Edge Computing (HotEdge 20), Cited by: §2.1, §2.1.
- Augmented reality for stem learning: a systematic review. Computers & Education 123, pp. 109–123. Cited by: §1.
- The role of ai in mixed and augmented reality interactions. In CHI ’20 workshop ”Artificial Intelligence for HCI: A Modern Approach”, Cited by: §1.
- Design and use paradigms for gazebo, an open-source multi-robot simulator. In 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)(IEEE Cat. No. 04CH37566), Vol. 3, pp. 2149–2154. Cited by: §2.2.
- AI2-thor: an interactive 3d environment for visual ai. arXiv. Cited by: §2.2.
- Agency modulates interactions with automation technologies. Ergonomics 61 (9), pp. 1282–1297. Cited by: §2.3.
- Dialogue learning with human-in-the-loop. arXiv preprint arXiv:1611.09823. Cited by: §2.2.
- The experience of agency in human-computer interactions: a review. Frontiers in human neuroscience 8, pp. 643. Cited by: §2.3.
- Which robot am i thinking about? the impact of action and appearance on people’s evaluations of a moral robot. In 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 125–132. Cited by: §1, §2.2.
- What is the sense of agency and why does it matter?. Frontiers in psychology 7, pp. 1272. Cited by: §2.3, item 1.
- Handheld augmented reality indoor navigation with activity-based instructions. In Proceedings of the 13th international conference on human computer interaction with mobile devices and services, pp. 211–220. Cited by: §2.1.
- An intelligent personal assistant for task and time management. AI Magazine 28 (2), pp. 47–47. Cited by: §3.1.
- Augmented reality mapping systems and related methods. Google Patents. Note: US Patent App. 16/822,828 Cited by: §2.1, §3.1.
- A systematic review of augmented reality applications in maintenance. Robotics and Computer-Integrated Manufacturing 49, pp. 215–228. Cited by: §1.
- Nginx: the high-performance web server and reverse proxy. Linux Journal 2008 (173), pp. 2. Cited by: Appendix B.
- Habitat: a platform for embodied ai research. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9339–9347. Cited by: Optimal Assistance for Object-Rearrangement Tasks in Augmented Reality, Figure 1, §1, §2.2, §4.
- Designing the user interface: strategies for effective human-computer interaction. Pearson Education India. Cited by: §2.3.
- The replica dataset: a digital replica of indoor spaces. arXiv preprint arXiv:1906.05797. Cited by: Figure 1, §4.
- Reinforcement learning: an introduction. MIT press. Cited by: footnote 1.
- Comparative effectiveness of augmented reality in object assembly. In Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 73–80. Cited by: §1, §2.1, §2.3.
- Augmented reality: an application of heads-up display technology to manual manufacturing processes. In Hawaii international conference on system sciences, pp. 659–669. Cited by: §2.1.
- Teleplanning by human demonstration for vr-based teleoperation of a mobile robotic assistant. In Proceedings 10th IEEE International Workshop on Robot and Human Interactive Communication. ROMAN 2001 (Cat. No. 01TH8591), pp. 462–467. Cited by: §2.1.
- Recent development of augmented reality in surgery: a review. Journal of healthcare engineering 2017. Cited by: §1.
- DD-PPO: learning near-perfect pointgoal navigators from 2.5 billion frames. arXiv, pp. arXiv–1911. Cited by: §4.
- Influences of augmented reality assistance on performance and cognitive loads in different stages of assembly task. Frontiers in Psychology 10, pp. 1703. Cited by: §1, §2.3.
- New realities: a systematic literature review on virtual reality and augmented reality in tourism research. Current Issues in Tourism 22 (17), pp. 2056–2081. Cited by: §1.
- Leveraging human guidance for deep reinforcement learning tasks. arXiv preprint arXiv:1909.09906. Cited by: §2.2.