Forecasting Hands and Objects in Future Frames
This paper presents an approach to forecast future presence and location of human hands and objects. Given an image frame, the goal is to predict what objects will appear in the future frame (e.g., 5 seconds later) and where they will be located at, even when they are not visible in the current frame. The key idea is that (1) an intermediate representation of a convolutional object recognition model abstracts scene information in its frame and that (2) we can predict (i.e., regress) such representations corresponding to the future frames based on that of the current frame. We design a new two-stream convolutional neural network (CNN) architecture for videos by extending the state-of-the-art convolutional object detection network, and present a new fully convolutional regression network for predicting future scene representations. Our experiments confirm that combining the regressed future representation with our detection network allows reliable estimation of future hands and objects in videos. We obtain much higher accuracy compared to the state-of-the-art future object presence forecast method on a public dataset.
The ability to forecast future scene is very important for many real-time computer vision systems. Similar to humans predicting how the objects in front of them will move and what objects would newly appear, computer vision systems need to infer future objects for their tasks. This is particularly necessary for interactive/collaborative systems, including surveillance systems, robots, and wearable devices. For instance, a robot working on a collaborative task with a human needs to predict what objects the human is expect to move and how they will move; a surgery robot need to forecast what surgical instruments will appear a few seconds later to better work with a human surgeon. This is also necessary for more natural human-robot interaction as well as better real-time surveillance, since the forecast will allow faster reaction of such systems in response to humans and objects.
In the past 2-3 years, there has been an increasing number of works on ‘forecasting’ in computer vision. Researchers studied forecasting trajectories [5, 17], convolutional neural network (CNN) representations , optical flows and human body parts , and video frames [7, 3]. However, none of these approaches were optimized for forecasting explicit future locations of hands/objects appearing in videos. Vondrick et al.  only forecasts presence of objects without their locations, because of its design to regress lower dimensional (e.g., 4K) representations. Luo et al.  forecasts human pose only when the person is already in the scene (e.g., it is unable to predict whether a new person would appear) without any objects. Its learning is also not end-to-end, since optical flows are used as its intermediate representation. [7, 3] were designed for forecasting direct image frames, instead of doing object-level estimations. An end-to-end approach to learn the forecast model optimized for future object location estimation has been lacking.
This paper introduces a new approach to forecast presence and location of hands and objects in future frames (e.g., 5 seconds later). Given an image frame, the objective is to predict future bounding boxes of appearing objects even when they are not visible in the current frame (e.g., Figure 3). Our key idea is that (1) an intermediate CNN representation of an object recognition model abstracts scene information in its frame and that (2) we can model how such representation changes in the future frames based on the training data. For (1), we design a new two-stream CNN architecture with an auto-encoder by extending the state-of-the-art convolutional object detection network (SSD ). For (2), we present a new fully convolutional regression network that allows us to infer future CNN representations. These two networks are combined to directly predict future locations of human hands and objects, forming a deeper network that could be trained in an end-to-end fashion (Figure 1).
We evaluated our proposed approach with a couple of first-person video datasets with human hands and objects. Human hands an objects dynamically appear and disappear in first-person videos taken with wearable cameras, making them suitable data to evaluate our approach. Notably, in our experiments with the public ADL dataset , our accuracy was higher than the previous state-of-the-art method  by more than 0.25 mean average precision (AP).
2 Related work
Computer Vision researchers are increasingly focusing on ‘forecasting’ of future scene. Some earlier works include early recognition of ongoing human activities [12, 4] and more recent works include explicit forecasting of human trajectories and future locations [5, 17, 9, 10]. There are also works forecasting future features or video frames themselves [15, 3, 8].
Kitani et al.  proposed an approach to predict human trajectories in surveillance videos. There are more works with a similar direction [17, 9]. Park et al.  also tried to predict future location of the person, but using a different viewpoint: egocentric videos. However, most of these trajectory-based analysis are limited in the aspect that they assume the person to forecast is already present in the scene. This is insufficient particularly when dealing with hands and objects recorded in wearable/robot cameras, since a human hand often goes out of the scene (together with objects) and returns.
More recently, Vondrick et al.  showed that forecasting of fully connected layer results of convolutional neural networks (e.g., VGG ) is possible. The paper further demonstrated that such representation forecast can be used for forecasting the presence of objects in the scene (i.e., whether a particular object will appear in front of the camera 5 seconds later or not). However, due to the limited dimensionality of the representation (i.e., 4K-D), the approach was not directly applicable for forecasting ‘locations’ of objects in the scene. Similarly,  used a CNN regression to forecast optical flows, and used such optical flows to predict future human body pose. However, it requires the human to be already present in the scene and his/her body-part locations are correctly estimated initially. Finn et al.  predicted future video frames by learning dynamics from training videos, but it also assumed the objects to be already present in the scene.
We believe this is the first paper to present a method to explicitly forecast future object locations using a fully convolutional network. The contribution of this paper is in (1) introducing the concept of future object forecast using fully convolutional regression of intermediate CNN representations, and (2) the design of the two-stream SSD model to consider both appearance and motion optimized for video-based future forecasting. There were previous works on pixel-level forecasting of future frames including [8, 3, 18, 16], but they were limited to the learning of pixel-level motion instead of entity-level predictions. Our approach does not assume hand/object to be in the scene for their future location prediction, unlike prior works based on tracking (e.g., trajectory-based estimation) or pixel motion (e.g., optical flow estimation). For example, Figure 3 shows our model forecasting an oven to appear 5-sec later, which is not in the current frame.
The objective of our approach is to predict future presence/location of human hands and objects in the scene given the current image frame. We propose a new two-stream convolutional neural network architecture with the fully convolutional future representation regression (Figure 1). The proposed model consists of two fully convolutional neural networks: (1) an extended two-stream video version of the Single Shot MultiBox Detector (SSD)  with a convolutional auto-encoder for representing and estimating objects and (2) a future regression network to predict the intermediate scene representation corresponding to the future frame.
The key idea of our approach is that we can forecast scene configurations of the near future (e.g., 5 seconds later) by predicting (i.e., regressing) its intermediate CNN representation. Inside our fully convolutional hand/object detection network, we abstract scene information of the input frame as its intermediate representation (i.e., in Figure 1) using convolutional auto-encoder. Such intermediate representation gets further processed by the later layers of the network to finalize positions of hand/object bounding boxes. Our approach estimates the intermediate representation of future frame and combines it with the later layers of the network to forecast future bounding boxes of hands/objects.
3.1 Two-stream network for scene representation
In this subsection, we newly introduce our two-stream network extending the previous fully convolutional object detection network. The objective of this component is to abstract the scene at time into a lower dimensional representation, so that estimation of hand and object locations become possible.
Our two-stream network is designed to combine evidence from both spatial- and motion-domain features to represent the scene, as shown in the top row of Figure 1. The spatial stream receives one image frame, while the temporal stream receives the corresponding X and Y gradients of optical flows. This design was inspired by the two-stream network of Simonyan and Zisserman , which was originally proposed for activity recognition. The intuition behind the use of the two-stream network is that it allows capturing of temporal motion patterns in activity videos as well as spatial information. We used OpenCV TVL1 optical flow algorithm to extract flow images.
Here, for the object-based scene representation, we extend the SSD object detection network. We first insert a fully convolutional auto-encoder to our model, which has five convolutional layers followed by five transposed convolutional layers (also referred as deconvolutions). This is to make our scene representation more compact by reducing the dimensionality. We use 5 5 filters for each layers of the auto-encoder. The number of filters in the convolutional layers are: 512, 256, 128, 64, and 256. The transposed convolutional layers have the symmetric number of filters: 256, 64, 128, 256, and 512. We do not apply any pooling layer, but instead use stride 2 for the last convolutional layer. This design allows the abstraction of scene information in an image frame as a lower dimensional (256x25x25) intermediate representation.
We design our object-based scene representation network to have both the spatial-steam and temporal-stream part. Instead of using late-fusion to combine spatial and temporal streams at the end of network as was done in , we design early-fusion in our two-stream network by combining two streams’ feature maps before the encoder-decoder component. Specifically, at conv5 layers, two feature blobs from both streams are combined to form a single blob with feature selection layer of one-by-one kernels. This selection layer is also learned during our training process, making it optimized for estimating hand and object bounding boxes given the frame.
In addition, since our designed regression component can combine multi-frame information, we are able to reduce the amount of computations in our temporal stream by making it receive one single optical flow image instead of stacked optical flows from multiple frames. This process is described in more detail in Subsection 3.2 .
Let denote the proposed two-stream network to estimate object locations given a video frame at time . This function has two input variables and , which represent a given current input frame and the corresponding optical flow image at time respectively. Note that is calculated from image and , so no future information after time is used. Then, this function can be decomposed as two sub functions, :
where a function denotes a convolutional encoder to extract compressed visual representation (feature map) from and , and indicates the remaining part of the proposed network that uses the compressed feature map as an input for predicting hands and object locations at time . The upper part of Figure 1 shows the architecture.
The loss function is identical to the original SSD , which is a combination of localization and confidence losses.
3.2 Future regression network
The objective of this research is not about estimating hand/object locations in the ‘current’ frame , but to forecast the locations of them in the ‘future’ frame . In order to do so, we design a method of predicting future representations using convolutional layers.
We formulate the problem as a regression problem of forecasting future intermediate representation of the proposed two-stream network based on its current intermediate representation . The main idea is that the intermediate representation of our proposed network abstracts spatial and motion information of hands and objects and that we can learn a function (i.e., a network) modeling how it changes over time. Once the future intermediate representation is regressed, we can hand the predicted representation to the remaining part of the proposed network (i.e., ) to forecast future hand/object bounding boxes.
Let denote our future regression network to predict the future intermediate scene representation given a current scene representation .
The regression network consist of nine convolutional layers, each having 256 channels of 5 5 filters except the last two layers. We use dilated convolution with 1024-D to cover a large receptive field of 13 13 for the 8th layer, and 256-D with 1 1 kernel is used for the last layer.
A desirable property of this formulation is that it allows training of the weights () of the regression network with unlabeled videos using the following loss function:
where indicates the frame at time from video . Here, we use our compressed scene representation having relatively low dimensionality, but we can use any other intermediate scene representation (from any layers of our two-stream network) in principle. Once we get the future scene representation , it is fed to the two-stream network to forecast hand/object locations corresponding to the future frame:
Figure 1 illustrates data flow of our proposed approach during inference (i.e., testing) phase. Given a video frame and its corresponding optical flow image at time , (1) we first extract the intermediate representation (), and (2) give it to the future regression network () to obtain future scene representation . Finally, (3) we predict future location of hands/objects by providing the predicted future scene representation to the remaining part of the proposed two-stream network () at time .
In addition to the above basic formulation, our proposed approach is extended to use previous frames to obtain as illustrated in Figure 1, instead of using just a single frame (i.e., the current frame) for the future regression:
Our future representation regression network allows us to predict future objects while considering the implicit object and activity context in the scene. The intermediate representation abstracts spatial and motion information in the current scene, and our fully convolutional future regressor can take advantage of it for the forecast.
We conducted two sets of experiments to confirm the forecast ability of our approach using the fully convolutional two-stream regression architecture. In the first experiment, we use a first-person video dataset to predict future human hand locations. In the second set of experiments, we use the public dataset with object annotations to train/test object bounding box forecasting.
In our experiments, we used three datasets for the training and testing of our approaches.
EgoHands:  This dataset is a collection of 48 ego-centric videos that contains four types of human interactions (i.e., playing cards, playing chess, solving a puzzle, and playing Jenga). The original dataset has 15,053 ground-truth labels for hands in 4,800 frames, and we also newly annotated 466 frames with 1,267 hand bounding boxes. We used this dataset for the training our two-stream network to represent hands in a video frame.
Unlabeled Human Interaction Videos: This is our newly collected dataset that contains a total of 47 first-person videos of human-human collaboration scenarios, taken with a wearable camera. The dataset contains videos of two types of collaborative scenarios: (1) a person wearing the camera cleaning up objects on a table if another person approaches the table while holding a large box, making a room for him/her to put the box, and (2) the camera wearer pushes a trivet on a table to another person as he/she is approaching the table while holding a cooking pan. The duration of each video clip is between 4 and 10 seconds, and the videos do not have ground truth annotations of human hand bounding boxes.
Activities of Daily Living (ADL) : This first-person video dataset contains 20 videos of 18 daily activities, such as making tea and doing laundry. This is a challenging dataset since frames display a significant amount of motion blur caused by the camera wearer’s movement. This dataset also suffers from noisy annotations. Object bounding boxes were provided as ground truth annotations. Although there are 43 types of objects in the dataset, we trained our model (and the baselines) for 15 most common categories, following the setting used in . We split the ADL dataset into four sets, using three sets for the training and the remaining set for the testing, as was done previously.
In order to confirm the benefits of our proposed approach quantitatively, we created multiple baselines.
(i) SSD with future annotations is the original SSD model  taking the current image frame as an input. This was extended to forecast the future hands/objects. Instead of providing current-frame object bound boxes as ground truths in the training step, we provided ‘future’ ground truth locations of hands and objects. This enables the model to directly regress future object boxes.
(ii) Hands only is the baseline only using estimated hand locations in the current frame to predict their future locations. The idea is to confirm whether the detection of the current hand locations is sufficient to infer their future locations. A set of fully connected layers were used for the future location estimation.
(iii) Hand-crafted features uses a hand-crafted state representation based on explicit hand and object detection results from the current frame. It encodes relative distances between all detected objects, and uses it to predict the future locations of them also using fully connected layer regression. More specifically, it detects objects using KAZE features  and hands using , then computes relative distances between all objects and hands to build the state representation having 20 values. It then performs a regression using a network of five fully connected layers.
In addition, we also implemented a simpler version of our approach, (iv) one-stream network, which uses the same CNN architecture as our proposed approach except that it only has the spatial stream (taking RGB input) without the temporal stream (taking optical flow input). We constructed this baseline to confirm how much the temporal-stream of our network helps predicting future hand/object locations.
In this section, we evaluated the performance of our approach to forecast future locations of hands and objects using the two different datasets (our Unlabeled Human Interaction Videos and ADL dataset). The training of our models were done in two stages: we first train the SSD part of the network based on object ground truths. Next, we train the future regressor based on current/future representations from the training videos, with the Huber loss widely used in regression. We also tried end-to-end fine tuning of the entire model, but it did not benefit much.
Hand location forecast:
We first evaluate the performance of our approach to predict future hand locations using the interaction dataset. This is a less noisier dataset than the ADL dataset. Here, we use hand detection results using the original SSD model trained on the EgoHands dataset  as the ground truth hand labels, since the interaction videos do not have any human annotations. We randomly split the dataset into the training set and the test set; we used 32 videos for the training and the remaining 15 videos for the testing. We used the precision and recall as our evaluation measure. Whether the forecasted bounding boxes are true positives or not was decided based on the “intersection over union” (IoU) ratio between areas of each predicted bounding box and the (future) ground truth box. We only considered the prediction result as a true positive when the IoU ratio was greater than 0.5.
Table 1 shows quantitative results of 1-second future hand predictions on our Human Interaction dataset. Since our network may use previous frames as an input for the future regression, we reported the performances of our approach with = frames. We observe that our proposed approaches significantly outperforms the original SSD trained with future hand locations. The one-stream model performed better than the SSD baseline, suggesting the effectiveness of our concept of future regression. Note that our one-stream K=1 takes the exactly same amount of input as the SSD baseline. Our two-stream models performed better than the one-stream models, indicating the temporal stream is helpful to predict future locations. Our proposed model with K= yields the best performance in terms of all three metrics, at about 34.18 score in F-measure. Figure 2 shows example hand forecast results.
Object location forecast:
We use the ADL dataset  to measure the ability of our proposed method to forecast object locations. Both 1-second and 5-second future bounding box locations are predicted, and the performances were measured in terms of mean average precision (mAP). The IoU ratio of 0.5 was used to determine whether a predicted bounding box is correct compared to the ground truth. Note that ADL dataset is a challenging dataset for future prediction, since the videos were taken from the first-person view displaying strong egocentric motion. Further, objects appearing in the scene are not evenly distributed across different videos. Many objects appear and disappear from the scene even within the 5 second window due to the camera ego-motion.
Table 2 shows average precision (AP) of each object category. We show that our approach significantly outperforms the SSD baseline. While only taking advantage of the same amount of information (i.e., a single frame), our approach (one-stream K=1) achieved superior performance to the SSD baseline. By using additional temporal information, our approach (two-stream K=1,10) outperforms its one-stream version by in mAP. This indicates that motion information is helpful in predicting the right location of objects in future frames, especially in first-person videos with strong ego-motion. Figures 5 and 5 show PR-curves for predicting 1-second and 5-second future objects of different categories. Figure 3 shows example object predictions in 1-second and 5-second future. Based on RGB and optical flow information in the frames, our approach is able to predict future objects even when they are not visible in the current scene.
Object presence forecast:
In this task, we use the ADL dataset to evaluate our approach in forecasting ‘presence’ of objects in future frames. Specifically, we ignore the location of the bounding boxes and decide that the object exists if its confidence score (of any box) is above the threshold. Similar to our object location forecast experiment, we obtained PR-curves and calculated AP of each object category. We trained our model to predict presence of objects in 5-second-future frames. This experiment makes it possible to directly compare our approach with the results of ’s AlexNet based architecture, following the same standard setting used in their experiments.
Table 3 compares different versions of our proposed approach with the baselines. We are able to observe that that our approaches significantly outperform the results reported in  while following the same setting. Our two-stream K=10 version obtained the mean AP of 40.8%, which is higher than  by the margin of 30%. In addition, our one-stream K=1 version that only uses one single RGB frame as an input obtained higher accuracy than the SSD baseline and  while using the same input. Their performances were 35.9 vs. 10.9 vs. 10.7. Furthermore, we were able to confirm that our two-stream K=1 version performs better than the two-stream version of SSD.
We presented a new approach to explicitly forecast human hand and object locations using a fully convolutional future representation regression network. The key idea was to forecast scene configurations of the near future by predicting (i.e., regressing) intermediate CNN representations of the future scene. We presented a new two-stream model to represent scene information of the given frame, and experimentally confirmed that we can learn a function (i.e., a network) to model how such intermediate scene representation changes over time. The experimental results confirmed that our object forecast approach significantly outperforms the previous work on the public dataset.
-  P. F. Alcantarilla, A. Bartoli, and A. J. Davison. Kaze features. In European Conference on Computer Vision (ECCV), 2012.
-  S. Bambach, S. Lee, D. J. Crandall, and C. Yu. Lending a hand: Detecting hands and recognizing activities in complex egocentric interactions. In IEEE International Conference on Computer Vision (ICCV), December 2015.
-  C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction. In Advances in Neural Information Processing Systems (NIPS), pages 64–72, 2016.
-  M. Hoai and F. De la Torre. Max-margin early event detectors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
-  K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert. Activity forecasting. In European Conference on Computer Vision (ECCV), 2012.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. Berg. SSD: Single shot multibox detector. In European Conference on Computer Vision (ECCV), 2016.
-  W. Lotter, G. Kreiman, and D. Cox. Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint arXiv:1605.08104, 2016.
-  Z. Luo, B. Peng, D.-A. Huang, A. Alahi, and L. Fei-Fei. Unsupervised learning of long-term motion dynamics for videos. arXiv preprint arXiv:1701.01821, 2017.
-  W. Ma, D. Huang, N. Lee, and K. M. Kitani. A game-theoretic approach to multi-pedestrian activity forecasting. arXiv preprint arXiv:1604.01431, 2016.
-  H. S. Park, J.-J. Hwang, Y. Niu, and J. Shi. Egocentric future localization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  H. Pirsiavash and D. Ramanan. Detecting activities of daily living in first-person camera views. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2847–2854. IEEE, 2012.
-  M. S. Ryoo. Human activity prediction: Early recognition of ongoing activities from streaming videos. In IEEE International Conference on Computer Vision (ICCV), 2011.
-  K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems (NIPS), pages 568–576, 2014.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
-  C. Vondrick, H. Pirsiavash, and A. Torralba. Anticipating visual representations with unlabeled video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  J. Walker, C. Doersch, A. Gupta, and M. Hebert. An uncertain future: Forecasting from static images using variational autoencoders. In European Conference on Computer Vision (ECCV), 2016.
-  J. Walker, A. Gupta, and M. Hebert. Patch to the future: Unsupervised visual prediction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3302–3309, 2014.
-  T. Xue, J. Wu, K. L. Bouman, and W. T. Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In Advances in Neural Information Processing Systems (NIPS), 2016.