Detecting Hands in Egocentric Videos: Towards Action Recognition

Detecting Hands in Egocentric Videos: Towards Action Recognition

Alejandro Cartas University of Barcelona,
Gran Via de les Corts Catalanes, 585, 08007 Barcelona, Spain
Mariella Dimiccoli University of Barcelona,
Gran Via de les Corts Catalanes, 585, 08007 Barcelona, Spain
Computer Vision Centre,
Campus UAB, 08193 Cerdanyola del Vallès, Barcelona, Spain
Petia Radeva University of Barcelona,
Gran Via de les Corts Catalanes, 585, 08007 Barcelona, Spain
Computer Vision Centre,
Campus UAB, 08193 Cerdanyola del Vallès, Barcelona, Spain

Recently, there has been a growing interest in analyzing human daily activities from data collected by wearable cameras. Since the hands are involved in a vast set of daily tasks, detecting hands in egocentric images is an important step towards the recognition of a variety of egocentric actions. However, besides extreme illumination changes in egocentric images, hand detection is not a trivial task because of the intrinsic large variability of hand appearance. We propose a hand detector that exploits skin modeling for fast hand proposal generation and Convolutional Neural Networks for hand recognition. We tested our method on UNIGE-HANDS dataset and we showed that the proposed approach achieves competitive hand detection results.

Ego-centric vision; First Person Vision; Hand-detection

1 Introduction

With the advances on wearable technologies in recent years, there has been a growing interest in analyzing data captured by wearable cameras [1]. In particular, due to the large number of potential applications, the analysis of human daily activities[2, 3, 4, 5] has gained special attention. Daily activities are crucial to characterize human behavior, and enabling their automatic recognition would pave the road to novel applications in the field of Preventive Medicine, such as health monitoring [2, 3], among others [6].

The hands are involved in a wide variety of daily tasks, such as typing on a self-phone keyboard, drinking coffee or riding a bike (see Fig. 1). Along with the objects being manipulated in a scene, the hands are often the main focus in the egocentric field of view. Consequently, their detection is a fundamental step towards action recognition. However, detecting hands in egocentric images is not a trivial task for three main reasons. First, the hands are intrinsically non-rigid and their shape appearance change continuously while manipulating objects. Second, the illumination conditions rapidly change in egocentric images as a consequence of the camera user movements across different locations. These changes also affect the appearance of the hands and their recognition, as stated by Li and Kitani [7]. Third, the complexity of the method also depends on the camera used and its position on the body (head, shoulders, or chest). For instance, if the camera is worn on the chest, the focus of attention is lost and the location of hands in the field of view becomes more unpredictable. Available methods for detecting hands in egocentric images  [7, 8, 9] are mostly based on hand-crafted features such as color histogram, texture and HOG in different color spaces.

The reminder of this paper is organized as follows. In the next section (2) we review the state of the art on egocentric activity recognition and other works closely related to our. In section 3, we introduce the proposed approach and in section 4 we details the experiments performed. Finally, in section we draw some conclusions.

2 Related work

In recent years, one of the first attempts to segment hands from egocentric images was proposed by Fathi et al. [8]. In order to determine regions containing hands and active objects, they modeled the background pixels using texture and boundary features. From the extracted foreground pixels, they distinguish between hands and objects using color histograms. Additionally, they introduced the Georgia Tech Ego-centric Activity (GTEA) dataset to test their model.

Figure 1: Examples of images showing actions involving hands. These pictures were captured by a chest-mounted wearable camera.

Li and Kitani [7] trained a pixel-level hand detector on images with more realistic egocentric characteristics such as motion, and extreme lighting and illumination changes. Their method combines superpixels with invariance descriptors, and color and texture features. They tested different combinations on the GTEA dataset and on their own proposed dataset, commonly referred as the zombie dataset. Although their results were better than other approaches, its method still failed when the hands were on dark or saturated regions. They extended their work by posing the detection problem as recommendation task using virtual probes [10]. Additionally, not only the hands are segmented by their method, but also the forearms.

Serra et al. [9] also proposed a hand segmentation that relies on the same combination of features HSV+LAB [7], but employed the Simple Linear Iterative Clustering (SLIC) algorithm for extracting superpixels. Moreover, they corrected segmentation problems by temporally smoothing the pixels and by joining segmented regions using a graph-based approach.

Betancourt et al. [11] proposed a two-stage hand detector using different color (RGB, HSV, LAB) and edge (HOG, GIST) features in addition with a classifier (SVM, random forests, decision trees). During the first stage, an image is divided using a grid in order to reduce the color features. In the second stage, the features are extracted and classified for each found region. The results on their own dataset indicate that the best performance is achieved combining HOG features and SVMs. In further work [12], they introduced the UNIGE-HANDS dataset and improved their detector to work on egocentric video sequences under the presence of image texture, color and luminosity variations. Specifically, they proposed a Kalman filter that smooths the results the frame-by-frame classification results of SVMs.

More recently, a new egocentric dataset named EgoHands was introduced by Bambach et al. [13]. This dataset consists of videos where a pair of persons wear camera glasses in front of each other while playing a board game. Specifically, its purpose is to detect left and right hands and their respective owner at the pixel level. The pipeline of their approach is similar to R-CNN, but they provide a probabilistic region proposal and perform a pixel-level segmentation at the end of it. Besides, they performed an activity classification of the four board games played in the dataset using images containing only the detected hands, thus preserving the original location and sizes.

In this work, we propose a hand detector that exploits skin modeling for fast hand proposal generation and Convolutional Neural Networks for hand recognition. We tested our method on UNIGE-HANDS dataset [11] and we show that the proposed approach achieves competitive hand detection results.

3 Hand detection

Our hand detector consists in a three-task architecture outlined in Fig. 2. We first detect regions containing skin pixels. Later, we generate a set of hand proposals using these regions. Finally, we classify the hand proposals using a Convolutional Neural Network (CNN).

Figure 2: Outline of the proposed method for hand detection.
(a) Overlay of detected skin pixels.

(b) Skin binary mask.
(c) Watershed operation.

(d) K-means lines.
(e) Hand contour cut
Figure 3: Example of a hand proposal generation over a skin region containing two pixel-connected arms. See text for detailed description.

Skin detection. For this task, we use the pixel-level skin detection (PERPIX) method introduced in [7]. The PERPIX method models skin pixels by combining color (RGB, HSV, and LAB), texture (SIFT, ORB), and histogram features (Gabor filters).

Hand proposal generation. In order to generate hand proposals, we determine if each estimated skin-region in an image contains two pixel-connected arms. For instance, Fig. 2(a) shows a case where the arms are joined to each other and considered as one skin-region. First, we fit a straight line using the points from the boundary of the skin-region, as depicted in Fig. 2(b). Next, if the mean squared error of the fit is greater than a fixed threshold, then the skin-region is considered as a two-arms region.

A two-arms region is split in two by applying a soft segmentation. The first step is to apply the -means lines algorithm over the contour points of the skin blob. Since each line represent an arm, is set to 2. Moreover, the calculated fit line at each iteration is the medial-axis line, obtained using orthogonal least squares. The second step is to perform a watershed transformation over the skin blob. The result of this operation are smaller sub-blobs that have soft boundaries, as seen on Fig. 2(c). The last step is to assign each sub-blob to the closest line. This achieved by computing the smallest distance between the each sub-blob centroid and the lines, as shown on Fig. 2(d).

After all resulting blobs are considered one-arm regions, then the hand proposals are extracted as follows. First, a rectangular convex-hull is calculated for each one-arm region. For example, extracted one-arm regions and their corresponding convex-hull are respectively shown in green and blue colors in Fig. 2(d). Furthermore, in order to extract a hand from the convex-hull, we calculate a line representing its wrist. We consider that a hand in the convex-hull is located in the side of the box closer to the center of the frame. As a result, we estimate the location of the wrist with respect to that side of the box. Second, a medial-axis line crossing the largest side of the convex-hull is computed. The wrist line perpendicularly intersects the medial-axis line and it is set at a fixed distance from the closest side to center of the frame. Fig. 2(d) shows the medial-axis and the wrist lines in yellow and cyan colors, respectively. Finally, the hand proposal are obtained by cutting the one-arm regions using the wrist line. For instance, hand proposal boxes appear in red in Fig. 2(d). More hand boxes can be proposed using different distances to wrist.

Figure 4: Summary of the datasets used for training and validation. Histogram by dataset of the number of images and bounding boxes. Note the scale change on the vertical axis.

Hand recognition. To classify a hand proposal we created a binary classifier by fine-tuning the CaffeNet network [14] pre-trained on ImageNet [15].

4 Experimental results

We describe the training and testing datasets in section 4.1, and detail the skin and hand detection training in section 4.2. We then present the experimental results on skin and hand detection tasks on section 4.3.

4.1 Datasets

Our experiments were done using the UNIGE-HANDS dataset  [12]. This dataset consists of 25 videos (292,461 images) captured by a single person using a head-mounted camera. The labels provided indicate if arms appeared or not in each frame. The videos were filmed on 5 different settings: office, street, bench, kitchen, and coffee bar. Each setting has 4 training and 1 testing videos. Half of the training videos show the user arms, while the other half show only the setting. In the case of the testing videos, the user arms appears half of the time.

The reported results on skin detection were obtained on the same fixed-split used by Betancourt et al.  [12]. Additionally, the evaluation on the hand detection task was done in a subset of 2,000 manually annotated images. The number of images containing hands were 1,000 and in total they were over 1,739 hands.

In order to train our binary hand classifier, we combined several datasets containing bounding boxes of hands [13, 16, 17, 18] as positive examples, and faces [19] and different categories [15] as negative examples. We also considered to include other hand datasets, but some of them considered the forearm as part of the hand [8, 7], or lack of hand annotations [4]. The number of images and bounding boxes by dataset are shown as a histogram in Fig. 4. The total number of images and bounding boxes is 761,946 and 872,414, respectively.

Figure 5: Detection results on the UNIGE-HANDS test set for different values of the intersection over union (IoU) ratios.

4.2 Training

We trained the PERPIX model using one training video for each setting category, i.e. we only used 5 videos for training. For each selected video, we uniformly sampled 30 and 150 frames as the input for two training models. All the user skin regions in these frames were annotated and segmented.111The annotations for skin detection training and hand detection evaluation are publicly available at The binary hand classifier network was created by fine-tunning the CaffeNet network [14]. It was fine-tuned for 20,000 iterations using Stochastic Gradient Descent, with a learning rate , a momentum , and weight decay equal to .

4.3 Skin and hand detection

We made a skin detection performance comparison on the UNIGEN dataset with the HOG-SVM and DBN methods originally designed for it [11], as seen on Table 1. The PERPIX method offers competitive results using less training data, specifically we only used 150 and 750 frames showing hands. Additionally, the results presented on [11] used a total number of 4439 frames. The results on the hand detection task were evaluated using precision/recall curves for 4 distinct values of intersection over union (IoU), as illustrated on Fig. 5. The average precision using the PASCAL VOC criteria was 20.01%.

True Positive True Negative
@30 frames @150 frames @30 frames @150 frames
Office 0.893 0.965 0.973 0.953 0.929 0.952 0.986 0.981
Street 0.756 0.834 0.872 0.900 0.867 0.898 0.586 0.574
Bench 0.765 0.882 0.773 0.892 0.965 0.979 0.954 0.948
Kitchen 0.627 0.606 0.713 0.628 0.777 0.848 0.789 0.830
Coffee bar 0.817 0.874 0.996 0.991 0.653 0.660 0.632 0.688
Total 0.764 0.820 0.862 0.863 0.837 0.864 0.799 0.815
Table 1: Skin-segmentation performance comparison. The HOG-SVM and DBN results correspond to [11] and the PERPIX results were obtained using the Per-pixel regression method [7] for 30 and 150 frames per settings video.

5 Conclusions

We presented an egocentric hand detector method, which relies on skin modeling for fast hand proposal generation and a convolutional neural network for hand classification. We tested our method on the UNIGE-HANDS dataset and obtained an average precision of 0.216 when using the PASCAL VOC criteria. We showed that the proposed approach achieves competitive hand detection results. Future work will investigate how to incorporate hand detection to egocentric action recognition.


A.C. was supported by a doctoral fellowship from the Mexican Council of Science and Technology (CONACYT) (grant-no. 366596). This work was partially founded by TIN2015-66951-C2, SGR 1219, CERCA, ICREA Academia’2014 and 20141510 (Marató TV3). M.D. is grateful to the NVIDIA donation program for its support with a GPU card.


  • [1] Bolaños, M., Dimiccoli, M., Radeva, P.: Toward storytelling from visual lifelogging: An overview. IEEE Transactions on Human-Machine Systems 47 (2017) 77–90
  • [2] Karaman, S., Benois-Pineau, J., Mégret, R., Dovgalecs, V., Dartigues, J.F., Gaëstel, Y.: Human daily activities indexing in videos from wearable cameras for monitoring of patients with dementia diseases. In: Pattern Recognition (ICPR), 2010 20th International Conference on, IEEE (2010) 4113–4116
  • [3] Zariffa, J., Popovic, M.R.: Hand contour detection in wearable camera video using an adaptive histogram region of interest. Journal of NeuroEngineering and Rehabilitation 10 (2013) 1–10
  • [4] Rogez, G., Supancic, J.S., Ramanan, D.: Understanding Everyday Hands in Action from RGB-D Images. In: ICCV 2015 - IEEE International Conference on Computer Vision, Santiago, Chile (2015)
  • [5] Cartas, A., Marín, J., Radeva, P., Dimiccoli, M.: Recognizing activities of daily living from egocentric images. In: To appear in the proceedings of the Iberian Conference on Pattern Recognition and Image Analysis (IbPRIA). (2017)
  • [6] Nguyen, T.H.C., Nebel, J.C., Florez-Revuelta, F., et al.: Recognition of activities of daily living with egocentric vision: A review. Sensors 16 (2016)  72
  • [7] Li, C., Kitani, K.M.: Pixel-level hand detection in egocentric videos. In: Conference on Computer Vision and Pattern Recognition (CVPR), IEEE (2013) 3570–3577
  • [8] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference On, IEEE (2011) 3281–3288
  • [9] Serra, G., Camurri, M., Baraldi, L., Benedetti, M., Cucchiara, R.: Hand segmentation for gesture recognition in ego-vision. In: Proceedings of the 3rd ACM International Workshop on Interactive Multimedia on Mobile & Portable Devices. IMMPD ’13, New York, NY, USA, ACM (2013) 31–36
  • [10] Li, C., Kitani, K.: Model recommendation with virtual probes for egocentric hand detection. In: Proceedings of the IEEE International Conference on Computer Vision. (2013) 2624–2631
  • [11] Betancourt, A., Lopez, M., Regazzoni, C.S., Rauterberg, M.: A Sequential Classifier for Hand Detection in the Framework of Egocentric Vision. In: Conference on Computer Vision and Pattern Recognition. Volume 1., Columbus, Ohio, IEEE Computer Society (2014)
  • [12] Betancourt, A., Morerio, P., Barakova, E.I., Marcenaro, L., Rauterberg, M., Regazzoni, C.S.: A Dynamic Approach and a New Dataset for Hand-Detection in First Person Vision. In: International Conference on Computer Analysis of Images and Patterns, Malta (2015)
  • [13] Bambach, S., Lee, S., Crandall, D., Yu, C.: Lending a hand: Detecting hands and recognizing activities in complex egocentric interactions. In: Computer Vision (ICCV), 2015 IEEE International Conference on, IEEE (2015)
  • [14] Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
  • [15] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (2015) 211–252
  • [16] Everingham, M., Eslami, S.M.A., Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision 111 (2014) 98–136
  • [17] Eshed Ohn-Bar, S.M., Mogelmose, A., Trivedi, M.M.: Vision for intelligent vehicles and applications (viva) workshop and challenge. Workshop and Challenge 13 (2015) 30–17
  • [18] Mittal, A., Zisserman, A., Torr, P.H.S.: Hand detection using multiple proposals. In: British Machine Vision Conference. (2011)
  • [19] Ng, H.W., Winkler, S.: A data-driven approach to cleaning large face datasets. In: 2014 IEEE International Conference on Image Processing (ICIP). (2014) 343–347
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description