Unsupervised Understanding of Location and Illumination Changes in Egocentric Videos
Wearable cameras stand out as one of the most promising devices for the upcoming years, and as a consequence, the demand of computer algorithms to automatically understand the videos recorded with them is increasing quickly. An automatic understanding of these videos is not an easy task, and its mobile nature implies important challenges to be faced, such as the changing light conditions and the unrestricted locations recorded. This paper proposes an unsupervised strategy based on global features and manifold learning to endow wearable cameras with contextual information regarding the light conditions and the location captured. Results show that non-linear manifold methods can capture contextual patterns from global features without compromising large computational resources. The proposed strategy is used, as an application case, as a switching mechanism to improve the hand-detection problem in egocentric videos.
keywords:Machine Learning, Unsupervised Learning, Egocentric Videos, First Person Vision, Wearable Camera
The emergence of wearable video devices such as action cameras, smart glasses and low-temporal life-logging cameras has detonated a recent trend in computer science known as First Person Vision (FPV) or Egovision. The 90’s idea of a wearable device with autonomous processing capabilities is nowadays possible and is considered one of the most relevant technological trends of the recent years Betancourt2014 (). The ubiquitous and personal nature of these devices opens the door to critical applications such as Activity Recognition Nguyen2016 (); Zhan2015 (), User-Machine Interaction Baraldi2015 (), Ambient Assisting Living Fathi2011 (); Pirsiavash2012 (); NataliaDiaz2014 () Augmented Memory Farringdon2000 (); Harvey2016 () and Blind Navigation Balakrishnan2007 (), among others.
One of the key features of wearable cameras is their capability to move across different locations and record exactly what the user is looking at. This is an unrestricted video perspective that requires existent methods to perform good in the unknown number of locations and the changing light conditions implied by this video perspective. A common way to deal with this problem is to predefine a particular application or location and bound the algorithms based on this. This is the case of gesture recognition for virtual museums proposed in Baraldi2015 () or the activity recognition methods based on the kitchen dataset Fathi2011 (); Fathi2011a (). Another way to alleviate the large number of recorded locations is by using exhaustive video labeling of the recorded locations and objects as is done in Pirsiavash2012 () to detect daily activities. The authors in Li2013b () use global histograms of color to reduce the effect of light changes in a color-based hand-segmenter.
The approach of Li2013b () shows that contextual information, such as light conditions, are valuable sources of information that can be used to improve the performance and applicability of current FPV methods. This idea is also applicable to other FPV related functionalities such as activity recognition, on which a device that can understand user’s location can easily reduce the number of possible activities and take more accurate decisions. Pervasive computing refers to the devices that can modify their behavior based on contextual variables as context-aware devices Lara2013 (), and its benefits are widely explored for example in assisted living Riboni2011 () and anomaly detection Zhu2013 ().
This paper is motivated by the potential impact of contextual information, such as light conditions and location, on different FPV methods. The strategy presented, is a first step towards our envision of a device that can understand the environment of the user and modify its behavior accordingly. The proposed approach understands the contextual information on which the user is involved as a set of different characteristics that can point to previously recorded conditions, and not as a scene classification problem based on manual labels assigned to particular locations (e.g., kitchen, office, street). In this way, this study devises an unsupervised procedure for wearable cameras to switch between different models or search spaces according to the light conditions or location on which the user is involved. Figure 1 summarizes our approach.
From Figure 1 it is clear that the transition from the global features to the unsupervised layer can be seen as a dimensional reduction from the global feature space (high dimensional space) to a simplified low dimensional space (intrinsic dimension). The latter provides an unsupervised location map to be used later to switch between different behaviours at different hierarchical levels. These dimensional reductions are known as manifold methods, and their capabilities to capture complex patterns are defined by their algorithmic and/or theoretic formulation Friedman1997 ().
Regarding the global features to be used, relevant information can be obtained from recent advances in FPV Betancourt2014 () and scene recognition Zhou2014 (); Oliva2001 (). Given the restricted computational resources of wearable devices, we use computationally efficient features such as color histograms and GIST descriptors. However, the proposed approach can be extended with more complex data such as deep features Zhu2013 (). In that case three important issues must be considered: i) the computational cost will restrict the applicability in wearable devices; ii) it will require large amounts of training videos and manual labels; iii) the use of existent “pre-trained” neural architectures compromises the unsupervised nature of our approach.
The novelties of this paper are three folded: i) It evaluates the capability of different linear and non-linear manifold methods, namely Principal Component Analysis (PCA), Isometric Mapping (Isomaps), Self Organizing Maps (SOM) and Growing Neural Gas (GNG), to capture light/location patterns from different global features without using manual labels. ii) It analyzes, following a feature selection procedure, the most discriminative components of the selected global features, iii) As an application case, the proposed unsupervised strategy is used to improve the hand-detection problem in FPV. The hand-detection problem is used as an example, because of its impact on context-aware devices in hand-based methods, and because it allows us to illustrate the role of the unsupervised layer and its contribution to the final hand-detection performance. The use of the same strategy at higher inference levels such as hand-segmentation or hand-tracking is left as future research.
The remainder of this paper is organized as follows: Section 2 summarizes some recent strategies to understand automatically contextual information. Later, Section 3 introduces our methodological approach, summing up the selected features, different manifold methods and some common unsupervised evaluation procedures. In Section 4 the manifold methods are trained, and their capability to capture light/location patterns is evaluated in a post-learning strategy using the manual labels of two public FPV datasets. Section 5 illustrates the use of the best performing manifold method to improve the hand-detection rate in FPV. Finally, Section 6 concludes and provides some future research lines.
2 State of the Art
In recent years, FPV video analysis is attracting the interest of the researchers, due to the increasing availability of wearable devices that can record what the user is looking at, and promising applications are emerging. Existing literature and commercial approaches highlight a broad range of possibilities, but also points to several challenges to be faced such as uncontrolled locations, illumination changes, camera motion, object occlusions, processing capabilities, among others Betancourt2014 (). This paper addresses the issue of illumination changes as well as unrestricted locations recorded by the camera. The general idea is to develop an unsupervised layer that, based on global features and using low computational resources, understands contextual information regarding the light conditions and the locations recorded by the camera.
The advantages of a device that can understand the environment are evident Starner1998 (); Zhu2011 (). Recent advances in pervasive computing and wearable devices frequently point at the location of the user as a valuable information source to design context-aware systems Riboni2011 (); Lara2013 (); Wang2011 (). An intuitive way to find the location is to use Global Positioning Systems (GPS). However, this approach is commonly restricted by the battery life as well as by poor indoor signal Hori2003 ().
To alleviate these restrictions, wearable cameras emerge as a possible solution: infer the context using the recorded frames. As an example, in Templeman2014 () local and global features are combined to identify private locations and avoid recording them. In fact, the idea pursued by the authors is in line with the seminal works on scene recognition proposed by Oliva and Torralba, on which scenes captured by static cameras are represented as low dimensional vectors known as GIST descriptors Oliva2005 (); Oliva2001 () and classified in a supervised way. Recent advances in scene recognition made by the same authors by exploiting the hidden layers of deep networks (deep features) are promising Zhou2014 (). However, their applicability on wearable devices is still restricted by the required computational resources and by the unavailability of large datasets recorded with wearable cameras.
Similar applications but following an unsupervised strategy are common in robotics, on which manifold algorithms like SOM or Neural Gas, are frequently used in autonomous navigation systems Puliti2003 (); Barakova2005 (); Barakova2005a (). Regarding FPV, the authors in Li2013b () propose a multi-model recommendation system for hand-segmentation in egocentric videos that modify its internal behaviour based on the recorded light conditions. In their paper, the authors design a performance matrix containing one row per training frame and one column per model. The matrix values are the segmentation scores and are used to decide the most suitable model for each frame in the testing dataset.
The proposed method is motivated by the switching mechanisms developed by Li2013b (); however, it is independent on the segmentation dataset and can extract information about the light conditions as well as the recorded location. Regarding the scene-recognition literature, our approach is fully unsupervised and is based on computationally efficient global features which make feasible to use it on wearable cameras.
3 Unsupervised method
As explained in previous sections one of our goals is to quantify the capability of different unsupervised manifold methods to capture the illumination and location changes in egocentric videos. Our approach follows the experimental findings of previous works, on which global features such as color histograms and GIST are used to describe the general characteristics of the scene Li2013b (); Oliva2001 (). Figure 2 summarizes our approach. Feature extraction and unsupervised training modules can be found in the left part of the picture, while the right part shows the post-learning evaluation. Manual labels are used in the shaded blocks of the diagram only. The remainder of this section introduces the datasets, motivates the global features and manifold methods, and concludes explaining the hyperparameter selection and the post-learning analysis.
The comparison of the manifold methods uses two popular FPV datasets, namely EDSH and UNIGE-HANDS. The main criteria for the dataset selection are the number of locations, the existence labels, and the illumination changes contained. To the best of our knowledge, these datasets are commonly used to compare hand-segmentation algorithms in FPV due to their challenging light conditions intentionally included in the dataset design phase.
EDSH: Dataset proposed by Li2013b () to train a pixel-by-pixel Hand-Segmenter in FPV. The dataset contains different locations with changing light conditions recorded from a head-mounted camera with a resolution of at a speed of fps. The labels about location and light conditions are manually created. For the experimental results, EDSH1 video is used for training and EDSH2 video for testing. In total frames are used for training and for testing. Figure 3 shows the EDSH training and testing dataset composition according to the labels to be used in the Section 4.2.
UNIGE-HANDS:Dataset proposed by Betancourt2015a () as baseline for the hand-detection problem in FPV. The dataset is recorded in different locations (1. Office, 2. Coffee Bar, 3. Kitchen, 4. Bench, 5. Street), and is recorded with a resolution of pixels and fps. The dataset provides the locations of the videos. Labels about indoor/outdoor information were manually created. In Section 4 the original training/testing split is used. In total frames are used for training and for testing. Figure 4 shows the EDSH training and testing dataset composition according to the labels to be used in the Section 4.2.
3.2 Feature selection
To represent the scene context we use color histograms and GIST descriptors. These features are widely accepted and used in the FPV literature, and their computational cost makes them suitable for wearable devices with highly restricted processing capabilities and battery life. As explained before, more complex features such as deep features can be used under the same framework, but different issues must be faced to reach a real applicability. We point deep features as a promising future work.
Due to its straightforward computation and intuitive interpretation, color histograms are probably the most used features in image classification Kakumanu2007 (). The variety of color spaces such as RGB, HSV, YCbCr or LAB makes it possible to exploit color patterns while alleviating potential illumination issues. In particular, HSV is based on the way humans perceive colors while LAB and YCbCr use one of the components for lightness and the remaining ones for the color intensity. In egocentric vision, Morerio2013 (); Betancourt2016 () use a mixture of color histograms and visual flow for hand-segmentation, while Baraldi2015 () combined HSV features, a Random Forest classifier and super-pixels for gesture recognition. Recently, Li and Kitani Li2013b () analyzed the discriminative power of different color histograms with a Random Forest regressor. Existent FPV literature commonly points to HSV as the best color space to face the changing light conditions in egocentric videos Morerio2013 (); Li2013b (). For the experimental results, we use color histograms of RGB, HSV, YCbCr and LAB.
Additionally, we use GIST Murphy2006 () as a global scale descriptor. It captures texture information, orientation and the coarse spatial layout of the image. GIST can be combined with other local descriptors to detect accurately objects in the scene, and was initially combined with a simple one-level classification tree, as well as with a naïve Bayesian classifier. GIST descriptor has been successfully applied on large scale image retrieval and object recognition Oliva2001 ().
Finally, the experimental results analyze the discriminative power, regarding light and location, of the proposed global features under a feature selection procedure. The idea behind this experiment is to fuse the more discriminative components of each global feature to increase the contextual information available in the high-dimensional space, and as consequence improve the patterns captured by the manifold method. For this purpose, all the proposed global features are merged and used with a Random Forest to solve the classification problems explained in Section 4. The feature importance of the Random Forest is used to build a combined feature with the most discriminative components.
3.3 Manifold learning
Manifold methods are mathematic or algorithmic procedures designed to move from a high dimensional space to a low dimensional one while preserving the most valuable information Friedman1997 (). Manifold methods are widely used and its applicability is fully validated in several field such as robotics Puliti2003 (); Barakova2005 (); Barakova2005a (), crowd analysis Chiappino2014a (); morerio2012people () and speech recognition Arous2010a (), among others.
In general, the capability of manifold methods to deal with complex data is defined by their mathematic formulations and assumptions. Manifold methods are usually grouped according to two factors: i) If the dimensional mapping uses manual labels, then the method is supervised; otherwise, it is unsupervised. As an example, Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are supervised and unsupervised, respectively. ii) If the intrinsic dimensions are linear combinations of the original space then it is linear; otherwise, it is non-linear. As an example, PCA is linear, and SOM is non-linear. Due to the final objective of this paper, the remaining part does not consider the supervised approaches such as LDA.
To find a well performed dimensional mapping, we use as baseline the Principal Component Analysis (PCA) algorithm, which is the most common linear manifold algorithm but usually fails to capture patterns in complex datasets. To capture complex patterns we use three non-linear manifold methods, namely Isomaps, SOM and GNG. These non-linear algorithms were chosen based on the advantages reported in previous studies Tenenbaum00 (); Kohonen1990 (); Florez2002 (), and their capability to be applied to new observations not included in the training data. In our exploratory analysis t-SNE was also used; however, its original formulation cannot be applied to data outside of the training dataset. Regarding SOM and GNG, this study is based on the original formulation to keep simple the interpretation and analysis of the results.
Principal Components Analysis: it is a linear technique to reduce data dimensionality by transforming the original data into a new set of variables that summarize the original data Tenenbaum00 (). The new variables are the principal components (PCs), and are uncorrelated and ordered such that the PC has the largest variance among all PCs, and the PC is orthogonal to the first PCs. The first few PCs capture the main variations in the dataset, while the last PCs capture the residual “noise” in data.
Isomaps: a non-linear dimensionality reduction algorithm proposed in Tenenbaum00 () that learns the underlying global geometry of a dataset using local distances between the observations. In comparison with classical linear techniques, Isomaps can handle complex non-linear patterns such as those in human handwriting or face recognition in images. Isomaps combine the major algorithmic features of PCA and the multidimensional-scaling computational efficiency, global optimality, and asymptotic convergence, which makes feasible its use in wearable cameras. The hyperparameter of Isomaps is the number of neighbors Jing2011 ().
Self Organizing Maps (SOM): it is one of the most popular unsupervised neural networks. It was originally proposed to visualize large dimensional datasets SOM () and easily find relevant information SOM01 () on them. In summary, the SOM is a two layer neural network that learns a non-linear projection of a high dimensional space (input layer) to a regular discrete low-dimensional grid of neural units (output layer). The discrete nature of the output layer facilitates the visualization of learned patterns and makes easy to find topological relations in the data.
The training phase of SOM relies on a competitive iterative process with a neighborhood function that acts as a smoothing kernel over the output layer SOM (). Typically, for each training sample, the best matching unit (BMU) is selected by using the Euclidean distance and then its local neighborhood is updated to make it slightly similar to the training sample. The neighborhood definition depends on the output layer. In our case, we use a regular quadrangular grid, but future improvements can be achieved by using more complex topologies such as toroidal or spherical grids Mount2011 (). The hyperparameter of SOM is the number of output neurons. In the experimental section, neurons weights are initialized by using PCA.
Growing Neural Gas (GNG): a common way to avoid the hyperparamer selection of SOM is to use growing structures that incrementally increase the number of neural units depending on the topology of the input data. GNG is an iterative algorithm to approximate the topology of a multidimensional dataset by using a changing number of neural units represented as a graph. In the most general form, the algorithm sequentially grows the nodes and adjusts the graph to the input data. In this way, each node of the graph has assigned a neural weight in the input space, and the algorithm sequentially adds or/and removes nodes based on cumulative error measurements between the nodes and the data Fritzke1995 (); Martinetz1991 (). An important aspect of the GNG is the position of the first two nodes. In the experimental section, the first nodes are randomly located in the input space. Aditionally, the GNG maximum number of neurons is defined as and in seek of a fair comparison with and , respectively.
For our particular interests, GNG and SOM play a similar role, and their usage in the global framework is the same; however, the predefined topology of SOM simplifies the understanding and visualization of the patterns captured by the algorithm in the application case.
3.4 Hyperparameters, classification rules, and post-learning evaluation
When evaluating manifold methods the most challenging part is to quantify if the patterns learned are modified by the phenomena under study. Previous studies usually follow two different strategies: the first one quantifies the information lost when moving the training dataset from the original space to the intrinsic dimension Saxena2004 (). The second strategy uses the manual labels or human knowledge to analyze the intrinsic dimension (output space) in a post-learning analysis Jing2011 ().
In our case, the information strategy is used to define the hyperparameters of the Isomap and the SOM. In particular, we use the reconstruction error to select the number of neighbors of the Isomaps as proposed in Saxena2004 (), and the Topological Conservation Quality (TCQ) to define the number of output neurons of SOM Arous2010a (). In the particular case of SOM the TCQ is selected to include in the analysis the concept of temporal continuity preservation; However, a similar analysis can be obtained by using alternative evaluation criteria such as the topographic product Bauer1992 (), or the topographic function Villmann1997 (). In general, the TCQ measures the number of times that the SOM transformation breaks a contiguity in the input data. In the input space, we define as contiguous two consecutive frames. In the output space two neurons are contiguous if they share one border. Formally the TCQ is defined as (1), where is the number of training samples and if the two closest neurons of an input vector are contiguous in the output space, and otherwise.
Once defined the hyperparameters, a post-learning analysis is done by using the manual labels to quantify the performance of the proposed manifold methods. For this purpose, each manifold method is trained on each global feature and dataset. Then a classification analysis is performed using the manual labels and defining as reference scores two popular supervised classifiers, namely Support Vector Machine (SVM) and Random Forest (RF). It is noteworthy that the supervised classifiers are in a favored position because they are theoretically developed to exploit the differences among manual labels; however, the closer the score of the manifold methods to the classifiers score, the more related the patterns learned are with the phenomena measured by the manual labels.
To use the manifold methods as classifiers, we use a majority voting rule in the output space (intrinsic dimension) using the training samples and their manual labels. For Isomaps and PCA, the majority voting rule is evaluated using the closest training frames in the output space. For SOM, the majority voting rule is evaluated on the training frames that activated the same output neuron of each testing sample.
4 Experimental results
This section evaluates the capabilities of the proposed manifold methods to capture light changes and separate different locations using global features. In the first part of this section, we calibrate the hyperparameters of the Isomap and SOM while preserving the unsupervised nature of the training phase. Later, we use the manual labels to analyze the patterns learned under a classification approach Jing2011 (). Finally, the discriminative ranking learned by a Random Forest is used to analyze the most relevant dimensions of the proposed global features.
4.1 Defining the hyperparameters
To define the number of neighbors considered in Isomaps we use the reconstruction error, which is the amount of information lost when transforming a point from the original space (global feature) to the intrinsic dimension. Figure 5 shows the reconstruction error of the Isomap when the number of closest neighbors increases. Note that, for all the features; the reconstruction error starts stabilizing when the closest neighbors are used. Therefore, we use 2 as the parameter in the remaining part of the paper.
Regarding the number of output neurons of the SOM we use the TCQ, as defined in Section 3.4. Figure 6 shows the TCQ for different SOM sizes. Two findings are highlighted from the figure: i) A small number of neurons offers a topological advantage in the TCQ, because the fewer the output neurons to activate, the easier to preserve contiguities in the output space. ii) The TCQ starts stabilizing for large SOMs, around for EDSH and for UNIGE dataset. In the experimental results we use three SOM sizes: , and , denoted as , respectively.
4.2 Post-learning analysis
To evaluate the patterns found by the manifold methods we perform an exhaustive post-learning analysis under a classification framework using the manual labels and defining as reference scores the performance of SVM (linear kernel) and a RF (10 decision trees with maximum depth 10). For this purpose we define two different classification problems: i) Discriminate among indoors and outdoors frames ii) Classify the labeled locations given by the datasets (e.g. Kitchen, Office, Street, etc).
Table 1 shows the percentage of testing data successfully classified by each method (columns) when using different features (rows). The table contains two horizontal groups, one for each classification problem. The first group shows the performance for the binary problem (indoor/outdoor), and the second group shows the strict multiclass match for the detailed locations. The first group of columns shows the unsupervised methods while the second group shows the supervised classifiers results. Note that, despite not using manual labels in the training phase, the performance of the unsupervised methods are close to their supervised counterparts, which validates the patterns learned, and confirms the relationship between the proposed global features with the light/location conditions.
|Indoor and Outdoor||EDSH||RGB||0.679||0.799||0.781||0.765||0.757||0.745||0.742||0.790||0.849|
In particular, Table 1 shows that within the unsupervised techniques the large SOM and GNG perform the best. The small differences between the SOM and GNG performance can be explained by the initialization of the neurons and the algorithmic differences. The first neurons of the GNG are located randomly in the input space while the SOM initial weights are defined by using PCA. The table also shows valuable insights about the most discriminative features. It is noteworthy the performance of the methods when HSV is used, particularly in the unsupervised approach. This fact confirms the intuition of previous works on which the use of HSV leads to algorithmic improvements when used as a proxy for the light conditions. About the datasets, it is possible to conclude that the EDSH dataset is the most challenging, especially for the location classification problem. Interestingly, in the Indoor/outdoor problem of EDSH dataset, the GIST achieves a good performance, but it is outperformed in the remaining problems by HSV.
More in detail Tables 1(a) and 1(b) show the confusion matrix of the and the Random Forest for the EDSH and the UNIGE dataset, when HSV color space is used. As expected from Table 1 the location of the EDSH are more challenging, which creates larger confusion levels. This is the case, for example, of “Stairs 1” frames, which are frequently confused with kitchen frames by both algorithms due to the presence of a similar floor and wall color in both locations. Regarding the UNIGE dataset, a good performance is obtained in all the locations achieving values larger than for the unsupervised approach. The difference in the performances of both datasets shows the importance of having locations with enough data for a classification approach; however, it allows us to conclude the existence of structural similarities in the colour configuration and light conditions of the frames labelled as “Kitchen” and “Stairs 1”. Figure 7 shows the time required by different sizes of SOM, GNG and RF to transform a descriptor to the output space. The horizontal lines, from top to bottom, show the frequency required to achieve real-time performance on videos with 30, 50, and 60 frames per second respectively. There is a computational advantage in the speed of GNG and RF; however, all of them are fast enough to process . The differences in performance can be a consequence of the particular implementations.
Another intuitive way to analyze the results is by visualizing the learned patterns. In summary, a well performed dimensional mapping must locate frames close to each other, in the output space, if they are under similar light conditions and scene configuration. In other words, if the proposed features are related to the light/location conditions, the unsupervised method will try to separate them in the output space. The quality of that separation is ruled by the complexity of the data and the manifold method used.
Figure 8 shows the 2D output for the , , , and , for both datasets using HSV. Different colors represent the manual labels. In the case of SOM and GNG, each neuron is labeled with the majority voting of the neural activations hits. The figure clearly shows that SOM successfully groups similar inputs in the same regions of the output layer. The GNG also create some groups of neurons for each location, but its visualization makes difficult to conclude. In the case of PCA and Isomaps, the patterns in the output space are not so evident, but definitely, the non-linearity of Isomaps allows them to capture more information than PCA, which is clearly affected by the orthogonality of the intrinsic dimensions.
It is remarkable the output space of the in the UNIGE dataset, on which both classification problems are located in different parts of the output layer. For the EDSH dataset, it is also possible to delineate some clusters, such as the kitchen (green) the street (black), the floor (red) and the stairs (yellow and orange). However, the remaining locations are not easily visible, e.g., both lobbies (in blue and pink). This is explained by the small number of frames available for these locations in the dataset.
Figure 9 shows the signature when transforming a uniform sampling of seconds from the street video of the UNIGE dataset using HSV. In the first row are the activated neurons (unsupervised locations) ordered by time from left to right. In the second row are the compressed snapshots for the input frames. As can be seen from the first row, the activations start on the left side and moves to the middle of the grid while the user walks in the street through different light conditions. The point color represents the temporal dimension, being yellow the first frame and red the last one.
4.3 Feature Analysis
This subsection exhaustively analyzes the discriminative capabilities of the proposed global features and combine the most relevant dimensions to improve the dimensional mapping. For this purpose we follow two steps: i) The global features (RGB, HSV, LAB, YCrCb, GIST) are combined and used to train a RF on each dataset and classification problem described in Section 4. ii) The discriminative importance learned by the RF is exploited by adding, in order of importance, each of the original dimensions while evaluating the performance of RF and .
Figure 10 summarizes the changes in performance (line plot) and the number of components (heat-map) belonging to each global feature on each step (x-axis). The upper and lower parts of the figure show the results for the EDSH and the UNIGE dataset, respectively. The first column corresponds to the indoor/outdoor problem and the second column to the location problem. The constant values in the line plots are the performance of and reported in Table 1.
From Figure 10 it is possible to conclude that combined features could improve the performance in the proposed classification problems. For instance, for the EDSH dataset, the combined features improves the SOM accuracy from to and to in the indoor/outdoor and location problem, respectively. For the UNIGE dataset, due to the original performance, the improvement is not as significant. However, for some steps in the location problem, the combined features reaches an accuracy of , which is slightly better than the of the HSV version. It is also noteworthy the result on the location problem for the EDSH dataset, on which the combined feature is close to the SOM-HSV combination, but is not able to improve its performance considerably. The latter fact confirms that the location problem in the EDSH dataset is the most challenging, not only for the manifold methods but also for the supervised classifiers.
Regarding the composition of the combined features, it is notable that by using less than components, it is possible to achieve similar performance to the SOM-HSV, which originally uses components. Additionally, for all cases, the method starts using HSV, YCbCr and LAB components as the most discriminative, but around the to the step, it aggressively uses GIST components to disambiguate the most difficult cases. It is important to note that HSV, YCbCr, and Lab, are color spaces designed to use one of the components for lumma and the other two components for chromatic information. A quick analysis of the GIST components suggests that the RF searches for orientations and scale in the scene. Finally, the RGB color-space is barely used.
5 Application case: Multi-model hand-detection
Once confirmed the capabilities of SOM to capture light conditions and the global characteristics of the scene, its output can be used as a map of unsupervised locations to build a multi-model approach to different problems such as object recognition, hand-detection, video-summarization, activity recognition, among others. This section illustrates the use of the unsupervised layer by using the hand-detection problem as defined in Betancourt2015 (), on which a Support Vector Machine (SVM) is trained with Histogram of Oriented Gradients (HOG) to detect whether the hands are being recorded by the camera or not Betancourt2015 (); Betancourt2014a (). The following part of this section uses the UNIGE dataset due to the intentional composition of frames with and without hands.
The hand-detection problem is used as example due to two reasons: i) It solves a simple question which makes it possible to illustrate the role of the unsupervised layer in the reported improvements; ii) The manual labeling is simple and easy to replicate. The proposed application can be extended to other hierarchical levels such as hand-segmentation; however, it would require extra labeling to supply quadratic growth of the number of neurons.
Our approach extends the method proposed in Betancourt2015 () by training one hand-detector for each unsupervised neuron of the HSV-SOM described in Section 3. Let’s denote each neuron and its local hand-detector as , and the global hand-detector as . Given an arbitrary frame , the local and global confidence about the hand presence is given by the SVM probabilistic notation as stated in equation (2) and (3), respectively. The model with the higher confidence is used to take the final decision. Here refers to the hyperplane learned by the HOG-SVM when trained on the whole training dataset, and to the hyperplane obtained with a HOG-SVM when trained on local training set assigned to neuron , which contains the training frames for which neuron was the best matching unit. Additionally, for each neuron a local testing set (LTS) is defined by combining the activations of the neighbouring neurons. The LTS of each neuron is used to evaluate its local -score. Due to the finite number of training frames, some neurons does not reach enough training frames or get only positive/negative frames which makes impossible to train their local hand-detectors. These neurons and the ones with local -score lower than are defined as degraded, and their hand-detector is replaced by the global version.
Figure 11 summarizes the performance of the multi-model approach for different SOM sizes (x-axis). The upper half of the figure shows the training and testing scores. This figure shows a quick increase in the score which stabilizes for SOMs with more than neurons. The lower half of the figure shows the average number of training frames per neuron (blue) and the number of degraded neurons (red). Two important conclusions can be drawn from these figures: i) The multi-model approach overfits the training dataset on large SOMs ii) The number of degraded neurons increases quickly and, as a consequence, no extra benefit is obtained from larger SOMs.
|True positive rate||True negatives rate||F1-score|
Table 3 compares the performance of the HOG-SVM and the multi-model strategy on a . The table shows the true-positive rate, true-negative rate and the score for each location in the dataset. In general, our approach considerably improves the performance for all locations, totalizing an improvement of score points in the whole dataset. The location with the larger improvement is the Coffee-bar with an increase of points in the score. This improvement is explained by an increase of and percentual units in the true-positive and true-negative rate, respectively.
Finally, Figure 12 summarizes some neural characteristics of the : Figure 11(a) shows the number of training frames used per neuron and the proportions between frames with (green) and without (red) hands. Some neurons have a slightly unbalanced training. This fact is included in the hand-detector training phase by using these proportions as the weights of the class in the SVM. Figure 11(b) summarizes the use of and . The size of the circle represents the number of testing frames activating a particular neuron. In turn, each circle is proportionally divided in green and red according to the number of times that the local or global model is used, respectively. The gray cells are the degraded neurons on which only the global model is always used. For this particular SOM size the degraded neurons are consequence of poor local scores. Figure 11(c) shows the composition of testing frames on each neuron in terms of its location. Note that, the resulting regions are in line with the regions presented in Section 4.2, Figure 7(b). Finally, Figure 11(d) shows the testing score of each neuron. It is noteworthy that the smallest scores are located in a contiguous region of the . This fact can be exploited by using a windowing to fuse the local models. In sake of an easy explanation of the application case, this improvement is not included in the current implementation.
6 Conclusions and future research
This paper proposes an unsupervised strategy to endow wearable cameras with contextual information about the light conditions and location recorded by using global features. The main finding of our approach is that using SOM and HSV, it is possible to develop an unsupervised layer that understands the illumination and location characteristics on which the user is involved. Our experiments validate the intuitive findings of previous works using HSV global histograms as a proxy for the light conditions recorded by a wearable camera. As an application case, the unsupervised layer is used to face the hand-detection problem under a multi-model approach. The experiments presented in the hand-detection application considerably outperform the method proposed in Betancourt2015 ().
The experimental results analyze the capabilities of different unsupervised methods to capture light and location changes in egocentric videos. The experimental results show that SOM can extract valuable contextual information about the illumination and location from egocentric videos without using manually labeled data.
Regarding the relationship between the global features and the recorded characteristics, our experiment points at HSV as the color space having the most discriminative power. Additionally, it is shown that by following a simple feature selection, it is possible to obtain a combined feature, mainly formed by HSV and GIST, which makes easier for SOM to capture these patterns. Two issues about the combined feature to be accounted for: i) it is computationally expensive compared with using just HSV; ii) it indirectly introduces a dependence between the manual labels and the training phase.
Concerning future work, several challenges in the proposed method can be faced. One of the more promising is the use of deep features to extract more complex contextual patterns. This type of approach could considerably improve the scalability of the system in particular when the user is visiting multiple and unknown locations. This strategy could be considered an example of knowledge transfer on which the information about scene recognition is obtained from the neural coefficients obtained with non-wearable camera. Important considerations mentioned before must be accounted if deep features are included. In the application case, important improvements can be achieved if the proposed framework is applied to other hierarchical levels, for example, the unsupervised layer can be used to switch between different color spaces at a hand-segmentation level or used to select different dynamic models at a hand-tracking level Betancourt2016b (). Another interesting improvement to the current approach is to include dynamic information in the activated neurons by exploiting the temporal correlation and avoiding to execute the unsupervised method for each frame in the video stream Betancourt2015 ().
Finally, an interesting application of the proposed approach can be found in video summarization, visualization and captioning. In this line, the output space can be used to find easily and retrieve video segments recorded on similar locations or light conditions.
This work was partially supported by the Erasmus Mundus joint Doctorate in Interactive and Cognitive Environments, which is funded by the EACEA, Agency of the European Commission under EMJD ICE. Likewise, we thank the AAPELE (Architectures, Algorithms and Platforms for Enhanced Living Environments) EU COST action IC1303 for the STSM Grant, the International Neuroinformatics Coordinating Facility (INCF) and the Finnish Foundation for Technology Promotion (TES).
The authors thank the Cyberinfrastructure Service for High Performance Computing, “Apolo”, at EAFIT University, for allowing us to run our computational experiments in their computing centre.
- journal: Pervasive and Mobile Computing
- A. Betancourt, P. Morerio, C. Regazzoni, and M. Rauterberg, “The Evolution of First Person Vision Methods: A Survey,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 25, no. 5, pp. 744–760, 2015.
- T.-H.-C. Nguyen, J.-C. Nebel, and F. Florez-Revuelta, “Recognition of Activities of Daily Living with Egocentric Vision: A Review,” Sensors, vol. 16, no. 1, pp. 72, 2016.
- K. Zhan, S. Faux, and F. Ramos, “Multi-scale Conditional Random Fields for first-person activity recognition,” in International Conference on Pervasive Computing and Communications. mar 2014, pp. 51–59, Ieee.
- L. Baraldi, F. Paci, G. Serra, L. Benini, and R. Cucchiara, “Gesture Recognition using Wearable Vision Sensors to Enhance Visitors’ Museum Experiences,” IEEE Sensors Journal, vol. 15, no. 5, pp. 1–1, 2015.
- A. Fathi, A. Farhadi, and J. Rehg, “Understanding egocentric activities,” in Proceedings of the IEEE International Conference on Computer Vision. nov 2011, pp. 407–414, IEEE.
- H. Pirsiavash and D. Ramanan, “Detecting activities of daily living in first-person camera views,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition. jun 2012, pp. 2847–2854, IEEE.
- N. Díaz, M. Pegalajar, J. Lilius, and M. Delgado, “A survey on ontologies for human behavior recognition,” ACM Computing Surveys, vol. 46, no. 4, pp. 1–32, 2014.
- J. Farringdon and V. Oni, “Visual Augmented Memory,” in International Symposium on wearable computers, Atlanta GA, 2000, pp. 167–168.
- M. Harvey, M. Langheinrich, and G. Ward, “Remembering through lifelogging: A survey of human memory augmentation,” Pervasive and Mobile Computing, vol. 27, pp. 14–26, 2016.
- G. Balakrishnan, G. Sainarayanan, R. Nagarajan, and S. Yaacob, “Wearable Real-Time Stereo Vision for the Visually Impaired.,” Engineering Letters, vol. 14, no. 2, pp. 6–14, 2007.
- A. Fathi, X. Ren, and J. Rehg, “Learning to recognize objects in egocentric activities,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Providence, RI, jun 2011, pp. 3281–3288, IEEE.
- C. Li and K. Kitani, “Model Recommendation with Virtual Probes for Egocentric Hand Detection,” in 2013 IEEE International Conference on Computer Vision, Sydney, 2013, pp. 2624–2631, IEEE Computer Society.
- O. D. Lara and M. A. Labrador, “A Survey on Human Activity Recognition using Wearable Sensors,” IEEE Communications Surveys & Tutorials, vol. 15, no. 3, pp. 1192–1209, 2013.
- D. Riboni and C. Bettini, “COSAR: hybrid reasoning for context-aware activity recognition,” Personal and Ubiquitous Computing, vol. 15, no. 3, pp. 271–289, 2011.
- Y. Zhu, N. M. Nayak, and a. K. Roy-Chowdhury, “Context-Aware Activity Recognition and Anomaly Detection in Video,” Selected Topics in Signal Processing, vol. 7, no. 1, pp. 91–101, 2013.
- J. H. Friedman, “On bias, variance, 0/1âloss, and the curse-of-dimensionality,” Data mining and knowledge discovery, vol. 77, no. 1, pp. 55–77, 1997.
- B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning Deep Features for Scene Recognition using Places Database,” Advances in Neural Information Processing Systems 27, pp. 487–495, 2014.
- A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” International Journal of Computer Vision, vol. 42, no. 3, pp. 145–175, 2001.
- T. Starner, B. Schiele, and A. Pentland, “Visual contextual awareness in wearable computing,” in Digest of Papers Second International Symposium on Wearable Computers Cat No98EX215. 1998, pp. 50–57, IEEE Computer Society.
- C. Zhu and W. Sheng, “Motion- and location-based online human daily activity recognition,” Pervasive and Mobile Computing, vol. 7, no. 2, pp. 256–269, 2011.
- L. Wang, T. Gu, X. Tao, H. Chen, and J. Lu, “Recognizing multi-user activities using wearable sensors in a smart home,” Pervasive and Mobile Computing, vol. 7, no. 3, pp. 287–298, 2011.
- T. Hori and K. Aizawa, “Context-based video retrieval system for the life-log applications,” in Proceedings of the 5th ACM SIGMM international workshop on Multimedia information retrieval - MIR ’03, New York, New York, USA, 2003, p. 31, ACM Press.
- R. Templeman, M. Korayem, D. Crandall, and K. Apu, “PlaceAvoider: Steering first-person cameras away from sensitive spaces,” in Network and Distributed System Security Symposium, 2014, pp. 23–26.
- A. Oliva, “Gist of the scene,” in Neurobiology of Attention, pp. 251–256. Elsevier Inc., 2005.
- P. Baldassarri and P. Puliti, “Self-organizing maps versus growing neural gas in a robotic application,” Artificial Neural Nets Problem Solving Methods, pp. 201–208, 2003.
- E. Barakova and T. Lourens, “Event Based Self-Supervised Temporal Integration for Multimodal Sensor Data,” Journal of Integrative Neuroscience, vol. 04, no. 02, pp. 265–282, 2005.
- E. Barakova and T. Lourens, “Efficient episode encoding for spatial navigation,” International Journal of Systems Science, vol. 36, no. October 2014, pp. 887–895, 2005.
- A. Betancourt, P. Morerio, E. Barakova, L. Marcenaro, M. Rauterberg, and C. Regazzoni, “A Dynamic Approach and a New Dataset for Hand-Detection in First Person Vision.,” in Lecture Notes in Computer Science, Malta, 2015, vol. 9256.
- P. Kakumanu, S. Makrogiannis, and N. Bourbakis, “A survey of skin-color modeling and detection methods,” Pattern Recognition, vol. 40, no. 3, pp. 1106–1122, mar 2007.
- P. Morerio, L. Marcenaro, and C. Regazzoni, “Hand Detection in First Person Vision,” in Fusion, Istanbul, 2013, University of Genoa, pp. 0–6.
- A. Betancourt, P. Morerio, L. Marcenaro, E. Barakova, M. Rauterberg, and C. Regazzoni, “Left/Right Hand Segmentation in Egocentric Videos,” Computer Vision and Image Understanding, 2016.
- K. Murphy, A. Torralba, D. Eaton, and W. Freeman, “Object detection and localization using local and global features,” Toward Category-Level Object Recognition - Lecture Notes in Computer Science, vol. 4170, pp. 382–400, 2006.
- S. Chiappino, P. Morerio, L. Marcenaro, and C. Regazzoni, “Bio-inspired relevant interaction modelling in cognitive crowd management,” Journal of Ambient Intelligence and Humanized Computing, vol. Feb, no. 1, pp. 1–22, feb 2014.
- P. Morerio, L. Marcenaro, and C. S. Regazzoni, “People count estimation in small crowds,” in Advanced video and signal-based surveillance (AVSS), 2012 IEEE Ninth International Conference on. IEEE, 2012, pp. 476–480.
- N. Arous and N. Ellouze, “On the Search of Organization Measures for a Kohonen Map Case Study: Speech Signal Recognition,” International Journal of Digital Content Technology and its Applications, vol. 4, no. 3, pp. 75–84, 2010.
- J. B. Tenenbaum, V. De Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” Science, vol. 290, no. 5500, pp. 2319–2323, 2000.
- T. Kohonen, “The self-organizing map,” Proceedings of the IEEE, vol. 78, no. 9, pp. 1464–1480, 1990.
- F. Florez, J. M. Garcia, J. Garcia, and A. Hernandez, “Representing 2D objects. Comparison of several self-organizing networks,” in Conference on Neural Networks ancl Applications, 2002, pp. 2–5.
- L. Jing and C. Shao, “Selection of the suitable parameter value for ISOMAP,” Journal of Software, vol. 6, no. 6, pp. 1034–1041, 2011.
- T. Kohonen, “The self-organizing map,” Proceedings of the IEEE, vol. 78, no. 9, pp. 1464–1480, Sep 1998.
- T. Kohonen, “Self-organizing maps,” Springer Series in Information Sciences. Berlin, Heidelberg, vol. 30, no. 3rd edition, 2001.
- N. J. Mount and D. Weaver, “Self-organizing maps and boundary effects: Quantifying the benefits of torus wrapping for mapping SOM trajectories,” Pattern Analysis and Applications, vol. 14, no. 2, pp. 139–148, 2011.
- B. Fritzke, “A Growing Neural Gas Learns Topologies,” Advances in Neural Information Processing Systems, vol. 7, pp. 625–632, 1995.
- T. Martinetz and K. Schulten, “A ”Neural-Gas” Network Learns Topologies,” 1991.
- A. Saxena and A. Gupta, “Non-linear dimensionality reduction by locally linear isomaps,” Neural Information Processing, pp. 1038–1043, 2004.
- H.-U. Bauer and K. R. Pawelzik, “Quantifying the neighborhood preservation of self-organizingfeature maps,” Transactions on Neural Networks, vol. 3, no. 4, pp. 570–579, 1992.
- T. Villmann, R. Der, M. Herrmann, and T. M. Martinetz, “Topology preservation in self-organizing feature maps: Exact definition and measurement,” IEEE Transactions on Neural Networks, vol. 8, no. 2, pp. 256–266, 1997.
- A. Betancourt, P. Morerio, L. Marcenaro, M. Rauterberg, and C. Regazzoni, “Filtering SVM frame-by-frame binary classification in a detection framework,” in International Conference on Image Processing, Quebec, Canada, 2015, vol. 2015-Decem, IEEE.
- A. Betancourt, M. Lopez, C. Regazzoni, and M. Rauterberg, “A Sequential Classifier for Hand Detection in the Framework of Egocentric Vision,” in Conference on Computer Vision and Pattern Recognition, Columbus, Ohio, jun 2014, vol. 1, pp. 600–605, IEEE.
- A. Betancourt, L. Marcenaro, E. Barakova, M. Rauterberg, and C. Regazzoni, “GPU Accelerated Left/Right Hand-segmentation in First Person Vision,” in European Conference on Computer Vision, 2016.