Predicting Visual Context for Unsupervised Event Segmentation in Continuous Photo-streams
Segmenting video content into events provides semantic structures for indexing, retrieval, and summarization. Since motion cues are not available in continuous photo-streams, and annotations in lifelogging are scarce and costly, the frames are usually clustered into events by comparing the visual features between them in an unsupervised way. However, such methodologies are ineffective to deal with heterogeneous events, e.g. taking a walk, and temporary changes in the sight direction, e.g. at a meeting. To address these limitations, we propose Contextual Event Segmentation (CES), a novel segmentation paradigm that uses an LSTM-based generative network to model the photo-stream sequences, predict their visual context, and track their evolution. CES decides whether a frame is an event boundary by comparing the visual context generated from the frames in the past, to the visual context predicted from the future. We implemented CES on a new and massive lifelogging dataset consisting of more than million images spanning over days. Experiments on the popular EDUB-Seg dataset show that our model outperforms the state-of-the-art by over in f-measure. Furthermore, CES’ performance is only points below that of human annotators.
Continuously recording our lives in the form of images can be of great usefulness for memory enhancement, tracking of the activities of daily living, and other related healthcare applications. However, lifelogging has an overload problem both in time and space: Lifelogging cameras take a minimum of pictures per minute, which can add to more than pictures a day, i.e. per year. Such vast load of data requires hours of manual analysis to, for example, select your day’s highlights, check what you ate and drank the past month, or monitor your grandparent’s routines. Hence, automatic tools to extract highlights and life patterns are needed (EEG2014; harvey2016remembering; XiongSnap; GarciadelMolino2018phd). However, analyzing lifelogs entails two great challenges related to its Low Time Resolution (LTR) and wearable nature: First, dramatic visual changes between consecutive frames even if these correspond to the same event. Second, a substantial presence of visual occlusions, walls and ceilings in the field of view, and frequent changes of visual orientations.
Extensive research has been conducted to retrieve specific events or obtain summaries from First Person View (FPV) videos (MySurvey; BolanosSurv). While event segmentation is needed for a complete, informative and diverse summary that includes most life events in the recording, little work has been done to that effect (BolanosSurv; NTCIR_db). Many approaches to segment High Time Resolution (HTR) video use motion between frames to infer the wearer’s activity, but motion cues are not available in photo-streams. Furthermore, obtaining annotations for such large datasets is very costly. As such, one can only resort to visual features and sensor metadata, and unsupervised techniques such as K-Means (LifeLogTask17_CLEF) and probabilistic models (dimiccoli2017sr).
Due to these limitations, current automated methods usually fail at modeling the frame sequences. As a consequence, they cannot perceive the overall context in heterogeneous events, and usually misinterpret occlusions and occasional diversions within events as different episodes. Our ambition is to build a segmentation model that mimics the human reasoning, as people can easily detect and discard such noise by comparing the new visual input with their understanding of both the previous and following scene (see Fig. 1).
In this work, we introduce Contextual Event Segmentation (CES), a novel event segmentation technique that, given a sequence of frames, predicts its visual context and then compares it to the context corresponding to the ensuing sequence. An LSTM-based generative model, that we call VCP, is used to predict the visual context. It is able to model our daily activities and learn the associations between different scenes, e.g. a train commute will include corridors, stairs, a platform, the interior of a wagon, etc. To train VCP we introduce R3, a novel and vast dataset for unsupervised lifelog analysis. It consists of over million images that depict the daily activities of different users over a total of days.
The main contributions of this paper are:
a segmentation approach that mirrors the human perceptual reasoning when segmenting photo-streams into events. In a series of experiments, CES proves to be superior to the state of the art by over in f-measure, and even competitive against manual annotations.
an LSTM-based generative model to predict visual context from a sequence of frames. We observe that the model learns event traits in common daily activities.
a large-scale lifelogging dataset containing images from users. To our knowledge, R3 is the largest FPV dataset currently available
2. Related Work
FPV content entails three main challenges: its unconstrained nature, its continuous stream of consecutive events, and its poor visual quality. In particular, the purpose of lifelogging is to have a diary of our lives. However, such huge amount of visual content must be summarized to be of practical use. The summary of these photo-streams should be complete, informative and diverse. When no query is given to constraint the content of the summary, the maximum variety of events should be included. To do so, the content must first be segmented into subshots in the case of High Temporal Resolution (HTR) videos, or events in the case of Low Temporal Resolution (LTR) videos (or photostreams).
Temporal segmentation in High Temporal Resolution First Person View
Third Person View (TPV) event segmentation approaches typically identify shot boundaries by detecting abrupt changes between consecutive frames (MoneySurv; HuSurv). However, FPV content is not comprised of separate shots, but rather a succession of events with smooth transitions, where event boundaries are not well defined.
Most FPV approaches for event segmentation use motion cues, both visual (e.g. optical flow, blurriness) (AizSum01; NgSum; LuSD; GygliSum14; VariniPref; PolegSeg; PolegCNN; AVS) and from sensors (AizSum01; spriggs2009temporal). Such features are used to predict the wearer’s activity or attitude patterns using probabilistic models (VariniPref) and deep learning (PolegCNN), to segment the videos accordingly. Other methods resort to visual similarity between groups of frames (e.g. color, GIST, CNN hash) (LeeDisco12; XiongSnap; LeePred15; AizSum01; NgSum; LuSD; bettadapura2016picturesque; XuGaze; PotapovSum). Temporally constrained clustering (LeePred15) and statistical frameworks (PotapovSum) have been used to determine whether the visual differences correspond to event boundaries or just abrupt head movements.
Temporal segmentation in Low Temporal Resolution First Person View
In the case of lifelog photo-streams, frames can be up to seconds apart. In such low temporal resolution, content may change a lot between consecutive frames even if they are part of the same event, and as Bolanos et al. remark in (BolanosSurv), visual motion information is unavailable (sensor information may sometimes be available (DohertyLL08)). Given the limited amount of annotated data, event segmentation is very often unsupervised, performed via K-Means and other hierarchical clustering algorithms on visual cues (e.g. color, CNN hashes) (lin2006structuring; DohertyLL08; DohertyRet08; bolanos2015visual; I2R2017CLEF; I2R2017NTCIR; yamamoto2017pbg). An exception to these unsupervised methods is (furnari2018personal), in which a personal location classifier is trained for each user, and events are segmented according to changes in the wearer’s location. Since these methods often ignore the semantic nature of the frames, Dimiccoli et al. (dimiccoli2017sr) propose defining the frames with semantic and contextual cues defined by CNN features and linguistic information. The relation between frames is assessed using a WordNet (WordNet) based knowledge graph, and the event boundaries are found using a graph-cut algorithm integrating an agglomerative clustering. Such segmentation methodology relies on the cross-analysis of consecutive frames, and cannot detect change points between two events with heterogeneous visual content, nor ignore small and isolated visual changes within an event.
To address this limitation, we present a novel event segmentation paradigm in which each frame is understood as part of a global sequence. As such, the visual context of the upcoming frame can be predicted from the preceding sequence of frames. This prevents the model from detecting false positives due to abrupt changes between consecutive frames, and allows it to understand the nature of heterogeneous events.
Sequence embedding for photo-album summarization and activity classification
Addressing the problem of story-telling from albums of to photos, Yu et al. (yu2017hierarchically) use a Recurrent Neural Net (RNN) to encode the local album context for each photo, so that the best key-frames can be selected. Liu et al. (liu2017let) use Gated Recurrent Units (GRUs) to align the local storylines into the global sequential timeline. To obtain better event descriptions, they further leverage the semantic coherence in a photo stream by jointly embedding the images and sentences into a common semantic space.
Using video content, Bhatnagar et al. (bhatnagarunsupervised) obtain good results at describing egocentric motor actions (e.g. stir, fold, open) in HTR videos using an hybrid CNN-LSTM auto-encoder. Similarly, Srivastava et al. (srivastava2015unsupervised) learn spatio-temporal features using a sequence-to-sequence future prediction model, proving that such an architecture is more efficient than an auto-encoder.
Whereas both (bhatnagarunsupervised; srivastava2015unsupervised) learn the spatio-temporal features from the raw frames in HTR video, we propose learning a global semantic visual context from the visual features of LTR frame sequences.
3. Contextual Event Segmentation
Given a continuous stream of photos, we, as humans, would identify the start of an event if the new frame differs from our expectation of what should follow the preceding sequence. We would also check whether that frame is consistent with the subsequent image sequence (or scene). If the new scene spans a very short time and returns to the previous, we would ignore it as an extra event, but rather wrap it within the current event (e.g. going for a bottle of water while watching TV). Therefore, we would frequently look forward and backward to verify whether it was a new event, or just a brief diversion or local outlier.
The proposed model is analogous to such intuitive framework of perceptual reasoning. It uses an encoder-decoder architecture to predict the visual context at time given the images seen before, i.e. the past. A second visual context is predicted from the ensuing frames, i.e. the future. If the two predicted visual contexts differ greatly, CES will infer that the two sequences (past and future) correspond to different events, and will consider as a candidate event boundary.
Therefore, CES consists of two modules (c.f. Algorithm 1): First, the Visual Context Predictor (VCP), that predicts the visual context of the upcoming frame, either in the past or in the future depending on the sequence ordering. Second, the event boundary detector, that compares the visual context at each time-step given the frame sequence from the past, with the visual context given the sequence in the future.
3.2. Visual Context Predictor
Inspired by (bhatnagarunsupervised; srivastava2015unsupervised; yu2017hierarchically), we propose predicting the visual context from a sequence of frames with a Long-Short Term Memory network. LSTM networks are a type of Recurrent Neural Network that learn long-time dependencies through four hidden layers, i.e. the gates. Thus, LSTMs can aggregate the information they receive by learning to forget. Their mathematical formulation can be expressed as
where , , , and correspond to the four gates of the unit (input, forget, output, and input modulation), is the element-wise product, are the network weights (gal2016theoretically), and and are the cell state and the hidden state, respectively, at time-step .
The sequential and relational nature of lifelogging photo-streams allows us to train the weights of an LSTM-based aggregation network without ground truth annotations. To obtain the weights of our Visual Context Predictor, we train an encoder-decoder architecture that, given a sequence of visual feature vectors, learns to predict the subsequent sequence, as shown in Fig. 2. Since LTR video frames are visually highly different from adjacent ones, the model will learn the general context of the event at the same time as the estimation of the visual feature of the upcoming frame.
The auto-encoder is defined as
where is the deep learned visual feature (c.f. Section 5.1) of frame , is the predicted visual context at time , and and correspond to the models trained to encode and decode the visual feature, respectively. The objective function of the learning process is to minimize the mean squared error of the prediction, i.e. .
VCP shares architecture and weights with the encoding model presented above, and is able to encode the visual context of lifelog image sequences both feed forward and backwards, i.e. in reverse time order. The chosen architecture for VCP (i.e. the encoder) is a single LSTM layer of neurons. The hidden state is then passed to the decoder, which has a corresponding LSTM layer. The pre-trained model will be made available upon publication.
3.3. Boundary detector
Given a frame , two different context predictions can be obtained from VCP. The first, the future context including the sequence of frames from the past (). The second, the past context including the frames in the future (), where is the total length of the lifelog. Thus, at each time-step , the future context given the past will be , and the past context given the future . Note that the frame is not seen when predicting the future and past context at time to avoid overlapping inputs in the prediction.
An event boundary will delimit sequences with very different visual context. Hence, the boundary prediction function is defined
where is the cosinus distance.
The larger the distance between the two predicted visual contexts, the more likely the upcoming frame will correspond to an event boundary. Since the visual context will change gradually within the vicinity of a boundary, boundary candidates are assigned to the local maxima. Local maximums will also be found for very slight changes in the visual context. Therefore, only the candidates whose prediction value is over the average candidate values are kept as final event boundaries.
4. R3 dataset
A large-scale FPV dataset is needed to train the Visual Context Predictor. Such dataset must consist of continuous LTR streams of images spanning at least a few hours, without the need for any annotation. However, the size of the publicly available LTR datasets is very limited: days in CLEF (LifeLogTask17_CLEF) and NTCIR (NTCIR_db), and in EDUB-Seg (dimiccoli2017sr) and EDUB-SegDesc (bolanos2018egocentric), spanning a total of hours and images. We can also resort to other popular HTR FPV video datasets such as the First Person Social Interaction Dataset (FathiDisney), Huji EgoSet (PolegSeg), and UTEgocentric (LeeDisco12), that cover , and hours, respectively. Down-sampled at , the accumulated length of these datasets is under images. This amount of information results insufficient to train efficient deep learning models.
In this work we introduce R3, a large scale lifelogging image dataset captured by users during days for a total of almost hours, resulting in over million images. A comparison of the size of R3 with respect to the other mentioned datasets is presented in Fig. 3. The users volunteered to capture their daily lives as part of a memory-enhancement user study. They were asked to put on the wearable camera for most of their day during a whole month, and were free to withdraw from the study if they felt that wearing it was disrupting their routines. The volunteers are mostly seniors older than years old, and span a wide range of occupations and lifestyles. To protect their privacy, only the extracted visual features will be released.
5.1. Data setup
The output of the pre-pooling layer of InceptionV3 (szegedy2016rethinking) is used to describe the frames in the lifelog. We use the available lifelogging video data from R3, CLEF, NTCIR, and EDUB-Seg to train the VCP model and test our CES framework. EDUB-SegDesc (bolanos2018egocentric) is reserved as validation for further supervised pruning of the prediction obtained from CES.
The datasets are used as follows:
Training of the VCP model: of R3 is used as training set for the Visual Context Predictor model. To ensure that the model is not biased toward this dataset, a of both CLEF (LifeLogTask17_CLEF) and NTCIR (NTCIR_db) is also included in the training set. This joined set adds up to images. A separate of R3 is used to validate the different configurations and select the best hyperparameters.
Testing set for the VCP model: the remaining of R3, and of CLEF and NTCIR is kept as test to confirm that VCP is not overfitted toward R3 (c.f. 2).
Testing of the CES framework: the semantic features for of the lifelogs in EDUB-Seg (dimiccoli2017sr) have been made available to us. We compare our method to the baselines in two overlapping sets: these lifelogs and the full lifelogs in the dataset.
5.2. Training methodology
We explore several architectures and training parameters for the Visual Context Predictor model. Regarding the architecture, we can modify the number of neurons in the encoding LSTM layer, the number of frames seen before starting the future prediction (), the amount of frames the decoder needs to predict (), and whether the prediction will be conditional or not, i.e. whether the model gets further inputs past frame . We investigate architectures between and neurons, values of between and , and the same range of for .
Concerning the training parameters, the loss is defined as the mean squared error of the prediction , and RMSProp without decay is used as optimizer. The learning rate is randomly set in the range , and is reduced by half after every epochs without significant improvement in the validation loss. Different batch sizes are used, between and sequences at a time.
The best configuration is found through a gridsearch on all the different parameters. We find that the best prediction performance (smaller validation loss) is achieved with neurons on a conditional architecture. The number of frames seen before starting the future prediction is set to , equal to the number of frames to predict (). We observe that training with longer sequences does not improve significantly the model performance (c.f. Table 2), while making the training slower. At test time, one single frame () is given to start the prediction of the whole day ().
Other implementation details.
We also analyze the possibility of fine-tuning the boundary prediction with supervised learning. For that purpose, we train an SVM with samples from a held-out validation set (EDUB-SegDesc (bolanos2018egocentric)). The SVM evaluates the boundary likeliness from cluster consistency indicators. In particular, two clusters are defined at opposite sides of the candidate boundary, containing the frames that precede or follow it. The indicators used are the correlation between the two clusters, the compactness of each of them and their union, and the BetaCV and Normalized Cut scores (zaki2014data).
5.3. Evaluation methodology
Following the literature, we report the averaged f-measure, precision and recall for the tested models (Table 1). For our evaluation, a detected boundary is considered a true positive if there is an element in the ground truth within a distance of tolerance, and the ground truth element is not already matched to any other detected boundary. Analogously, all elements in the ground truth for which no detected boundary is found within the tolerance are considered false negatives. This tolerance is set to frames.
We compare the performance of the following baselines on the publicly available EDUB-Seg dataset (dimiccoli2017sr):
Smoothed K-Means: the lifelog is clustered into events using k-means with a fixed . The clustering is then smoothed by assigning each frame to the most common cluster within a window. This operation is done iteratively until no more changes occur. As a result, some clusters may disappear.
AC-Color: Agglomerative Clustering on the color feature of the frames, as done in (LeePred15).
SR-Clustering: Semantic Regularized Clustering as de-scribed in (dimiccoli2017sr), using only visual features (CNN), and also semantic cues (Imagga).
KTS: Kernel Temporal Segmentation as described in (PotapovSum).
Bias in the Ground Truth
Since segmenting lifelogs into events can be a very subjective task, the curators of EDUB-Seg provide in (dimiccoli2017sr) an extensive analysis on the uniformity among the ground truth annotated by different subjects. They conclude that visual lifelog event segmentation can be objectively evaluated, since different people (which are not the camera wearer) tend to segment the lifelogs consistently. For the purpose of our evaluation, we select the ground truth from the first annotator. We use the other annotations as a baseline. For the lifelogs that only included one annotation, we asked independent subjects to annotate the events, so that we would have at least two sets of annotations for each lifelog. We therefore report the performance of the manual annotations as an upper reference in Table 3.
Other implementation details.
To find the local maximums in the prediction signal of CES, as well as smoothing the K-Means clustering, a window of size is chosen, so that it is consistent with the ground truth tolerance.
|CES (with VCP)||0.70||0.66||0.80||0.69||0.66||0.77|
Table 1 presents the results of CES and the baselines in EDUB-Seg, and a smaller subset (which includes the semantic features needed for SR-ClusteringImagga). The position of each method in the Precision-Recall curve is shown in Fig. 4. While most methods fall within the mid-range performance in terms of f-measure, CES stands out of the baselines, improving their performance by over , and positioning itself on the upper range of the absolute spectrum. The performance of CES is even competitive with that of the manual annotations.
We show in Fig. 5 the performance of CES applied to one of the tested lifelogs. We can observe that most elements in the ground truth fall on the spikes of the prediction signal, or very close to them. This confirms the suitability of using the predicted contexts as a boundary cue.
While the baselines fail at detecting boundaries between heterogeneous events, CES is capable of extracting the underlying context of each event, and discern their disparity (e.g. shopping at the supermarket after riding a bike on the street). Moreover, in cases in which the camera wearer orientation changes within a static event (e.g. looking back from your food to your colleagues), traditional segmentation methods detect such view change as an event boundary, whereas CES is able to detect the presence of a common visual path. However, if the view change spans longer than CES memory, CES will not be able to contextualize it within the event. An example for such a situation can be seen in Figs. 5(b) and 5(c). We also note that the ability of CES to detect the general context of the visual sequence and track common cues sometimes misleads the prediction. When the ground truth of a boundary falls within the same physical space, or similar contexts, CES does not perceive their differences, and thus does not detect the boundary. Arguably, such boundaries are also difficult to detect by external viewers. This may also occur when short transitions between events are considered events on their own.
Predicting the context vs predicting the actual frame
One could think that predicting a future frame and comparing it to the actual future frame should be better than comparing the visual context. We tested this hypothesis, in which
where is predicted from and from . The intuition behind this formulation is that a local outlier will be badly predicted both from the future and the past, whereas an event change will provide a good prediction only in one direction. This theory proves not precise in practice. The generative model embeds noise into the frame descriptor, and, as expected, generates samples closer to the previous (seen) frame than the (unseen) target. As such, using such a noisy signal is detrimental to the final objective. The performance of such method is reported as CES-error in Table 3.
Informativeness of the Visual Context
To validate the encoding efficiency of VCP and hence the informativeness of the visual context, we have tested CES using two alternative sequence encodings: first, an average of the previous frames (or subsequent in the case of the past prediction); second, a PCA time-dimensionality reduction on the aforesaid set. These two variants are reported in Table 3 as CES-mean and CES-PCA, respectively.
We observe that the visual context predicted by VCP results much more informative than any of the other contextual encodings. While the averaged encoding obtains a predictive performance similar to the output of our decoder (c.f. Table 2), the encoding transformation of VCP is superior as a contextual visual feature. Moreover, unlike PCA, which takes the inputs as a set, VCP takes the inputs as a sequence, and is able to learn a more informative context descriptor.
|trained with N / M :||10 / 10||1 / 40||1/100||1/1||10/1|
|# neurons :||256||512||1024||512||1024||1024||mean*|
|mse future pred.:||1.058||1.030||1.024||1.03||1.029||1.028||1.58||1.054|
|mse past pred.:||1.059||1.029||1.024||1.03||1.029||1.028|
*As a reference, we include using the average of the previous frames as the predicted feature, i.e. .
Pruning of the candidate boundaries using supervised learning
For high recall results, false candidate boundaries can be discarded using cluster analysis between the frames that the candidate separates. Having annotated data to train a pruning model can improve the performance of the segmentation algorithm in terms of precision, having minimal impact on the recall. We tested this hypothesis training an SVM model to detect false positives. As can be observed in Table 3, such model improves the average precision of CES by (absolute gain of ), while recall only decreases by (absolute loss of ). The benefit of using a supervised SVM pruning is much significant for segmentation algorithms of lower precision, such as k-means, even if coming at a higher recall cost.
Performance of CES relative to manual annotations
Since there is not just one correct way of segmenting video content into events, we have to compare the performance of CES relative to that of the average person. For each lifelog, we average the performance of all available annotations, as evaluated on the selected ground truth. The averaged scores are reported in Table 3. We observe that subjects are, in f-measure, only points better than CES. Even though the precision of the manual annotations is very high, the annotators also obtain worse recall than CES. This is due to some of the subjects selecting very general events, e.g. wrapping all working afternoon within the same event, disregarding the different meetings. Such annotation criteria yields many false negatives, and therefore drops the recall score. Analogously, in some other cases, subjects selected more details than the ground truth. As a result, their rate of false positives is greater than zero.
CES segments, on average, into more events than the annotators. As a result, it is able to detect more true boundaries than the test subjects, but will also find a relative more incorrect events. Such a large increase is to be expected, as the selected ground truth is very exhaustive, and the annotators rarely identify boundaries not present in the ground truth. Overall, we can conclude that CES is a highly precise event segmentation algorithm. Given our ground truth, CES’ f-measure is of relative to the manual performance.
|averaged F1||averaged Prec.||averaged Rec.|
|CES (with VCP)||0.69||0.66||0.77|
|k-means w/ SVM||0.67||0.70||0.67|
|CES w/ SVM||0.71||0.75||0.71|
In this paper, we have introduced Contextual Event Segmentation, a novel unsupervised event segmentation method that uses the sequential nature of a photo-stream to infer the presence of event boundaries. At the core of CES is the Visual Context Predictor (VCP), a future sequence generator model that predicts the visual context from a given sequence of frames. The visual context at given the past is compared to that at given the future, to determine whether there is a boundary at frame .
We have also introduced R3, a large scale visual lifelogging dataset depicting a wide variety of events. It is recorded in an unconstrained manner by independent users, who captured their daily activities morning to evening during over a month. The existence of R3 has allowed us to train the Visual Context Predictor, which is able to model human activities given sequences of visual features. In a series of experiments, we have proved that the visual context is a strong indicator of event changes. We conjecture that it can also be useful for storytelling and tracking of daily activities.
Leveraging on the visual context of the sequences allows CES to detect boundaries between heterogeneous events and ignore local occlusions and brief diversions. CES improves the performance of the baselines by over in f-measure. The performance of CES is competitive with manual annotations, for which the f-measure is only better than CES’. We propose a fully unsupervised pipeline, which results in greater recall than precision. To improve the precision, supervised pruning can be applied to the final detection step by using cluster consistency analysis. Even though further supervised analysis can be performed to improve that performance, it will always be contingent on the ground truth used, which will be inherently subjective.
- copyright: none
- The data is accessible from http://dx.doi.org/10.17632/ktps5my69g.1