An Adaptive Online HDPHMM for Segmentation and Classification of Sequential Data
Abstract
In the recent years, the desire and need to understand sequential data has been increasing, with particular interest in sequential contexts such as patient monitoring, understanding daily activities, video surveillance, stock market and the like. Along with the constant flow of data, it is critical to classify and segment the observations onthefly, without being limited to a rigid number of classes. In addition, the model needs to be capable of updating its parameters to comply with possible evolutions. This interesting problem, however, is not adequately addressed in the literature since many studies focus on offline classification over a predefined class set. In this paper, we propose a principled solution to this gap by introducing an adaptive online system based on Markov switching models with hierarchical Dirichlet process priors. This infinite adaptive online approach is capable of segmenting and classifying the sequential data over unlimited number of classes, while meeting the memory and delay constraints of streaming contexts. The model is further enhanced by introducing a ‘learning rate’, responsible for balancing the extent to which the model sustains its previous learning (parameters) or adapts to the new streaming observations. Experimental results on several variants of stationary and evolving synthetic data and two video datasets, TUM Assistive Kitchen and collated Weizmann, show remarkable performance in segmentation and classification, particularly for evolutionary sequences with changing distributions and/or containing new, unseen classes.
1Introduction and related work
The joint problem of time segmentation and recognition of sequential data into meaningful subsequences has attracted significant research in a variety of domains. The ability to automatically segment and classify data is a core technology for applications like speaker diarisation, finance, activity understanding, multimedia annotation and humancomputer interaction. To date, the main proposed solutions have included sliding windows [1], the hidden Markov model (HMM) [2], conditional random fields [3] [4], and structural SVM [5], covering the spectrum of generative, discriminative and maximummargin dynamic classifiers. Along with advancements in learning and inference, research has witnessed increasingly realistic datasets which are bridging the gap between lab and real applications [6] [7].
Nevertheless, important challenges such as model adaptation and dynamic class sets remain unresolved. We address both these limitations by an adaptive online model that can accommodate an unlimited (theoretically infinite) number of classes. In a nutshell, this is achieved by applying a Bayesian nonparametric model, the hierarchical Dirichlet process (HDP), as the prior for a hidden Markov model (a model known as HDPHMM [8] [9]), and exploiting an adaptive learning rate for model adaptation. The proposed model provides an adaptive online learning approach for joint segmentation and recognition of sequential data with incremental class sets and we refer to it as AdOn HDPHMM in the following. The model is i) online: can receive sequential data in batches and segment and recognise them onthefly; ii) adaptive: using a limited memory buffer, the model can tune its parameters in response to diverse observations from the existing classes, as well as instantiating new unseen classes. It continues learning throughout the entire life of its application; and is iii) onlyinitially supervised: the model uses a relatively short initial bootstrap of supervised training, but it adapts in a fully unsupervised manner during its operation. It is also considered as a onepass process of streaming data, without revision. These constraints obviously make adaptation much more challenging, yet suiting the model to a large span of reallife problems. To improve adaptation in such an unsupervised learning scenario, we introduce the notion of ‘learning rate’, that tunes how biased the model is towards its previous learning (memory), versus adapting to the patterns conveyed by the new observations (adaptability). Experiments support the efficiency of utilising a learning rate, particularly in evolving scenarios.
The rest of this paper is organised as follows: in the rest of this Section we present the related literature and provide more clarification to the scope of this study. In Section 2 we describe the hierarchical Dirichlet process and its temporal extension HDPHMM. Section 3 presents the proposed online approach, expanding on the adaptive learning rate. Through the experiments and discussions in Section 4, we evaluate and compare the proposed variants with existing benchmarks, and conclude in Section 5.
1.1Related work
Amongst the many paradigms available for class modelling, hierarchical Bayesian modelling and, in particular, the hierarchical Dirichlet process (HDP) [8] offer a principled way to infer an arbitrary number of classes from a set of samples via a hierarchy of prior distributions. The hierarchical Dirichlet process (HDP) is a Bayesian nonparametric technique estimating the joint posterior distribution of a set of latent classes and a set of parameters, typically by Gibbs sampling [10] or variational inference [11]. It has been used for a variety of applications, including the modelling of sequential data, by integrating HDP priors into statespace models such as HMM. In the resulting HDPHMM [8] [9], the classes correspond with the discrete states of a Markov chain and the data are explained by a stateconditional observation model. Given a set of samples, classification is performed by state decoding, while allowing the number of states to dynamically grow or shrink. The hierarchical Dirichlet process is finding increasing application in domains as varied as bioinformatics, speaker diarization, vision and others for problems of joint segmentation and classification (see [12] [13] [14] for some recent references).
Most of the segmentation and recognition studies in the literature follow an offline approach, where the entire data set is presented at once during the learning stage [6] [7]. Such systems obviously do not suit the needs of streaming data which are ubiquitous in today’s applications. In response to this increasing demand for online systems, many studies are dedicated to this topic. However, the term online has been given a variety of meanings in different contexts. Our interpretation is sequential processing of temporal data in minibatches, inspired by recursive Bayesian estimation [15] and further elaborated throughout this paper. This interpretation is distinct from that of other studies in the literature where online refers to a closed dataset that is processed incrementally and possibly repeatedly, such as Bayesian online nonparametrics [16] [17], stochastic optimisation methods [18] [19], formal bounds for online learning [20], all based upon the foundations laid by seminal works such as [21] [22].
Despite that almost all the proposed approaches consider closed, predefined sets of classes, in scenarios like longterm learning or monitoring the number of classes is not precisely predictable. Additionally, as more data stream in, the known classes may change in parameters due to observing a more comprehensive sample or a natural evolution over time. In either case, models are expected to update parameters of the known classes and add new classes to their vocabulary once they appear. Unsupervised adaptation can be very challenging in nonstationary domains, where adaptation and drift
In the absence of expert feedback, we elaborate more on the learning rate as a dynamic lever for balancing adaptability (Section 3.1). Most previous studies approach this problem by assigning constant weights to prior learning and the likelihood of the current data. However, in more complex problems the choice of the learning rate is highly dependent on the data dynamics and the application domain. Some online studies propose adaptive learning rates via exponential decay [24], and, more recently, regretbased adaptations of the learning rate (i.e., the step size of gradient descent) [18] [17] [19]. However, such adaptation strategies are only suitable for finite training sets. In our solution, we introduce a novel learning rate that constantly adapts to the statistics of the streaming data, without revision or supervision. For stationary problems where the parameters only slightly change, the learning rate tunes itself to rely more on prior memory. Conversely, under evolving distributions the dynamics of data and their modes can significantly vary, calling for a more adaptive model with less inertia to the past. Adding to the complexity, many reallife problems require a mixture of both, i.e. a continuous spectrum for the learning rate to follow more or less tightly the dynamics of observations at each point in time. In this work, we tackle this problem by a posterior estimation of the learning rate separately for each parameter in the model  thereby, allowing each parameter to dynamically determine its adaptability in each batch.
2The hierarchical Dirichlet process
A Dirichlet process, , is a generative model that can be thought of as a distribution over discrete distributions with countably infinite categories. It is controlled by a scalar parameter, , known as the concentration parameter, and a base measure, , over a measurable space . A sample from a Dirichlet process is a distribution over differing from zero at only a countably infinite number of locations or atoms, :
The discrete set of locations is obtained by repeatedly sampling the base measure, while the weight for each location, , is established by a stickbreaking process, noted as GEM() (named after Griffiths, Engen and McCloskey) [25]. We refer to the weight vector simply as . A hierarchical Dirichlet process (HDP) consists of (at least) two layers of Dirichlet processes, obtained with a similar construction:
where and are the concentration parameters of the toplevel an lowerlevel Dirichlet processes, respectively. Since is discrete, the various , are also discrete and sampled from the elements of (Figure 1).
In practical applications, the continuous space of distribution is taken to be the parameter space for a data likelihood, as in . Likelihood could be, for instance, a Gaussian distribution of mean parameters sampled from a NormalInverseWishart (NIW) distribution. Given the generative model of the HDP, the joint distribution of data and parameters factorises as . Typically, multiple are sampled to model data belonging to different groups. Yet, the hierarchical structure of the HDP makes all the usefully share distributional properties. Examples can be as diverse as words in a collection of books or genetic markers across different populations.
2.1The HDPHMM
The HDP has also been used as prior distribution for the parameters of switching models such as the hidden Markov model [8] [13]. When applied to a Markov chain, , , the HDP changes its interpretation significantly (Figure Figure 2). In this case, each , , is used as one row of the Markov chain’s transition matrix, representing the probability of transitioning from state in the previous timestep to any other states in the current timestep, ). Thanks to the properties of HDP, new states will be created when the data are not adequately explained by the current set of states. In contrast to the conventional HDP, the index of the group, , of each observation is usually not known explicitly anymore, but it is instead inferred in sequential order from the chain. Therefore, in the case of the HDPHMM . As a consequence, in the HDPHMM the number of groups () and the number of indices in each () coincide. Adding the HDP as prior caters for arbitrary number of states, or activity classes [13].
It is worth adding that a reported limitation of HDPHMM is the tendency to oversegment due to its unbounded number of classes [26]. Fox et al. have proposed adding a ‘sticky’ prior () to the transition matrix to emulate an inertia towards changing states, illustrated in Figure 2 [27]. We utilise the sticky prior in this study, yet denoting it as HDPHMM for brevity.
2.2Inference and Learning
Inference and learning are typically performed simultaneously in the HDP and its extensions by estimating the joint posterior distribution of the indicator variables, parameters, hidden variables and hyperpriors conditioned on the observations. Deriving such an extensive joint posterior is analytically intractable, hence mainly inferred using Gibbs sampling or variational inference. Gibbs sampling is a simple yet effective method capable of estimating complex posteriors with significant accuracy, yet it can converge slowly or permanently remain in a local minima (poor mixing). Variational inference is usually faster to compute, however it requires prior derivation of analytical approximations and can suffer from low accuracy due to the approximation. Unlike the negative presumption about Gibbs efficiency, we will show how a brief initial supervised learning can result in rapid convergence to accurate distributions.
Having inferred the class indicators, , we proceed with translating the indices into meaningful classes. In unsupervised learning, the correspondence between the groundtruth classes of data and the labels assigned by the classification algorithm may not be obvious. In the case of the HDP, this problem is exacerbated by the fact that the number of classes is undetermined. Therefore, to reestablish the best possible onetoone correspondence, the Hamming distance between groundtruth and assigned labels is minimised by a greedy algorithm, matching labels in decreasing frequency order.
3The Adaptive Online HDPHMM
The proposed AdOn HDPHMM uses a supervised initialisation (bootstrap) of frames, followed by the main unsupervised adaptive online inference (Figure Figure 3). The extent of the supervised phase varies with the application: in applications where annotation is easy, the bootstrap can be longer to provide a more comprehensive training, while in domains with costly annotation the bootstrap will be brief. In either case, during supervised learning, indicator variables are fixed to their groundtruth values, and the model’s parameters are sampled for a given number of iterations to reach convergence. After conclusion of the bootstrap phase, the data are processed in successive batches, and the posterior probabilities of both indicator variables and parameters are estimated iteratively on each batch.
Considering a generic stream of data, , the posterior probability of the parameters can be written as , where indicates the parameter vector of Figure 2. In the case of the HDPHMM, the parameter vector is where are the parameters of the emission densities, are the transition probabilities (and weights of the lowerlevel DPs), and are the weights of the higherlevel DP. Further, since we assume normal densities, we have , with and the usual mean and covariance parameters. The online version leverages on posterior adaptation, using the posterior computed up to time , as the prior for the next batch of data, :
where is the batch number (Figure 4). Given that the updated posterior embeds the distributional properties of the observations up to the current time, observations in Equation 3 can be discarded after adaptation. It implies that the accumulated sufficient statistics of previous data are propagated parametrically and the nonparametric nature of the model is related to the inference method of the current data batch. With that, the model carries all the prior learning and infers new labels using a limited memory buffer. While this may come at a price of reduced accuracy, to our knowledge it is the only viable approach for unbounded streaming data. In contrast [16] presents online inference for latent Dirichlet allocation, yet over an unbounded buffer. Our work extends that model to infinite class sets while meeting the finite memory requirements of sequential data processing.
3.1Learning rate adaptation
In the proposed adaptive system, a learning rate is applied over the prior and noted as in the following. In each batch, is responsible for setting the weight of prior learning on the model’s parameters (). In other words, our target is to balance the impact of the current observations with the previous learning accumulated along the previous batches. This can augment or weaken the posterior learning ‘inertia’ in ‘adapting’ to the current data (likelihood), as opposed to retaining ‘memory’ (prior).
It is worth noting that the length of the current batch compared to the number of past samples, plays a role in their relative influence on posterior parameters (see Appendix A for more details). Accordingly, can be articulated as a scaling factor to the number of ‘pseudoobservations’ in the prior to balance with the respective number for current batch
For prior distributions belonging to the exponential family, this proposition does not violate Bayes’ Theorem, thanks to the properties of canonical parameters. Accordingly, we use exponential family likelihoods and priors for easier integration of the learning rate into the model. Hereby, we focus on the prior in Equation 4 (in bold font) and its hyperparameters, translating them into exponential family notations. The standard parameters, , are converted into the corresponding canonical parameters, , and we make explicit their dependence on hyperparameters, :
Adding the learning rate () as an exponent to this prior does not alter the type of distribution. Rather, it updates the canonical parameters of the prior, ultimately affecting its weight in the resulting posterior. Please note that we only need to derive a proportional posterior for sampling purposes. Hence, the exponent on any term independent from (such as ) can be ignored thanks to the proportionality. The normalisation coefficient can be merged into the sufficient statistics, assuring that its exponent is absorbed into the scaled canonical parameter ().
In general terms, the posterior distribution of given in the presence of data samples in can be inferred as follows:
In our case, are the parameters of the HDPHMM and their priors are a NormalInverseWishart distribution for and and the HDP for and . Given that both the NIW distribution and the Dirichlet process are members of the exponential family, Equation 7 shows a unified way of inferring posterior parameters in canonical form [29]:
In the following subsections, we present the prior distribution of each parameter under the learning rate, and the posterior distribution of the corresponding learning rate.
Inference of covariance matrix
We infer and in the NormalInverseWishart prior by first sampling using an InverseWishart (IW) distribution, thereby using to sample from a Normal distribution [30]. The learning rate for is noted as in the text. Yet, to avoid cluttering the notation in the equations, we simply note it as in Equation 8.
As mentioned earlier, the addition of a positive learning rate as exponent on the IW prior does not alter the type of distribution and can be merged into the hyperparameters. Below, we convert the hyperparameters into the natural form () to show the impact of more clearly. Ultimately, they are converted back to standard form () to show the linear transformation caused by the learning rate.
Inference of
To sample from the posterior, ideally we would like to consider a conjugate prior that analytically derives the posterior hyperparameters, given that of the prior and the sufficient statistics of the current data. A candidate conjugate prior for IW distribution is Gamma. However, the InverseWishart is only conjugate to the Gamma as the prior for the scale parameter (or a scaling coefficient for the scale parameter, , in the multivariate cases). Hence, a Gamma cannot be used as a conjugate prior for deriving the posterior of in a maximumaposteriori solution (Appendix B presents the proof).
Therefore, we utilise a maximumlikelihood solution to derive the posterior hyperparameters for . The posterior for is modeled using an InverseGamma (IG) distribution, the univariate correspondent of the InverseWishart. The samples of IG are positive real values, suitable for the scalar learning rate . The distributions are displayed below.
Comparing the univariate IW and IG in Equation 9, we can derive the posterior parameters as:
As can be seen, the hyperparameters in the InverseWishart map to those of the InverseGamma. The only issue is how to best map the scale matrix () into scalar value through . We propose to use the largest eigenvalue in as the scale parameter in the InverseGamma posterior. Approximating a covariance matrix via its first principal components is a meaningful and common approach [31] [32]. Another choice for could be the determinant. The determinant is used as the scale associated with a square matrix since it is equal to the product of all its eigenvalues. While it gives a more thorough account of all the eigenvalues, it becomes unsuitable when the dimensionality is high and many of the eigenvalues are close or equal to zero. Moreover, calculating the determinant of a highdimensional matrix is very costly in an online context. Therefore, the first approach is generally preferable.
So far, we have established a way to infer from a single IW distribution belonging to a single state. Considering the proposed model with infinite states, we need to merge the IW parameters for all classes to infer . This is done through a weighted average of , where the weights are the frequency of observations for each state, aka degrees of freedom in IW parameters ().
Inference of mean
Having inferred , the next step is to derive the multivariate mean, , in the NIW prior. Let us consider a generic multivariate Normal distribution with known covariance. To observe the impact of the learning rate, we convert its parameters into the natural form and multiply them by the learning rate , and ultimately revert them back to the standard format:
Inference of
Posterior sampling of is conducted with a similar approach to , but using a Gamma conjugate prior. This time the Gamma prior is conjugate by definition, since its sample is utilised merely as a scaling coefficient for the covariance. The detailed proof of the conjugacy is provided in Appendix C. Similarly to , the weighted average of sufficient statistics across all classes is used to infer .
Inference of the HDP transition parameters
Thus far we have discussed the adaptation of the learning rate for emission parameters. The other main set of parameters in our AdOn HDPHMM are the HDP’s and parameters that jointly and hierarchically cater for the transition probabilities. The distributions of these parameters are shown in Equation 13, where and are HDP sufficient statistics representing the frequency of occurrence in each class:
Similarly to the previous parameters, we illustrate the impact of the learning rates, and , on the hyperparameters of the above Dirichlet distributions in standard form and infer the posterior samples for these learning rates:
Inference of and
To the best of our knowledge, there are no conjugate priors over a scaling factor for the parameters of a Dirichlet distribution, in the presence of an intercept. Hence, we estimate the next batch’s learning rate using a MetropolisHastings (MH) jump. This approach is used in several other studies (such as [33] [34]) and is a valid MCMC move. For the MH step, one can choose a suitable candidate function and the samples are accepted with probability of acceptance .
To sample , we have selected the candidate function as , the prior over the learning rate. The new sample () is accepted with the probability in Equation 14, updating for the current batch with the accepted sample. An identical approach can be taken for by replacing for in Equation 14. The subscripts in are removed to avoid notational clutter.
3.2Discussion on the learning rates
As per the above sections, the learning rates for each parameter are inferred separately to allow more degrees of freedom for independent adaptation of each parameter. The empirical results support this, as each of the learning rates () can adapt differently for the same sequence of data, depending on the complexity of the data and degree of evolution in the emissions and state transitions. Nevertheless, their impact pattern on the mean and covariance of the respective posterior distributions tends to be similar. As clearly shown for (Equation 12) the learning rate does not change the mean, but reversely impacts the covariance (see Appendix D for more details). Accordingly, for all cases when the posterior distribution is more driven by the current observations. However, for the inferred parameters follow the prior distribution more closely. In the following experiments, the dynamics of with respect to the data is explored more extensively.
4Experiments
The experiments aim to explore the effectiveness of the proposed AdOn HDPHMM for segmentation and classification in a variety of scenarios. To closely examine the adaptability of the model, we have designed several synthetic datasets with stationary and evolutionary distributions. It also allows us to investigate the effects of using learning rates in enhancing adaptability, where an adaptive learning rate is noted concisely as ’ada ’ and the basic alternative with fixed learning rate is shown as . Following with two more video datasets, we demonstrate the performance of the proposed model in various challenging sequences with noisy data, abrupt changes and new classes in the test data. It is important to mention that the degree of challenge in the synthetic experiments is not easily comparable to the video data, due to differences in the nature of the signals, noise and, most importantly, degree of evolution that is stronger by design in the synthetic data. Hence, analysing both categories of experiments can shed more light on the adaptability of AdOn HDPDHMM in various contexts.
To evaluate the results more comprehensively, metrics for both classification and time segmentation performance are introduced. For classification accuracy, we have used framelevel comparison of the decoded classes with the ground truth (based on Hamming distance). To evaluate time segmentation, the standard metrics of precision and recall are utilised to indicate the accuracy of detecting boundaries between segments. A true boundary is regarded as correctly detected if a change of state is decoded within an interval of frames from the ground truth location, where is set to 10 percent of the average segment length. Any additional detected boundaries are counted as false positives. We also report the difference between the overall number of actions detected in the test sequence and the number of actions in the ground truth (noted as cardinality, with an ideal value of zero).
The empirical results are quantitatively reported in tables, also visualised in colour plots of groundtruth vs. estimated labels. In each illustration (for instance, Figure ?), the horizontal axis is the time and the estimated labels are plotted on top of the true labels, providing a qualitative measure for the segmentation and classification performance. These plots are best viewed in colour.
4.1Synthetic data
The basic framework of the synthetic dataset is generated from a univariate HMM, with 5 states distributed around dispersed means () with unit variance and a Dirichletdistributed transition matrix (). This generative model is similar to the AdOn HDPHMM, but not an exact replicate, due to the absence of the HDP prior and adaptation of in the generative process (please refer to Figures Figure 2 and Figure 4 for comparison).
Stationary distributions
Given the above basic configuration, the stationary experiments are run over 3 sequences of length 100, trained using leaveoneout cross validation. Hence, the distributions of training and test samples are the same. The test sequence is split into batches with approximate size of 16 time units. To provide adaptation, the inferred parameters of each batch are propagated into the next batch as priors.
The proposed Adaptive Online HDPHMM is able to recognise and segment this basic version with 100 percent accuracy, whether or not the learning rates are used. To probe the model further, we add a significant noise to the above model by increasing the standard deviation to 50, thereby causing a considerable overlap between the distributions of each state (Figure ?). Despite this substantial noise, the model is significantly accurate with an average of 76.3 percent framelevel accuracy. Repeating this experiment on the same data yet with fixed learning rates (), shows a noticeable decline in accuracy of 3 percentage points and undesirable extra states. Table 1, first two rows, shows the detailed accuracy figures in terms of precision, recall and number of inferred states.
Accuracy  Recall  Precision  Cardinality


[0.5ex] Stationary, Noisy (ada )  0.76  0.92  0.92  0.33

Stationary, Noisy ()  0.73  0.89  0.93  1.7

Evolutionary, shifting mean (ada )  0.97  0.97  0.99  0

Evolutionary, shifting mean ()  0.71  0.99  1  1

Evolutionary, new class (ada )  1.00  1.00  1.00  0

Evolutionary, new class ()  0.86  0.86  0.98  0

Evolutionary, combined (ada )  0.93  1.00  1.00  1

Evolutionary, combined ()  0.81  0.95  0.97  2

Evolutionary distributions
A more advanced experiment is designed by training the model on synthetic data with evolving distributions, either involving gradual shifts to the means of each class or including new unseen classes. The standard deviation for this experiment is set to .
Shifting class means: To examine the adaptability of the model, we drift the class means by at each time step. Therefore, an instance appearing at in the test sequence is generated from a distribution with its mean shifted by 5 units. For a nonadaptive model and given the synthetic generation scheme, such data can cause significant classification errors after a few tens of time units. However, the results of the Adaptive Online HDPHMM demonstrate smooth adaptation and excellent accuracy over the evolving sequence (Figure ?). There are a few misclassifications towards the end of the sequence which are due to the heavy distributional drift. Comparison between these results and that of fixed learning rates shows a significant drop of 26 percentage points in accuracy and one undesirable new class (see Table 1, section ii).
New classes: In this experiment, distributions do not shift, yet one new class appears around with the same as the other classes. The model is able to create a new state (shown with a random new colour in Figure ?), learn and consistently recognise it in the later batches without distorting parameters of the existing classes. The overall accuracy of 100 percent for this experiment is mostly thanks to the contribution of the learning rate in adjusting the variances of each class with respect to the degree of adaptation. Not using learning rates can highly reduce accuracy (14 percentage points) due to drift in the existing classes (Table 1), also exhibiting one extra class and reduced recall and precision.
Combination of the two: Combining the above two evolutionary scenarios, we test the proposed model on a sequence with a new class that needs to be distinguished among the existing shifting classes. The challenge is twofold: i) the shifting modes are prone to being misclassified as new classes, and ii) the new class might be merged into one of the existing shifted modes. This experiment is the closest to challenging real world scenarios where new states are likely to appear while the distributions can change over time. Given the combined challenge, the AdOn HDPHMM proves highly accurate (93 percent), exhibiting a considerable improvement on the accuracy (12 percentage points) and cardinality of states thanks to the learning rate mechanism.
The performance of the Adaptive Online HDPHMM is not perturbed by these challenges because the learning rate tunes the adaptability of the parameters with respect to the observed data. In an evolutionary scenario, the likelihood of the observations given the current parameters is low. This causes the posterior covariance learning rate () to increase, keeping the variance close to its prior. This, in turn, prevents a drift of the variance towards large values and allows for the mean to evolve. The concentration of around zero is an empirical support for this claim (see Figure ?b).
In the absence of the learning rate, the model still learns and recognises the new state thanks to properties of HDP. However, the overall performance deteriorates. On the one hand, new undesirable classes appear in response to drift. On the other, some of the existing classes collapse into a single one, due to considerable increase of variance caused by the class shifts. This rigid increase in variance does not allow the means to evolve, ultimately forcing the model to merge some of the neighboring states into a single class with a large variance (Figure ?e,f).
4.2 Activity recognition datasets
In this section, we use two video datasets to assess the performance of the proposed model in activity recognition scenarios.
Collated Weizmann dataset
The Weizmann dataset contains 93 singleaction videos from a set of 10 classes performed by 9 different actors. While the recognition accuracy on the original dataset is saturated [35] [36], some studies have collated its individual actions into (unsegmented) sequences to experiment with time segmentation [5]. In a similar way, we have created 4 sequences, each consisting of 12 random actions selected from the provided action classes. Each sequence consists of approximately 900 frames. As feature set, we have used the position of the actor’s centroid in the image plane and the distances between the centroid and the actors’ contour along five given directions [37].


[0.5ex] Method  S1  S2  S3  S4  S1  S2  S3  S4  S1  S2  S3  S4

Online HDPHMM (ada )  0.82  0.76  0.89  0.81  0.92  0.66  0.95  0.80  0  0  0  0

Online HDPHMM ()  0.81  0.70  0.95  0.80  0.92  0.66  0.89  0.80  0  1  0  1

Offline HDPHMM  0.78  0.76  0.95  0.81  0.91  0.66  0.95  0.81  0  0  0  0

Offline Maxmargin 


The estimated states of the AdOn HDPHMM variants over the above sequences are visualised in Figure ?, showing remarkable qualitative accuracy in segmentation and classification. The quantitative results are reported in Table 2, including an Offline variant representing the experiment with a single batch including the whole test sequence. This variant is run for the sake of comparison with a similar offline maxmargin study [5]. However, the results are not directly comparable for two reasons: a) the datasets are similar in conception, yet different in sequence collation, and b) the classifier in [5] operates over a closed set of classes, as opposed to ours that allows unlimited number of classes. The results with the fixed learning rate () show a similar trend to the adaptive, and only a slightly lower average accuracy. This can be due to the stationary nature of the dataset, as training and test sequences are drawn from similar distributions and adaptation is not significant. In addition, the accuracy with the online processing does not show any noticeable deterioration over the full, offline processing.
TUM kitchen dataset
The TUM kitchen dataset is a human assistive dataset, consisting of natural unsegmented sequences of everyday activities performed in a typical kitchen environment [7]. The dataset contains multimodal data, annotated separately for the actors’ left and right hands (9 classes) and torso (2 classes). The features are 28D vectors of joint coordinates for the torso and the relevant hands. The main actions include ‘Reaching’, ‘Releasing Grasp Of Something’, ‘Taking An Object’, ‘Reaching Upward’, ‘Lowering An Object’, opening and closing doors and drawers and ‘Carrying While Locomoting’, the distinction of which are quite subtle at times even for human annotators. The main advantage of this dataset over the collated Weizmann is that the transitions between actions occur naturally and the boundaries are vague even to human annotation, hence time segmentation is more challenging.
In our experiments, we have performed segmentation and classification on the actions of the left and right hands, separately. All the sequences provided by the 3D motion capture sensors are used in leaveoneout cross validation tests. Experiments are run for both the typical sequences (denoted as ’robotic’, taking objects one by one), and the more challenging ones (’complex’ including sequences with multiple objects moved together, in arbitrary order and repeatedly).
For a general study of performance, we run an experiment on all the above sequences involving both the robotic and complex. The difference between them is in state transition probabilities, height and size of actors and frequencies of action occurrence. The experiment is repeated with fixed and adaptive learning rates and results are compared in Table 3, generally showing significant match in framelevel accuracy. The comparison between similar sequences with fixed vs. adaptive learning rate shows a minor improvement of framelevel accuracy and significant decrease of state cardinality error. Note that the figures under cardinality show differences between inferred vs. actual number of states. To facilitate visual evaluation, 4 of the sequences are colourplotted in Figure ?. It is worth noting that classes in this dataset may prove hard to segment. For instance, distinction between putting object on the table and leaving grasp of it can be very subtle (the backtoback lavenderblue and lightblue colours in Figure ?). This becomes more challenging when a model has extra degrees of freedom for deriving a dynamic number of classes and explains the negative cardinality in the results.
To specifically observe the adaptive behaviour, we have trained the model on the robotic sequences and tested it on complex ones. Although the emission parameters might not radically change in this scenario, the transition probabilities need to adapt due to changes in the order of actions in the complex set. Table 4 can be used to observe the remarkable contribution of the learning rate mainly in cardinality and overall accuracy. Similar to the synthetic results, in the presence of learning rates the model is able to prevent an excessive increase of the variance and avoid neighboring classes to collapse into one (the phenomena that can be observed when in Figures ?d,e).
To evaluate the ability to recognise new classes, we have taken the first 4 sequences and removed the observations related to ‘Lowering an object’ (shown in lavenderblue in Figure ?f) in all but the first sequence. We have then trained the model on sequences 24 and tested on the sequence containing the new action. AdOn HDPHMM is able to recognise a new action (
brown in Figure ?f) and learn its parameters with consistent future recognition. This significant property of the model is inherent to the HDP approach and the behaviour is similar, irrespective of whether or not the learning rate is utilised.
The closest study on the TUM kitchen dataset leverages a CRF [7]. This method is not directly comparable to ours since AdOn HDPHMM is online, adaptive and with a dynamic class set. To create a closer match, we have run the Offline variant of AdOn HDPHMM, the results of which are similar to the CRF and outperforming it for complex sequences. This finding aligns with our principal claim that adaptability leads to remarkable improvements when the test distributions are different from the training. The distribution of and (the transitionrelated learning rates) for these experiments are mainly peaked around 0.1, indicating that the learning rates encourage the model to rely on the observed data to infer the HDP transition probabilities, which translates into more adaptability.


[0.5ex] Sequences 


ada  ada  ada  ada  
Online Seq 00  0.79  0.81  0.73  0.71  0  1  1  1

Online Seq 01  0.79  0.82  0.75  0.75  2  1  0  1

Online Seq 02  0.76  0.70  0.78  0.75  2  1  1  1

Online Seq 03  0.84  0.84  0.67  0.69  0  2  0  1

Online Seq 04  0.70  0.69  0.71  0.72  1  1  2  3

Online Seq 06  0.51  0.48  0.56  0.55  3  6  1  3

Online Seq 07  0.45  0.48  0.57  0.55  3  4  1  3

Online Seq 08  0.64  0.68  0.62  0.63  1  3  2  2

Online Seq 09  0.73  0.71  0.70  0.69  0  2  1  2

Online Seq 010  0.79  0.79  0.68  0.70  0  2  0  1

Online Seq 011  0.70  0.76  0.63  0.63  5  4  2  3

Online Seq 012  0.64  0.64  0.58  0.55  1  2  3  5

Online Seq 10  0.65  0.69  0.68  0.69  1  2  3  4

Online Seq 11  0.71  0.69  0.65  0.62  1  1  2  3

Online Seq 12  0.63  0.63  0.76  0.74  1  1  0  1

Online Seq 13  0.14  0.14  0.66  0.65  6  6  0  0

Online Seq 14  0.64  0.67  0.74  0.71  4  5  0  1

Online Seq 15  0.67  0.67  0.61  0.61  0  0  2  2

Online Seq 17  0.69  0.68  0.60  0.58  1  1  1  2

Avg Online  0.80  0.79  0.73  0.73  1.00  1.25  0.5  1.00

Avg Offline  0.80  0.81  0.74  0.73  1.00  1.50  1.00  1.10

Avg Offline CRF 


Avg Online  0.66  0.66  0.67  0.66  1.68  2.37  1.16  2.26

Avg Offline  0.66  0.66  0.67  0.66  1.48  2.28  1.23  2.26

Avg Offline CRF 




[0.5ex] Sequences 


ada  ada  ada  ada  
Online Actor1, complex  0.73  0.72  0.65  0.68  2  3  2  2

Online Actor3, complex  0.55  0.54  0.52  0.49  1  3  4  6

Online Actor1, repetitive  0.45  0.48  0.57  0.55  3  4  1  3

4.3 Sampling efficiency and computational time
We next examine the Gibbs sampler’s mixing rate and execution time for the above experiments. To gain an overall understanding of parameter mixing (emission and transition) the loglikelihood is shown in Figure ?e. Since most of the sampled variables contribute to the likelihood calculation, the wellmixed results indicate general mixing efficiency in the model. Additionally, mixing trends of the learning rates () for a generic evolutionary run are shown Figure ?ad, both to monitor mixing and support the experiments’ discussion. The large values of prevents the model from the immediate tendency to increase the variance to fit the changing distributions. Rather, the model allows for the means to evolve, by converging to small values of . The similarly small values of and ensure adaptability of the model towards changing state transitions for HDPHMM. Through the orchestration of these parameters, the proposed model can adapt to changes in the streaming batches with more exact account of the true cardinality of the classes and be immune from collapsing neighboring classes into a single one.
Eventually, the computational time per frame for runs on an Intel Xeon E5 2.90 GHz processor, over the Weizmann and TUM kitchen datasets are shown in Figure ?f. The boxplot includes online and offline variants, with and without learning rates to help explore how using the learning rates and online scheme can affect the computational time. Based on the elapsed time (in seconds), the offline run is the fastest since all the data are processed in a single batch. The adaptive online runs occur in 34 batches of 1000 iterations each, therefore indicate an increase of about 510 ms in completion time. Adapting the learning rate can cause between 310 ms delay, yet given the discussed benefits particularly for evolving sequences, this latency is quite reasonable. It is important to mention that given the initial bootstrap training, the Gibbs algorithm converges rapidly allowing for the model to run in acceptable time. Overall, using the learning rate ensures multiple improvements without imposing excessive computational load on the system.
5Conclusion
In this paper, we have proposed a novel, adaptive online model suited for onthefly time segmentation and recognition of sequential data. The proposed AdOn HDPHMM is capable of online segmentation and classification of streaming batches of data over incremental class sets. The main contribution of this model is the unsupervised posterior adaptation of the parameters over the successive data batches. This is accomplished by using a learning rate that dynamically tunes the model balancing the impact of the current batch with the memory accumulated so far. This proves an effective solution for online sequential estimation problems requiring adaptation over evolving distributions.
The performance of AdOn HDPHMM is evaluated via a number of experiments including stationary and evolutionary scenarios. Thereby, we have tested the general segmentation and classification accuracies in addition to the ability to detect the correct number of classes. The results are reported on variations of synthetic data and two activity recognition video datasets (Collated Weizmann and TUM Assistive Kitchen). The proposed model has achieved a remarkable accuracy in all cases, and considerable improvements in evolutionary scenarios.
Thanks to the unsupervised adaptive online estimation and the capacity to learn over infinite class sets, the proposed AdOn HDPHMM can be a solution for sequential estimation in a number of scenarios which have received relatively little attention in the literature. Not relying on human intervention, revision or correction of estimated labels, this model can be a suitable candidate for streaming applications. In addition, although designed for evolutionary distributions, its accuracy over stationary data has proved higher than or equal to that of the most comparable results, and the computational load is not affected significantly.
6Appendix A: The balancing effect of
In this section we address posterior inference of parameters and explore how the prior and likelihood distributions convey the knowledge of the current observations and accumulated summary of previous data. Considering the online HDPHMM model with parameters , observations and learning rate , the posterior for parameters in the batch is:
As more batches stream in (i.e. increases), the weight of prior is accumulated and adaptivity to new data declines. The learning rate, however, can be used as an equaliser that controls the balance of prior versus likelihood and tunes model adaptivity. For positive values of , the model discounts the impact of accumulated previous data and allows for more adaptivity. However, when , posterior is inclined to follow the prior more strictly. In other words, can be seen as the scaling coefficient for the number of ‘pseudoobservations’ in the prior.
7Appendix B: Conjugacy for
To sample from the posterior, ideally we would like to consider a conjugate prior that analytically derives the posterior hyperparameters, given those of the prior and the sufficient statistics of the current data. A candidate prior for the IW distribution is the Gamma. In this section, we investigate if the Gamma can be proven a conjugate prior for the IW likelihood, considering the impact of the learning rates on and .
Given the proposed learning rate model, the probability density function of the InverseWishart distribution can be redefined as below. We have derived the new hyperparameters through conversion to canonical parameters, multiplication with the learning rate and reversion to the standard form to simplify sampling.
We assume Gamma is the conjugate prior distribution for sampling , and try to prove it below.
Thanks to proportionality, we can remove the constant terms with respect to :
Ideally, we should create terms proportional to ’’, but because affects both hyperparameters, the initial term related to the degrees of freedom () is also dependent on :
To conclude, the InverseWishart is only conjugate to the Gamma for the scale parameter (or a scaling coefficient over parameter ), and cannot be used as a conjugate prior for deriving the posterior distribution of .
8Appendix C: Conjugacy for
Let us consider a multivariate Normal distribution in a fully general case (Equation 15), with a conjugate Gamma prior over random variable . We will show that the conjugacy holds for this setting, through expanding the right hand side of the proportionality and deriving the posterior hyperparameters in the presence of a single sample of data (), i.e. . The resulting parameters can be easily extended to generalise to the case of observations.
The remaining terms are proportional to a Gamma distribution with the following parameters:
9Appendix D: Impacts of on parameter distributions
In this appendix, we explore the impact of on changing the mean and covariance of the InverseWishart. As mentioned in the paper, approximately in all cases the mean stays unchanged and the variance is scaled inversely to the learning rate.
Accepting the approximation above, the resulting samples are drawn approximately around the same mean, but with a scaled variance. When the variance increases, whereas for the distribution is more peaky. In other words, the posterior samples of in the former case are allowed to move away from the IW mean, tending to have greater adaptability towards the current observed data, but in the latter case the posterior samples concentrate around the prior mean, discouraging covariance adaptation.
Footnotes
 Defined as an undesirable deviation from the ideal model.
 For convenience, in this paper we have constrained all batches to be of the same length and explored the variable alternative in [28].
References
 A graphtheoretic approach to enterprise network dynamics, volume 24.
Horst Bunke, Peter J. Dickinson, Miro Kraetzl, and Walter D. Wallis. Birkháuser, 2007.  Recognizing human action in timesequential images using hidden Markov model.
J. Yamato, J. Ohya, and K. Ishii. In Proc. CVPR, pages 379–385, 1992.  Conditional models for contextual human motion recognition.
Cristian Sminchisescu, Atul Kanaujia, and Dimitris Metaxas. Computer Vision and Image Understanding, 104(23):210–220, 2006.  Conditional Random Fields for Activity Recognition.
D. L. Vail, M. M. Veloso, and J. D. Lafferty. In Proc. Int. Conf. on Autonomous Agents and MultiAgent Systems, 2007.  Joint segmentation and classification of human actions in video.
Minh Hoai, Zhenzhong Lan, and Fernando De la Torre. In Proc. CVPR, 2011.  Trecvid 2011 – an overview of the goals, tasks, data, evaluation mechanisms and metrics.
Paul Over, George Awad, Martial Michel, Jonathan Fiscus, Wessel Kraaij, and Alan F. Smeaton. In Proceedings of TRECVID 2011. NIST, USA, 2011.  The TUM Kitchen Data Set of Everyday Manipulation Activities for Motion Tracking and Action Recognition.
Moritz Tenorth, Jan Bandouch, and Michael Beetz. In IEEE International Workshop on Tracking Humans for the Evaluation of their Motion in Image Sequences (THEMIS), in conjunction with ICCV2009, 2009.  Hierarchical Dirichlet processes.
Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Journal of the American Statistical Association, 101(476):1566–1581, 2006.  The infinite hidden markov model.
Matthew J Beal, Zoubin Ghahramani, and Carl E Rasmussen. In Advances in neural information processing systems, pages 577–584, 2001.  Stochastic relaxation, Gibbs distributions, and the bayesian restoration of images.
Stuart Geman and Donald Geman. Pattern Analysis and Machine Intelligence, IEEE Transactions on, PAMI6(6):721 –741, nov. 1984.  Collapsed variational inference for HDP.
Y. W. Teh, K. Kurihara, and M. Welling. In Advances in Neural Information Processing Systems (NIPS), volume 20, 2008.  Dirichlet process mixtures of multinomials for data mining in mice behaviour analysis.
M. Zanotto, D. Sona, V. Murino, F. Papaleo, and H. Kjellstrom. In International Conference on Computer Vision Workshops (ICCVW), 2013 IEEE Computer Society Conference on, pages 197–202, 2013.  Bayesian Nonparametric Inference of Switching Dynamic Linear Models.
E.B. Fox, E.B. Sudderth, M.I. Jordan, and A.S. Willsky. IEEE Transactions on Signal Processing, 59(4):1569–1585, 2011.  Supervised hierarchical Dirichlet processes with variational inference.
C. Zhang, E. Henrik, X. Gratal, F. Pokorny, and H. Kjellstrom. In International Conference on Computer Vision Workshops (ICCVW), 2013 IEEE Computer Society Conference on, pages 254–261, 2013.  Recursive bayesian estimation using gaussian sums.
H.W. Sorenson and D.L. Alspach. Automatica, 7(4):465 – 479, 1971.  Online Inference of Topics with Latent Dirichlet Allocation.
Kevin R. Canini, Lei Shi, and Thomas L. Griffiths. In Proc. AISTATS 2009, volume 5, pages 65–72, 2009.  Adaptive learning rate for online linear discriminant classifiers.
Ludmila I. Kuncheva and Catrin O. Plumpton. In Proceedings of the 2008 Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition, SSPR & SPR ’08, pages 510–519, Berlin, Heidelberg, 2008. SpringerVerlag.  Adaptive subgradient methods for online learning and stochastic optimization.
John Duchi, Elad Hazan, and Yoram Singer. J. Mach. Learn. Res., 12:2121–2159, July 2011.  No more pesky learning rates.
Tom Schaul, Sixin Zhang, and Yann LeCun. CoRR abs/1206.1106v2, (2), 2013.  A pacbayesian bound for lifelong learning.
Anastasia Pentina and Christoph H. Lampert. The 31st International Conference on Machine Learning, (2), 2014.  Prediction, Learning, and Games.
Nicolo CesaBianchi and Gabor Lugosi. Cambridge University Press, New York, NY, USA, 2006.  Stochastic learning.
Leon Bottou. Advanced Lectures on Machine Learning, number LNAI 3176 in Lecture Notes in Artifficial Intelligence, pages 146–168, 2004.  Streambased joint explorationexploitation active learning.
C.C. Loy, T.M. Hospedales, Tao Xiang, and Shaogang Gong. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1560–1567, 2012.  A stochastic approximation method.
Herbert Robbins and Sutton Monro. The Annals of Mathematical Statistics, 22(3), 1951.  A constructive definition of Dirichlet priors.
J. Sethuraman. Statistica Sinica, 4:639–650, 1994.  Developing a tempered HDPHMM for systems with state persistence.
E. Fox, E. Sudderth, M. Jordan, and A. Willsky. Technical report, MIT Laboratory for Information and Decision Systems, 2007.  An HDPHMM for systems with state persistence.
E. B. Fox, E. B. Sudderth, M. I. Jordan, and A. S. Willsky. In Proc. ICML, July 2008.  An infinite adaptive online learning model for segmentation and classification of streaming data.
A. Bargi, R.Y. Da Xu, and M. Piccardi. In Pattern Recognition (ICPR), 2014 22nd International Conference on, pages 3440–3445, Aug 2014.  Pattern recognition and machine learning.
C.M. Bishop and SpringerLink (Online service). Springer New York, 2006.  Bayesian Nonparametric Learning of Complex Dynamical Phenomena.
E. B. Fox. Ph.D. thesis, MIT, Cambridge, MA, 2009.  Probabilistic principal component analysis.
Michael E. Tipping and Chris M. Bishop. Journal of the Royal Statistical Society, Series B, 61:611–622, 1999.  Distribution of the largest eigenvalue for real wishart and gaussian random matrices and a simple approximation for the tracy–widom distribution.
Marco Chiani. Journal of Multivariate Analysis, 129(0):69 – 81, 2014.  Nonparametric Bayesian sparse factor models with application to gene expression modeling.
David Knowles and Zoubin Ghahramani. The Annals of Applied Statistics, 5(2B):1534–1552, 2011.  A nonparametric conditional factor regression model for multidimensional input and response.
Ava Bargi, Richard Yi Da Xu, Zoubin Ghahramani, and Massimo Piccardi. JMLR W and CP, 33:77–85, 2014.  Recognizing human activities from silhouettes: Motion subspace and factorial discriminative graphical model.
Liang Wang and David Suter. In Proc. CVPR, Los Alamitos, CA, USA, 2007. IEEE Computer Society.  Combining different local binary pattern variants to boost performance.
Loris Nanni, Sheryl Brahnam, and Alessandra Lumini. Expert Systems with Applications, 38(5):6209 – 6216, 2011.  In DICTA, pages 188–195, 2009.
Zia Moghaddam and Massimo Piccardi.