# An Adaptive Online HDP-HMM for

Segmentation and Classification of Sequential Data

###### Abstract

In the recent years, the desire and need to understand sequential data has been increasing, with particular interest in sequential contexts such as patient monitoring, understanding daily activities, video surveillance, stock market and the like. Along with the constant flow of data, it is critical to classify and segment the observations on-the-fly, without being limited to a rigid number of classes. In addition, the model needs to be capable of updating its parameters to comply with possible evolutions. This interesting problem, however, is not adequately addressed in the literature since many studies focus on offline classification over a pre-defined class set. In this paper, we propose a principled solution to this gap by introducing an adaptive online system based on Markov switching models with hierarchical Dirichlet process priors. This infinite adaptive online approach is capable of segmenting and classifying the sequential data over unlimited number of classes, while meeting the memory and delay constraints of streaming contexts. The model is further enhanced by introducing a ‘learning rate’, responsible for balancing the extent to which the model sustains its previous learning (parameters) or adapts to the new streaming observations. Experimental results on several variants of stationary and evolving synthetic data and two video datasets, TUM Assistive Kitchen and collated Weizmann, show remarkable performance in segmentation and classification, particularly for evolutionary sequences with changing distributions and/or containing new, unseen classes.

## 1 Introduction and related work

The joint problem of time segmentation and recognition of sequential data into meaningful sub-sequences has attracted significant research in a variety of domains. The ability to automatically segment and classify data is a core technology for applications like speaker diarisation, finance, activity understanding, multimedia annotation and human-computer interaction. To date, the main proposed solutions have included sliding windows [1], the hidden Markov model (HMM) [2], conditional random fields [3] [4], and structural SVM [5], covering the spectrum of generative, discriminative and maximum-margin dynamic classifiers. Along with advancements in learning and inference, research has witnessed increasingly realistic datasets which are bridging the gap between lab and real applications [6] [7].

Nevertheless, important challenges such as model adaptation and dynamic class sets remain unresolved. We address both these limitations by an adaptive online model that can accommodate an unlimited (theoretically infinite) number of classes. In a nutshell, this is achieved by applying a Bayesian non-parametric model, the hierarchical Dirichlet process (HDP), as the prior for a hidden Markov model (a model known as HDP-HMM [8] [9]), and exploiting an adaptive learning rate for model adaptation. The proposed model provides an adaptive online learning approach for joint segmentation and recognition of sequential data with incremental class sets and we refer to it as AdOn HDP-HMM in the following. The model is i) online: can receive sequential data in batches and segment and recognise them on-the-fly; ii) adaptive: using a limited memory buffer, the model can tune its parameters in response to diverse observations from the existing classes, as well as instantiating new unseen classes. It continues learning throughout the entire life of its application; and is iii) only-initially supervised: the model uses a relatively short initial bootstrap of supervised training, but it adapts in a fully unsupervised manner during its operation. It is also considered as a one-pass process of streaming data, without revision. These constraints obviously make adaptation much more challenging, yet suiting the model to a large span of real-life problems. To improve adaptation in such an unsupervised learning scenario, we introduce the notion of ‘learning rate’, that tunes how biased the model is towards its previous learning (memory), versus adapting to the patterns conveyed by the new observations (adaptability). Experiments support the efficiency of utilising a learning rate, particularly in evolving scenarios.

The rest of this paper is organised as follows: in the rest of this Section we present the related literature and provide more clarification to the scope of this study. In Section 2 we describe the hierarchical Dirichlet process and its temporal extension HDP-HMM. Section 3 presents the proposed online approach, expanding on the adaptive learning rate. Through the experiments and discussions in Section 4, we evaluate and compare the proposed variants with existing benchmarks, and conclude in Section 5.

### 1.1 Related work

Amongst the many paradigms available for class modelling, hierarchical Bayesian modelling and, in particular, the hierarchical Dirichlet process (HDP) [8] offer a principled way to infer an arbitrary number of classes from a set of samples via a hierarchy of prior distributions. The hierarchical Dirichlet process (HDP) is a Bayesian nonparametric technique estimating the joint posterior distribution of a set of latent classes and a set of parameters, typically by Gibbs sampling [10] or variational inference [11]. It has been used for a variety of applications, including the modelling of sequential data, by integrating HDP priors into state-space models such as HMM. In the resulting HDP-HMM [8] [9], the classes correspond with the discrete states of a Markov chain and the data are explained by a state-conditional observation model. Given a set of samples, classification is performed by state decoding, while allowing the number of states to dynamically grow or shrink. The hierarchical Dirichlet process is finding increasing application in domains as varied as bio-informatics, speaker diarization, vision and others for problems of joint segmentation and classification (see [12] [13] [14] for some recent references).

Most of the segmentation and recognition studies in the literature follow an offline approach, where the entire data set is presented at once during the learning stage [6] [7]. Such systems obviously do not suit the needs of streaming data which are ubiquitous in today’s applications. In response to this increasing demand for online systems, many studies are dedicated to this topic. However, the term online has been given a variety of meanings in different contexts. Our interpretation is sequential processing of temporal data in mini-batches, inspired by recursive Bayesian estimation [15] and further elaborated throughout this paper. This interpretation is distinct from that of other studies in the literature where online refers to a closed dataset that is processed incrementally and possibly repeatedly, such as Bayesian online nonparametrics [16] [17], stochastic optimisation methods [18] [19], formal bounds for online learning [20], all based upon the foundations laid by seminal works such as [21] [22].

Despite that almost all the proposed approaches consider closed, pre-defined sets of classes, in scenarios like long-term learning or monitoring the number of classes is not precisely predictable. Additionally, as more data stream in, the known classes may change in parameters due to observing a more comprehensive sample or a natural evolution over time. In either case, models are expected to update parameters of the known classes and add new classes to their vocabulary once they appear. Unsupervised adaptation can be very challenging in non-stationary domains, where adaptation and drift^{1}^{1}1Defined as an undesirable deviation from the ideal model. are hardly distinguishable. To our knowledge, a frequent assumption in online studies is to avail of periodic or ad-hoc feedback from the user (active learning [23] [21] [19]). This feedback allows the model to evaluate the regret and re-dress possible drifts and misclassifications. However, such information is hard or costly to obtain in many real application domains.

In the absence of expert feedback, we elaborate more on the learning rate as a dynamic lever for balancing adaptability (section 3.1). Most previous studies approach this problem by assigning constant weights to prior learning and the likelihood of the current data. However, in more complex problems the choice of the learning rate is highly dependent on the data dynamics and the application domain. Some online studies propose adaptive learning rates via exponential decay [24], and, more recently, regret-based adaptations of the learning rate (i.e., the step size of gradient descent) [18] [17] [19]. However, such adaptation strategies are only suitable for finite training sets. In our solution, we introduce a novel learning rate that constantly adapts to the statistics of the streaming data, without revision or supervision. For stationary problems where the parameters only slightly change, the learning rate tunes itself to rely more on prior memory. Conversely, under evolving distributions the dynamics of data and their modes can significantly vary, calling for a more adaptive model with less inertia to the past. Adding to the complexity, many real-life problems require a mixture of both, i.e. a continuous spectrum for the learning rate to follow more or less tightly the dynamics of observations at each point in time. In this work, we tackle this problem by a posterior estimation of the learning rate separately for each parameter in the model - thereby, allowing each parameter to dynamically determine its adaptability in each batch.

## 2 The hierarchical Dirichlet process

A Dirichlet process, DP(\gamma,H), is a generative model that can be thought of as a distribution over discrete distributions with countably infinite categories. It is controlled by a scalar parameter, \gamma, known as the concentration parameter, and a base measure, H, over a measurable space \theta. A sample G_{0} from a Dirichlet process is a distribution over \theta differing from zero at only a countably infinite number of locations or atoms, \theta_{k},k=1\ldots K:

\begin{split}&\displaystyle G_{0}\sim DP(\gamma,H):\\ &\displaystyle G_{0}=\sum_{k=1}^{K}\beta_{k}\delta(\theta-\theta_{k}),\hskip 1% 4.226378pt{K\to\infty}\\ &\displaystyle\theta_{k}\sim H,\hskip 14.226378pt\mathbf{\beta}\sim GEM(\gamma% )\end{split} | (1) |

The discrete set of locations is obtained by repeatedly sampling the base measure, while the weight for each location, \beta_{k},k=1\ldots K, is established by a stick-breaking process, noted as GEM(\gamma) (named after Griffiths, Engen and McCloskey) [25]. We refer to the weight vector simply as \beta. A hierarchical Dirichlet process (HDP) consists of (at least) two layers of Dirichlet processes, obtained with a similar construction:

\begin{split}&\displaystyle G_{j}\sim HDP(\gamma,\alpha,H):\\ &\displaystyle G_{0}\sim DP(\gamma,H)\\ &\displaystyle G_{j}=\sum_{k=1}^{K}\pi_{jk}\delta(\theta-\theta_{k})\hskip 14.% 226378pt{K\to\infty}\\ &\displaystyle\theta_{k}\sim H,\hskip 14.226378pt\pi_{j}\sim DP(\alpha,\beta),% \hskip 14.226378pt\mathbf{\beta}\sim GEM(\gamma)\end{split} | (2) |

where \gamma and \alpha are the concentration parameters of the top-level an lower-level Dirichlet processes, respectively. Since G_{0} is discrete, the various G_{j} (j=1\ldots J), are also discrete and sampled from the elements of G_{0} (Figure 1).

In practical applications, the continuous space of distribution H is taken to be the parameter space for a data likelihood, as in y\sim f(y|\theta):\hskip 5.690551pt\theta\sim H. Likelihood f(y|\theta) could be, for instance, a Gaussian distribution of mean parameters \theta sampled from a Normal-Inverse-Wishart (NIW) distribution. Given the generative model of the HDP, the joint distribution of data and parameters factorises as f(y|\theta)G_{j}(\theta). Typically, multiple G_{j} are sampled to model data belonging to different groups. Yet, the hierarchical structure of the HDP makes all the G_{j} usefully share distributional properties. Examples can be as diverse as words in a collection of books or genetic markers across different populations.

### 2.1 The HDP-HMM

The HDP has also been used as prior distribution for the parameters of switching models such as the hidden Markov model [8] [13]. When applied to a Markov chain, z_{1:T}, p(z_{1:T})=p(z_{1})\prod_{t=2}^{T}p(z_{t}|z_{t-1}), the HDP changes its interpretation significantly (Figure 2). In this case, each \pi_{j}=\left\{\pi_{jk}\right\}, {k=1\ldots K}, is used as one row of the Markov chain’s transition matrix, representing the probability of transitioning from state j in the previous time-step to any other states in the current time-step, p(z_{t}|z_{t-1}=j). Thanks to the properties of HDP, new states will be created when the data are not adequately explained by the current set of states. In contrast to the conventional HDP, the index of the group, j, of each observation is usually not known explicitly anymore, but it is instead inferred in sequential order from the chain. Therefore, in the case of the HDP-HMM z_{t}\sim p(z_{t}|z_{t-1}=j)=\pi_{j},\hskip 8.535827pty_{t}\sim f(y_{t}|\theta% _{z_{t}}). As a consequence, in the HDP-HMM the number of groups (J) and the number of indices in each \pi_{j} (K) coincide. Adding the HDP as prior caters for arbitrary number of states, or activity classes [13].

It is worth adding that a reported limitation of HDP-HMM is the tendency to over-segment due to its unbounded number of classes [26]. Fox et al. have proposed adding a ‘sticky’ prior (\kappa) to the transition matrix to emulate an inertia towards changing states, illustrated in Figure 2 [27]. We utilise the sticky prior in this study, yet denoting it as HDP-HMM for brevity.

### 2.2 Inference and Learning

Inference and learning are typically performed simultaneously in the HDP and its extensions by estimating the joint posterior distribution of the indicator variables, parameters, hidden variables and hyper-priors conditioned on the observations. Deriving such an extensive joint posterior is analytically intractable, hence mainly inferred using Gibbs sampling or variational inference. Gibbs sampling is a simple yet effective method capable of estimating complex posteriors with significant accuracy, yet it can converge slowly or permanently remain in a local minima (poor mixing). Variational inference is usually faster to compute, however it requires prior derivation of analytical approximations and can suffer from low accuracy due to the approximation. Unlike the negative presumption about Gibbs efficiency, we will show how a brief initial supervised learning can result in rapid convergence to accurate distributions.

Having inferred the class indicators, z_{1:T}, we proceed with translating the indices into meaningful classes. In unsupervised learning, the correspondence between the ground-truth classes of data and the labels assigned by the classification algorithm may not be obvious. In the case of the HDP, this problem is exacerbated by the fact that the number of classes is undetermined. Therefore, to re-establish the best possible one-to-one correspondence, the Hamming distance between ground-truth and assigned labels is minimised by a greedy algorithm, matching labels in decreasing frequency order.

## 3 The Adaptive Online HDP-HMM

The proposed AdOn HDP-HMM uses a supervised initialisation (bootstrap) of T_{b} frames, followed by the main unsupervised adaptive online inference (Figure 3). The extent of the supervised phase varies with the application: in applications where annotation is easy, the bootstrap can be longer to provide a more comprehensive training, while in domains with costly annotation the bootstrap will be brief. In either case, during supervised learning, indicator variables z_{1:T_{b}} are fixed to their ground-truth values, and the model’s parameters are sampled for a given number of iterations to reach convergence. After conclusion of the bootstrap phase, the data are processed in successive batches, and the posterior probabilities of both indicator variables and parameters are estimated iteratively on each batch.

Considering a generic stream of data, y_{1:t}, the posterior probability of the parameters can be written as p(\phi|y_{1:t})\propto f(y_{1:t}|\phi)\hskip 2.845276ptp(\phi), where \phi indicates the parameter vector of Figure 2. In the case of the HDP-HMM, the parameter vector is \phi=\left\{\theta,\pi,\beta\right\} where \theta are the parameters of the emission densities, \pi are the transition probabilities (and weights of the lower-level DPs), and \beta are the weights of the higher-level DP. Further, since we assume normal densities, we have \theta=\left\{\mu,\Sigma\right\}, with \mu and \Sigma the usual mean and covariance parameters. The online version leverages on posterior adaptation, using the posterior computed up to time t, as the prior for the next batch of data, y_{t+1:t+\Delta t}:

\begin{split}\displaystyle p(\phi_{n+1}|y_{1:t+\Delta t})&\displaystyle\propto f% (y_{t+1:t+\Delta t}|\phi_{n},y_{1:t})\hskip 2.845276ptp(\phi_{n}|y_{1:t})\\ &\displaystyle\approx f(y_{t+1:t+\Delta t}|\phi_{n})\hskip 2.845276ptp(\phi_{n% })\end{split} | (3) |

where n is the batch number (Figure 4). Given that the updated posterior embeds the distributional properties of the observations up to the current time, observations y_{1:t} in Equation 3 can be discarded after adaptation. It implies that the accumulated sufficient statistics of previous data are propagated parametrically and the non-parametric nature of the model is related to the inference method of the current data batch. With that, the model carries all the prior learning and infers new labels using a limited memory buffer. While this may come at a price of reduced accuracy, to our knowledge it is the only viable approach for unbounded streaming data. In contrast [16] presents online inference for latent Dirichlet allocation, yet over an unbounded buffer. Our work extends that model to infinite class sets while meeting the finite memory requirements of sequential data processing.

### 3.1 Learning rate adaptation

In the proposed adaptive system, a learning rate is applied over the prior and noted as \tau in the following. In each batch, \tau is responsible for setting the weight of prior learning on the model’s parameters (\theta,\pi,\beta). In other words, our target is to balance the impact of the current observations with the previous learning accumulated along the previous batches. This can augment or weaken the posterior learning ‘inertia’ in ‘adapting’ to the current data (likelihood), as opposed to retaining ‘memory’ (prior).

\begin{split}\displaystyle p(\phi|y,\tau)\propto p(y|\phi)\boldsymbol{p(\phi)^% {\tau}}\\ \end{split} | (4) |

It is worth noting that the length of the current batch compared to the number of past samples, plays a role in their relative influence on posterior parameters (see Appendix A for more details). Accordingly, \tau can be articulated as a scaling factor to the number of ‘pseudo-observations’ in the prior to balance with the respective number for current batch^{2}^{2}2For convenience, in this paper we have constrained all batches to be of the same length and explored the variable alternative in [28]..

For prior distributions belonging to the exponential family, this proposition does not violate Bayes’ Theorem, thanks to the properties of canonical parameters. Accordingly, we use exponential family likelihoods and priors for easier integration of the learning rate into the model. Hereby, we focus on the prior in Equation 4 (in bold font) and its hyperparameters, translating them into exponential family notations. The standard parameters, \phi, are converted into the corresponding canonical parameters, \Theta, and we make explicit their dependence on hyper-parameters, \eta:

\begin{split}&\displaystyle p(\Theta|\eta)^{\tau}=p(\Theta|\eta,\tau)=f(\eta)^% {\tau}g(\Theta)^{\tau}\exp\left(\Theta^{T}\eta\right)^{\tau}\\ &\displaystyle\hskip 85.358268pt=f^{\prime}(\eta)\exp\left(\tau\Theta^{\prime T% }\eta\right),\hskip 28.452756pt\Theta^{\prime}=[ln(g(\Theta));\Theta]\\ \end{split} | (5) |

Adding the learning rate (\tau) as an exponent to this prior does not alter the type of distribution. Rather, it updates the canonical parameters of the prior, ultimately affecting its weight in the resulting posterior. Please note that we only need to derive a proportional posterior for sampling purposes. Hence, the \tau exponent on any term independent from \Theta (such as f(\eta)) can be ignored thanks to the proportionality. The normalisation coefficient g(\Theta)^{\tau} can be merged into the sufficient statistics, assuring that its \tau exponent is absorbed into the scaled canonical parameter (\tau\eta). ^{3}^{3}3g(\Theta)^{\tau}\exp(\tau\Theta^{T}\eta)=\exp\left(\tau ln(g(\Theta))+\tau%
\Theta^{T}\eta)\right)=\exp\left(\tau[ln(g(\Theta));\Theta]^{T}[1;\eta]\right)

In general terms, the posterior distribution of \tau given \Theta in the presence of N data samples in {\bf Y} can be inferred as follows:

\begin{split}\displaystyle p(\tau|\Theta,{\bf Y},\eta)\propto p(\Theta|\tau,{% \bf Y},\eta)p(\tau)\\ \end{split} | (6) |

In our case, \Theta are the parameters of the HDP-HMM and their priors are a Normal-Inverse-Wishart distribution for \mu and \Sigma and the HDP for \pi and \beta. Given that both the NIW distribution and the Dirichlet process are members of the exponential family, Equation 7 shows a unified way of inferring posterior parameters in canonical form [29]:

\begin{split}&\displaystyle p(\Theta|{\bf Y},\tau^{*},\eta^{*})\propto p({\bf Y% }|\Theta,\tau,\eta)p(\Theta|\eta,\tau)\\ &\displaystyle p(\Theta|{\bf Y},\tau^{*},\eta^{*})\propto\left[\left(\prod_{n=% 1}^{N}h(y_{n})g(\Theta)\right)\exp\left(\Theta^{T}\sum_{n=1}^{N}u(y_{n})\right% )\right]\left[f(\eta,\tau)g(\Theta)^{\tau}\exp\left(\tau\Theta^{T}\eta\right)% \right]\\ &\displaystyle\text{\it\small removing the constants with respect to $\Theta$:% }\\ &\displaystyle p(\Theta|{\bf Y},\tau^{*},\eta^{*})\propto g(\Theta)^{\tau+N}% \exp\left(\Theta^{T}\left(\sum_{n=1}^{N}u(y_{n})+\tau\eta\right)\right)\\ &\displaystyle\tau^{*}=\tau+N,\hskip 56.905512pt\eta^{*}=\sum_{n=1}^{N}u(y_{n}% )+\tau\eta\\ \end{split} | (7) |

In the following sub-sections, we present the prior distribution of each parameter under the learning rate, and the posterior distribution of the corresponding learning rate.

#### 3.1.1 Inference of covariance matrix \Sigma

We infer \mu and \Sigma in the Normal-Inverse-Wishart prior by first sampling \Sigma using an Inverse-Wishart (IW) distribution, thereby using \Sigma to sample \mu from a Normal distribution [30]. The learning rate for \Sigma is noted as \tau_{\Sigma} in the text. Yet, to avoid cluttering the notation in the equations, we simply note it as \tau in Equation 8.

\begin{split}&\displaystyle p(\mu,\Sigma)^{\tau}=p(\mu|\Sigma)^{\tau}p(\Sigma)% ^{\tau}:\\ &\displaystyle p(\Sigma)=IW(\Sigma|\Psi,\nu),\hskip 8.535827ptp(\mu|\Sigma)=% \mathcal{N}(\mu|\mu,\frac{1}{\sigma}\Sigma)\\ \end{split} | (8) |

As mentioned earlier, the addition of a positive learning rate as exponent on the IW prior does not alter the type of distribution and can be merged into the hyper-parameters. Below, we convert the hyper-parameters \phi_{IW}=\{\Psi,\nu\} into the natural form (\eta) to show the impact of \tau more clearly. Ultimately, they are converted back to standard form (\phi) to show the linear transformation caused by the learning rate.

\begin{split}&\displaystyle\phi_{IW}=(\Psi,\nu)\rightarrow\eta_{IW}=\left(-% \frac{1}{2}\Psi,-\frac{\nu+p+1}{2}\right),\hskip 19.916929ptp=\text{\it\small number% of dimensions}\\ &\displaystyle\eta^{\prime}_{IW}=\tau\eta_{IW}=\left(-\frac{\tau}{2}\Psi,-% \frac{\tau(\nu+p+1)}{2}\right)\rightarrow\phi^{\prime}_{IW}=(\tau\Psi,\tau(\nu% +p+1)-p-1)\\ \end{split} | (9) |

Inference of {\bf\tau_{\Sigma}}

To sample \tau_{\Sigma} from the posterior, ideally we would like to consider a conjugate prior that analytically derives the posterior hyper-parameters, given that of the prior and the sufficient statistics of the current data. A candidate conjugate prior for IW distribution is Gamma. However, the Inverse-Wishart is only conjugate to the Gamma as the prior for the scale parameter (or a scaling coefficient for the scale parameter, \Psi, in the multivariate cases). Hence, a Gamma cannot be used as a conjugate prior for deriving the posterior of \tau_{\Sigma} in a maximum-a-posteriori solution (Appendix B presents the proof).

Therefore, we utilise a maximum-likelihood solution to derive the posterior hyper-parameters for \tau_{\Sigma}. The posterior for \tau_{\Sigma} is modeled using an Inverse-Gamma (IG) distribution, the univariate correspondent of the Inverse-Wishart. The samples of IG are positive real values, suitable for the scalar learning rate \tau_{\Sigma}. The distributions are displayed below.

\begin{split}&\displaystyle IW(\Sigma|\Psi,\nu)=\frac{|\Psi|^{\frac{\nu}{2}}}{% 2^{\frac{\nu}{2}}\Gamma(\frac{\nu}{2})}|\Sigma|^{-\frac{\nu+p+1}{2}}\exp\left(% -\frac{1}{2}tr(\Psi\Sigma^{-1})\right)\\ &\displaystyle IW(\sigma|\psi,\nu)=\frac{\psi^{\frac{\nu}{2}}}{2^{\frac{\nu}{2% }}\Gamma(\frac{\nu}{2})}\sigma^{-\frac{\nu+2}{2}}\exp\left({-\frac{\psi}{2% \sigma}}\right)\hskip 28.452756pt\text{\it\small univariate IW : p = 1}\\ &\displaystyle IG(\tau|\beta,\alpha)=\frac{\beta^{\alpha}}{\Gamma(\alpha)}\tau% ^{-\alpha-1}\exp\left({-\frac{\beta}{\tau}}\right)\\ \end{split} | (10) |

Comparing the univariate IW and IG in Equation 10, we can derive the posterior parameters as:

\begin{split}&\displaystyle IG(\tau|\beta^{*},\alpha^{*})\approx IW(\Sigma|% \Psi,\nu)\hskip 28.452756pt\text{\it\small where }\beta^{*}=\frac{f(\Psi)}{2},% \alpha^{*}=\frac{\nu}{2}\\ \end{split} | (11) |

As can be seen, the hyper-parameters in the Inverse-Wishart map to those of the Inverse-Gamma. The only issue is how to best map the p\times p scale matrix (\Psi) into scalar value \beta^{*} through f(\Psi). We propose to use the largest eigenvalue in \Psi as the scale parameter in the Inverse-Gamma posterior. Approximating a covariance matrix via its first principal components is a meaningful and common approach [31] [32]. Another choice for f(\Psi) could be the determinant. The determinant is used as the scale associated with a square matrix since it is equal to the product of all its eigenvalues. While it gives a more thorough account of all the eigenvalues, it becomes unsuitable when the dimensionality is high and many of the eigenvalues are close or equal to zero. Moreover, calculating the determinant of a high-dimensional matrix is very costly in an online context. Therefore, the first approach is generally preferable.

So far, we have established a way to infer \tau_{\Sigma} from a single IW distribution belonging to a single state. Considering the proposed model with infinite states, we need to merge the IW parameters for all classes to infer \tau_{\Sigma}. This is done through a weighted average of \Psi_{k},k=1\ldots K, where the weights are the frequency of observations for each state, aka degrees of freedom in IW parameters (\nu).

\begin{split}&\displaystyle IG(\tau|\beta^{*},\alpha^{*})\approx IW(\Sigma|% \Psi,\nu)\hskip 28.452756pt\beta^{*}=\frac{\sum_{k=1}^{K}{max(eig(\Psi_{k})).% \nu_{k}}}{2\sum_{k=1}^{K}{\nu_{k}}},\alpha^{*}=\frac{\sum_{k=1}^{K}{\nu_{k}}}{% 2K}\\ \end{split} | (12) |

#### 3.1.2 Inference of mean \mu

Having inferred \Sigma, the next step is to derive the multivariate mean, \mu, in the NIW prior. Let us consider a generic multivariate Normal distribution N=(\mu|\mu_{0},\frac{1}{\sigma}\Sigma)^{T} with known covariance. To observe the impact of the learning rate, we convert its parameters \phi_{\tiny\mu}=(\mu_{0},\frac{1}{\sigma}\Sigma) into the natural form and multiply them by the learning rate \tau_{\mu}, and ultimately revert them back to the standard format:

\begin{split}&\displaystyle\phi_{\tiny\mu}=(\mu_{0},\frac{1}{\sigma}\Sigma)% \rightarrow\eta_{\small\mu}=\left(\frac{1}{\sigma}\Sigma^{-1}\mu_{0},\frac{1}{% 2\sigma}\Sigma^{-1}\right)^{T}\\ &\displaystyle\eta^{\prime}_{\mu}=\tau\eta_{\mu}=\left(\frac{\tau}{\sigma}% \Sigma^{-1}\mu_{0},\frac{\tau}{2\sigma}\Sigma^{-1}\right)^{T}\rightarrow\phi^{% \prime}_{\tiny\mu}=\left(\mu_{0},\frac{1}{\tau\sigma}\Sigma\right)\\ &\displaystyle N(\mu|\eta^{\prime}_{\small\mu})=\mathcal{N}(\mu|\mu_{0},\frac{% 1}{\tau\sigma}\Sigma)\\ \end{split} | (13) |

Inference of {\bf\tau_{\mu}}

Posterior sampling of \tau_{\mu} is conducted with a similar approach to \tau_{\Sigma}, but using a Gamma conjugate prior. This time the Gamma prior is conjugate by definition, since its sample \tau_{\mu} is utilised merely as a scaling coefficient for the covariance. The detailed proof of the conjugacy is provided in Appendix C. Similarly to \tau_{\Sigma}, the weighted average of sufficient statistics across all classes is used to infer \tau_{\mu}.

\begin{split}&\displaystyle G(\tau|\alpha^{*},\beta^{*})\propto\mathcal{N}(\mu% |\mu_{0},\frac{1}{\tau\sigma}\Sigma)G(\tau|\alpha,\beta)\\ &\displaystyle\alpha^{*}=\alpha+1/2,\hskip 28.452756pt\beta^{*}=\beta+\frac{% \sigma\sum_{k=1}^{K}{(\mu_{k}-\mu_{0k})^{T}\Sigma_{k}^{-1}(\mu_{k}-\mu_{0k}).% \nu_{k}}}{2\sum_{k=1}^{K}{\nu_{k}}}\\ \end{split} | (14) |

#### 3.1.3 Inference of the HDP transition parameters

Thus far we have discussed the adaptation of the learning rate for emission parameters. The other main set of parameters in our AdOn HDP-HMM are the HDP’s \beta and \pi parameters that jointly and hierarchically cater for the transition probabilities. The distributions of these parameters are shown in Equation 15, where m and n are HDP sufficient statistics representing the frequency of occurrence in each class:

\begin{split}&\displaystyle{\bf\beta}\sim Dir(\gamma/L+m_{.1},...,\gamma/L+m_{% .L})\\ &\displaystyle\pi_{j}\sim Dir(\alpha_{1}\beta_{1}+n_{j1},...,\alpha_{j}\beta_{% j}+\kappa+n_{jj},...,\alpha_{L}\beta_{L}+n_{jL})\\ \end{split} | (15) |

Similarly to the previous parameters, we illustrate the impact of the learning rates, \tau_{\beta} and \tau_{\pi}, on the hyper-parameters of the above Dirichlet distributions in standard form and infer the posterior samples for these learning rates:

\begin{split}&\displaystyle\text{{\bf Inference of the learning rate for }}{% \bf\beta}\\ &\displaystyle\phi_{\beta}=(\gamma/L+m_{.1},...,\gamma/L+m_{.L})\rightarrow% \eta_{\beta}=(\gamma/L+m_{.1}-1,...,\gamma/L+m_{.L}-1)\\ &\displaystyle\eta^{\prime}_{\beta}=\tau_{\beta}\eta_{\beta}=(\tau_{\beta}(% \gamma/L+m_{.1})-\tau_{\beta},...,\tau_{\beta}(\gamma/L+m_{.L})-\tau_{\beta})% \\ \\ &\displaystyle\beta\sim Dir(\tau_{\beta}(\gamma/L+m_{.1}-1)+1,...,\tau_{\beta}% (\gamma/L+m_{.L}-1)+1)\\ \\ &\displaystyle\text{{\bf Inference of the learning rate for }}{\bf\pi}\\ &\displaystyle\phi_{\pi}=(\alpha_{1}\beta_{1}+n_{j1},...,\alpha_{j}\beta_{j}+% \kappa+n_{jj},...,\alpha_{L}\beta_{L}+n_{jL}))\\ &\displaystyle\rightarrow\eta^{\prime}_{\pi}=(\tau_{\pi}(\alpha_{1}\beta_{1}+n% _{j1})-\tau_{\pi},...,\tau_{\pi}(\alpha_{j}\beta_{j}+\kappa+n_{jj})-\tau_{\pi}% ,...,\tau_{\pi}(\alpha_{L}\beta_{L}+n_{jL})-\tau_{\pi})\\ \\ &\displaystyle\pi_{j}\sim Dir(\tau_{\pi}(\alpha_{1}\beta_{1}+n_{j1}-1)+1,...,% \tau_{\pi}(\alpha_{j}\beta_{j}+\kappa+n_{jj}-1)+1,...,\tau_{\pi}(\alpha_{L}% \beta_{L}+n_{jL}-1)+1))\\ \end{split} | (16) |

Inference of \tau_{\beta} and \tau_{\pi}

To the best of our knowledge, there are no conjugate priors over a scaling factor for the parameters of a Dirichlet distribution, in the presence of an intercept. Hence, we estimate the next batch’s learning rate using a Metropolis-Hastings (MH) jump. This approach is used in several other studies (such as [33] [34]) and is a valid MCMC move. For the MH step, one can choose a suitable candidate function and the samples are accepted with probability of acceptance p(\xi\rightarrow\xi^{*})\propto\min\left(1,\frac{p(\xi^{*})Q(\xi\rightarrow\xi% ^{*})}{p(\xi)Q(\xi^{*}\rightarrow\xi)}\right).

To sample \tau_{\beta}, we have selected the candidate function as G(\tau_{\beta}|\alpha,\beta), the prior over the learning rate. The new sample (\tau^{*}_{\beta}) is accepted with the probability in Equation 17, updating \tau_{\beta} for the current batch with the accepted sample. An identical approach can be taken for \tau_{\pi} by replacing \pi_{j} for \beta in Equation 17. The \beta subscripts in \tau_{\beta} are removed to avoid notational clutter.

\begin{split}&\displaystyle p(\tau\rightarrow\tau^{*})\propto\min\left(1,\frac% {p(\tau^{*}|\alpha,\beta)Q(\tau\rightarrow\tau^{*})}{p(\tau|\alpha,\beta)Q(% \tau^{*}\rightarrow\tau)}\right)\\ &\displaystyle\frac{p(\tau^{*}|\alpha,\beta)Q(\tau\rightarrow\tau^{*})}{p(\tau% |\alpha,\beta)Q(\tau^{*}\rightarrow\tau)}\propto\frac{Dir(\beta|\alpha,\tau^{*% })G(\tau^{*})G(\tau)}{Dir(\beta|\alpha,\tau)G(\tau)G(\tau^{*})}=\frac{Dir(% \beta|\alpha,\tau^{*})}{Dir(\beta|\alpha,\tau)}\\ \end{split} | (17) |

### 3.2 Discussion on the learning rates

As per the above sections, the learning rates for each parameter are inferred separately to allow more degrees of freedom for independent adaptation of each parameter. The empirical results support this, as each of the learning rates (\tau_{A},\tau_{\Sigma},\tau_{\beta},\tau_{\pi}) can adapt differently for the same sequence of data, depending on the complexity of the data and degree of evolution in the emissions and state transitions. Nevertheless, their impact pattern on the mean and covariance of the respective posterior distributions tends to be similar. As clearly shown for \tau_{\mu} (Equation LABEL:eq:tau_mu) the learning rate does not change the mean, but reversely impacts the covariance (see Appendix D for more details). Accordingly, for all cases when 0\leq\tau<1 the posterior distribution is more driven by the current observations. However, for \tau>1 the inferred parameters follow the prior distribution more closely. In the following experiments, the dynamics of \tau with respect to the data is explored more extensively.

## 4 Experiments

The experiments aim to explore the effectiveness of the proposed AdOn HDP-HMM for segmentation and classification in a variety of scenarios. To closely examine the adaptability of the model, we have designed several synthetic datasets with stationary and evolutionary distributions. It also allows us to investigate the effects of using learning rates in enhancing adaptability, where an adaptive learning rate is noted concisely as ‘ada \tau‘ and the basic alternative with fixed learning rate is shown as \tau=1. Following with two more video datasets, we demonstrate the performance of the proposed model in various challenging sequences with noisy data, abrupt changes and new classes in the test data. It is important to mention that the degree of challenge in the synthetic experiments is not easily comparable to the video data, due to differences in the nature of the signals, noise and, most importantly, degree of evolution that is stronger by design in the synthetic data. Hence, analysing both categories of experiments can shed more light on the adaptability of AdOn HDP-DHMM in various contexts.

To evaluate the results more comprehensively, metrics for both classification and time segmentation performance are introduced. For classification accuracy, we have used frame-level comparison of the decoded classes with the ground truth (based on Hamming distance). To evaluate time segmentation, the standard metrics of precision and recall are utilised to indicate the accuracy of detecting boundaries between segments. A true boundary is regarded as correctly detected if a change of state is decoded within an interval of \pm\Delta t frames from the ground truth location, where \Delta t is set to 10 percent of the average segment length. Any additional detected boundaries are counted as false positives. We also report the difference between the overall number of actions detected in the test sequence and the number of actions in the ground truth (noted as cardinality, with an ideal value of zero).

The empirical results are quantitatively reported in tables, also visualised in colour plots of ground-truth vs. estimated labels. In each illustration (for instance, Figure 6), the horizontal axis is the time and the estimated labels are plotted on top of the true labels, providing a qualitative measure for the segmentation and classification performance. These plots are best viewed in colour.

### 4.1 Synthetic data

The basic framework of the synthetic dataset is generated from a univariate HMM, with 5 states distributed around dispersed means (\mu=[100,200,300,400,500]) with unit variance and a Dirichlet-distributed transition matrix (\alpha=[3,3,3,3,3]). This generative model is similar to the AdOn HDP-HMM, but not an exact replicate, due to the absence of the HDP prior and adaptation of \tau in the generative process (please refer to Figures 2 and 4 for comparison).

#### 4.1.1 Stationary distributions

Given the above basic configuration, the stationary experiments are run over 3 sequences of length 100, trained using leave-one-out cross validation. Hence, the distributions of training and test samples are the same. The test sequence is split into batches with approximate size of 16 time units. To provide adaptation, the inferred parameters of each batch are propagated into the next batch as priors.

The proposed Adaptive Online HDP-HMM is able to recognise and segment this basic version with 100 percent accuracy, whether or not the learning rates are used. To probe the model further, we add a significant noise to the above model by increasing the standard deviation to 50, thereby causing a considerable overlap between the distributions of each state (Figure 5). Despite this substantial noise, the model is significantly accurate with an average of 76.3 percent frame-level accuracy. Repeating this experiment on the same data yet with fixed learning rates (\tau=1), shows a noticeable decline in accuracy of 3 percentage points and undesirable extra states. Table 1, first two rows, shows the detailed accuracy figures in terms of precision, recall and number of inferred states.

#### 4.1.2 Evolutionary distributions

A more advanced experiment is designed by training the model on synthetic data with evolving distributions, either involving gradual shifts to the means of each class or including new unseen classes. The standard deviation for this experiment is set to \sigma=10.

Shifting class means: To examine the adaptability of the model, we drift the class means by \delta=0.5 at each time step. Therefore, an instance appearing at t=10 in the test sequence is generated from a distribution with its mean shifted by 5 units. For a non-adaptive model and given the synthetic generation scheme, such data can cause significant classification errors after a few tens of time units. However, the results of the Adaptive Online HDP-HMM demonstrate smooth adaptation and excellent accuracy over the evolving sequence (Figure 6). There are a few misclassifications towards the end of the sequence which are due to the heavy distributional drift. Comparison between these results and that of fixed learning rates shows a significant drop of 26 percentage points in accuracy and one undesirable new class (see Table 1, section ii).

New classes: In this experiment, distributions do not shift, yet one new class appears around \mu=600 with the same \sigma as the other classes. The model is able to create a new state (shown with a random new colour in Figure 6), learn and consistently recognise it in the later batches without distorting parameters of the existing classes. The overall accuracy of 100 percent for this experiment is mostly thanks to the contribution of the learning rate in adjusting the variances of each class with respect to the degree of adaptation. Not using learning rates can highly reduce accuracy (14 percentage points) due to drift in the existing classes (Table 1), also exhibiting one extra class and reduced recall and precision.

Combination of the two: Combining the above two evolutionary scenarios, we test the proposed model on a sequence with a new class that needs to be distinguished among the existing shifting classes. The challenge is two-fold: i) the shifting modes are prone to being misclassified as new classes, and ii) the new class might be merged into one of the existing shifted modes. This experiment is the closest to challenging real world scenarios where new states are likely to appear while the distributions can change over time. Given the combined challenge, the AdOn HDP-HMM proves highly accurate (93 percent), exhibiting a considerable improvement on the accuracy (12 percentage points) and cardinality of states thanks to the learning rate mechanism.

The performance of the Adaptive Online HDP-HMM is not perturbed by these challenges because the learning rate tunes the adaptability of the parameters with respect to the observed data. In an evolutionary scenario, the likelihood of the observations given the current parameters is low. This causes the posterior covariance learning rate (\tau_{\Sigma}) to increase, keeping the variance close to its prior. This, in turn, prevents a drift of the variance towards large values and allows for the mean to evolve. The concentration of \tau_{\mu} around zero is an empirical support for this claim (see Figure 9b).

In the absence of the learning rate, the model still learns and recognises the new state thanks to properties of HDP. However, the overall performance deteriorates. On the one hand, new undesirable classes appear in response to drift. On the other, some of the existing classes collapse into a single one, due to considerable increase of variance caused by the class shifts. This rigid increase in variance does not allow the means to evolve, ultimately forcing the model to merge some of the neighboring states into a single class with a large variance (Figure 6e,f).

### 4.2 Activity recognition datasets

In this section, we use two video datasets to assess the performance of the proposed model in activity recognition scenarios.

#### 4.2.1 Collated Weizmann dataset

The Weizmann dataset contains 93 single-action videos from a set of 10 classes performed by 9 different actors. While the recognition accuracy on the original dataset is saturated [35] [36], some studies have collated its individual actions into (unsegmented) sequences to experiment with time segmentation [5]. In a similar way, we have created 4 sequences, each consisting of 12 random actions selected from the provided action classes. Each sequence consists of approximately 900 frames. As feature set, we have used the position of the actor’s centroid in the image plane and the distances between the centroid and the actors’ contour along five given directions [37].

The estimated states of the AdOn HDP-HMM variants over the above sequences are visualised in Figure 7, showing remarkable qualitative accuracy in segmentation and classification. The quantitative results are reported in Table 2, including an Offline variant representing the experiment with a single batch including the whole test sequence. This variant is run for the sake of comparison with a similar offline max-margin study [5]. However, the results are not directly comparable for two reasons: a) the datasets are similar in conception, yet different in sequence collation, and b) the classifier in [5] operates over a closed set of classes, as opposed to ours that allows unlimited number of classes. The results with the fixed learning rate (\tau=1) show a similar trend to the adaptive, and only a slightly lower average accuracy. This can be due to the stationary nature of the dataset, as training and test sequences are drawn from similar distributions and adaptation is not significant. In addition, the accuracy with the online processing does not show any noticeable deterioration over the full, offline processing.

#### 4.2.2 TUM kitchen dataset

The TUM kitchen dataset is a human assistive dataset, consisting of natural unsegmented sequences of everyday activities performed in a typical kitchen environment [7]. The dataset contains multi-modal data, annotated separately for the actors’ left and right hands (9 classes) and torso (2 classes). The features are 28D vectors of joint coordinates for the torso and the relevant hands. The main actions include ‘Reaching’, ‘Releasing Grasp Of Something’, ‘Taking An Object’, ‘Reaching Upward’, ‘Lowering An Object’, opening and closing doors and drawers and ‘Carrying While Locomoting’, the distinction of which are quite subtle at times even for human annotators. The main advantage of this dataset over the collated Weizmann is that the transitions between actions occur naturally and the boundaries are vague even to human annotation, hence time segmentation is more challenging.

In our experiments, we have performed segmentation and classification on the actions of the left and right hands, separately. All the sequences provided by the 3D motion capture sensors are used in leave-one-out cross validation tests. Experiments are run for both the typical sequences (denoted as ‘robotic’, taking objects one by one), and the more challenging ones (‘complex’ including sequences with multiple objects moved together, in arbitrary order and repeatedly).

For a general study of performance, we run an experiment on all the above sequences involving both the robotic and complex. The difference between them is in state transition probabilities, height and size of actors and frequencies of action occurrence. The experiment is repeated with fixed and adaptive learning rates and results are compared in Table 3, generally showing significant match in frame-level accuracy. The comparison between similar sequences with fixed vs. adaptive learning rate shows a minor improvement of frame-level accuracy and significant decrease of state cardinality error. Note that the figures under cardinality show differences between inferred vs. actual number of states. To facilitate visual evaluation, 4 of the sequences are colour-plotted in Figure 8. It is worth noting that classes in this dataset may prove hard to segment. For instance, distinction between putting object on the table and leaving grasp of it can be very subtle (the back-to-back lavender-blue and light-blue colours in Figure 8). This becomes more challenging when a model has extra degrees of freedom for deriving a dynamic number of classes and explains the negative cardinality in the results.

To specifically observe the adaptive behaviour, we have trained the model on the robotic sequences and tested it on complex ones. Although the emission parameters might not radically change in this scenario, the transition probabilities need to adapt due to changes in the order of actions in the complex set. Table 4 can be used to observe the remarkable contribution of the learning rate mainly in cardinality and overall accuracy. Similar to the synthetic results, in the presence of learning rates the model is able to prevent an excessive increase of the variance and avoid neighboring classes to collapse into one (the phenomena that can be observed when \tau=1 in Figures 8d,e).

To evaluate the ability to recognise new classes, we have taken the first 4 sequences and removed the observations related to ‘Lowering an object’ (shown in lavender-blue in Figure 8f) in all but the first sequence. We have then trained the model on sequences 2-4 and tested on the sequence containing the new action. AdOn HDP-HMM is able to recognise a new action (brown in Figure 8f) and learn its parameters with consistent future recognition. This significant property of the model is inherent to the HDP approach and the behaviour is similar, irrespective of whether or not the learning rate is utilised.

The closest study on the TUM kitchen dataset leverages a CRF [7]. This method is not directly comparable to ours since AdOn HDP-HMM is online, adaptive and with a dynamic class set. To create a closer match, we have run the Offline variant of AdOn HDP-HMM, the results of which are similar to the CRF and outperforming it for complex sequences. This finding aligns with our principal claim that adaptability leads to remarkable improvements when the test distributions are different from the training. The distribution of \tau_{\pi} and \tau_{\beta} (the transition-related learning rates) for these experiments are mainly peaked around 0.1, indicating that the learning rates encourage the model to rely on the observed data to infer the HDP transition probabilities, which translates into more adaptability.

Accuracy | Cardinality | |||||||

Sequences | RH | LH | RH | LH | ||||

ada \tau | \tau=1 | ada \tau | \tau=1 | ada \tau | \tau=1 | ada \tau | \tau=1 | |

Online Seq 0-0 | 0.79 | 0.81 | 0.73 | 0.71 | 0 | -1 | -1 | -1 |

Online Seq 0-1 | 0.79 | 0.82 | 0.75 | 0.75 | -2 | -1 | 0 | -1 |

Online Seq 0-2 | 0.76 | 0.70 | 0.78 | 0.75 | -2 | -1 | -1 | -1 |

Online Seq 0-3 | 0.84 | 0.84 | 0.67 | 0.69 | 0 | -2 | 0 | -1 |

Online Seq 0-4 | 0.70 | 0.69 | 0.71 | 0.72 | 1 | -1 | -2 | -3 |

Online Seq 0-6 | 0.51 | 0.48 | 0.56 | 0.55 | -3 | -6 | -1 | -3 |

Online Seq 0-7 | 0.45 | 0.48 | 0.57 | 0.55 | -3 | -4 | -1 | -3 |

Online Seq 0-8 | 0.64 | 0.68 | 0.62 | 0.63 | -1 | -3 | -2 | -2 |

Online Seq 0-9 | 0.73 | 0.71 | 0.70 | 0.69 | 0 | -2 | -1 | -2 |

Online Seq 0-10 | 0.79 | 0.79 | 0.68 | 0.70 | 0 | -2 | 0 | -1 |

Online Seq 0-11 | 0.70 | 0.76 | 0.63 | 0.63 | -5 | -4 | -2 | -3 |

Online Seq 0-12 | 0.64 | 0.64 | 0.58 | 0.55 | -1 | -2 | -3 | -5 |

Online Seq 1-0 | 0.65 | 0.69 | 0.68 | 0.69 | -1 | -2 | -3 | -4 |

Online Seq 1-1 | 0.71 | 0.69 | 0.65 | 0.62 | -1 | -1 | -2 | -3 |

Online Seq 1-2 | 0.63 | 0.63 | 0.76 | 0.74 | -1 | -1 | 0 | -1 |

Online Seq 1-3 | 0.14 | 0.14 | 0.66 | 0.65 | -6 | -6 | 0 | 0 |

Online Seq 1-4 | 0.64 | 0.67 | 0.74 | 0.71 | -4 | -5 | 0 | -1 |

Online Seq 1-5 | 0.67 | 0.67 | 0.61 | 0.61 | 0 | 0 | -2 | -2 |

Online Seq 1-7 | 0.69 | 0.68 | 0.60 | 0.58 | -1 | -1 | -1 | -2 |

robotic sequences | ||||||||

Avg Online | 0.80 | 0.79 | 0.73 | 0.73 | 1.00 | 1.25 | 0.5 | 1.00 |

Avg Offline | 0.80 | 0.81 | 0.74 | 0.73 | 1.00 | 1.50 | 1.00 | 1.10 |

Avg Offline CRF [7] | 0.83 (avg) | - | ||||||

complex sequences | ||||||||

Avg Online | 0.66 | 0.66 | 0.67 | 0.66 | 1.68 | 2.37 | 1.16 | 2.26 |

Avg Offline | 0.66 | 0.66 | 0.67 | 0.66 | 1.48 | 2.28 | 1.23 | 2.26 |

Avg Offline CRF [7] | 0.63 (avg) | - |

Accuracy | Cardinality | |||||||
---|---|---|---|---|---|---|---|---|

Sequences | RH | LH | RH | LH | ||||

ada \tau | \tau=1 | ada \tau | \tau=1 | ada \tau | \tau=1 | ada \tau | \tau=1 | |

Online Actor1, complex | 0.73 | 0.72 | 0.65 | 0.68 | -2 | -3 | 2 | -2 |

Online Actor3, complex | 0.55 | 0.54 | 0.52 | 0.49 | -1 | -3 | -4 | -6 |

Online Actor1, repetitive | 0.45 | 0.48 | 0.57 | 0.55 | -3 | -4 | -1 | -3 |

### 4.3 Sampling efficiency and computational time

We next examine the Gibbs sampler’s mixing rate and execution time for the above experiments. To gain an overall understanding of parameter mixing (emission and transition) the log-likelihood is shown in Figure 9e. Since most of the sampled variables contribute to the likelihood calculation, the well-mixed results indicate general mixing efficiency in the model. Additionally, mixing trends of the learning rates (\tau_{\mu},\tau_{\Sigma},\tau_{\pi},\tau_{\beta}) for a generic evolutionary run are shown Figure 9a-d, both to monitor mixing and support the experiments’ discussion. The large values of \tau_{\Sigma} prevents the model from the immediate tendency to increase the variance to fit the changing distributions. Rather, the model allows for the means to evolve, by converging to small values of \tau_{\mu}. The similarly small values of \tau_{\pi} and \tau_{\beta} ensure adaptability of the model towards changing state transitions for HDP-HMM. Through the orchestration of these parameters, the proposed model can adapt to changes in the streaming batches with more exact account of the true cardinality of the classes and be immune from collapsing neighboring classes into a single one.

Eventually, the computational time per frame for runs on an Intel Xeon E5 2.90 GHz processor, over the Weizmann and TUM kitchen datasets are shown in Figure 9f. The boxplot includes online and offline variants, with and without learning rates to help explore how using the learning rates and online scheme can affect the computational time. Based on the elapsed time (in seconds), the offline run is the fastest since all the data are processed in a single batch. The adaptive online runs occur in 3-4 batches of 1000 iterations each, therefore indicate an increase of about 5-10 ms in completion time. Adapting the learning rate can cause between 3-10 ms delay, yet given the discussed benefits particularly for evolving sequences, this latency is quite reasonable. It is important to mention that given the initial bootstrap training, the Gibbs algorithm converges rapidly allowing for the model to run in acceptable time. Overall, using the learning rate ensures multiple improvements without imposing excessive computational load on the system.

## 5 Conclusion

In this paper, we have proposed a novel, adaptive online model suited for on-the-fly time segmentation and recognition of sequential data. The proposed AdOn HDP-HMM is capable of online segmentation and classification of streaming batches of data over incremental class sets. The main contribution of this model is the unsupervised posterior adaptation of the parameters over the successive data batches. This is accomplished by using a learning rate that dynamically tunes the model balancing the impact of the current batch with the memory accumulated so far. This proves an effective solution for online sequential estimation problems requiring adaptation over evolving distributions.

The performance of AdOn HDP-HMM is evaluated via a number of experiments including stationary and evolutionary scenarios. Thereby, we have tested the general segmentation and classification accuracies in addition to the ability to detect the correct number of classes. The results are reported on variations of synthetic data and two activity recognition video datasets (Collated Weizmann and TUM Assistive Kitchen). The proposed model has achieved a remarkable accuracy in all cases, and considerable improvements in evolutionary scenarios.

Thanks to the unsupervised adaptive online estimation and the capacity to learn over infinite class sets, the proposed AdOn HDP-HMM can be a solution for sequential estimation in a number of scenarios which have received relatively little attention in the literature. Not relying on human intervention, revision or correction of estimated labels, this model can be a suitable candidate for streaming applications. In addition, although designed for evolutionary distributions, its accuracy over stationary data has proved higher than or equal to that of the most comparable results, and the computational load is not affected significantly.

## 6 Appendix A: The balancing effect of \tau

In this section we address posterior inference of parameters and explore how the prior and likelihood distributions convey the knowledge of the current observations and accumulated summary of previous data. Considering the online HDP-HMM model with parameters \phi, observations Y and learning rate \tau, the posterior for parameters in the n^{th} batch is:

\begin{split}&\displaystyle p(\phi_{n}|Y_{n},\phi_{n-1},\tau)\propto p(Y_{n}|% \phi_{n-1})p(\phi_{n-1})^{\tau}\\ &\displaystyle\propto\underbrace{\prod^{N}_{i=1}{p(y_{n,i}|\phi_{n-1})}}% \underbrace{\left(p(\phi_{0})\prod^{n-1}_{j=1}{\prod^{N}_{i=1}{p(y_{j,i}|\phi_% {j-1})}}\right)^{\tau}}\\ &\displaystyle\hskip 42.679134ptN\hskip 76.822441ptN(n-1)\tau\end{split} | (18) |

As more batches stream in (i.e. n increases), the weight of prior is accumulated and adaptivity to new data declines. The learning rate, however, can be used as an equaliser that controls the balance of prior versus likelihood and tunes model adaptivity. For positive values of \tau<1, the model discounts the impact of accumulated previous data and allows for more adaptivity. However, when \tau>1, posterior \phi_{n} is inclined to follow the prior more strictly. In other words, \tau can be seen as the scaling coefficient for the number of ‘pseudo-observations’ in the prior.

## 7 Appendix B: Conjugacy for \tau_{\Sigma}

To sample \tau_{\Sigma} from the posterior, ideally we would like to consider a conjugate prior that analytically derives the posterior hyper-parameters, given those of the prior and the sufficient statistics of the current data. A candidate prior for the IW distribution is the Gamma. In this section, we investigate if the Gamma can be proven a conjugate prior for the IW likelihood, considering the impact of the learning rates on \Psi^{\prime} and \nu^{\prime}.

Given the proposed learning rate model, the probability density function of the Inverse-Wishart distribution can be redefined as below. We have derived the new hyper-parameters through conversion to canonical parameters, multiplication with the learning rate and reversion to the standard form to simplify sampling.

\begin{split}&\displaystyle\Psi^{\prime}=\tau\Psi\hskip 56.905512pt\nu^{\prime% }=\tau(\nu+p+1)-p-1=c\tau+c^{\prime}\\ &\displaystyle IW_{p}(\Psi,\nu,\tau)=\frac{|\Psi|^{\frac{c\tau+c^{\prime}}{2}}% }{2^{\frac{c\tau p+pc^{\prime}}{2}}\Gamma_{p}(\frac{c\tau+c^{\prime}}{2})}|Y|^% {-\frac{c\tau+c^{\prime}+p+1}{2}}\exp\left({-\frac{1}{2}}tr(\tau\Psi Y^{-1})% \right)\\ \end{split} | (19) |

We assume Gamma is the conjugate prior distribution for sampling \tau, and try to prove it below.

\begin{split}&\displaystyle G(\tau|\Sigma,\alpha^{*},\beta^{*})\propto IW(% \Sigma|\Psi,\nu,\tau)G(\tau|\alpha,\beta)\\ &\displaystyle G(\tau|\Sigma,\alpha^{*},\beta^{*})\propto\frac{|\Psi|^{\frac{c% \tau+c^{\prime}}{2}}}{2^{\frac{c\tau p+pc^{\prime}}{2}}\Gamma_{p}(\frac{c\tau+% c^{\prime}}{2})}|\Sigma|^{-\frac{c\tau+c^{\prime}+p+1}{2}}\exp\left({-\frac{1}% {2}}tr(\tau\Psi\Sigma^{-1})\right)\frac{\beta^{\alpha}}{\Gamma(\alpha)}\tau^{% \alpha-1}\exp(-\beta\tau)\\ \end{split} | (20) |

Thanks to proportionality, we can remove the constant terms with respect to \tau:

\begin{split}&\displaystyle G(\tau|\Sigma,\alpha^{*},\beta^{*})\propto\frac{|% \Psi|^{\frac{c\tau}{2}}}{2^{\frac{c\tau p}{2}}\Gamma_{p}(\frac{c\tau}{2})}|% \Sigma|^{-\frac{c\tau}{2}}\exp\left({-\frac{1}{2}}tr(\tau\Psi\Sigma^{-1})% \right)\tau^{\alpha-1}\exp(-\beta\tau)\\ \end{split} | (21) |

Ideally, we should create terms proportional to ‘\tau^{\alpha-1}\exp\left(-\tau\ \left(\frac{1}{2}\text{tr}(\Psi^{-1}Y)+\beta%
\right)\right)’, but because \tau affects both hyper-parameters, the initial term related to the degrees of freedom (\nu) is also dependent on \tau:

\begin{split}&\displaystyle G(\tau|\Sigma,\alpha^{*},\beta^{*})\propto\frac{|% \Psi|^{\frac{c\tau}{2}}|\Sigma|^{-\frac{c\tau}{2}}}{2^{\frac{c\tau p}{2}}% \Gamma_{p}(\frac{c\tau}{2})}\tau^{\alpha-1}\exp\left(-\tau\ \left(\frac{1}{2}% \text{tr}(\Psi\Sigma^{-1})+\beta\right)\right)\\ \end{split} | (22) |

To conclude, the Inverse-Wishart is only conjugate to the Gamma for the scale parameter (or a scaling coefficient over parameter \Psi), and cannot be used as a conjugate prior for deriving the posterior distribution of \tau_{\Sigma}.

## 8 Appendix C: Conjugacy for \tau_{\mu}

Let us consider a multivariate Normal distribution in a fully general case (Eq. 23), with a conjugate Gamma prior over random variable \tau. We will show that the conjugacy holds for this setting, through expanding the right hand side of the proportionality and deriving the posterior hyper-parameters in the presence of a single sample of data (A), i.e. N=1. The resulting parameters can be easily extended to generalise to the case of N observations.

\begin{split}&\displaystyle G(\tau|A,\alpha^{*},\beta^{*})\propto N(A|\mu,% \frac{\Sigma}{\tau\sigma})G(\tau|\alpha,\beta),\hskip 28.452756pt\sigma,\tau>0% \\ &\displaystyle\propto\frac{1}{\sqrt{(2\pi)^{2}\frac{|\Sigma|}{\tau\sigma}}}% \exp(-\frac{\tau\sigma}{2}(A-\mu)^{T}\Sigma^{-1}(A-\mu)))\times\frac{\beta^{% \alpha}}{\Gamma(\alpha)}\tau^{\alpha-1}\exp(-\beta\tau)\\ &\displaystyle\text{discarding the terms that are independent of the random % variable $\tau$, we will have:}\\ &\displaystyle\propto\tau^{1/2}.\tau^{\alpha-1}\exp(-\beta\tau-\frac{\tau% \sigma}{2}(A-\mu)^{T}\Sigma^{-1}(A-\mu)))\\ &\displaystyle\propto\tau^{\alpha-1/2}\exp(-\tau(\beta+\frac{\sigma}{2}(A-\mu)% ^{T}\Sigma^{-1}(A-\mu))))\\ \end{split} | (23) |

The remaining terms are proportional to a Gamma distribution with the following parameters:

\begin{split}\displaystyle\alpha^{*}=\alpha+1/2,\hskip 28.452756pt\beta^{*}=% \beta+\frac{\sigma}{2}(A-\mu)^{T}\Sigma^{-1}(A-\mu)\\ \end{split} | (24) |

## 9 Appendix D: Impacts of \tau on parameter distributions

In this appendix, we explore the impact of \tau on changing the mean and covariance of the Inverse-Wishart. As mentioned in the paper, approximately in all cases the mean stays unchanged and the variance is scaled inversely to the learning rate.

\begin{split}&\displaystyle IW(\Sigma|\Psi,\nu)^{\tau}\propto\left(|\Sigma|^{-% \frac{\nu+p+1}{2}}\exp\left({-\frac{1}{2}}tr(\Psi\Sigma^{-1})\right)\right)^{% \tau}\\ \\ &\displaystyle\left(|\Sigma|^{-\frac{\nu+p+1}{2}}\exp\left({-\frac{1}{2}}tr(% \Psi\Sigma^{-1})\right)\right)^{\tau}=|\Sigma|^{-\frac{\tau(\nu+p+1)}{2}}\exp% \left({-\frac{1}{2}}tr(\tau\Psi\Sigma^{-1})\right)\\ &\displaystyle\hskip 156.490157pt\approx|\Sigma|^{-\frac{\tau\nu+p+1}{2}}\exp% \left({-\frac{1}{2}}tr(\tau\Psi\Sigma^{-1})\right)\\ \\ &\displaystyle\Rightarrow\hskip 14.226378ptIW(\Sigma|\Psi,\nu)^{\tau}\propto IW% (\Sigma|\tau\Psi,\tau\nu)\\ \end{split} | (25) |

Accepting the approximation above, the resulting \Sigma samples are drawn approximately around the same mean, but with a scaled variance. When 0\leq\tau_{\Sigma}<1 the variance increases, whereas for \tau_{\Sigma}>1 the distribution is more peaky. In other words, the posterior samples of \Sigma in the former case are allowed to move away from the IW mean, tending to have greater adaptability towards the current observed data, but in the latter case the posterior samples concentrate around the prior mean, discouraging covariance adaptation.

\begin{split}&\displaystyle\text{mean of }\Sigma\sim IW(\Psi,\nu):\hskip 56.90% 5512ptM_{\Sigma}=\frac{\Psi}{\nu+p+1},\\ &\displaystyle\text{mean of }\Sigma^{\small(\tau)}\sim IW(\tau\Psi,\tau\nu):% \hskip 34.143307ptM_{\Sigma}^{\small(\tau)}=\frac{\tau\Psi}{\tau\nu+p+1}% \approx M_{\Sigma}\\ &\displaystyle\text{variance of }\Sigma\sim IW(\Psi,\nu):\hskip 45.524409ptV_{% \Sigma}\approx\frac{\Psi_{ij}^{2}}{\nu^{3}}\\ &\displaystyle\text{variance of }\Sigma^{\small(\tau)}\sim IW(\tau\Psi,\tau\nu% ):\hskip 22.762205ptV_{\Sigma}^{\small(\tau)}\approx\frac{\tau^{2}\Psi_{ij}^{2% }}{\tau^{3}\nu^{3}}\approx V_{\Sigma}/\tau\end{split} | (26) |

## References

- [1] Horst Bunke, Peter J. Dickinson, Miro Kraetzl, and Walter D. Wallis. A graph-theoretic approach to enterprise network dynamics, volume 24. Birkháuser, 2007.
- [2] J. Yamato, J. Ohya, and K. Ishii. Recognizing human action in time-sequential images using hidden Markov model. In Proc. CVPR, pages 379–385, 1992.
- [3] Cristian Sminchisescu, Atul Kanaujia, and Dimitris Metaxas. Conditional models for contextual human motion recognition. Computer Vision and Image Understanding, 104(2-3):210–220, 2006.
- [4] D. L. Vail, M. M. Veloso, and J. D. Lafferty. Conditional Random Fields for Activity Recognition. In Proc. Int. Conf. on Autonomous Agents and Multi-Agent Systems, 2007.
- [5] Minh Hoai, Zhen-zhong Lan, and Fernando De la Torre. Joint segmentation and classification of human actions in video. In Proc. CVPR, 2011.
- [6] Paul Over, George Awad, Martial Michel, Jonathan Fiscus, Wessel Kraaij, and Alan F. Smeaton. Trecvid 2011 – an overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of TRECVID 2011. NIST, USA, 2011.
- [7] Moritz Tenorth, Jan Bandouch, and Michael Beetz. The TUM Kitchen Data Set of Everyday Manipulation Activities for Motion Tracking and Action Recognition. In IEEE International Workshop on Tracking Humans for the Evaluation of their Motion in Image Sequences (THEMIS), in conjunction with ICCV2009, 2009.
- [8] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581, 2006.
- [9] Matthew J Beal, Zoubin Ghahramani, and Carl E Rasmussen. The infinite hidden markov model. In Advances in neural information processing systems, pages 577–584, 2001.
- [10] Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distributions, and the bayesian restoration of images. Pattern Analysis and Machine Intelligence, IEEE Transactions on, PAMI-6(6):721 –741, nov. 1984.
- [11] Y. W. Teh, K. Kurihara, and M. Welling. Collapsed variational inference for HDP. In Advances in Neural Information Processing Systems (NIPS), volume 20, 2008.
- [12] M. Zanotto, D. Sona, V. Murino, F. Papaleo, and H. Kjellstrom. Dirichlet process mixtures of multinomials for data mining in mice behaviour analysis. In International Conference on Computer Vision Workshops (ICCVW), 2013 IEEE Computer Society Conference on, pages 197–202, 2013.
- [13] E.B. Fox, E.B. Sudderth, M.I. Jordan, and A.S. Willsky. Bayesian Nonparametric Inference of Switching Dynamic Linear Models. IEEE Transactions on Signal Processing, 59(4):1569–1585, 2011.
- [14] C. Zhang, E. Henrik, X. Gratal, F. Pokorny, and H. Kjellstrom. Supervised hierarchical Dirichlet processes with variational inference. In International Conference on Computer Vision Workshops (ICCVW), 2013 IEEE Computer Society Conference on, pages 254–261, 2013.
- [15] H.W. Sorenson and D.L. Alspach. Recursive bayesian estimation using gaussian sums. Automatica, 7(4):465 – 479, 1971.
- [16] Kevin R. Canini, Lei Shi, and Thomas L. Griffiths. Online Inference of Topics with Latent Dirichlet Allocation. In Proc. AISTATS 2009, volume 5, pages 65–72, 2009.
- [17] Ludmila I. Kuncheva and Catrin O. Plumpton. Adaptive learning rate for online linear discriminant classifiers. In Proceedings of the 2008 Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition, SSPR & SPR ’08, pages 510–519, Berlin, Heidelberg, 2008. Springer-Verlag.
- [18] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121–2159, July 2011.
- [19] Tom Schaul, Sixin Zhang, and Yann LeCun. No more pesky learning rates. CoRR abs/1206.1106v2, (2), 2013.
- [20] Anastasia Pentina and Christoph H. Lampert. A pac-bayesian bound for lifelong learning. The 31st International Conference on Machine Learning, (2), 2014.
- [21] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006.
- [22] Leon Bottou. Stochastic learning. Advanced Lectures on Machine Learning, number LNAI 3176 in Lecture Notes in Artifficial Intelligence, pages 146–168, 2004.
- [23] C.C. Loy, T.M. Hospedales, Tao Xiang, and Shaogang Gong. Stream-based joint exploration-exploitation active learning. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1560–1567, 2012.
- [24] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3), 1951.
- [25] J. Sethuraman. A constructive definition of Dirichlet priors. Statistica Sinica, 4:639–650, 1994.
- [26] E. Fox, E. Sudderth, M. Jordan, and A. Willsky. Developing a tempered HDP-HMM for systems with state persistence. Technical report, MIT Laboratory for Information and Decision Systems, 2007.
- [27] E. B. Fox, E. B. Sudderth, M. I. Jordan, and A. S. Willsky. An HDP-HMM for systems with state persistence. In Proc. ICML, July 2008.
- [28] A. Bargi, R.Y. Da Xu, and M. Piccardi. An infinite adaptive online learning model for segmentation and classification of streaming data. In Pattern Recognition (ICPR), 2014 22nd International Conference on, pages 3440–3445, Aug 2014.
- [29] C.M. Bishop and SpringerLink (Online service). Pattern recognition and machine learning. Springer New York, 2006.
- [30] E. B. Fox. Bayesian Nonparametric Learning of Complex Dynamical Phenomena. Ph.D. thesis, MIT, Cambridge, MA, 2009.
- [31] Michael E. Tipping and Chris M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B, 61:611–622, 1999.
- [32] Marco Chiani. Distribution of the largest eigenvalue for real wishart and gaussian random matrices and a simple approximation for the tracyâwidom distribution. Journal of Multivariate Analysis, 129(0):69 – 81, 2014.
- [33] David Knowles and Zoubin Ghahramani. Nonparametric Bayesian sparse factor models with application to gene expression modeling. The Annals of Applied Statistics, 5(2B):1534–1552, 2011.
- [34] Ava Bargi, Richard Yi Da Xu, Zoubin Ghahramani, and Massimo Piccardi. A non-parametric conditional factor regression model for multi-dimensional input and response. JMLR W and CP, 33:77–85, 2014.
- [35] Liang Wang and David Suter. Recognizing human activities from silhouettes: Motion subspace and factorial discriminative graphical model. In Proc. CVPR, Los Alamitos, CA, USA, 2007. IEEE Computer Society.
- [36] Loris Nanni, Sheryl Brahnam, and Alessandra Lumini. Combining different local binary pattern variants to boost performance. Expert Systems with Applications, 38(5):6209 – 6216, 2011.
- [37] Zia Moghaddam and Massimo Piccardi. In DICTA, pages 188–195, 2009.