Dynamic Hierarchical Dirichlet Process for Abnormal Behaviour Detection in Video
This paper proposes a novel dynamic Hierarchical Dirichlet Process topic model that considers the dependence between successive observations. Conventional posterior inference algorithms for this kind of models require processing of the whole data through several passes. It is computationally intractable for massive or sequential data. We design the batch and online inference algorithms, based on the Gibbs sampling, for the proposed model. It allows to process sequential data, incrementally updating the model by a new observation. The model is applied to abnormal behaviour detection in video sequences. A new abnormality measure is proposed for decision making. The proposed method is compared with the method based on the non-dynamic Hierarchical Dirichlet Process, for which we also derive the online Gibbs sampler and the abnormality measure. The results with synthetic and real data show that the consideration of the dynamics in a topic model improves the classification performance for abnormal behaviour detection.
Unsupervised and semi-supervised learning for various video processing applications is an active research area nowadays. In many situations supervised learning is inappropriate or impossible. For example, in abnormal behaviour detection it is difficult to predict in advance what kind of abnormality may happen, collect and label a training dataset for some supervised learning algorithm.
Within the unsupervised methods topic modeling is a promising approach for abnormal behaviour detection [1, 2, 3]. It allows not only to give warnings about abnormalities but also provides an information about typical patterns of behaviour or motion.
Topic modeling [4, 5] is a statistical tool for discovering a latent structure in data. In text mining it is assumed that unlabelled documents can be represented as mixtures of topics, where the topics are distributions over words. The topics are latent and the inference in topic models is aimed to discover them.
In the conventional topic models, documents are independent. They share the same set of topics, but weights in a topic mixture for a particular document are independent of weights for all other documents in a dataset. However, in some cases it is reasonable to assume dependence in topic mixtures in different documents.
Consider the analysis of scientific papers of a given conference in text mining. It is expected that if a topic is “hot” in a given year, it would be popular in the next year too. The popularity of the topics changes through the years but in each two successive years the set of popular topics would be similar. It means that in a topic model the topic mixtures in the documents in successive years are similar to each other.
The same ideas are valid for abnormal behaviour detection. Documents are usually defined as short video clips extracted from a whole video sequence. Topics represent some local motion patterns. If the clips are sufficiently short, motions started in a given clip would continue in the next clip. Therefore it may be expected that the topic mixtures in the successive clips would be similar.
In this paper the dynamic topic model is proposed to improve the performance of abnormal behaviour detection. Two types of dynamics are considered in the topic modeling literature. In the first type the dynamics is assumed on the topic mixtures in documents [6, 7, 8]. This type of the dynamics is described earlier. In the second type the dynamics is assumed on the topics themselves [9, 10, 11], i.e. the distributions over words, which correspond to topics, change through time. There are works where both types of the dynamics are considered [12, 13].
In the proposed model the first type of the dynamics is considered. The model is constructed to encourage neighbour documents to have similar topic mixtures. The second type of the dynamics is not assumed, as in the video processing the set of words and their popularity do not change, thus the distributions over words are not expected to change.
Imagine there is an infinitely long video sequence. Motion patterns, which are typical for a scene, may appear and disappear and the total number of these patterns may be infinite. The motion patterns are modelled as topics in the topic model, hence the number of topics in the topic model may potentially be infinite. This kind of intuition may be simulated by a nonparametric model . Therefore the proposed model is nonparametric.
The most related model to the proposed one is presented in , which is also a dynamic topic model. The main difference between this model and the proposed one is that in the later a document, although is encouraged to have a topic mixture similar to the one in the previous document, may have any of the topics used in the dataset so far.
In abnormal behaviour detection it is essential to make a decision as soon as possible to warn a human operator to react. We propose batch and online inference for the model based on the Gibbs sampler. During the batch offline set up the Gibbs sampler processes a training set of documents, estimating distributions of words in topics. During the online set up testing documents are processed one by one. The main goal of the online inference is to estimate a topic mixture for the current document, without reconsidering all the previous documents. We also propose an abnormality measure, which is used in the final decision making.
The rest of the paper is organised as follows. In section II visual words and documents are defined. The proposed model is described in section III. Section IV presents the inference for the model, while section V introduces the abnormality detection procedure. The experimental results are given in section VI. Section VII concludes the paper.
Ii Video representation
In order to apply the topic modeling approach to video processing it is required to define visual words and visual documents. In this paper a visual word is defined as a quantised local motion measured by an optical flow . The optical flow vector is discretised spatially by averaging among pixels. The direction of the average optical flow vector is further quantised into the four main categories — up, right, down and left (Figure 1). The location of the averaged optical flow vector and its categorised direction together form a visual word.
The whole video sequence is divided into non-overlapping clips. Each clip is a visual document. The document consists of all the visual words extracted from the frames that form the corresponding clip.
Topics in topic modeling are defined as distributions over words. They indicate which words appear together. In the video processing applications topics are distributions over visual words. As visual words represent local motions, topics indicate the set of local motions that frequently appear together. They are usually called activities or actions (e.g. [16, 6, 2, 17]).
Once visual documents, words and topics are defined, the topic model for video processing can be formulated.
Iii Proposed model
There is a sequence of documents , where each document consists of words : . It is assumed that words are generated from a set of hidden distributions , that are called topics and documents are mixtures of this shared set of topics. The number of topics is not fixed. Moreover it is assumed that observing the infinite amount of data we can expect to have an infinite number of topics.
Iii-a Hierarchical Dirichlet Process Topic Model
This kind of mixture models with a potentially infinite number of mixture components can be modelled with the Hierarchical Dirichlet Process (HDP) . The HDP is a hierarchical extension of the Dirichlet process (DP), which is a distribution over random distributions . Each document is associated with a sample from a DP:
where is a concentration parameter, is a base measure. can be seen as a vector of mixture components weights, where the number of components is infinite.
The base measure itself is a sample from another DP:
with the concentration parameter and the base measure . This shared measure from a DP ensures that the documents will have the same set of topics but with different weights. Indeed, is almost surely discrete , concentrating its mass on the atoms drawn from . Therefore, picks the mixture components from this set of atoms.
A topic, that is an atom , is often modelled as the multinomial distribution with a probability of choosing a word [4, 5]. The base measure is therefore chosen as the conjugate Dirichlet distribution, usually a symmetric one. Let denote a parameter of this Dirichlet distribution.
The document is formed by repeating the procedure of drawing a topic from the mixture:
and drawing a word from the chosen topic:
for every token , where is the multinomial distribution.
Iii-A1 Chinese restaurant franchise
There are several ways of the HDP representation (as well as the DP). In this paper the representation called Chinese restaurant franchise (CRF) is considered as it is used for the derivation of the Gibbs sampling inference scheme. In this metaphor, each document corresponds to a “restaurant”; words correspond to “customers” of the restaurant. The words in the documents are grouped around “tables”. Each table serves a “dish”, which corresponds to a topic. The “menu” of dishes, i.e. the set of the topics, is shared among all the restaurants.
Let denote a table assignment for the token in the document , denote a topic assignment for the table in the document . Let denote the number of words assigned to the table in the document and denote the number of tables in the document serving the topic . The dots in subscripts mean marginalisation over the corresponding dimension, e.g. denotes the number of tables among all the documents serving the topic , while denotes the total number of tables in the document . Marginalisation over both dimensions means the total number of tables in the dataset.
The generative process of a dataset is as follows. A new token comes to the document and chooses one of the occupied tables with a probability proportional to a number of words assigned to this table, or the new token starts a new table with a probability proportional to :
If the token starts a new table it chooses one of the used topics with a probability proportional to a number of tables serving this topic among all the documents, or the token chooses a new topic, sampling it from the base measure , with a probability proportional to :
where is a number of topics used so far.
Once the token is assigned to the table with the topic , the word for this token is sampled from this topic:
Iii-B Dynamic Hierarchical Dirichlet Process Topic Model
In the HDP exchangeability of documents and words is assumed which means that the joint probability of the data is independent of the order of the documents and the words in the documents. However, in the video processing applications this assumption may be invalid. While the words inside the documents are still exchangeable, the documents themselves are not. All actions and motions in the real life last for some time, and it is expected that the topic mixture in the current document is similar to the topic mixture in the previous document. Some topics may appear and disappear but the core structure of the mixture components weights only slightly changes from document to document.
We propose the dynamic extension of the HDP topic model to take into account this intuition. In this model the probability of the topic explicitly depends on the usage of this topic in the current and previous documents , therefore the topic distribution in the current document would be similar to the topic distribution in the previous document. The topic probability still depends on the number of tables serving this topic in the whole dataset , but this number is weighted by a non-negative value , which is a parameter of the model. As in the previous case, it is possible to sample a new topic from the base measure .
The generative process can be then formulated as follows. A new token comes to a document and, as before, chooses one of the occupied tables with a probability proportional to the number of words already assigned to it, or it starts a new table with a probability proportional to the parameter :
If the token starts a new table, it chooses a topic for it. One of the used topics is chosen with a probability proportional to the sum of the number of tables having this topic in the current and previous documents and the weighted number of tables among all the documents, which serve this topic, . A new topic can be chosen for the table with a probability proportional to the parameter :
Finally, the word is sampled for the token in the document , assigned to the table , which serves the topic . The word is sampled from the corresponding topic :
Standard inference algorithms process an entire dataset. For large or stream datasets this batch set up is computationally intractable. Online algorithms process data in a sequential manner, one data point at a time, incrementally updating the variables, corresponding to the whole dataset. It allows to save memory space and reduce the computational time. In this paper a combination of offline batch and online inference is proposed and this section describes it in details.
The Gibbs sampling scheme is used . The inference procedure consists of two parts. Firstly, the traditional batch set up of the Gibbs sampling is applied to the training set of the documents. Then an online set up of the inference is applied for the testing documents. This means that the information about a testing document is incrementally added to the model, not requiring to process the training documents again.
In the Gibbs sampling inference scheme the hidden variables and are sampled from their conditional distributions. In the Gibbs sampler for the HDP model exchangeability of documents and words is used by treating the current variable as the table assignment for the last token in the last document and as the topic assignment for the last table in the last document. There is no exchangeability of documents in the proposed model, but words inside a document are still exchangeable. Therefore, the variable can be treated as the table assignment for the last token in the current document , and the variable can be treated as the topic assignment for the last table in the current document . The documents are processed in the order they appear in the dataset.
The following notation is used below. Let denote the size of the words vocabulary, is the set of the table assignments for all the tokens in the documents from to . Let and denote the corresponding sets for the topic assignments and the observed data. Let denote the number of tables having the topic in the documents from to . Let also denote the words assigned to the table in the document .
Let denote the number of times the word is associated with the topic , denote the number of tokens associated with the topic : , regardless the word assignments. The notation is used for the number of times the word associated with the topic in the documents from to .
The superscript indicates the corresponding variable without considering the token in the document , e.g. the set variable or the count is the number of words, assigned the table in the document , excluding the word for the token . Similarly, the superscript means the corresponding variable without considering the table in the document .
Iv-a Batch Gibbs sampling
Iv-A1 Sampling topic assignment
The topic assignment for the table in the document is sampled from the conditional distribution given the observed data and all the other hidden variables, i.e. the table assignments for all the tokens and the topic assignments for all the other tables :
The likelihood term can be computed by integrating out the distribution :
where is the gamma-function. In the case when is a new topic () the integration is done over the prior distribution for . The obtained likelihood term (12) is then:
The second multiplier in (11) can be further factorised as:
The first term in (14) is the probability of the topic assignments for all the tables in the next documents depending on the change of the topic assignment for the table in the document . Consider the topic assignments in the document firstly. From (9) it is:
where the sign of proportionality is used w.r.t. , is the set of the topics that firstly appear in the document , the superscript means that is set to for the corresponding counts, is the cardinality of the set. The similar probabilities of the topic assignments for all the next documents depend on only in the term . It is assumed that the influence of on these probabilities is not significant and the first term in (14) is approximated by the probability of the topic assignments in the document (15) only:
The second term in (14) is the prior for :
As a result, (14) is computed as follows:
The table assignment for the token in the document is sampled from the conditional distribution given the observed data and all the other hidden variables, i.e. the topic assignments for all the tables and the table assignments for all the other tokens :
The first term in (IV-A2) is the likelihood of the word . It changes depending on whether is one of the previously used table or it is a new table. For the case when is the table which is already used the likelihood is:
Consider now the case when , i.e. the likelihood of the word being assigned to a new table. This likelihood can be found by integrating out the possible topic assignments for this table:
where is as (18).
The second term in (IV-A2) is the prior for :
Then the conditional distribution for sampling a table assignment is:
If a new table is sampled, then a topic for it is sampled from (19).
Iv-B Online inference
In online or distributed implementations of inference algorithms in topic modeling the idea is to separate global variables, i.e. those that depend on the whole set of data, and local variables, i.e. those that depend only on the current document [21, 22, 23].
For the proposed dynamic HDP model the global variables are the distributions , which are approximated by the counts , and the global topic popularity, which is estimated by the counts . Note, that the relative relationship between counts is important, rather than the absolute values of the counts. The local variables are the topic mixture weights for each document, governed by the counts . The training dataset is assumed to be large enough such that the global variables are well estimated by the counts available during the training stage and a new document can only slightly change the obtained ratios of the counts.
Following this assumption, the learning procedure is organised as follows. The batch Gibbs sampler is run for the training set of the documents. After this training stage the global counts and for all and are stored and used for the online inference of the testing documents. For each testing document the online Gibbs sampler is run to sample table assignments and topic assignments for this document only. The online Gibbs sampler updates the local counts . After the Gibbs sampler converges, the global counts and are updated with the information obtained by the new document.
The equations for the online version of the Gibbs sampler slightly differ from the batch ones (19) and (24). Namely, the conditional probability in the topic assignment sampling distribution (19) differs from (14). As next documents are not observed during processing the current document, this probability consists only of the prior term :
Substituting this expression into (19) the obtained sampling distribution for the topic assignment in the online Gibbs sampler is:
The updating distribution for the topic assignment in the online Gibbs sampler remains the same as in the batch version (24).
V Abnormality detection
Topic models provide a probabilistic framework for abnormality detection. Under this framework the abnormality measure is the likelihood of data. The low value of the likelihood means the built model cannot explain the current observation, i.e. there is something atypical in the observation, which is not fitted to the typical motion patterns, learnt by the model.
From the Gibbs sampler we have estimates of the distributions and posterior samples of the table and topic assignments. This information can be used to estimate the predictive likelihood of a new clip. The predictive likelihood, normalised by the length of the clip in terms of visual words, is used as an abnormality measure in this paper.
The predictive likelihood is estimated via a harmonic mean , as it allows to use the information from the posterior samples:
where is the number of the posterior samples, and are from the -th posterior sample obtained by the Gibbs sampler, and
The superscript on the counts means these counts are from the -th posterior sample.
The abnormality detection procedure is then as follows. The batch Gibbs sampler is run on the training dataset. Then for each clip from the testing dataset first the online Gibbs sampler is run to obtain the posterior samples of the hidden variables corresponding to the current clip. Afterwards the abnormality measure:
is computed for the current clip. If the abnormality measure is below than some threshold, the clip is labelled as abnormal, otherwise as normal. And the next clip from the testing dataset is processed.
In this section the proposed method is applied to abnormality detection111The code is available on https://github.com/OlgaIsupova/dynamic-hdp. The method is compared with the one, based on the HDP topic model, where for the HDP topic model the online version of the Gibbs sampler and the abnormality measure are derived similarly to the dynamic HDP (for the batch Gibbs sampler of the HDP topic model the implementation by Chong Wang is used222It is available on https://github.com/Blei-Lab/hdp). Each of the algorithms has 5 runs with different initialisations to obtain 5 independent posterior samples. Both batch and online samplers are run for 1000 “burn-in” iterations.
The methods are compared on both synthetic and real data. The abnormality classification accuracy is used for the quantitative comparison of the methods. For computing classification accuracy the ground truth about abnormality should be provided. For the synthetic data the ground truth is known from the generation, for the test real data the clips are labelled manually as normal or abnormal. Note, the methods use only unlabelled data, labels are applied for performance measure.
In statistics the following measures are used for binary classification: true positive (TP) is the number of observations which are correctly detected by an algorithm as positive, false negative (FN) is the number of observations which are incorrectly detected as negative, true negative (TN) is the number of observations which are correctly detected as negative, and false positive FP is the number of observations which are incorrectly detected as positive .
For the quantitative comparison the area (AUC) under the receiver operating characteristic (ROC) curve is used in this paper. The curve is built by plotting the true positive rate versus the false positive rate while the threshold varies. The true positive rate (TPR), also known as recall, is defined as:
The false positive rate (FPR), also known as fall-out, is defined as:
Vi-a Synthetic data
The popular “bar” data is used as a synthetic data (introduced in ). In this data the vocabulary consists of words, organised into a matrix. There are topics in total, the word distributions of these topics form vertical and horizontal bars in the matrix (Figure 2).
The training dataset consisting of documents is generated from the proposed model (8) – (10), where noise is added to the distributions . Each of the documents has words. The hyperparameters are set to the following values for the generation: , , .
Similarly, the testing dataset consisting of documents is generated, but where random documents are generated as “abnormal”. In the proposed model it is assumed that topic mixtures in neighbour documents are similar. Contrarily to this assumption topics for an abnormal document are chosen uniformly from the set of all the topics except those used in the previous document.
The both algorithms are run for these datasets, computing the abnormality measure for all the testing documents. The hyperparameters , , are set to the same values as for the generation, ( is not used in the generation as the word distributions in topics are set manually).
In Figure 3 the ROC-curves for the obtained abnormality measures are presented. There is also presented the ROC-curve for the “true” abnormality measure. The “true” abnormality measure is computed using the likelihood given the true distributions and the true table and topic assignments and , i.e. it corresponds to the model that can perfectly restore the latent variables. Table I contains the obtained AUC values.
|Dataset||Dynamic HDP||HDP||“True” model|
The results show that the proposed dynamic HDP can detect the simulated abnormalities and its performance is competitive to the “true” model. The original HDP method should not detect this kind of abnormalities, as they do not contradict to its generative model, it is confirmed by the experimental results.
Vi-B Real data
The algorithms are applied to the QMUL-junction real data . This is a 45-minutes video captured a road junction (Figure (a)a). The frame size is . The -pixel grid cells are used for spatial averaging of the optical flow. For the optical flow estimation the sparse pyramidal version of the Lucas-Kanade optical flow algorithm is used  (the implementation is available in the opencv library). The resulting vocabulary size is . Non-overlapping clips, -second length, are treated as visual documents. A 5-minute video sequence is used as a training dataset.
The algorithms are run with the following hyperparameters: , , . The weight parameter for the dynamic HDP is set to .
The data is manually labelled as normal/abnormal to measure classification accuracy, where abnormal event examples are jay-walking (Figure (b)b), driving wrong direction (Figure (c)c), disruption in traffic flow (Figure (d)d).
The ROC-curves for the methods are presented in Figure 5. The corresponding AUC values can be found in Table I. The proposed dynamic HDP method outperforms the other one. The provided results show that consideration of dynamics in a topic model may improve the classification results in abnormality detection.
In this paper a novel Bayesian nonparametric dynamic topic model is proposed, denoted as dynamic HDP. The Gibbs sampling scheme is applied for inference. The online set up for the inference is designed, allowing to incrementally train the model when the data is processed sequentially. The model is applied for abnormal behaviour detection in video. The abnormality decision rule is based on the predictive likelihood of the data that is developed in this paper. We show that the proposed method, based on the dynamic topic model, improves the classification performance in comparison to the method, based on the model without dynamics. We compare the proposed dynamic HDP method with the method based on the HDP, introduced in . The experiments both on synthetic and real data confirm the superiority of the proposed method.
Olga Isupova and Lyudmila Mihaylova would like to thank the support from the EC Seventh Framework Programme [FP7 2013-2017] TRAcking in compleX sensor systems (TRAX) Grant agreement no.: 607400. Lyudmila Mihaylova acknowledges also the support from the UK Engineering and Physical Sciences Research Council (EPSRC) via the Bayesian Tracking and Reasoning over Time (BTaRoT) grant EP/K021516/1.
-  H. Jeong, Y. Yoo, K. M. Yi, and J. Y. Choi, “Two-stage online inference model for traffic pattern analysis and anomaly detection,” Machine Vision and Applications, vol. 25, no. 6, pp. 1501–1517, 2014.
-  J. Varadarajan and J. Odobez, “Topic models for scene analysis and abnormality detection,” in Proceedings of the 2009 IEEE 12th International Conference on Computer Vision Workshops (ICCV Workshops), Sept 2009, pp. 1338–1345.
-  R. Mehran, A. Oyama, and M. Shah, “Abnormal crowd behavior detection using social force model,” in Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2009, pp. 935–942.
-  T. Hofmann, “Probabilistic latent semantic indexing,” in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR ’99. New York, NY, USA: ACM, 1999, pp. 50–57.
-  D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, Mar. 2003.
-  T. Hospedales, S. Gong, and T. Xiang, “Video behaviour mining using a dynamic topic model,” International Journal of Computer Vision, vol. 98, no. 3, pp. 303–323, 2012.
-  D. Kuettel, M. Breitenstein, L. Van Gool, and V. Ferrari, “What’s going on? Discovering spatio-temporal dependencies in dynamic scenes,” in Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2010, pp. 1951–1958.
-  I. Pruteanu-Malinici, L. Ren, J. Paisley, E. Wang, and L. Carin, “Hierarchical Bayesian modeling of topics in time-stamped documents,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 6, pp. 996–1011, June 2010.
-  C. Wang, D. Blei, and D. Heckerman, “Continuous time dynamic topic models,” in Proceedings of the Twenty-Fourth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-08). Corvallis, Oregon: AUAI Press, 2008, pp. 579–586.
-  X. Fu, J. Li, K. Yang, L. Cui, and L. Yang, “Dynamic online HDP model for discovering evolutionary topics from Chinese social texts,” Neurocomputing, vol. 171, pp. 412–424, 2016.
-  C. Chen, N. Ding, and W. Buntine, “Dependent hierarchical normalized random measures for dynamic topic modeling,” in Proceedings of the 29th International Conference on Machine Learning (ICML-12), ser. ICML ’12, J. Langford and J. Pineau, Eds. New York, NY, USA: Omnipress, July 2012, pp. 895–902.
-  D. M. Blei and J. D. Lafferty, “Dynamic topic models,” in Proceedings of the 23rd International Conference on Machine Learning, ser. ICML ’06. New York, NY, USA: ACM, 2006, pp. 113–120.
-  A. Ahmed and E. Xing, “Timeline: A dynamic hierarchical Dirichlet process model for recovering birth/death and evolution of topics in text stream,” in Proceedings of the Twenty-Sixth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-10). Corvallis, Oregon: AUAI Press, 2010, pp. 20–29.
-  P. Orbanz and Y. W. Teh, “Bayesian nonparametric models,” in Encyclopedia of Machine Learning. Springer, 2011, pp. 81–89.
-  B. K. Horn and B. G. Schunck, “Determining optical flow,” Artificial Intelligence, vol. 17, pp. 185–203, 1981.
-  X. Wang and X. Ma, “Unsupervised activity perception in crowded and complicated scenes using hierarchical Bayesian models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 3, pp. 539–555, 2009.
-  O. Isupova, L. Mihaylova, D. Kuzin, G. Markarian, and F. Septier, “An expectation maximisation algorithm for behaviour analysis in video,” in Proceedings of the 18th International Conference on Information Fusion (Fusion) 2015, July 2015, pp. 126–133.
-  Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierarchical Dirichlet processes,” Journal of the American Statistical Association, vol. 101, no. 476, pp. 1566–1581, 2006.
-  T. S. Ferguson, “A Bayesian analysis of some nonparametric problems,” The Annals of Statistics, vol. 1, no. 2, pp. 209–230, 1973.
-  S. Geman and D. Geman, “Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, no. 6, pp. 721–741, 1984.
-  K. Vorontsov, O. Frei, M. Apishev, P. Romov, and M. Dudarenko, “BigARTM: Open source library for regularized multimodal topic modeling of large collections,” in Analysis of Images, Social Networks and Texts. Springer, 2015, pp. 370–381.
-  P. Smyth, M. Welling, and A. U. Asuncion, “Asynchronous distributed learning of topic models,” in Advances in Neural Information Processing Systems, 2009, pp. 81–88.
-  C. Wang, J. W. Paisley, and D. M. Blei, “Online variational inference for the hierarchical Dirichlet process.” in AISTATS, vol. 2, no. 3, 2011, p. 4.
-  T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proceedings of the National Academy of Sciences, vol. 101, no. 1, pp. 5228–5235, 2004.
-  K. P. Murphy, Machine learning: a probabilistic perspective. MIT press, 2012.
-  J.-Y. Bouguet, “Pyramidal implementation of the affine Lucas Kanade feature tracker description of the algorithm,” Intel Corporation, vol. 5, no. 1-10, p. 4, 2001.