State Aggregation Learning from Markov Transition Data
Abstract
State aggregation is a model reduction method rooted in control theory and reinforcement learning. It reduces the complexity of engineering systems by mapping the system’s states into a small number of metastates. In this paper, we study the unsupervised estimation of unknown state aggregation structures based on Markov trajectories. We formulate the state aggregation of Markov processes into a nonnegative factorization model, where left and right factor matrices correspond to aggregation and disaggregation distributions respectively. By leveraging techniques developed in the context of topic modeling, we propose an efficient polynomialtime algorithm for computing the estimated state aggregation model. Under some “anchor state” assumption, we show that one can reliably recover the state aggregation structure from sample transitions with high probability. Sharp divergence error bounds are proved for the estimated aggregation and disaggregation distributions, and experiments with Manhattan traffic data are provided.
1 Introduction
State aggregation is a longexisting approach for model reduction of complicated systems. It is widely used as a heuristic to reduce the complexity of control systems and reinforcement learning. The earliest idea of state aggregation is to aggregate “similar” states into a small number of subsets through a partition map. However, the metastates and the partition map are often handpicked by practitioners based on domainspecific knowledge rogers1991aggregation (); bertsekas1995neuro (). Alternatively, the partition can be chosen via discretization in accordance with some priorly known similarity metric or feature functions tsitsiklis1996feature (). More generally speaking, state aggregation also allows one to map states into metastates through some probabilistic relation. Such models have been used for modeling large Markov decision processes, where each observed state is represented as a mixture over latent states singh1995reinforcement (); bertsekas1995neuro (). This model is known as soft state aggregation, where states and metastates are mapped to each other via aggregation distributions and disaggregation distributions. Experiences and theoretical evidences suggest that the knowledge of a state aggregation model significantly reduces the complexity of solving the Markov decision problem singh1995reinforcement ().
Unfortunately, existing approaches that use the idea of state aggregation for optimal control and reinforcement learning typically require prior knowledges about the exact system dynamics and/or the state aggregation structures. Neither is available in emerging reinforcement learning applications that involve complicated blackbox systems, like autonomous driving, computer games and cell evolution. In these cases, little prior structure is known about the underlying process, while a substantial amount of transition data are available. There lacks a principled approach that infers the state aggregation structure of a blackbox system from unsupervised data.
In this paper, we study the estimation of a state aggregation model from sample transition data. We hope to develop an unsupervised state aggregation approach that requires minimal prior knowledge. Suppose that we observe a Markov trajectory generated by a statetransition system. In the case of hard state aggregation, the goal would be to find a partition mapping such that . More generally, we consider the soft aggregation structure such that every state can be represented using a membership model over a number of metastates. To this end, we hope to develop an efficient statistical procedure to estimate and compute the state aggregation model based on trajectories of the Markov chain. The soft state aggregation structure of a Markov chain can be formulated into a nonnegative factorization of the transition probability operator. Experience suggests that spectral decomposition is effective for analysis and state compression of Markov trajectories zhang2018spectral (). However, plain spectral decomposition does not recover the latent states, aggregation or disaggregation distributions. It remains open how to identify the state aggregation model effectively from data.
Inspired by a technique used in topic modeling TopicSCORE (), we develop an efficient method to solve this open problem. We introduce a notion of “anchor state”, which is an analog of anchor words, to uniquely identify metastates of the Markov chain. Based on existence of anchor states, our method takes as input sample state transitions and performs a multistep SVD approach, where the key is to properly normalize the data and rotate the singular vectors into the nonnegative cone by a simplexhunting algorithm. The proposed method outputs the estimated state aggregation model, including the metastates, aggregation/disaggregation distributions, and sets of anchor states that represent each metastate. Numerical experiments are performed on simulated random walks and a Manhattan taxitrip data set, in which we identified leading modes of the taxi traffic as well as anchor regions in the city. On the theoretical side, we prove highprobability error bounds for the total variation divergence between the estimated distributions and the ground truth. To the authors’ best knowledge, this is the first statisticallyproven approach that uncovers unknown state aggregation structure from time series data.
2 Related Literatures
State aggregation is one of the earliest model reduction methods that were used to approximate known control systems with smaller ones. It also applies to reinforcement learning where the underlying transition probabilities are not known a priori. State aggregation reduces the complexity of the state space and therefore reduces computational costs for approximating the optimal value function or policy; see e.g., moore1991variable (); bertsekas1995neuro (); singh1995reinforcement (); tsitsiklis1996feature (); ren2002state (). Similar to but more general than the state aggregation approach, a line of works, known as representation learning, construct basis functions for representing highdimensional value functions, for which many methods have been developed; e.g. johns2007constructing (); mahadevan2005proto (); parr2007analyzing (); petrik2007analysis (). mahadevan2009learning (). The aforementioned methods typically require prior knowledge about structures of the problem, lacking statistical guarantees.
Two recent papers studied the statistical model reduction of Markov time series. The paper zhang2018spectral () proposed a class of spectral state compression methods of Markov process based on singular value thresholding of empirical transition matrices. It gives statistical error bounds for estimating the transition matrix and recovering the principal subspace associated with the transition matrix. Another related paper li2018estimation () studied the estimation of Markov transition matrix using maximum likelihood estimation with a lowrank constraint. Both zhang2018spectral () and li2018estimation () focus on estimation of a lowrank Markov model. However, their results do not apply to recovery of state aggregation model which corresponds to a particular nonnegative factorization model. In particular, the spectral methods cannot recover the aggregation and disaggregation distributions.
Estimating the state aggregation structure is essentially a problem of nonnegative matrix factorization (NMF) lee1999learning () under noise corruption. NMF is widely applied and many methods exist (see gillis2014and () for a comprehensive overview). However, many classical methods for NMF only perform well with independent and homoscedastic noise. The empirical transition matrix generated by a Markov chain does not satisfy these conditions. Fortunately, we discover that state aggregation is closely related to topic modeling. They share similar nonnegative factorization structure and similar noise structures. In particular, empirical transition distribution conditioned at a state is analogous to the bagofwords representation of a “document”, each outgoing state can be viewed as a “dictionary word”, and a metastate is analogous to a “topic”. This inspires us to borrow ideas from topic modeling to develop tractable methods for estimating state aggregation.
In particular, we find that the state aggregation model is strongly related to topic models with an “anchor word” assumption Ge (); TopicSCORE (). The anchor word assumption has a natural analog in state aggregation models, which we call anchor state assumption. An anchor state is visited by only one metastate with a nonzero probability. The “anchor state” assumption is meaningful in applications: In Manhattan taxitrip data, we have successfully identified a number of “anchor states”, each representing a location in the city that reflects the leading traffic modes hidden in the taxitrip data.
The proposed method is adapted from the topic modeling method in TopicSCORE (). It is able to identify the aggregation/disaggregation distributions by leveraging the technique in TopicSCORE (), where, in the context of a topic model with anchor words, they developed a principled way to find vertices corresponding to the singular vectors. Our method is customized to analyzing Markov transition data for the state aggregation problem. It uses a normalization technique, which is called Jin’s SCORE SCORE () and proved very useful in network data and text data analysis mixedSCORE (); TopicSCORE (). The normalization scheme is particularly crafted to address the dependence issue in Markov data. Our method can compute estimates of both aggregation and disaggregation distributions, while the topic modeling method in TopicSCORE () can recover only the right factors. To the best knowledge of the authors, this is the first work that develops a tractable method for estimating the metastates, aggregation and disaggregation distributions with statistical guarantees.
3 The State Aggregation Model
In this section, we introduce the model of soft state aggregation and the notion of anchor states. We also discuss its relations with topic model and other latentvariable models for Markov chains.
Assumption 1 (Soft State Aggregation).
Let be a positive integer. The Markov chain admits a soft state aggregation, i.e., there exist random variables such that
for all with probability , where and are independent of and referred to as the aggregation distributions and disaggregation distributions, respectively.
The soft state aggregation model has been discussed in various literatures like singh1995reinforcement (), zhang2018spectral (). See bertsekas1995dynamic () Section 6.3.7 for a textbook review. Admitting a soft state aggregation is equivalent to a nonnegative factorization of the transition matrix. A Markov chain satisfy the soft state aggregation with metastates if and only if its transition matrix can be written in the form of
(1) 
where satisfy that and . This decomposition means that one can map the states into metastates while preserving the system’s dynamics. It also suggests that the transition dynamics is the sum of leading modes, each mode corresponding to a rank1 matrix. The smallest integer such that the nonnegative factorization (1) holds is called the nonnegative rank of . Throughout the paper, we assume that the nonnegative rank of , which equals to the smallest number such that Assumption 1 holds, is a known constant and does not scale with the number of states . Note that the nonnegative factorization (1) is a stronger model than spectral decomposition. As a result, the spectral decomposition method proposed in zhang2018spectral () does not apply to estimation of the state aggregation model.
We hope to estimate the state aggregation structure suggested in Assumption 1 and identify the metastates . However, the latent process is not necessarily identifiable. In fact, there exist infinitely many ways to specify metastates and aggregation distributions to represent the same Markov process. In order to find a meaningful state aggregation structure, we require that there exist at least one “anchor state” that specifies each metastate.
Assumption 2 (Anchor State Condition).
For every metastate , there exists an anchor state and a nonnegative decomposition such that , , and for all , where denote columns of .
The anchor state condition is a natural analog of the anchor word condition for topic modeling Ge () and the separability condition for nonnegative matrix factorization donoho2004does (). Under additional technical conditions, the anchor state condition would ensure unique identifiability of the metastates and the nonnegative matrix factorization donoho2004does (). Notably, the state aggregation model with anchor state condition is related to several models, which we detail below:

Relation to Topic Model. The topic model is a statistical model for the generating process of text documents. Given a dictionary of distinct words, there exist unknown topics, each corresponding to a probability mass function over the dictionary, denoted by where . Each document admits a mixed membership over the topics, and words in document are drawn from the dictionary using the probability mass function . Let be the matrix, each row of which contains the expected word distribution in a document. Then, admits a nonnegative factorization:
where is called the topic matrix and is called the weight matrix. The state aggregation model has a natural analog to topic models: the empirical transitions conditioned on each state can be thought as a document, and and are analogous to the weight matrix and topic matrix, respectively. The state aggregation model differs from topic models in the sense that both documents and words correspond to the same set of states and the factor matrices may admit special structures particular to the Markov system.

Relation to Hard Aggregation. A Markov chain admits a hard aggregation if there exists a partition of the state space such that for all . This suggests that the transition matrix has a nonnegative decomposition where is a blockwiseconstant matrix and each column of it is an indicator function of one of the subsets (Prop. 3 of zhang2018spectral ()). Admitting a hard state aggregation model does not necessarily imply the existence of anchor state for every metastate.

Relation to Lumpable Markov Models. A Markov process is called lumpable if the state space can be partitioned into blocks while still preserving the strong Markov property. However, a lumpable Markov chain is not necessarily lowrank (Prop. 4 of zhang2018spectral ()), as there may be local dynamics within the blocks leading to a fullrank transition matrix. A particularly interesting case is when a Markov chain is both aggregatable and lumpable, which happens if and only if
where columns of are indicator functions of the blocks, is a stochastic matrix governing the blocktoblock transitions, and is a diagonal matrix with block sizes along its diagonal. Lumpability implies some form of symmetry between left and right singular spaces of . When an aggregatable Markov chain is also lumpable, the transition matrix can be decomposed into the form by letting and . In this case, we can verify that the anchor state condition holds, therefore it is a special case that falls under our assumptions.
4 State Aggregation Learning
We develop a spectral state aggregation method for analyzing Markov transition data, by adapting a technique in TopicSCORE () which was developed for topic modeling. Suppose that we are given sample pairs of state transitions , where follows the transition kernel . We formulate a matrix of empirical statetransition counts , where . We will use the count matrix as the input of our estimation procedure, which is detailedly described in Algorithm 1.
In what follows we explain the intuition of Algorithm 1 and assume for simplicity that are independently generated from some distribution, so that rows of become independent. Given the count matrix , the first step is to conduct a columnwise normalization of :
(2) 
This normalization helps reduce noise heteroscedasticity and dependence. Roughly speaking, each row of has a multinomial distribution, as a result,
Hence, the normalization step (2) makes the “covariance” matrix of approximately an identity matrix.
The state aggregation model implies
where , . It inspires the following steps of estimating :

Estimate the column space of by the leading right singular vectors of .

Recover from its column space using the nonnegative constraint.

Estimate from the estimate of .
The first and third steps are trivial, so what remains is how to recover from its column space.
Borrowing the idea of TopicSCORE (), we apply a normalization scheme, known as SCORE SCORE (), on singular vectors of . Let be the right singular vectors of . The SCORE normalization divides each of by the leading singular vector in an entrywise fashion and obtains a matrix
(3) 
Under mild conditions, is a positive vector so is welldefined. It turns out that rows of exhibits a simplex geometry:

There exits a lowdimensional simplex with vertices such that all rows of are approximately contained in this simplex.

When each metastate has an anchor state, each vertex of the above simplex has a row of located around it. Therefore, we can use the point cloud of rows of to estimate these vertices.

With knowledge of simplex vertices, one can directly transform the singular vectors into the nonnegative cone and get an estimate of .
Algorithm 1 implemented these ideas and customize them to work for the state aggregation learning problem. We refer the readers to the appendix for a detailed explanation of the proposed method.
Input: empirical statetransition counts , number of metastates , a tuning integer
Output: the estimate and

Vertex hunting. Apply a means clustering to rows of with the number of clusters equal to . Let be the resulting means cluster centers. Search among all simplexes whose vertices are selected from and find the simplex such that is minimized, where denotes the projection to . Let be the vertices of this simplex.

For , compute
where is the standard simplex. Let be the matrix whose th row is .

Renormalize each column of the matrix
by its norm. The resulting matrix is .

Estimate the aggregation matrix by
where is the empirical transition probability matrix.
In Step 5, the estimated is not guaranteed to be a nonnegative matrix. There is an alternative way to Step 5: For each , we project the jth row of to the convex hull of columns of , where the projected vector is uniquely expressed as a convex combination of columns of with a combination coefficient vector denoted as . We use to estimate the th row of . Numerical experiments suggest that both subroutines for estimating work equally well. When take known constant values, the overall method has a runtime polynomial in .
5 Main Statistical Results
In this section we establish the main results on statistical upper bounds for estimating the aggregation and disaggregation matrices .
First, we present a statistical error bound for recovering the disaggregation distributions (columns of ):
Theorem 1 (Statistical error bound for ).
Consider a state Markov chain with transition matrix that satisfies Assumptions 1 and 2 with a fixed . Let sample transitions be generated independently from the stationary distribution . Fix positive constants . Suppose and is an irreducible matrix with . Suppose the first right singular vector of satisfies that the ratio between its maximum and minimum entries is bounded by . Suppose satisfies . When and , with probability , the estimate given by Algorithm 1 satisfies
Next, we present a statistical error bound for recovering the disaggregation distributions (rows of ):
Theorem 2 (Statistical error bounds for ).
Theorems 1 and 2 provide a totalvariance estimation error bound that applies uniformly to every individual aggregation distribution and every individual disaggregation distribution.
In what follows, we explain the technical conditions required by Theorems 1,2 and their theoretical implications.
Remark (About regularity conditions). Our theoretical bound requires a few regularity conditions, but they are rather mild:

The requirement that is irreducible means it cannot be converted to a blockwise diagonal matrix via simultaneous row/column permutation. The case that has a blockwise diagonal structure rarely happens. Note that hard state aggregation only implies has a block structure but not . For lumpable Markov models, as long as the latent Markov process is irreducible, this condition is also satisfied.

The condition of is mild because is a lowrank matrix and its nonzero eigenvalues are typically at the constant order.

The condition on the first right singular vector of is a form of incoherence and is commonly used in the literature of matrix completion and spectral analysis. We note that all coordinates of the first right singular are strictly positive, as a result of the irreducibility of .

The condition will be satisfied if every state can be reached sufficiently often (at least probability) in one step conditioned on some metastate.
Informally speaking, they require some form of irreducibility and reachability as well as that aggregation/disaggregation distributions be sufficiently independent so one can separate them.
Remark (Comparison with existing bounds). Existing works on estimating lowrank Markov models relied on spectral decomposition. They only gave error bounds for either estimating or estimating the principal subspace of zhang2018spectral (); li2018estimation (). However, zhang2018spectral (); li2018estimation () did not account for the nonnegative aspect of the factorization, thus they cannot recover any state aggregation structure. To our best knowledge, Theorems 1 and 2 give the first direct error bounds for estimating individual aggregation and disaggregation distributions.
Remark (Optimality of the error bounds). Consider the special case where , so the Markov chain becomes repeatedly and independently sampling from a fixed state distribution, which equals to the single column of . In this case, the leading term of the error bound given by Theorem 1 is . It matches the minimax total variance divergence (up to a logarithmic factor) for estimating a state discrete distribution han2015minimax (). Using a similar argument, we can see that the leading error bound given by Theorem 2 is also nonimprovable for estimating aggregation distributions, because there are aggregation distributions (with a support size which is treated as a constant) and each distribution is sampled for roughly times. Therefore our error bounds are optimal in their dependence on and , as long as is sufficiently large.
Remark. The current results left open a few issues. First, we assumed for simplicity that the number of metastates is a constant value, therefore our error bounds depend only on and . It remains to analyze how the recovery error scales as varies. Second, we have assumed for simplicity that are generated independently from the stationary distribution. It remains to analyze how would the error change if the sample transitions are drawn from a long state trajectory. In this case, one needs to analyze the mixing properties of the Markov chain. These issues are left for future research.
6 Numerical Experiments
Finally we experiment the proposed state aggregation procedure on both simulated and real data.
6.1 Simulation Results
First we test our new approach on simulated sample transitions. For a state Markov chain with meta states, we first randomly create two matrices such that each meta state has the same number of anchor states. After assembling a transition matrix , we generate random walk data as input of Algorithm 1.
We carry out experiments with , and the number of anchor states equal to for each meta state. The results with varying sample size are plotted in Figure 2. We observe that, when there are more anchor states, Algorithm 1 performs slightly better at recovering columns of . In the loglog plot, the error in scales linearly with . By fitting a curve, we see that the error curve (when there are anchor states) approximately satisfies . This validates our error upper bound in Theorem 1.
We next compare Algorithm 1 with the spectral state compression method proposed in zhang2018spectral (), in terms of recovering the transition matrix . Let be our estimator, and let be obtained using the competing method (computed by singular value thresholding). We measure the recovery error by using an averagedperrow total variation divergence. Numerical results are plotted in Figure 2. We see that our proposed method consistently outperforms the competing estimator.
6.2 Experiments with NYC Taxi Data
We analyze a data set of NYC yellow cab trips that were collected in January 2016 NYCyellowcabJan2016 (). We treat each taxi trip as a sample transition generated by a citewide Markov process over the New York city, where the transition is from a pickup location to some dropoff location. We discretize the map into cells so that the Markov process becomes a finitestate one.
We apply the proposed state aggregation method to the taxitrip data, and we test various sizes of the metastates, i.e., . For each vertex produced in the algorithm, there exists a cluster of anchor states around it that represent a same metastate. In order to connect these anchor states with a region in the city, we select points that are closest to the vertex, and locate the corresponding cells on the map. The convex hull of the cells forms an “anchor region”, which is plotted in Figure 5.
Let , be the estimated aggregation and disaggregation matrices. We use heat maps to visualize the columns of . Take for example. We pick two metastates, with anchor states in the downtown and midtown areas, respectively. We plot in Figure 5 the corresponding columns of and . Recall that the columns of are the disaggregation distributions, and columns of can be thought of as some likelihood function for transitioning to corresponding metastates. The heat maps can be viewed as some leading “modes” in the trafficdynamics, corresponding to anchor rigions in the downtown and midtown areas, respectively.
Furthermore, we analyze the estimated aggregation and disaggregation distributions to find a partition of the Manhattan city. Recall that in Algorithm 1, each state is projected onto a simplex, which can be represented as a convex combination of simplex’s vertices. For each state, we assign the state to a cluster that corresponds to largest value in the weights of convex combination. In this way, we can cluster the 1922 locations into a small number of regions. The partition results are shown in Figure 5, where anchor regions are marked in each cluster.
7 Summary
This paper proposes a spectralbased method for uncovering the state aggregation structure from Markov transition data. We establish sharp statistical error bound for each aggregation/disaggregation distribution. While our current results apply only to finitestate Markov chains, we hope they provide an initial attempt towards developing more powerful tools for statistical system identification. We also hope this paper would point out an intriguing connection between state aggregation and topic modeling, which might inspire more work in this area and motivate the developments of novel statistical methods for modeling complex systems.
References
 [1] NYC Taxi and Limousine Commission (TLC) trip record data. http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml. Accessed June 11, 2018.
 [2] Sanjeev Arora, Rong Ge, and Ankur Moitra. Learning topic models–going beyond SVD. In Foundations of Computer Science (FOCS), pages 1–10, 2012.
 [3] Dimitri P Bertsekas. Dynamic programming and optimal control. Athena scientific Belmont, MA, 2007.
 [4] Dimitri P Bertsekas and John N Tsitsiklis. Neurodynamic programming. Athena Scientific, Belmont, MA, 1996.
 [5] David Donoho and Victoria Stodden. When does nonnegative matrix factorization give a correct decomposition into parts? In Advances in neural information processing systems, pages 1141–1148, 2004.
 [6] Nicolas Gillis. The why and how of nonnegative matrix factorization. Regularization, Optimization, Kernels, and Support Vector Machines, 12(257), 2014.
 [7] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Minimax estimation of discrete distributions under loss. IEEE Transactions on Information Theory, 61(11):6343–6354, 2015.
 [8] Jiashun Jin. Fast community detection by SCORE. Annals of Statistics, 43(1):57–89, 2015.
 [9] Jiashun Jin, Zheng Tracy Ke, and Shengming Luo. Estimating network memberships by simplex vertex hunting. arXiv:1708.07852, 2017.
 [10] Jeff Johns and Sridhar Mahadevan. Constructing basis functions from directed graphs for value function approximation. In Proceedings of the 24th international conference on Machine learning, pages 385–392. ACM, 2007.
 [11] Zheng Tracy Ke and Minzhe Wang. A new SVD approach to optimal topic estimation. arXiv:1704.07016, 2017.
 [12] Daniel Lee and Sebastian Seung. Learning the parts of objects by nonnegative matrix factorization. Nature, 401(6755):788–791, 1999.
 [13] Xudong Li, Mengdi Wang, and Anru Zhang. Estimation of Markov chain via rankconstrained likelihood. Proceedings of the 35th international conference on Machine learning, 2018.
 [14] Sridhar Mahadevan. Protovalue functions: Developmental reinforcement learning. In Proceedings of the 22nd international conference on Machine learning, pages 553–560. ACM, 2005.
 [15] Sridhar Mahadevan. Learning representation and control in Markov decision processes: New frontiers. Foundations and Trends® in Machine Learning, 1(4):403–565, 2009.
 [16] Andrew W Moore. Variable resolution dynamic programming: Efficiently learning action maps in multivariate realvalued statespaces. In Machine Learning Proceedings 1991, pages 333–337. Elsevier, 1991.
 [17] Ronald Parr, Christopher PainterWakefield, Lihong Li, and Michael Littman. Analyzing feature generation for valuefunction approximation. In Proceedings of the 24th international conference on Machine learning, pages 737–744. ACM, 2007.
 [18] Marek Petrik. An analysis of laplacian methods for value function approximation in MDPs. In IJCAI, pages 2574–2579, 2007.
 [19] Zhiyuan Ren and Bruce H Krogh. State aggregation in Markov decision processes. In Proceedings of the 41st IEEE Conference on Decision and Control, volume 4, pages 3819–3824. IEEE, 2002.
 [20] David F Rogers, Robert D Plante, Richard T Wong, and James R Evans. Aggregation and disaggregation techniques and methodology in optimization. Operations Research, 39(4):553–582, 1991.
 [21] Satinder P Singh, Tommi Jaakkola, and Michael I Jordan. Reinforcement learning with soft state aggregation. In Advances in neural information processing systems, pages 361–368, 1995.
 [22] John N Tsitsiklis and Benjamin Van Roy. Featurebased methods for large scale dynamic programming. Machine Learning, 22(13):59–94, 1996.
 [23] Anru Zhang and Mengdi Wang. Spectral state compression of Markov processes. arXiv:1802.02920, 2018.
Appendix A Rationale of Algorithm 1
In this section, we explain the rationale of Algorithm 1, especially for:

Why the SCORE normalization [8] produces a simplex geometry.

How the simplex geometry is used for estimating .
Write for short and . The normalized data matrix
(4) 
The matrix can be viewed as the “signal” part of . Let be the right singular vectors of . They can be viewed as the population counterpart of . We define a population counterpart of the matrix produced by SCORE:
(5) 
From now on, we pretend that the matrix is directly given and study the geometric structures associated with the singular vectors and the SCORE matrix .
a.1 The simplex geometry and explanation of steps of Algorithm 1
When , the matrix , defined in (4), also admits a lowrank decomposition:
where
By linear algebra, the span of the right singular vectors is the same as the column space of . It implies there exists a linear transformation such that
(6) 
Since is a nonnegative matrix, each row of is an affine combination of rows of . Furthermore, if is an anchor state, then the th row of has exactly one nonzero entry, and so the th row of is proportional to one row of . This gives rise to the following simplicialcone geometry:
Proposition 1 (Simplicial cone geometry).
Suppose , each meta state has an anchor state, and . Let contain the right singular vectors of . There exists a simplicial cone in , which has extreme rays, such that all rows of are contained in this simplicial cone. Furthermore, for all anchor states of a meta state, the th row of lies exactly on one extreme ray of this simplicial cone.
Remark. Similar simplicial cone geometry has been discovered in the literature of nonnegative matrix factorization [5]. The simplicial cone there is associated with rows of the matrix that admits a nonnegative factorization, but the simplicial cone here is associated with singular vectors of the matrix. Since SVD is a linear projection, it is not surprising that the simplicial cone structure is retained in singular vectors.
However, in the real case, we have to apply SVD to the noisy matrix, then the simplicial cone is corrupted by noise and hardly visible. We hope to find a proper normalization of , so that the normalized rows are all contained in a simplex, where all points on the extreme ray of the previous simplicial cone (these points do not overlap) fall onto one vertex of the current simplex (these points now overlap). Such a simplex geometry is much more robust to noise corruption and is easier to estimate.
How to normalize to obtain a simplex geometry is tricky. If all entries of are nonnegative, we can normalize each row of by the norm of that row, and rows of the resulting matrix are contained in a simplex. However, consists of singular vectors and often have negative entries, so such a normalization won’t work.
By Perron’s theorem in linear algebra, the leading right singular vector have all positive coordinates. It turns out that normalizing each row of by the corresponding coordinate of is a proper normalizaiton that will produce a simplex geometry. This is the idea of SCORE [8, 9, 11].
Proposition 2 (PostSCORE simplex geometry).
In the setting of Proposition 1, additionally, we assume is an irreducible matrix. Consider the matrix . Then, there exists a simplex , which has vertices , such that all rows of are contained in this simplex. Furthermore, for all anchor states of a same meta state, the th row of falls exactly onto one vertex of this simplex.
Proposition 2 explains the rationale of the vertex hunting step. The vertex hunting step we used was borrowed from [9, 11]; see explanations therein.
Let be the vertices of the simplex . By vertex hunting, we obtain estimates of these vertices. The next question is: How can we recover from the simplex vertices ?
Let be the th row of , for . By the nature of a simplex, each point in it can be uniquely expressed as a convex combination of the vertices. This means, for each , there exists a weight vector from the standard simplex such that
The next proposition shows that we can recover from .
Proposition 3 (Relation of simplex and matrix ).
By Proposition 3, each column of the matrix
is proportional to the corresponding column of . Since each column of has a unit norm, if we normalize each column of the above matrix by its norm, we can exactly recover .
Once we have obtained , we can immediately recover from by the relation:
The above gives the following theorem:
Proposition 4 (Exact recovery of and ).
In the setting of Proposition 2, if we apply Algorithm 1 to the matrix , it exactly outputs and .
a.2 Proof of propositions
Proposition 1 follows from (6) and definition of simplicial cone. Proposition 4 is proved in Section 1.1. We now prove Propositions 23. Recall that by (6), for ,
where is the th column of . When is an irreducible matrix, by Perron’s theorem, all the coordinates of are strictly positive. This guarantees that is welldefined. Additionally, for an anchor state of the th meta state, , where . Therefore, for . We define a matrix
By definition,
(7) 
and
(8) 
Combining them with (6) gives
(9) 
Let
The (9) implies
Since is a nonnegative matrix, the first equation implies that each row of is from the standard simplex, and the second equation implies that each row of is a linear combination of the rows of , where the combination coefficients come from the corresponding row of . This has proved the simplex geometry stated in Proposition 2.
Note that the th row of is located on one vertex of the simplex if and only if the th row of is located on one vertex of the standard simplex. From the way we define , its th row equal to
Since , and are all positive vectors, is located on one vertex of the standard simplex if and only if exactly one of is nonzero, where the latter is true if and only if is an anchor state. This has proved Proposition 2.
Furthermore, from the way is defined above, using the fact that , we immediately find that
which is equivalent to
This has proved Proposition 3.
Appendix B Proof of Theorems 12
Theorem 1 is adapted from Theorem 2.2 of [11], by connecting our setting with a topic modeling setting. Let . Since the sample transitions are generated independently from the stationary distribution . We have

. Since , . Combining it with largedeviation inequalities for Bernoulli variables (e.g., the Bernstein inequality), with probability , simultaneously for all , for some constant .

Conditioning on , the rows of are independent of each other, where the th row has a distribution of , where is the th row of .
Therefore, conditioning on , each row of follows the same statistical model as the bagofwords representation of a document of length of , generated from the topic matrix with a weight over the topics. Therefore, Theorem 1 is an application of Theorem 2.2 of [11] to the case where the dictionary size is , the number of documents is , and the minimum document length is .
We notice that Theorem 2.2 of [11] assumes all documents have equal length. Here, ’s are different; but since they are at the same order, the proof of Theorem 2.2 of [11] also applies. Moreover, our regularity conditions look different from those of Theorem 2.2 of [11], as we have adapted them to the state aggregation setting.
We now prove Theorem 2. Fix . Note that
As a result,
(10)  
(11)  
(12)  
(13) 
We shall bound the three terms separately. Consider . Let be the th column of . We first give a bound for . Let be drawn from a multinomial distribution with the number of trials equal to and the probability vector equal to . Then,
Each is bounded by . Moreover, the variance of is bounded by , where we have used the fact that . Using the Bernstein inequality, we know that with probability ,
Since this is true for every , we immediately find out that
(14) 
where we note that by our assumption. Write . In Theorem 2.2 of [11], it provides a rowwise error bound on , that is, with probability , for all
(15) 
As a result, we can get an upper bound on :
(16)  
(17)  
(18)  
(19) 
In particular, it implies . We notice that
where in the last inequality we have used (16) and that . As a result,
(20)  
(21)  
(22) 
In particular, . Combining it with (14), we find that
(23)  
(24)  
(25) 
Consider . It is seen that
Note that we have . Combining it with the bound in (20), we immediately find that
(26) 
Consider . It is seen that
The third two term is dominated by the first two terms. We already have a bound for . Additionally, by (15),
(27)  
(28)  
(29)  
(30) 
(31) 