Efficient Learning and Planning with Compressed Predictive States

# Efficient Learning and Planning with Compressed Predictive States

\nameWilliam Hamilton \emailwilliam.hamilton2@mail.mcgill.ca
\nameMahdi Milani Fard \emailmmilani1@cs.mcgill.ca
\nameJoelle Pineau \emailjpineau@cs.mcgill.ca
McGill University
###### Abstract

Predictive state representations (PSRs) offer an expressive framework for modelling partially observable systems. By compactly representing systems as functions of observable quantities, the PSR learning approach avoids using local-minima prone expectation-maximization and instead employs a globally optimal moment-based algorithm. Moreover, since PSRs do not require a predetermined latent state structure as an input, they offer an attractive framework for model-based reinforcement learning when agents must plan without a priori access to a system model. Unfortunately, the expressiveness of PSRs comes with significant computational cost, and this cost is a major factor inhibiting the use of PSRs in applications. In order to alleviate this shortcoming, we introduce the notion of compressed PSRs (CPSRs). The CPSR learning approach combines recent advancements in dimensionality reduction, incremental matrix decomposition, and compressed sensing. We show how this approach provides a principled avenue for learning accurate approximations of PSRs, drastically reducing the computational costs associated with learning while also providing effective regularization. Going further, we propose a planning framework which exploits these learned models. And we show that this approach facilitates model-learning and planning in large complex partially observable domains, a task that is infeasible without the principled use of compression.111An earlier version of this work appeared as: W.L. Hamilton, M. M. Fard, and J. Pineau. Modelling sparse dynamical systems with compressed predictive state representations. In Proceedings of the Thirtieth International Conference on Machine Learning, 2013.

Efficient Learning and Planning with Compressed Predictive States William Hamilton william.hamilton2@mail.mcgill.ca
Mahdi Milani Fard mmilani1@cs.mcgill.ca
Joelle Pineau jpineau@cs.mcgill.ca
School of Computer Science
McGill University

Editor:

Keywords: Predictive State Representation, Reinforcement Learning, Dimensionality Reduction, Random Projections

## 1 Introduction

In the reinforcement learning (RL) paradigm, an agent in a system acts, observes, and receives feedback in the form of numerical signals (Sutton and Barto, 1998). Given this experience, the agent determines an optimal policy (i.e., a guide for its future actions) via value-function based dynamic programming or parametrized policy search. This is conceptually analogous to the ‘operant conditioning’ postulated to underlie certain forms of animal (and human) learning. Organisms learn to repeat actions that give positive feedback and avoid those with negative results.

### 1.1 Fully to Partially Observable Domains

In the standard formulation, an RL agent is given prior knowledge of a domain in the form of a state-space, transition probabilities, and an observation (i.e., sensor) model. Formally, the system is described by a Markov decision process (MDP), and given the MDP description, a variety of optimization algorithms may then be used to solve the problem of determining an optimal action policy (Sutton and Barto, 1998). In general, approximate solutions are determined for domains exhibiting large, or even moderate, dimensionality (Gordon, 1999).

The situation is further complicated in domains exhibiting partial observability, where observations are aliased and do not fully determine an agent’s state in a system. For example, an agent’s sensors may indicate the presence of nearby objects but not the agent’s global position within an environment. To accommodate this uncertainty, the MDP framework is extended as partially observable Markov decision processes (POMDPs) (Kaelbling et al., 1998). Here, the true state is not known with certainty, and optimization algorithms must act upon belief states (i.e., probability distributions over the state-space).

### 1.2 Model-Learning Before Planning

The POMDP extension introduces a measure of uncertainty in the reinforcement learning paradigm. Nevertheless, an agent learning a policy via the POMDP framework has access to considerable a priori knowledge: Most centrally, the agent (which necessarily and implicitly contains the POMDP solver) has access to a description of the system in the form of an explicit state-space representation. Moreover, in a majority of instances, the agent knows the probabilities governing the transitions between states, the observation functions governing the emission of observable quantities from these states, and the reward function specifying some empirical measure of “goodness” for each state (Kaelbling et al., 1998).

Access to such knowledge allows for the construction of optimal (or near-optimal) plans and is useful for real-world applications where considerable domain-specific knowledge is available. However, the converse situation, where a (near)-complete system model is not known a priori, is both important and lags behind in terms of research results. In such a setting, an agent must learn a system model prior to (or while simultaneously) learning an action policy.

At an application level, there are many situations in which expert knowledge is sparse, and it is possible that even application domains with domain-knowledge could benefit from the use of algorithms that learn system models prior to planning and that are thus free from unintended biases introduced via expert-specified system models. At a more theoretical level, the development of general agents that both learn system models and plan using such models is fundamental in the pursuit of creating truly intelligent artificial agents that can learn and succeed independent of prior domain knowledge.

### 1.3 Learning a Model-based Predictive Agent

In this work we outline an algorithm for constructing a learning and planning agent for sequential decision-making under partial state observability. At a high-level, the algorithm is model-based, specifying an agent that builds a model of its environment through experience and then plans using this learned model. Such a model-based approach is necessary in complicated partially observable domains, where single observations are far from sufficient statistics for the state of the system (Kaelbling et al., 1998). At its core, the algorithm relies on the powerful and expressive model class of predictive state representations (PSRs) (Littman et al., 2002). PSRs (described in detail in Section 2) are an ideal candidate for the construction of an agent that both learns a system model and plans using this model, as they do not require a predetermined state-space as an input.

PSRs have been used as the basis of model-based reinforcement learning agents in a number of recent works (Boots et al., 2010; Rosencrantz et al., 2004; Ong et al., 2012; Izadi and Precup, 2008; James and Singh, 2004). However, for these previous approaches, the time and space complexities of learning scale super-linearly in the maximum length of the trajectories used (see Section 3). In this work we use an approach that simultaneously ameliorates the efficiency concerns related to constructing PSRs and alleviates the need for domain-specific feature construction. The model-learning algorithm, termed compressed predictive state representation (CPSR), uses random projections in order to efficiently learn accurate approximations of PSRs in sparse systems. In addition, the approach utilizes recent advancements in incrementally learning transformed PSRs (TPSRs), providing further optimization (Boots and Gordon, 2011). The details of the model-learning algorithm are provided in Section 3.2. Section 4 presents theoretical results pertaining to the accuracy of the approximate learned model and elucidates how our approach regularizes the learned model, trading off reduced variance for controlled bias.

The planning algorithm used is an extension of the fitted- function approximation-based planning algorithm for fully observable systems (Ernst et al., 2005). This approach has been applied to PSRs previously with some success (Ong et al., 2012) and provides a strong alternative to point-based value iteration methods (Izadi and Precup, 2008). The algorithm simply substitutes a predictive state for the observable MDP state in a fitted- learning algorithm, and a function approximator is used to learn an approximation of the -function for the system (i.e., the function mapping predictive states and actions to expected rewards). The details of the planning approach are outlined in Section 5. The main empirical contribution of this work is the application of this approach to domains and sample-sizes of complexity not previously feasible for PSRs. Section 6 will highlight empirical results demonstrating the performance of the algorithm on some synthetic robot navigation domains and a difficult real-world application task based upon the ecological management of migratory bird species.

This work builds upon the algorithm presented in Hamilton et al. (2013), extending it in a number of ways. Specifically, this work (1) permits a broader class of projection matrices, (2) includes optional compression of both histories and tests, (3) combines compressed sensing with incremental matrix decomposition to facilitate incremental/online learning, (4) provides a more detailed theoretical analysis of the model-learning algorithm, (5) explicitly includes a planning framework, which exploits the learned CPSR models in a principled manner, and (6) provides extensive empirical results pertaining to both model-learning and planning, including results on a difficult real-world problem.

## 2 Predictive State Representations

Predictive state representations (PSRs) offer an expressive and powerful framework for modelling dynamical systems and thus provide a suitable foundation for a model-based reinforcement learning agent. In the PSR framework, a predictive model is constructed directly from execution traces, utilizing minimal prior information about the domain (Littman et al., 2002; Singh et al., 2004). Unlike latent state based approaches, such as hidden Markov models or POMDPs, PSR states are defined only via observable quantities. This not only makes PSRs more general, as they do not require a predetermined state-space, but it also increases their expressive power relative to latent state based approaches (Littman et al., 2002). In fact, the PSR paradigm subsumes POMDPs as a special case (Littman et al., 2002). In addition, PSRs facilitate model-learning without the use of local-minima prone expectation-maximization (EM) and allow for the efficient construction of globally optimal models via a method-of-moments based algorithm (James and Singh, 2004). The following section outlines the foundations of the PSR approach and sets the stage for the presentation of compressed predictive state representations in Section 3 and our efficient learning algorithm in Section 3.2. Much of the PSR background material (e.g., the derivation of the PSR model in Sections 2.2 and 2.3) expands upon the presentation in Boots et al. (2010) and utilizes important results from that work.

### 2.1 Notation

#### 2.1.1 Matrix Algebra Notation

Bold letters denote vectors and matrices . Given a matrix , denotes its Frobenius norm. is used to denote the Moore–Penrose pseudoinverse of . Sometimes names are given to the columns and rows of a matrix using ordered index sets and . In this case, denotes a matrix of size with rows indexed by and columns indexed by . We then specify entries in a matrix (or tensor) using these indices and the bracket notation; e.g., corresponds to the entry in the row indexed by and the column indexed . Rows or columns of a matrix are specified using this index notation and the symbol; e.g., , denotes the row of . Finally, given and we define as the submatrix of with rows and columns specified by the indices in and , respectively.

#### 2.1.2 Probability Notation

We denote the probability of an event by and use to denote the usual probabilistic conditioning. To avoid excessive notation, when the operator is applied to a vector of events, it is understood as returning a vector of probabilities unless otherwise indicated (i.e., a single operator is used for single events and vectors of events).

For clarity, we use to denote conditioning upon an agents policy (i.e., plan). That is, denotes that we are conditioning upon the knowledge that the agent will “intervene” in a system by executing the specified actions.

### 2.2 Technical Foundations

A PSR model represents a partially observable system’s state as a probability distribution over future events. More formally, we maintain a probability distribution over different sequences of possible future action-observation pairs. Such sequences of possible future action-observations are termed tests and denoted . For example, we could construct a test , where notationally subscripts refer to time, superscripts identify particular actions or observations, and actions following the symbol denote that we are conditioning upon the agent “intervening” by performing those specified actions at the specified times. We can then say that such a test is executed if the agent intervenes and takes the specified actions, and we say the test succeeded if the observations received by the agent match those specified by the test. Going further, we can define the probability of success for test as

 P(τi)=P(ok1t+1,ok2t+2,...,oknt+n||al1t+1,al2t+2,...,alnt+n). (1)

Of course, we want to know more than just the unconditioned probabilities of success for each test. A complete model of a dynamical system also requires knowing the success probabilities for each test conditioned on the agent’s previous experience, or history. We denote such a history , where again subscripts denote time and superscripts identify particular actions or observations. Importantly, the symbol for intervention is absent from the definition of history, as the sequence of actions specified in a history are assumed to have already been executed.

Finally, given that an agent has performed some actions and received some observations, defining some history , we compute

 P(τOi|hj||τAi), (2)

the probability of succeeding conditioned upon the agent’s current history in the system, where and denote the ordered lists of actions and observations, respectively, defined in .

It is not difficult to see that a partially observable system is completely described by the conditional success probabilities of all tests given all histories. That is, if we have then we trivially have all necessary information to characterize the dynamics of a system. Of course, maintaining all such probabilities directly is infeasible, as there is a potentially infinite number of tests and histories (and at the very least an exorbitant number for any system of even moderate complexity) (Littman et al., 2002).

Fortunately, it has been shown that it suffices to remember only the conditional probabilities for a (potentially) small core set of tests, and the conditional probabilities for all other tests may be defined as linear functions of the conditional probabilities for the tests in this core set222In this work, the shortened phrase core set is always to be interpreted as core set of tests; that is, such sets always correspond to a set of tests. (Littman et al., 2002). More formally, we define the system dynamics matrix, , as the (potentially infinite size) matrix, where each row corresponds to a particular test (under some lexicographic ordering), each column to a particular history (under some lexicographic ordering), and a particular entry to . simply organizes in a matrix structure. In Littman et al. (2002) and Singh et al. (2004) it is shown that if has rank then (1) corresponds to the rank of the partially observable system, as defined by Jaeger (2000) and (2) there exists a minimal core set of size (i.e., the smallest core set of tests is of size , though there may be larger core sets). Thus, if has rank , it suffices to remember conditional probabilities for only tests (those that are a part of the minimal core set), and the conditional probabilities for all other tests may be defined as linear functions of the conditional probabilities for these tests.

The rank of thus describes the complexity of a system. For example, a system with can not be modelled by a POMDP with less than states; though it may require more than POMDP states (Singh et al., 2004). In contrast, a PSR can always (exactly) model a system with using a minimal core set of exactly size (Singh et al., 2004). This demonstrates how PSRs can be more compact than POMDPS.

Thus, for a PSR, given a minimal core set (i.e., ), we can compute the conditional probability of some test as

 P(τOi|hj||τAi)=r⊤τiP(QO|hj||QA), (3)

where is a vector of weights and an ordered vector of conditional probabilities for each test in the minimal core set . Integral to this approach is the fact that restricting the model to linear functions of tests in the minimal core set does not preclude the modelling of non-linear systems, as the dynamics implicit in the probabilities may specify non-linear behaviours (Littman et al., 2002).

Thus, given the functions mapping tests in the core set to all other tests, it suffices to maintain, at time , only the vector , where is the history of the system at time . That is, it suffices to maintain only the vector of conditional probabilities for the tests in a core set (which is usually assumed to be minimal) .

### 2.3 The PSR Model

Formally, a PSR model of a system is defined by , where and define the possible observations and actions respectively, is a minimal core set of tests, defines a set of linear functions mapping success probabilities for tests in the minimal core set to the probabilities for all tests, and defines the initial state of the system (i.e., ). Since contains only linear functions, its elements can be specified as vectors of weights. These vectors, in turn, are specified using a finite set of linear operators (i.e., matrices). Specifically, we define a linear operator for each action-observation pair such that

 P(okt+1|ht||alt+1) =m⊤∞MalokP(QO|ht||QA) (4) =m⊤∞Malokmt, (5)

where is a constant normalizer such that .

These operators map probabilities of tests in the specified minimal core set to the probabilities for single action-observation pairs and may be recursively combined to generate the full set of linear functions in . For instance, for the test
, we compute

 P(τOi|ht||τAi) =r⊤τiP(QO|ht||QA) (6) =m⊤∞Malnokn⋯Mal2ok2Mal1ok1mt. (7)

These operators can also be used to produce -step predictions (i.e., the probability
of seeing an observation, , after taking action, , -steps in the future) by:

 P(okt+n|ht||alt+n) =m⊤∞Malok(M⋆)n−1mt, (8)

where is a matrix that can be computed once and stored as a parameter for quick computation (Wiewiora, 2007).

Lastly, the operators provide a convenient method for updating the predictive state, defined by the prediction vector , as an agent tracks through a system and receives observations. The prediction vector is updated to after an agent takes an action and receives observation using:

 mt+1 =P(QO|ht+1||QA) (9) =P(QO|htalok||QA) (10) =Malokmtm⊤∞Malokmt. (11)

Together, the elements of (where is understood to contain the linear operators described above and the normalizer) thus provide a succinct model of a system, which allows for the efficient computation of event probabilities and also facilitates conditioning upon observed histories.

### 2.4 Learning PSRs

There is a considerable amount of literature describing different approaches to learning PSRs. We provide an overview of the standard approaches, as Section 3.2 describes, in detail, the efficient compressed learning approach we propose.333For a slightly more detailed discussion of existing PSR learning approaches see Wiewiora (2007).

In general, PSR learning approaches may be divided into two distinct classes: discovery-based and subspace-based. In the discovery-based approach, a form of combinatorial search is used to discover the (minimal) core set of tests, and the PSR model is then computed in a straightforward manner given the explicit knowledge of (James and Singh, 2004; James et al., 2005). This method generates an exact PSR model. However, the combinatorial search required to find precludes the use of this approach in domains of even moderate cardinality.

Unlike the discovery-based approaches, subspace-based approaches obviate the need for determining exactly (Hsu et al., 2008; Boots et al., 2010; Rosencrantz et al., 2004). Instead, subspace-identification techniques (e.g., spectral methods) are used in order to find a subspace that is a linear transformation of the subspace defined by (Rosencrantz et al., 2004). The linear nature of the PSR model allows the use of this transformed PSR model in place of the exact PSR model without detriment. Specifically, it can be shown that the probabilities obtained via such a transformed model are consistent with those obtained via the true model (Boots et al., 2010).

Formally, one first specifies a large (non-minimal) core set of tests and a set of histories . Next, one defines two observable matrices , , and observable matrices (one for each action-observation pair). is a matrix which contains the joint probabilities of all specified tests and histories. is a vector containing the marginal probabilities of each possible history. And each is a matrix containing the the joint probabilities of all specified tests and histories where a particular action-observation pair (indicated by the subscript) is appended to the history (Boots et al., 2010). These observable matrices can be viewed as submatrices of , the system dynamics matrix (e.g., ). We also define matrices and analogously but with replacing (e.g., ).

Under the assumption that the empty history occurs first in the lexicographic ordering of , the discovery-based approach builds a PSR model by

 m0=[PQ,H]∗,1 (12) m⊤∞=P⊤H(PQ,H)†, (13) Mao=PQ,ao,H(PQ,H)†, (14)

while the subspace-based approach builds a model by

 β0=[ZPT,H]∗,1 (15) β⊤∞=P⊤H(ZPT,H)†, (16) Bao=ZPT,ao,H(ZPT,H)†, (17)

where is the projection matrix defining the subspace used for learning, which satisfies certain conditions. The conditions upon and the standard selection criterion for choosing it are elucidated in Section 2.5 below.

From these equations we see that PSR learning, in both the subspace and discovery paradigms, corresponds to a set of regression problems. The psuedoinverses in (12)-(17) corresponding to solutions to a set regression problems. For example, in the learning of the columns of correspond to samples in the regression (i.e., each history is a sample), the rows to features (i.e., each test is a feature), and the regression targets are the entries of (i.e., the marginal history vector).

In general, the complexity of the discovery-based learning approach is dominated by the combinatorial search for the set of core tests. In the worst case this search has time-complexity , where is the max-length of a trajectory (i.e., execution trace) used to learn the model. If the minimal core set of tests is provided as input, the discovery-based method has complexity ; however, the assumption that the minimal core set of tests is known is not realistic in practice. In contrast, the subspace-based approach has time-complexity , where is the column-dimension of . If the size of the minimal core set of tests is known (an unrealistic assumption) then .

### 2.5 Transformed Representations

PSR models learned via the subspace method are often referred to as transformed PSRs (TPSRs), since they learn a model that is an invertible transform of a standard PSR model. More formally, given the set of linear parameters defining a PSR model and an invertible matrix , we can construct a TPSR by applying as a linear operator to each parameter. That is, we set , , and , and these new transformed matrices constitute the TPSR model (Boots and Gordon, 2011). It is easy to see that the ’s cancel out in the prediction equation (6) and update equation (9). Intuitively, TPSRs can be thought of as maintaining a predictive state upon an invertible linear transform of the state defined by the tests in the minimal core set.

In practice, the matrix is determined by the projection matrix , which is used during learning in the subspace-based paradigm. To make the relationship between and explicit, we define the following matrices: , with each row corresponding to the linear function mapping the probabilities of tests in the minimal core set to the probability of test (i.e., the as defined in (6)); , with the marginal history probabilities along the diagonal; and, , with each column equal to the expected probability vector for the tests in the minimal core set given that history has been observed (i.e., ). These matrices can then be used to define a factorization of the observable matrices. In particular, Boots et al. (2010) show that

 (18)

and that

 PT,ao,H=RMaoQN (19)

holds for all .

Examining the equations for the different learning methods (i.e., (12) and (15)) and using the factorizations given in (18) and (19), we see first that for the discovery-based method, which learns a true untransformed PSR, we have that

 PQ,H=IQN, (20)

where is the identity. In this case the set of tests in is the minimal core set, and thus the core set mapping operator is replaced by the identity. Similarly, we have

 PQ,ao,H=IMaoQN. (21)

Thus for the discovery method

 PQ,ao,H(PQ,H)† =MaoQN(QN)† (22) =Mao, (23)

where we used the fact that is full column-rank by definition. By contrast, for the subspace learning algorithm we have, assuming that has full row-rank,

 Bao =ZPT,ao,H(ZPT,H)† (24) =ZRMaoQN(ZRQN)† (25) =ZRMaoQN(QN)†(ZR)† (26) =ZRMao(ZR)†, (27)

where we again used the fact that has full column-rank. If we further assume that is invertible (i.e., is square in addition to being full row rank) then (27) simplifies to

 ZRMao(ZR)−1. (28)

Similar results hold for and , showing that the subspace learning method does, in fact, return TPSRs in the case where is invertible, and in this case we have a transformed representation with .

The final piece of a TPSR is the specification of , the projection matrix defining the subspace used during learning (and implicitly defining the transformation matrix ). We know from the above derivations that must be chosen such that is invertible. The standard method for guaranteeing this is by choosing via spectral techniques; that is, is set to be , the transpose of the matrix of right singular vectors (from the thin-SVD of ) (Boots et al., 2010).

The TPSR approach can also be extended to work with features of tests and histories (Boots et al., 2010; Boots and Gordon, 2011) and/or kernelized to work in continuous domains (Boots and Gordon, 2013). This is useful in cases where the observation space is too complex for standard tests to be used (i.e., when the observation space is structured or continuous). When features of tests and histories are used, however, they are usually specified in a domain-specific manner (Boots et al., 2010). Some authors have also used randomized Fourier methods to efficiently approximate kernel-based feature selection (Boots and Gordon, 2011). These methods are quite successful in continuous domains (Boots et al., 2010; Boots and Gordon, 2011, 2013).

In contrast, the benefit of the algorithm presented in Section 3.2 is that it implicitly performs general purpose feature selection (for discrete-domains) using random compression. And this is especially useful in cases where it is difficult to know a sufficient set of features prior to training (e.g., in the case where the model is being learned incrementally). Moreover, the motivation between the compression performed in this work and the above-mentioned feature-based techniques are disjoint in that the goal of this work is to provide compression for efficient learning whereas the above-mentioned feature-based learning strategies are motivated by the need to cope with continuous or structured observation spaces. See Section 7.2 for further discussion on the relationship between this work and these alternative feature-based approaches.

## 3 Compressed Predictive State Representations

In this section, we describe our extension of PSRs, compressed predictive state representations (CPSRs). The CPSR approach, at its core, combines the state-of-the-art in subspace PSR learning with recent advancements in compressed sensing. This marriage provides an extremely efficient and principled approach for learning accurate transformed approximations of PSRs in complex systems, where learning a full PSR is simply intractable. Section 3.1 motivates the use of compressed sensing techniques in a PSR learning algorithm, and Section 3.2 describes the efficient CPSR learning approach we propose.

### 3.1 Foundations: Compressed Estimation

Despite the fact that non-compressed subspace-based algorithms, such as TPSR, can specify a small dimension for a transformed space (e.g., by removing the least important singular vectors of as in done in Rosencrantz et al. (2004) and analyzed in Kulesza et al. (2014)), there are still a number of computational limitations. To begin, TPSRs require that the matrix, , be estimated in its entirety, and that the matrices be partially estimated as well. Moreover, since the naive TPSR approach must compute a spectral decomposition of it has computational complexity , in the batch (and incremental mini-batch) setting, assuming the observable matrices are given as input. Thus in domains that require many (possibly long) trajectories for learning or that have large observation spaces, such as those described in Section 6, the naive TPSR approach becomes intractable, since and both scale as , where is the max length of a trajectory in a training set of size .444Note that and scale linearly with the number of observed test/histories. The bound is thus pessimistic in that it assumes each training instance is unique.555It is worth noting that no explicit bounds on the sample complexity of PSR learning have been elucidated. However, the sample complexity bounds of Hsu et al. (2008) provide results for a special case of TPSR learning (i.e., no actions and only single length tests and histories). In general, PSR approaches are consistent estimators but cannot be assumed to be data efficient (thus emphasizing the need to accommodate large sample sizes). In order to circumvent these computational constraints (and provide a form of regularization), the CPSR learning algorithm we propose (in the next section) performs compressed estimation.

This method is borrowed from the field of compressed sensing and works by projecting matrices down to low-dimensional spaces determined via randomly generated bases. More formally, a matrix is compressed to a matrix (where ) by:

 X=ΦY, (29)

where is a Johnson-Lindenstrauss matrix (i.e., a matrix satisfying the Johnson-Lindenstrauss lemma) (Baraniuk and Wakin, 2009). Intuitively, a Johnson-Lindenstrauss matrix is a random matrix defining a low-dimensional embedding which approximately preserves Euclidean distances between projected points (i.e., the projection preserves the dot-product between vectors). Different choices for are discussed in Section 6. It is worth noting that in our case, the matrix multiplication in (29) is in fact performed “online”, and the matrices corresponding to and are never explicitly held in memory (details in Section 3.2).

The fidelity of this technique depends what is called the sparsity of the matrix . Sparsity in this context refers to the maximum number of non-zero entries which occur in any column of . Formally, if we denote a column vector of by , we say that a matrix is -sparse if:

 k≥||yi||0∀yi∈Y, (30)

where denotes Donoho’s zero “norm” (which simply counts the number of non-zero entries in a vector).

The technique is very well suited for application to PSRs. Informally, the sparsity condition is the requirement that for every history , only a subset of all tests have non-zero probabilities (a more formal definition appears in the theory section below). This seems realistic in many domains. For example, in the PocMan domain described below, we empirically found the average column sparsity of the matrices to be roughly 0.018% (i.e., approximately 0.018% of entries in a column were non-zero). Moreover, as we will demonstrate empirically in Section 6, certain noisy observation models induce sparsity that can be exploited by this approach.

### 3.2 Efficiently Learning CPSRs

In this section, we present our novel compressed predictive state representation (CPSR) learning algorithm. The algorithm builds upon the work of Hamilton et al. (2013), extending their algorithm in a number of important ways. Specifically, the algorithm presented here (1) permits a broad class of compression matrices (any full-rank projection matrix satisfying the JL lemma), (2) includes optional compression of both histories and tests, and (3) combines compressed sensing with spectral methods in order to provide numerical stability and facilitate incremental (and even online) model-learning. Section 3.2.1 describes the foundational batch-learning algorithm. Section 3.2.2 describes how to incrementally update a learned model with new data efficiently for deployment in online settings.

#### 3.2.1 Batch Learning of CPSRs

To begin, we define two injective functions: and . These functions are independent mappings from tests and histories, respectively, to columns of independent random full-rank Johnson-Lindenstrauss (JL) projection matrices and , respectively. The matrices are defined via these functions since the full sets and may not be known a priori, and we can get away with this “lazy” specification since the columns of JL projection matrices are determined by independent random variables.

Next, given a training trajectory of action-observation pairs of any length, let be an indicator function taking a value of if the action-observations pairs in correspond to . Similarly define as the length of a sequence (e.g., of action-observation pairs) and let be an indicator function taking a value of if can be partitioned such that, starting from some index within the sequence, there are action-observation pairs corresponding to those in and the next pairs correspond to those in .666In this work we use . That is we do not use the suffix history estimation algorithm (Wolfe et al., 2005), where is varied in the range . Using minimizes dependencies between estimation errors as the same samples are not used to get estimates for multiple histories.

Given a batch of of training trajectories we compute compressed estimates of the observable matrices and 777We do not normalize our probability estimates in the estimation equations since the normalization constants cancel out during learning.:

 ^ΣH =ΦH^PH (31) =∑z∈Z∑hj∈HIhj(z)ϕH(hj), (32) ^ΣT,H =ΦT^PT,HΦ⊤H (33) (34)

where denotes the tensor (outer) product of two vectors.

Next, we compute the rank- thin SVD of :

 (^U,^S,^V)=SVD(^ΣT,H). (35)

Given these matrices we can construct and , the compressed and transformed estimates of and , respectively:

 c1=^S^V⊤e, (36)
 c⊤∞=^Σ⊤H^V^S−1, (37)

where is a vector such that . In practice this can be guaranteed by defining a modified history map such that that for the null history, , and that for all . This specification of assumes that all are starting from a unique start state. If this is not the case, then we set such that , which again can be guaranteed without cost but in this case by simply adding a constant “dummy” column to the front of . In this latter scenario, we would, in fact, not be learning exactly and instead would learn , an arbitrary feasible state as our start state. The uncertainty in our state estimate should decrease, however, as we update and track through our system and the process mixes (Boots et al., 2010). And indeed, the majority of domains without well-defined start-states are those for which there is significant mixing over time, so this technique should introduce only a small amount of error in practice.

Given the SVD of , we can also estimate the matrices, the compressed and transformed versions of the matrices, directly via a second pass over the data. First, however, we must define a third class of indicator functions on : takes value if and only if the training sequence can be partitioned such that, starting from some index within the sequence, there are action-observation pairs corresponding to appended with a particular and the next correspond to those in . In other words, is equivalent to , where a particular is appended to the history . Using these indicators and the SVD matrices of , we compute, for each :

 Cao=∑z∈Z∑ti,hj∈T×HIhj,ao,ti(z)[(^U⊤ϕT(ti))⊕(^S−1^V⊤ϕH(hj))]. (38)

Thus, in two passes over the data, we are able to efficiently construct our CPSR model parameters. The primary computational savings engendered by this approach is in the computation of the pseudoinverse of , which we implicitly compute via an SVD. Since we are performing pseudoinversion (i.e., SVD) on a compressed matrix, the computational complexity is uncoupled from the number of tests and histories in the set of observed trajectories . Recalling that denotes the max length of a trajectory in and letting denote the number of trajectories in the set , this approach has a computational complexity of

 O(L|Z|dHdT+d2TdH)=O(L|Z|) (39)

since and are a user-specified constants888Section 4 describes how the choice of these constants affects the accuracy of the learned model. (assuming the standard cubic computational cost for the SVD). Without compression (i.e., with naive TPSR), a computational cost of

 O(L|Z|+|H||T|2)=O(L3|Z|3) (40)

is incurred.

In addition to these computational savings, the above approach has the added benefit of not requiring that and be known in entirety prior to learning. This is especially important in the case where we want to alternate model-learning and planning/exploration phases using incremental updates (described below), as it is very unlikely that all possible tests and histories are observed in the first round of exploration. Performing SVD on the compressed matrices also induces a form of regularization (similar to regularization) on the learned model, where variance is reduced at the cost of a controlled bias (details in Section 4).

#### 3.2.2 Incremental Updates to the Model

In addition to straightforward batch learning, it is also possible to incrementally update a learned model, given new training data, (Boots and Gordon, 2011). This is especially useful in that it facilitates alternating exploration and exploitation phases. Of course, if such a non-blind alternating approach is used then the distribution of the training data changes (i.e., it becomes non-stationary), and the sampled trajectories can no longer be assumed to be i.i.d.. Despite this theoretical drawback, Ong et al. (2012) show that non-blind sampling approaches can lead to better planning results in a small sample setting.999In this work, where larger sample sizes were used, we did not find a significant benefit to goal-directed sampling and in fact saw detrimental effects in terms of planning ability and numerical stability during learning. See Section 7 for details.

Briefly, we obtain a new estimate and update our estimate using using (34) and (31) with . Next, we update our SVD matrices, given our additive update to , using the methods of (Brand, 2002). The and vectors are then re-computed exactly as in equations (36) and (37).

To obtain our matrices, we compute:

 Cnewao =∑z∈Z′∑ti,hj∈T×HIhj,ao,ti(z)[(^U⊤newϕT(ti))⊕(^S−1new^V⊤newϕH(hj))] (41)

The first term in (3.2.2) corresponds to estimating the contribution to the new matrix from the new data, and the second term is the projection of the old matrix onto the new basis. Using the results of Brand (2002), the complexity of this update is

 O(L′|Z′|(dTdH+(d′)3+d′dT)+dTd′dH), (42)

where denotes the maximum length of a trajectory in .

## 4 Theoretical Analysis of the Learning Algorithm

In the following section, we describe theoretical properties of the CPSR learning approach. Our analysis proceeds in two stages. First, we show that the learned model is consistent in the case where and (i.e., when no real compression occurs). Following this, we outline results bounding the induced approximation error (bias) and decrease in estimation error (variance) due to learning a compressed model.

The analysis included in this section is intended as a means to justify the compression technique and study the overall consistency of our algorithm. It also provides guidance for the choosing of a theoretically sound range of values for the projection size used in the algorithm.

### 4.1 Consistency of the Learning Approach

The following adapts the results of Boots et al. (2010) and shows the consistency of our learning approach when the random projection dimension is greater than or equal to the true underling dimension of the system (i.e., the size of the minimal core set of tests, ). We then describe the implications of this result for the case where we are in fact projecting down to a dimension smaller than .

#### 4.1.1 Consistency in the Non-Compressed Setting

We begin by noting a fundamental result from the TPSR literature. Recall the matrix where each row, , specifies the linear map:

 r⊤τiP(QO|ht||QA)=P(τOi|ht||τAi). (43)

Supposing that and and with coming from the SVD of , we have

 c0 =(U⊤ΦTR)m0, (44) c⊤∞ =m⊤∞(U⊤ΦTR)−1, (45) Cao =(U⊤ΦTR)Mao(U⊤ΦTR)−1. (46)

That is, we simply recover a TPSR where , and it has been shown that the above implies a consistent learning algorithm (Boots et al., 2010; Boots and Gordon, 2011). We note that appears in these consistency equations, while does not, emphasizing the different roles these two matrices occupy. This difference will play an important role in the theoretical analysis below.

#### 4.1.2 Extension to the Compressed Case

In the case where and/or things are not as straightforward. Specifically, equations (44)-(46) no longer hold as is no longer invertible (it is in fact, no longer square), since the SVD is taken on which has rank less than when and/or while the column dimension of is . The primary focus of our theoretical analysis is the effect of this fact, i.e. not being invertible. We show how we can view as inducing a form of compressed linear regression, and we provide bounds on the excess risk of learning within a compressed space.

There is, however, the additional complication of when , as in that setting it is no longer possible to remove from the consistency equations (44)-(46). From the perspective of regression, can be viewed as compressing the number of samples, while can be viewed as compressing the features. In this work, we focus on the effect of compressing tests and provide detailed analysis of how compressing tests (i.e., features) affects the implicit linear regression performed. Zhou et al. (2007) discuss the effect of compressing samples during regression, a result that follows naturally from the Johnson-Lindenstrauss lemma, and in Section 7, we discuss these results and their relationship to this work. For completeness, Section 6 also provides an empirical analysis of the effects of compressing histories and tests versus compressing tests alone.

### 4.2 Effects of Compression

In what follows, we analyse the effects of compression by viewing as inducing a form of compressed linear regression, where both the input data and targets are compressed.

#### 4.2.1 Preliminaries

This approach is justified by noting that, as discussed in Section 2.4, in equations (37) and (38) of our learning algorithm we are in fact performing implicit linear regression. That is, for :

 ^V^S−1 =(^U⊤^ΣT,H)† (47) (48)

In other words, is the Moore-Penrose pseudoinverse of , and multiplication by is thus equivalent to performing least-squares linear regression.

Following the discussion in the previous section and to avoid unnecessary complication, we assume has orthonormal columns (i.e., is not compressive) while analysing the effects of compressing the tests. In the case where has orthonormal columns, we define as the compressed analogue of , and see that (38) can be rewritten as

 Cao =(^U⊤^ΣT,ao,H)(^U⊤^ΣT,H)† (49) (50) =(^U⊤ΦT^PT,ao,H)Φ⊤H(Φ⊤H)†(^U⊤ΦT^PT,H)† (51) =(^U⊤ΦT^PT,ao,H)(^U⊤ΦT^PT,H)†, (52)

where (51)-(52) holds since is assumed to have orthonormal columns. An analogous result holds for and thus, can, indeed, be omitted in our analysis (under the assumption that ).

Moreover, we ignore the term in what follows, which is justified in the case where (i.e., when the truncated SVD dimension is equal to the test compression dimension). This condition is very mild in the sense that the use of SVD during learning is primarily motivated by the need to efficiently compute pseudoinverses, which facilitates the efficient batch and incremental model-learning algorithms. That is, the SVD is not used as a dimensionality reduction technique, as random projections are used in that role.101010As noted in Section 7.1.3 it is sometimes beneficial to use and/or discard very small singular values in order to improve the numerical stability of computing inverses during learning. However, this issue of numerical stability is orthogonal to the analysis presented in this section. Thus, under the assumption that , we have that

 Ax=b⇒^U⊤Ax=^U⊤b (53)

holds, since is orthonormal for . Thus, the appearance of in the pseudoinverse is inconsequential in an analysis of the effect of compressing prior to regression.

To simplify the analysis one step further, we assume that our test set is a minimal core set . Therefore, random projections are applied on and matrices. The projections from over-complete test sets with rank bigger than down to dimensions can be achieved by first projecting to size and then projecting from to . By the results of Section 4.1.1, this first projection leads to a consistent model, i.e. a model that is a linear transform of the model learned directly from and matrices, since is invertible with probability 1 when the projected dimension is equal to (Boots et al., 2010). The assumption that we work with the and matrices directly (as apposed to invertible transforms of them) simplifies the analysis below in that we can elucidate our sparsity assumptions etc. directly in terms of the minimal core set of tests instead of random linear functions of tests in the minimal core set. This assumption is mild in that we could work with these random invertible linear transforms and discuss the discrepancy between a “random” TPSR (i.e., a TPSR defined via a random linear transform) and a compressed version of this “random” TPSR, and this discussion would be analogous to that which is provided below, albeit with more cumbersome and unnecessarily complex derivations. The assumption that we work with the minimal core set of tests simply allows for a more interpretable and less cluttered analysis.

Now, we define:

 Bao=PQ,ao,H(PQ,H)†,β∞=(PQ,H)†^PH. (54)

Since is a minimal core set of tests, the above is a TPSR representation (Boots et al., 2010; Rosencrantz et al., 2004). Assume we have enough histories in such that matrices are full rank. Defining and to be the vectors containing the joint probabilities of all tests in the minimal core set and a fixed history , we have (by the linearity of PSRs):

 ∀h:PQ,ao,h=BaoPQ,h,Ph=β⊤∞PQ,h. (55)

One can thus think of finding the and parameters as regression problems, having the estimates of s as noisy input features. We also have noisy observations of the outputs and . Since the sample set suffers from the error in variables problem (i.e., is noisy both on the input and output values) direct regression in the original space might result in large estimation error. Therefore, we apply random projections, reducing the estimation error (variance) at the cost of a controlled approximation error (bias). And we get the added benefit that working in the compressed space also helps with the computation complexity of the algorithm.

Note that there is an inherent difference between our work and the TPSR framework. In TPSR, one seeks to find concise linear transformations of the observation matrices, whereas CPSR seeks to find good approximations in a compressed space (which cannot be linearly transformed to the original model). That said, approximate variants of the TPSR learning algorithm have been analyzed from the perspective of compressed regression (albeit without appealing to the compressed sensing framework we employ) (Kulesza et al., 2014; Boots and Gordon, 2010). For example, Kulesza et al. (2014) analyze low-rank TPSR models where the rank of the learned model is made less than by removing the least significant singular vectors of . We reiterate, however, that these analyses are distinct from the analysis presented in this work, as we analyze low-rank models where the rank is reduced via random projection-based compression (not by removing least-significant singular vectors). The following sections provide an analysis of the error induced by this compression and how the error propagates through the application of several compressed operators.

#### 4.2.2 Error of One Step Regression

When the size of the projections is smaller than the size of the minimal core set, we have the implicit regression performed on a compressed representation. The update operators are thus the result of compressed ordinary least-squares regression (COLS). There are several bounds on the excess risk of regression in compressed spaces (Maillard and Munos, 2009, 2012; Fard et al., 2012, 2013). In this section, we assume the existence of a generic upper bound for the error of COLS.

Assume we have a target function where is in a -sparse -dimensional space, and is the bias of the linear fit. We observe an i.i.d. sample set , where ’s are independent zero-mean noise terms for which the maximum variance is bounded by , and ’s are sampled from a distribution . Let be the compressed least-squares solution on this sample with a random projection of size . That is, with

 ^wd=(ΦX⊤XΦ⊤)−1(ΦX⊤)y∈Rd, (56)

where is a design matrix, is a vector of training targets, and is a random projection matrix. Define to be the weighted norm under the sampling distribution. We assume the existence of a generic upper bound function , such that with probability no less than :

 ∥f(x)−^fd(x)∥ρ(x)≤ϵ(n,D,d,∥w∥2,∥x∥2ρ(x),∥b(x)∥2ρ(x),σ2η,δ). (57)

The effectiveness of the compressed regression is largely dependent on how the term behaves compared to the norm of the target values. We refer the reader to the discussions in Maillard and Munos (2009) and Maillard and Munos (2012) on the term. In the case of working with PSRs, we have that the probability of the tests are often highly correlated. Using this property, we will show that can be bounded well below its size.

In order to use these bounds, we need to consider the sparsity assumptions in our compressed PSR framework. We formalize the inherent sparsity, discussed in previous sections, as follows: For all , and are -sparse. Given that the empirical estimates of zero elements in these vectors are not noisy, for we have that is -sparse (with a similar argument for ).

To simplify the analysis, in this section we define our matrices to be slightly different from the ones used in the described algorithm. By forcing the diagonal entries to be 0, we avoid using the th feature for the th regression. This removes any dependence between the projection and the target weights and simplifies the discussion. Since we are working with random compressed features as input, all of the features have similar correlation with the output, and thus removing one of them changes the error of the regression by a factor of . We can nevertheless change the algorithm to use this modified version of the regression so that the analysis stays sound.

The following theorem bounds the error of a one step update using the compressed operators. We use i.i.d. normal random projection for simplicity. The error bounds for other types of random projections should be similar.111111The core modifications necessary are analogous to those used made in Achlioptas (2001) to adapt the Johnson-Lindenstrauss lemma to more general random matrices. Let be matrix with the th row removed. We have the following:

###### Theorem 1

Let be a large collection of sampled histories according to , and let be an i.i.d. normal random projection: . We observe noisy estimate of input and of the output, where elements of and are independent zero-mean random variables with maximum variance and respectively. Let be the decreasing eigenvalues of . Choose such that and define . For , define:

 ui=Φi^PQ,ao,H(Φ−i^PQ,H)†.

Define to be a matrix such that:

 (Cao)i=[ui,1,ui,2,…,ui,i−1,0,ui,i,ui,i+1,…,ui,d−1].

Then with probability no less than we have:

 ∥∥Cao(ΦPQ,h)−ΦPQ,ao,h∥∥ρ(h)≤√dϵ(|H|,|Q|,d,w2,x2,b2,σ2η,δ/4d), (58)

where:

 w2 = ∥Bao∥2(m+4√mln(4d/δ)), (59) x2 = ∥PQ,h∥2ρ(h), (60) b2 = ν+4√νln(4d/δ), (61) σ2η = 4kln(4|Q|