An approximation method for discrete Markov decision models with a large state space
Abstract
We propose a new approximation approach to solve a discrete Markov decision model (DMD) with a large state space. The DMD is an structural model which can analyze data obtained from agents making dynamic decisions, however, to solve DMDs with a large number of discrete states is always difficult (and at times, impossible) because of a huge computational cost. The number of the states in DMDs increases exponentially as we introduce state variable, and this phenomenon is called “The Curse of Dimensionality.” To overcome this problem, we propose the new approach, named a statistical least square temporal difference method (SLSTD), that can solve DMDs containing the large state space with a low computational cost. The SLSTD can easily solve a Bellman equation of DMDs with a high dimensional variable, by employing two approximation techniques. Experimentally, the SLSTD performs faster and more accurate than other existing methods, and in some cases, reduces the computation time by over 99 percent. We also show that an estimator for a parameter of interest obtained by the SLSTD has the consistency and the asymptotically normality.
JEL Classification : C63, D01.
1 Introduction
A discrete Markov decision model (DMD), also known as a dynamic discrete choice model, is extensively used for analyzing a behavior of agents. The main advantage of the approach with the DMD is that it admits us to implement the counterfactual analysis of agents, since the DMD can handle the dynamic decision making of the agents. The agents in the DMD observe their own state in each period, then decide their action with considering future reward and transition between states. The action by the agents is often formalized as a discrete choice, and we can analyze the characteristics of the agents by investigating the realized choice. Rust (1987) suggested the DMD, and has found many applications in presentday works in econometrics, marketing science, transportation science, and the dynamic games theory.
Since the estimator for the parameter of the DMD rarely has an analytical form, implementing the DMD approach requires numerical calculation. However, to solve the DMDs requires conducting the highly complex nonlinear computation, hence the computational cost restricts the flexibility of the DMD. Thus, many empirical researches are suffered from the computational hurdle and forced to reduce volume of their DMD. Accordingly, many methodological researches have suggested computational methods to remove the hurdle. Hotz and Miller (1993), Aguirregabiria and Mira (2002), Su and Judd (2012), and Dube et al. (2012) proposed methods to solve the DMD with a single agent. Aguirregabiria and Mira (2007), Bajari et al. (2007), Pesendorfer and SchmidtDengler (2008), and Egesdal et al. (2015) proposed efficient methods to solve the DMD including multiple agents and game structures.
“The curse of dimensionality,” that refers to the exponential rise in the number of grid points in the state space of the DMD, is one of the such computational difficulties. When we increase the number of discretized state variables in the DMD, the grid points in the state space increases exponentially and it leads to the extreme rise in the computation cost. Furthermore, such large state spaces require a large amount of computational memory to store a great number of numerical values, and in some instances, ordinary computers are not even up for the task.
Several methods have been suggested to handle to the curse of dimensionality in the DMD. There are some general methods that we can apply into various DMDs. Keane and Wolpin (1997) and Imai and Keane (2004) proposed a method to approximate the value function by basis functional approximation method. A series estimation based on Judd (1996) is also a method to handle the curse and it can be applicable to a wide range of DMDs. There are other methods for specific DMDs which are specialized to analyze specific topics. Hendel and Nevo (2006) and Gowrisankaran and Rysman (2012) solve specific DMDs for analyzing consumer choices. Another method is the Monte Carlo method by Rust (1997) that can solve any model with up to a certain number of dimensions.
Despite the rich researches against the curse of dimensionality, few methods achieve both the generality and the sufficient cost reduction. Some methods are valid for specific problem, like Hendel and Nevo (2006) and Gowrisankaran and Rysman (2012), thus it is not applicable when we try to analyze the other topics. On the other hand, the general methods such as Keane and Wolpin (1997) do not have enough theoretical analysis which guarantees the result of the analysis, and also their performance is not sufficient in some cases. (We will discuss its detail in Section 4).
Purpose of this paper is to suggest a new computational method which is applicable for wide range of DMDs and has sufficient computational and statistical performance. In this paper, we suggest a statistical least square temporal difference method (SLSTD) that can avoid the curse of the dimensionality of the DMD. The SLSTD focus on the Bellman equation of the DMD. Solving the Bellman equation is an origin of the computational burden when the state space is large, and reducing the cost of handling the Bellman equation is a critical problem. The SLSTD simplify the Bellman equation by applying basis function approximation to a high dimensional variable of the Bellman equation. Furthermore, the SLSTD employs the stochastic rootfinding technique to solve the simplified Bellman equation. By the two techniques, we can substantially reduce the computational burden from the curse of dimensionality. The combination of the two methods is based on the idea of the temporal difference (TD) method by Sutton (1988) and the least square temporal difference (LSTD) method by Bradtke and Barto (1996): we extend both methods for applying them to the DMD models.
Our numerical experiments reveal advantages of the SLSTD over approximation method by Keane and Wolpin (1997) and the series estimation. First, the SLSTD can provide more accurate results of the parameter estimation. Second, the computation time by the SLSTD is nearly independent of the size of the state space, and as such, the computation time remains small even when the state space is large. Third, the SLSTD also has an advantage from an aspect of the computational memory. We also provide the asymptotic properties of the parameter estimation obtained by the SLSTD. Given some conditions on the smoothness of models and the number of basis functions, we can even obtain consistency and asymptotic normality.
The rest of paper is organized as follows. Section 2 describes the fundamental structure of the DMD and the existing approximation methods to handle the curse of dimensionality. Section 3 introduces the SLSTD. Section 4 examines performance of the SLSTD by numerical experiments. Section 5 shows the theoretical aspects of the SLSTD. Section 6 concludes. The proofs are collected in the Appendix.
2 Model and existing methods
2.1 Model
The DMD is a statistical model to analyze a sequence of discrete choices, and the purpose of the analysis is to estimate the parameter of the agents from their choices. In the DMD, the observed choices depend on the state in which the agent stays, and the action of the agents determine the transition between states. We derive the likelihood of the actions, and estimate the parameters of interest by maximizing the likelihood.
We consider the DMD with discrete time and the discrete state variable. DMDs are formulated as . is a state space with dimensions, and each dimension has states. Then, the state space is represented as and . is an action space, and is a parameter space. is a reward function, and is the space of stochastic factors. is the transition probability between states. represents the state in the next period.
There exist agents, and an agent observes own state in each period . Then the agent decide own action from the action space. Also, the agent privately observes which is an independent stochastic factors and it is unobservable for the researchers. The state of the agents evolves after the action has been made. Also it is assumed that the transition between states has a first order Markov property.
Denote by the parameter vector which explains the characteristic of whole agents, and is the parameter of interest for the researchers. We set that the agent obtains reward with the stochastic factor and the parameter vector . At time and with given the state , the agent maximizes the discounted sum of the reward, named the value function as follows,
(1) 
where is the discount factor and is the initial state. Here, is a function, and also define and
To analyze the decision making of the agent , we consider the choice probability of the action in each period. Let be the probability of choosing in state with given parameter :
(2) 
where is an indicator function and the expectation is taken over the state transitions .
Since has finite elements, we can combine (1) and (2), then we obtain the following equation, named the Bellman equation, as follows,
(3) 
where is an expectation of with given . By solving the Bellman equation, we can obtain the value of .
Suppose that we observe a sequence of state transitions and actions . The likelihood of the observed sequence of derived by the conditional choice probability (2), when we obtain the following log likelihood function with observation :
(4) 
By maximum likelihood estimation, we obtain the estimator while satisfies the Bellman equation (3).
Remark
Practically, calculating the value of and requires a tedious numerical integration. To avoid the computation, several assumptions on the functional and distributional form of are often introduced.
When we are allowed to assume that where is a stochastic term that is i.i.d. with respect to action and time, and follows the typeI extreme value distribution, we obtain the following simple forms:
In this case, the Bellman equation (3) can be rewritten as
where represents a conditional expectation of , i.e.,
where is the Euler’s constant. This form enables us to calculate and analytically.
2.2 The Curse of Dimensionality and Existing Methods
To solve the Bellman equation (3) is necessary to evaluate the log likelihood function in (4), however, it is quite difficult when the state space is large. Since the state variables are discretized, the Bellman equation (3) is regarded as an equation of dimensional vector. Namely, let , and rewrite the Bellman equation as , where is the right hand side of (3). However, as the number of the state variables increases, the number of states is exponentially increases against , and solving the equation requires huge computational time and cost. For instance, in the DMD for the career decision, if we allow each agents to possess types of the human capitals for maximum years for each of the types, then we obtain in the DMD and we have to solve an equation with dimensional vector. It requires a huge computational time, and also note that ordinal laptops cannot contain such the highdimensional vector in their computational memory. Similar examples are also introduced in several literature, for example, Hendel and Nevo (2006) for the consumer choice and Egesdal et al. (2015) for the discrete choice game.
To avoid the curse, there exist some methods to solve the DMD under the curse of dimensionality; these are of two types: general methods and problemspecific methods.
Keane and Wolpin (1994) suggested a method that can be applied to general types of DMDs. This method picks some states randomly in each time, and estimates the coefficients of an interpolation function of the picked states as
where is the interpolation function. Though it is handy, this method has some faults. First, the computation time increases rapidly. The method is mainly suited for simplifying Bellman equation evaluation, and thus is not good at reducing the statespace computation cost. Second, the theoretical framework of the method has not been sufficiently elaborated on. Since performance is guaranteed only by numerical experiments, its theoretical properties, such as consistency and size of biases, are unknown.
We also consider another general method using the sequential series estimation method, which can solve many DMDs. This method, too, picks states from the state space in each time, and approximates as
where is a weight and is a basis function. Since the method approximates the value function in each period, the method requires multiple approximations. Judd (1996) provides the idea of the series estimation, and this method applied the idea to DMDs. This method is useful and its convergence is theoretically guaranteed, but it does have one limitation, which we discuss later. Rust (1997) too suggested a general method with Monte Carlo that can calculate a value function with no effect of an increase in the number of dimensions. Their method, though independent of the object of analysis, requires strong restrictions on the transition and state space of the model.
Some problemspecific methods, such as those by Hendel and Nevo (2006) and Gowrisankaran and Rysman (2012), are for consumer choices. These methods display high performance in market analysis, but depend on the specific characteristics of the market and are not applicable to other DMDs.
Thus, while problemspecific methods are fast, they cannot solve other general problems, and while some general methods enjoy wide applicability, they are not computationally feasible. Accordingly, there is a need for a general method that achieves computational feasibility.
Other popular method to estimate the DMD include the conditional choice probability (CCP) method by Hotz and Miller (1993), the nested pseudo likelihood (NLP) method by Aguirregabiria and Mira (2002), and the mathematical programming with equilibrium constraint (MPEC) method by Su and Judd (2012). The SLSTD works under the curse of dimensionality, whereas these methods are design to solve DMDs with relatively small state spaces. As an example, consider the carrier decision model by Keane and Wolpin (1997), which has over 1 million states (). It is impossible for implement the MPEC and CCP methods to solve such a model, because these methods need to provide a numerical matrix, which is computationally not feasible.
3 Proposed Method
We introduce the SLSTD which solves the Bellman equation (3) approximately with low computational cost. The SLSTD employs mainly two techniques, (i) the functional approximation method, and (ii) the stochastic approximation method. After solving the Bellman equation by the SLSTD, we provide a formation about (iii) the parameter estimation. The main idea of the SLSTD is based on the TD method by Sutton (1988) and the LSTD method by Bradtke and Barto (1996) and Nedic and Bertsekas (2003).
Preliminarily, we provide some notation. . Let is an orthonormal system in with , where is a convex hull of . With given , consider a vectorvalued function .
For brevity, we define a Bellman operator such as
then the Bellman equation (3) can be rewritten as
(5) 
Also we let and be a solution of the Bellman equation. Note that the part is probed to be unique by Rust et al. (2002) for each .
3.1 Method Outline
In this section, we provide an outline of the SLSTD. Purpose of the SLSTD is to solve the Bellman equation (5) (the simplified version of (3)). The SLSTD employs the two approximation techniques, (i) the basis functional approximation, and (ii) the stochastic approximation method.
(i) Basis Functional Approximation of the Value Function :
With given , we approximate the value function as
(6) 
where is a vector of approximation weights . Since the approximation is regarded as a projection of onto the linear space spanned by , there exists a unique optimal weight by the projection theorem. Note that converges to zero as when is sufficiently smooth (See Tsybakov (2009)).
By the functional approximation (6), we can represent (dimensional vector) by (dimensional vector) by using the given orthonormal system. Since we set is much less than , we can avoid the high dimensionality of the Bellman equation.
However, it is not enough to solve the curse of dimensionality. There are some problems remain : (a) the problem of obtaining which approximately solves the Bellman equation remains, (b) the computational cost reduction is not enough, and (c) the accumulation of the approximation problem appears. Especially, the problem (a) is critical. The solution of the Bellman equation should satisfy the equation (6) for all . Thus, it is necessary for evaluating the equation (6) for all to solve the Bellman equation by ordinal method, such as the Newton’s method. However, as we already discussed, is too large in some cases, hence it require a high computation cost. In the rest of the section, we introduce additional method to solve the problem (a). The problems (b) and (c) will be discussed in Section 4.
(ii) Stochastic Approximation for Obtaining :
To estimate the optimal weight , we evaluate the goodness of the approximation with given . In the view of , satisfies for all with given . Then, we define the similar moment condition for as follows. We consider the minimization problem
(7) 
where is a set of generated from the empirical distribution. For the minimization problem, we implement Lemma 6 in Tsitsiklis and Van Roy (1997), we obtain the weak form of a first order condition (7).
(8) 
To solve (8), we implement the stochastic approximation method in Benveniste et al. (2012). The stochastic approximation method is an algorithm to find a root of an equation which is given by a form of an expectation. As a sequence of random variables generated from a probability distribution is observed onebyone, the stochastic approximation update the solution of the equation, and the sequence of the solution converges to the root. Here, we cite Theorem of the stochastic approximation method from Benveniste et al. (2012).
Theorem: Stochastic approximation (Benveniste et al. (2012), Theorem 17).
Let be a random variable with transition probability and denote where is a expectation under stationary distribution. Suppose that there exists a unique satisfying . Further, consider a decreasing sequence , where and . Suppose

has an envelope function with a polynomial of and a linear function of

holds for all .
hold. If a sequence of is generated in the following iteration equation
then the sequence converges to with probability as .
To apply the theorem, we set , and . We check the assumptions in Appendix. This algorithm by the stochastic approximation method enables us to solve the problem (8) without summing up or integrating with respect to the all state in each step.
By the stochastic approximation method, we define the sequence from the following equation:
(9) 
with step size , which satisfies and . The initial point of is arbitrary. The basic approach underlying the SLSTD is modifying the approximation parameter iteratively as per the temporal difference; this is why we refer to the method as the TD method.
Based on the stochastic approximation method, we define the estimator of as a limit of the sequence (9) as follows:
Now, by the SLSTD approach, we obtain the estimator of the value function as
with given . Note that the estimation of is implemented for each fixed , and the estimator can differ for each .
(iii) Parameter Estimation :
Finally, we define the estimator of the parameter of interest by the SLSTD as
where is defined in (4). Note that when we have to evaluate the value of with different during the optimization with respect to , we have to rerun the SLSTD for each . It looks costly at first glance, however, it is not a problem in practice. Details are shown in Section 4.
3.2 Implementation and Discussion
To proceed with iteration (9), we need to prepare the sequence of the states . Since we use the set of the state from the empirical distribution in (7), we use the state transitions from the observed data for the iteration. In other words, The SLSTD approach intensively minimizes the error of the Bellman equation (5) on states whose agents pass frequently.
Consider agents, with the th agent having a state transition of length . Now we have a set of observed state transition . First, we implement the iteration (9) on the state transition of one agent. When the first agent reaches a terminal state, we carry on the and continue the iteration (9) with the state transition of the next agent. We repeat the operation for all agents. Thus, total number of observation is . This method fits structure of data in econometric field which has many agents, contrast to the ordinal TD method implements the iteration with one long state transition.
This operation has another interpretation. We unite the data of all agents as one agent’s repetitive action to generate the decision sequence. If the model has a terminal state and the repetitive agent reaches it, he does not obtain any reward and goes back to the initial state with probability one. Then, the state transition become irreducible for all state, thus we can recognize the state transition as having stationary distribution.
Practically, we do not have to generate the sequence until . We can stop the iteration, when we judge the convergence of the sequence, namely, is no less than the sufficient small predetermined tolerance level .
We now provide a pseudo code of the SLSTD. Algorithm 1 shows how to solve the Bellman equation with given parameter . A sequence of is predetermined, and along with the initial approximation weight .
We now discuss the initial point of and the tuning parameter . When is far from the limit, the convergence of becomes unstable. Thus, the initial point needs tuning when the solution is not stable. The select of the step size is a more important problem. We use satisfies and with positive and . When is too small or is too large, the step size becomes small and solution is strongly affected by the initial point. Thus, selecting a proper step size, too, is necessary to obtain a stable solution.
We now discuss the choice of basis functions. The power series and Bsplines are common as basis functions. However, the Bspline function provides better estimates. Since the SLSTD locally evaluates the value function approximation on the observed state transition, the Bspline functions provide more precise approximation.
4 Numerical Experiment
4.1 Parameter Estimation
To compare the estimation accuracy of the methods, we conducted a numerical experiment using some empirical DMDs. We consider the method of Keane and Wolpin (1994) (henceforth, KW), the sequential series estimation, and the SLSTD. We used a DMD for analyzing the career decision which is a simplified version of Keane and Wolpin (1997). The DMD is a finite time model, with an adjustable number of state variables and actions of the agents, and the time horizon.
We present the DMD used in our experiment and it is often used in the labour economics analysis, such as Keane and Wolpin (1997). The model has state variables and its terminal time is . An elements of the state space has a form : is age, is a education year, is a carrier of work and contains the choice in the previous period. The state space is constructed as . We set a action space as . We set a reward function as
The choice is a decision of schooling and the choice increases and by one. The choice is a decision of working, and it increases and by one. The choice is staying home, and it increases only .
We generate data with agents, and estimate the parameters by the generated data. represents the number of elements in the state space. When we , then we have elements. We set the stochastic term to follow a typeI extreme value distribution. Since the state space of this model is not so large, we can estimate without the approximation method, such as the SLSTD, the KW method, and the series representation, for comparison. To use the SLSTD, we use the Bspline functions as the basis functions. We also set same basis functions for the sequential series method, and we set grid points in each dimension for approximation. For the KW method, we provide grid points in each period.
First, we compare the parameter estimation. For estimation, we use the simulated numerical data generated from the model with the true parameter. We generate data with agents, and replicate the experiment times. Table 1 and Table 2 show the result with different parameter sets. The tables also contain the computational time for solving the Bellman equation with one parameter set, and the squared error of the Bellman equation which is a difference between the left hand side and the right hand side of (5). Here, denotes the error of the Bellman equation. The values are the means of the estimators of the replications; the figures in parentheses are standard deviations.
The SLSTD provides a less estimation bias than other methods for every cases. In some cases, the sequential series estimator has better performance; this is because the model is linear and very simple, and thus, is likely to perform better in series estimations. However, when model gets larger or complex, the sequential series method does not work well. In contrast, the standard deviation of the SLSTD are larger than others. The KW method performs unstable for most cases. As the state space bigger, the bias becomes larger.
From a point of the Bellman error, the SLSTD performs best, and the series method performs worst. From a point of computational cost, the KW methods requires more computational time. The SLSTD and the sequential series method provides less computational cost in Table 1 and Table 2. We will investigate this point further in the following section.
4.2 Computational Cost
Next, we see the results of experiments about computational burden. We use the same model as in the estimation part, and change the value of and . The following results show the timetosolve the Bellman equation for given true parameters. For estimation, we repeat the same process for 200 times.
Table 3 shows the results. The unit of time is seconds. We can see that the sequential and the KW methods cannot avoid the exponentially increasing computational burden. The burden is particularly severe for the KW method. In contrast, the computation cost by the SLSTD does not increase exponentially.
We provide some explanations for these results. First, the SLSTD evaluates the Bellman equation on less number of states. The SLSTD refers only the state that is observed as data, and avoid checking other states due to the stochastic approximation method. In contrast, other methods need to refer more number of states for evaluating the Bellman equation. Secondly, the SLSTD uses less computational memory, thus it realizes less computational cost. The sequential series and the KW methods require a large memory to store the numerical values of . When the memory for use is quite large, accessing the memory becomes much costly. On the other hand, the SLSTD only stores information about the approximation weight and the basis functions. When the values of some states are required, the weight and basis functions are sufficient to recall . This contributes to the computational advantage of the SLSTD.
Note also that the sequential series method fails to obtain the value when and . This is because of the increased error from the accumulation of sequential approximation. Because it accepts backward induction, we have to approximate in each period. As is approximated several times, the approximation error accumulates, and sometimes the accumulated error diverges. Figure 1 shows the error accumulation. We try to approximate the DMD with labor economics in Section 4 using the sequential series method. The horizontal axis is the time period, and the vertical axis shows the value of the approximated . We can observe the error accumulation and that its size rises exponentially by the multiple approximation.
When and , the KW method and the sequential series method cannot yield any results. This is because the computation time is so long that we cannot obtain the estimation results. Since the state space is quite large in this case, the computational memory in laptops cannot handle the numerical value of in the usual way : in other words, these two methods are not appropriate. In contrast, the SLSTD only stores the values of the weight , and it does not need a large memory to store the value of .
4.3 Discussion
In this section, we discuss the advantages of the SLSTD. First, the SLSTD does not suffer from computer memory limitations. As mentioned earlier, the sequential series method and the KW methods require a large memory to store numerical values, and it is at times, impossible to store all values in the state space. For instance, when , the requirement is 80 GB of memory. In contrast, the SLSTD only stores value of which can recover value of for all .
The second advantage is in terms nonsmoothness of DMDs. Figure 1 presents a accumulation of approximation error by the series method, when the DMD has nonsmooth term. As the series method requires multiple approximation, the horizontal axis of figure 1 a number of the approximation and the vertical axis is a size of the error of the Bellman equation. It is easy to check that the multiple approximation causes the error approximation. The sequential series method fails when does not have enough smoothness. In contrast, the SLSTD can avoid the problem because it approximates the at once. In addition, the SLSTD delivers a theoretical analysis about smoothness and estimation. In following section, we provide a theoretical analysis of the SLSTD.
The third advantage lies when applying to a nonlogit type model. Throughout many empirical researches, it is often required that the stochastic term has a typeI extreme distribution and the choice probability is represented in the multinomial logit form. If we use a nonlogit type model, the derivation of choice probability requires costly numerical integration. However, as shown earlier, the SLSTD reduces the number of times the choice probability needs to be derived. Thus, the SLSTD has a relative advantage when applying to a nonlogit type model.
5 Theory
In this section, we show the consistency of provided by the SLSTD, and the asymptotic properties of the estimator . In this section, we define the norm as the Euclid norm, and consider . represents a partial differentiation with respect to .
5.1 Property of
As for evaluating , our theoretical results of the stochastic approximation part mainly depends on Tsitsiklis and Van Roy (1997) and Tagorti and Scherrer (2015). The error in the approximation by the basis functions comes from nonparametric series estimation, in the line of Newey (1997) and Andrews (1991).
We also consider a following asymptotic settingÂ¥ : increases as increases. We write the settings as for some . In the field of empirical researches, there is a correlation between the size of state space and the number of observation. For example, in the DMD we used in Section 4, and are correlated through terminal time . This setting decently explains properties of the state spaces in actual empirical researches.
To apply the theories, we assume the following conditions.
denotes the expectation with a stationary distribution.
To evaluate the shape of , we define a new function on continuous space which satisfies for all .
Assumption 1.
Assume that

is bounded.

is times differentiable with respect to .
The following lemma provides a consistency of the stochastic approximation method.
Lemma 1.
If assumption 1 holds, then for all , we obtain with large probability,
The proof is in Appendix B. Now, we obtain the convergence rate of , as it becomes an important factor of the following asymptotic estimation analysis. When , and , we obtain .
5.2 Asymptotic properties of the estimation
In this section, we provide the asymptotic normality of the estimator for . Throughout this analysis, we recognized as a nuisance parameter, and treated as a parameter of interest. The asymptotic result is provided by Kosorok (2000) which analyzes the semiparametric Mestimator.
First, we formalize the estimation problem. Since that we observe state transitions, we use as an index of the observation, and and is a corresponding index. According to the empirical estimation equation (4), we rewrite the log likelihood function as
where . Then, and . We define the true parameter as .
To show the asymptotic result, we consider the following assumptions, in the line of IL.
Assumption 2.
Assume that

is an interior point in the compact , and is the unique maximizer of .

exists and it is nonsingular.

For all , is Lipschitz continuous with respect to .

With some radius , the class is PDonsker.

hold as , for all with some radius .
Assumption 21 is an identification condition for the true parameter . Assumption 22 is for the regularity for the estimation problem of DMDs, and they are generally assumed in the asymptotic statistics. (For example, see Van der Vaart (2000).) One can criticize that there exist empirical researches without the identifiability of the true parameter, thus we have to pay attention to the such cases. Assumption 23 requires Assumption 24 is somewhat abstract, however, it can cover the wide range of the functions, such as smooth, monotone, Lipschitz continuous functions, introduced in Van der Vaart (2000). When we let the reward function and satisfy the such properties, Assumption 24 holds. Assumption 25 requires a kind of smoothness. Though it is strong assumption a little, it is general requirement in the literature of the semiparametric statistics.
The following theorem provides asymptotic normality.
Here, denotes with a vector . The proof is provided in Appendix C.
6 Conclusion
We suggested a new approximation technique, the SLSTD, to solve discrete Markov decision models with a large state space. Because the curse of dimensionality makes the computation cost enormous, it prevents development of research using DMD models. We numerically show that the SLSTD can approximate and solve the Bellman equation with a low computation cost. Further, the asymptotic theory guarantees that the SLSTD has good properties.
Appendix A Figures and Tables
Method  Time (sec)  

True param  1.0  2.0  1.0  9.0  
SLSTD  4  10  3000  0.97  2.49  1.16  5.89  0.44  2.2E+01 
(0.17)  (0.23)  (0.17)  (0.31)  (0.003)  (2.9E+01)  
Sequential  4  10  3000  1.06  2.03  0.99  8.91  0.09  3.0E+04 
(0.04)  (0.11)  (0.02)  (0.18)  (0.001)  (4.1E+04)  
KW  4  10  3000  3.49  2.25  0.21  4.63  0.64  3.5E+02 
(9.66)  (1.51)  (0.29)  (1.74)  (0.007)  (7.8E+02)  
SLSTD  4  15  10125  0.97  2.49  1.16  5.89  0.69  1.6E+02 
(0.17)  (0.23)  (0.17)  (0.31)  (0.004)  (2.5E+02)  
Sequential  4  15  10125  0.19  0.03  0.05  5.71  0.16  4.8E+05 
(0.12)  (0.11)  (0.13)  (7.19)  (0.017)  (7.1E+05)  
KW  4  15  10125  4.46  3.86  0.04  3.46  2.85  1.4E+03 
(2.30)  (2.09)  (0.30)  (2.65)  (0.016)  (2.8E+03)  
SLSTD  4  20  24000  0.58  0.88  0.64  8.13  0.95  1.3E+03 
(0.70)  (0.81)  (0.62)  (6.44)  (0.007)  (2.3E+03)  
Sequential  4  20  24000  0.47  0.36  0.46  3.19  0.40  1.7E+06 
(0.24)  (0.14)  (0.07)  (5.31)  (0.001)  (3.0E+06)  
KW  4  20  24000  4.76  3.59  1.46  5.19  9.62  2.5E+03 
(0.45)  (0.34)  (0.14)  (0.66)  (0.137)  (5.3E+03) 
Method  Time (sec)  

True param  2.0  3.0  2.0  12.0  
SLSTD  4  10  3000  2.24  1.20  2.25  11.18  0.44  2.1E+02 
(0.26)  (1.02)  (0.34)  (5.76)  (0.004)  (2.9E+02)  
Sequential  4  10  3000  1.96  2.99  2.10  11.73  0.09  2.1E+05 
(0.03)  (0.02)  (0.01)  (0.19)  (0.002)  (2.9E+05)  
KW  4  10  3000  2.44  1.52  1.46  0.06  0.64  1.6E+03 
(0.47)  (0.29)  (0.23)  (1.01)  (0.018)  (3.6E+03)  
SLSTD  4  15  10125  1.36  1.97  3.33  12.44  0.70  3.5E+03 
(1.52)  (3.00)  (3.91)  (6.03)  (0.022)  (6.3E+03)  
Sequential  4  15  10125  0.19  0.45  1.32  10.18  0.16  2.7E+06 
(1.10)  (3.10)  (3.40)  (15.33)  (0.001)  (4.7E+06)  
KW  4  15  10125  9.46  6.53  0.60  6.75  2.84  6.3E+03 
(3.07)  (2.01)  (1.07)  (5.13)  (0.027)  (1.2E+04)  
SLSTD  4  20  24000  1.81  2.78  1.85  11.06  0.96  1.3E+04 
(0.65)  (0.74)  (0.50)  (3.19)  (0.042)  (2.8E+04)  
Sequential  4  20  24000  0.55  0.93  0.98  12.32  0.40  9.0E+06 
(0.51)  (1.02)  (1.01)  (0.08)  (0.032)  (1.9E+07)  
KW  4  20  24000  10.39  3.71  4.57  6.17  9.61  1.1E+04 
(0.28)  (0.52)  (1.04)  (6.10)  (0.069)  (2.5E+04) 
Method  Time (mean)  Time (s.d.)  

SLSTD  3  20  1200  0.92  0.07 
Sequential  3  20  1200  0.04  0.01 
KW  3  20  1200  0.40  0.04 
SLSTD  3  30  2700  1.43  0.39 
Sequential  3  30  2700  0.06  0.10 
KW  3  30  2700  0.74  0.31 
SLSTD  3  40  4800  2.88  0.83 
Sequential  3  40  4800  0.16  0.02 
KW  3  40  4800  1.77  0.16 
SLSTD  4  20  24000  1.21  0.26 
Sequential  4  20  24000  0.52  0.13 
KW  4  20  24000  11.90  2.23 
SLSTD  4  30  81000  1.67  0.23 
Sequential  4  30  81000  1.15  0.18 
KW  4  30  81000  66.90  7.61 
SLSTD  4  40  192000  2.99  0.86 
Sequential  4  40  192000  3.76  1.16 
KW  4  40  192000  349.99  85.86 
SLSTD  5  20  480000  1.06  0.03 
Sequential  5  20  480000  7.93  0.02 
KW  5  20  480000  3081.82  178.30 
SLSTD  5  30  2430000  3.31  0.47 
Sequential  5  30  2430000  69.55  6.82 
KW  5  30  2430000  82351.47  12344.05 
SLSTD  5  40  7680000  5.78  0.56 
Sequential  5  40  7680000  null  null 
KW  5  40  7680000  null  null 
Appendix B Proof of Lemma 1
In this proof, we keep fixed and omit the notation. As mentioned before, is the solution of the Bellman equation, and is the approximation value obtained by SLSTD method.
At the beginning, the approximation error of can be decomposed as
First, we consider the term . To evaluate this error, we have to show the existence of the optimal approximation weight . Theorem 1 of Rust et al. (2002) shows that the Bellman equation of DMD models has a unique fixed point solution. Then, Lemma 6 in Tsitsiklis and Van Roy (1997) guarantees the existence of an optimal that uniquely satisfies .
Next, we show that the sequence of generated by the stochastic approximation method converges to . To show this, we verify the conditions of Theorem 2 in Tsitsiklis and Van Roy (1997) and Theorem 17 in Benveniste et al. (2012). Because the Bellman operation of DMD models is a contraction mapping, we can apply Lemma 9 of Tsitsiklis and Van Roy (1997) and show that . The existence of a stationary distribution is guaranteed by the combining of data. The compactness of can satisfy the condition about the initial state. Then, we can apply the theory of Tsitsiklis and Van Roy (1997), and show that .
According to the discussion, we can evaluate the approximation error of the SLSTD. Tagorti and Scherrer (2015) provide a theory and show that, with a large probability,
Here, is the maximum number of the iteration.
About the second term , this is an error of projection of onto the linear space spanned by the basis functions. When the domain of the function is continuous, this error is equivalent to the error of a least square series estimation, and Andrews (1991) and Newey (1997) provide the theoretical result for this estimation. By the assumption 1, most conditions of Newey (1997) are satisfied. A rank condition of Newey (1997) is a critical condition. To discuss about it, we denote as a set of states observed in the set of transition. Since increases at least order , an i.i.d. data generating derives . Hence we can treat of Newey (1997) and the number of transition as same. Then, we obtain
Thus, we obtain lemma 1.
Appendix C Proof of Theorem 1
Let