Optimal Hierarchical Learning Path Design with Reinforcement Learning
Abstract
Elearning systems are capable of providing more adaptive and efficient learning experiences for students than the traditional classroom setting. A key component of such systems is the learning strategy, the algorithm that designs the learning paths for students based on information such as the students’ current progresses, their skills, learning materials, and etc. In this paper, we address the problem of finding the optimal learning strategy for an Elearning system. To this end, we first develop a model for students’ hierarchical skills in the Elearning system. Based on the hierarchical skill model and the classical cognitive diagnosis model, we further develop a framework to model various proficiency levels of hierarchical skills. The optimal learning strategy on top of the hierarchical structure is found by applying a modelfree reinforcement learning method, which does not require information on students’ learning transition process. The effectiveness of the proposed framework is demonstrated via numerical experiments.
Optimal Hierarchical Learning Path Design with Reinforcement Learning
Xiao Li
Department of Educational Psychology
University of Illinois at UrbanaChampaign
Hanchen Xu
Department of Electrical and Computer Engineering
University of Illinois at UrbanaChampaign
Jinming Zhang
Department of Educational Psychology
University of Illinois at UrbanaChampaign
Huahua Chang
Department of Educational Psychology
University of Illinois at UrbanaChampaign
Optimal Hierarchical Learning Path Design with Reinforcement Learning
Introduction
Designing optimal learning strategies for students has emerged as an interesting and important topic in recent years, along with the trending transformation from traditional classroom teaching to Elearning systems (?, ?). Thanks to online learning technologies, information such as students’ test results and response time can be monitored, which enables Elearning systems to select the most appropriate learning materials to each individual student. For example, students are routed with finest learning materials based on their skills, and the materials’ contents and difficulty levels, instead of following a routine learning path that does not differentiate individual students. This notion is referred to as personalized learning (?, ?), and also known as adaptive learning or smart learning (?, ?).
Several studies have provided innovative approaches to personalized Elearning systems. For example, cognitive diagnosis models (CDMs), known as the foundation of assessing students’ mastery of skills, are extended to model their learning processes (?, ?, ?). The knowledge tracing method (?, ?) functions similarly in modeling learning but focusing on one attribute each time (?, ?). In the aforementioned models, skills are assumed to be unstructured without considering skill hierarchical structure and proficiency levels. However, ignoring skill hierarchy and proficiency levels may contaminate classification results (?, ?). Another direction towards personalized learning is finding optimal learning strategies that recommend learning materials (?, ?). Existing researches typically characterize the learning process as a Markov decision problem, the transition kernel of which is known. However, the transition kernel is hardly known in practice. As a matter of fact, the transition processes of learners’ states are unobservable and may vary across different learning materials.
In this paper, we address those challenges by proposing an integrated Elearning system, which is equipped with the optimal learning strategy obtained via a modelfree method that takes the skill hierarchy into account. The contributions of this paper are the following. First, a hierarchical learning model is developed to explicitly characterize skill hierarchy and proficiency levels, which, albeit important, have not been addressed yet in existing models. We model the proficiency levels of hierarchical skills following the same form of CDMs; therefore, the latent skills and their proficiency levels can be estimated using CDMs, and the state transitions can be characterized by a hidden Markov model (HMM). The proposed hierarchical learning model is easy to implement and can accommodate various types of skill hierarchies (?, ?). In addition, the number of model parameters and states to be estimated is largely reduced with regards to the restricted state space defined in the model. Second, a modelfree reinforcement learning (RL) method is applied to finding the optimal learning strategy. Using RL techniques, the proposed Elearning system is fully datadriven and does not required prior information on the HMM. At each stage of learning, a set of items will be distributed to the learners, whose responses to these items are next collected by the Elearning system. Learners’ hidden states are estimated using psychometric models and updated based on the responses. We compared the modelfree RL method with a heuristic method, and demonstrated via numerical experiments that the modelfree RL method can find a better learning strategy that outperforms the heuristic one quickly.
The rest of the paper is organized as follows. The CDMs and conventional HMMs used for modeling learning paths are introduced in the “Preliminaries" section. The hierarchical learning model and the modelfree RL algorithm for finding the optimal learning strategy is presented in the “Models and Algorithms" section. Results from numerical experiments are presented in the “Experiments" section. Some concluding remarks and potential future directions are discussed in the “Concluding Remarks and Future Directions" section.
Preliminaries
Cognitive Diagnosis Models
CDMs are psychometric models that examine students’ mastery of specific skills at a finegrained level. These models provide a summary information in the form of score profiles, the element of which represents the proficiency level of a skill by examinees. The element takes binary values if only the presence and absence of a skill is modeled. They are ideal frameworks that aid in identifying optimal learning materials to be distributed next since they keep track of learners’ different skills considering their multidimensional features. Skills in CDMs are discrete and assumed to be latent. They are reflected by responses given by examinees to items measuring one or more skills. The skill sets are described as attribute profiles and each skill is referred to as an attribute in the CDMs. Binary values are used to model the mastery or nonmastery of a attribute. The proficiency levels of each attribute can be transformed to an attribute profile taking binary values. Details of the model will be discussed in the Hierarchical Learning Model section. As an example, in the deterministic inputs, noisy “and" gate (DINA) model (?, ?)—a commonlyused CDM which is both tractable and interpretable, attribute profiles as well as model parameters can be easily estimated by expectationmaximization and Markov Chain Monte Carlo (MCMC) algorithms (?, ?).
Most CDMs require the construction of a Qmatrix (?, ?) for implementation. To be specific, suppose the Elearning system considers attributes and contains items. The Qmatrix is a matrix whose element , , , on the row and column taking binary values, indicates whether the item is associated with the attribute. The DINA model translates the association by a strict rule—the associated attributes are required for learners to answer the item correctly. The Qmatrix specifies the cognitive specification for each test item explicitly (?, ?).
An example is provided to illustrate the construction of Qmatrix. Consider the mixed attributes in the system including addition and multiplication. The item “" requires addition attribute to be answered correctly the item, while “" measures both addition and multiplication attributes. Thus the corresponding row of the Qmatrix for the first item is and that for the second is . The Qmatrix provides a method to formulate the conditional independence between item responses and attribute profiles. That is, conditioning on measured attributes, item responses are independent of irrelevant attributes. The Qmatrix is generally specified before a test and further improved based on students’ responses during the test (?, ?, ?).
In the DINA model, the probability of correctly answering an item is defined based on the Qmatrix. Following the same notation as above, assume attributes and items in the Elearning system. Let be the attribute profile for the learner, where and each element of belongs to . A value of 1 indicates a mastered attribute and 0 an unmastered attribute. Let be the response of learner to item , , where indicates a correct answer while indicates an incorrect one. Therefore, the probability of a correct answer conditional on the attribute profile is defined as
(1) 
where denotes probability, indicates whether or not the learner has mastered all attributes required for the item . The value of is if the learner possesses all attributes and is if the learner lacks at least one of the required attributes. Mathematically, it is defined as
denotes the slipping parameter—the probability of a learner possessing all attributes required in item , i.e.,
and denotes the guessing parameter—the probability of correctly answering the item without required attributes, i.e.,
CDMs are classified into noncompensatory and compensatory model (?, ?). The DINA model is a noncompensatory model for the reason that it assumes the learner who lacks any of the required attributes will fail to answer the item. Unlike noncompensatory models, compensatory models allow a high ability attribute to compensate for a low ability attribute on another dimension. Other noncompensatory models include noisy input, deterministic, “and" gate (NIDA) model (?, ?), and the reduced reparameterized unified model (?, ?). Compensatory models include deterministic input noisy “or" gate (DINO) model (?, ?). More general CDMs have been developed to include many noncompensatory and compensatory models (?, ?, ?). Both noncompensatory and noncompensatory models are wellexamined in modeling diagnostic skills.
Learning Paths with the Hidden Markov Model
Learning paths can be modeled by the HMM as the attribute profile is latent (?, ?, ?). The Markov model specifies that a learner’s next state, after provided with a certain learning material, will only depend on his or her current state and the material. Figure 1 illustrates how to model the learning path with a HMM. Define the attribute profile as the state in the Markov model, denoted as for the learner at time step . The state transition is as follows:
(2) 
where denotes the learning material distributed at time , and , which is the set of all learning materials. The transition process from current state to the next is thus formulated as a Markov decision process (MDP).
The learning paths with latent attribute profiles can either be considered as a partially observable MDP (?, ?), or two separate components, one with a psychometric model and one MDP. In both cases, we assume no retrogress exists—once learners master the attribute, they will not lose it, that is,
(3) 
and
(4) 
In this study, the psychometric model and a HMM are used to estimate the attribute profiles. Specifically, given timeinvariant item parameters and a proper psychometric model such as CDMs, the attribute profile of learner at time step can be estimated from item responses. Take the DINA model as an example. Given item responses from learners at time step , denoted as , the attribute profile can be estimated through (1).
Models and Algorithms
Hierarchical Learning Model
Attribute hierarchy method (AHM) were first proposed to deal with situations where cognitive attributes are hierarchically related and thus dependent (?, ?). In particular, the AHM investigates precedence ordering of cognitive competencies required to solve test problems. It has four different structures including linear, convergent, divergent and unstructured. An intuitive example of the hierarchical structure is how students learn addition “" and multiplication “". Addition is considered as a prerequisite for multiplication. Students are able to learn multiplication only after they fully understand addition or at least are equipped with basic knowledge of it.
All structures investigated by AHM can be split into dependent relationships between two attributes. For example, Fig. 2 exhibits the divergent structure among 5 attributes, denoted as , . The hierarchical structure among the five can be split to the four dependent links shown as dotted arrow line in Figure 2. That is, is a prerequisite of and , while is a prerequisite of and . Therefore, in order to model the hierarchical structure, we make three assumptions on the link between two dependent attributes.
Assume attribute is prerequisite to attribute . There are different proficiency levels for each attribute. Denote the lack of attribute as , , and different proficiency levels as . Whether or not possessing a certain proficiency level of each attribute is binary. We make the following assumptions on the attribute hierarchy:

Learners can only possess a high proficiency level after they have mastered lower proficiency level of the same attribute. That is,
(5) 
Certain proficiency level of can only be learned after the same proficiency level of is achieved. That is,
(6) 
The probability of a learner to master the attribute conditional on mastering a high proficiency level of is no smaller than mastering a lower proficiency level of, . That is, for and ,
(7) and for and ,
(8)
Therefore, by expressing the relationship between dependent attributes, hierarchical attribute structure is modeled.
We next model different proficiency levels of attributes to be elements of attribute profiles as in CDMs. The proficiency levels of learners on different attributes can be estimated by psychometric models as a result. An example of a Qmatrix for two hierarchical attributes with two proficiency levels is provided in Table 1. In this example, attribute addition () is presumed to be a prerequisite of attribute multiplication (). Onedigit calculation is assumed to be the low proficiency level while twodigit calculation is assumed to be the high proficiency level for both operations.
Item  

1  0  0  0  
1  1  1  0  
1  1  1  1 
To incorporate the attribute hierarchy, the state space is constructed following the hierarchical learning model assumptions. Originally in CDMs, states shall be included in the HMM with respect to attributes. With hierarchical learning model, the state space is reduced to states shown as rows in Table 2. As a result, the attribute profile of learner at time step , i.e., , could be any row in Table 2.
State  

1  0  0  0  0 
2  1  0  0  0 
3  1  1  0  0 
4  1  0  1  0 
5  1  1  1  0 
6  1  1  1  1 
All attribute hierarchy can be generalized by the hierarchical learning model other than the linear structure given above. More strict assumptions can be added if necessary in practice. For example, a attribute cannot be learned before its prerequisite is fully mastered. The state space of the example in the experiment will be further restricted to the space shown as Table 3 with states only.
State  

1  0  0  0  0 
2  1  0  0  0 
3  1  1  0  0 
4  1  1  1  0 
5  1  1  1  1 
The design of hierarchical learning model makes it possible to incorporate not only attribute hierarchy, but also different proficiency levels of attributes in CDMs. The model follows the common form of CDMs so that the restricted Qmatrix is easy to construct, and parameters in CDMs as well as attributes can be estimated easily (?, ?). In addition, the hierarchical design largely reduces the number of parameters and attributes to be recovered in CDMs.
Reinforcement learning
RL is widely used in solving problems by interacting with the environment, without requiring an explicitly expressed MDP model (?, ?). The RL method can be applied in finding the optimal learning strategy for several reasons. First, in Elearning systems, how learners’ attribute profiles transit after feeding a learning material is unknown. RL methods can be an ideal fit in finding the best solution since it does not require an explicit model to estimate the utility of taking actions in the environment (?, ?). Second, the learning path with attribute hierarchy modeled by a HMM can be wellsolved by the RL method. Third, the RL method searches for the longterm optimal solution which takes future rewards into consideration instead of simply choosing the best option at immediate step (?, ?). These advantages make it an ideal solution for finding the optimal learning strategy in the Elearning system.
The overall framework is illustrated in Fig. 3, where the agent is the Elearning system that determines action (i.e., learning material), sent to the environment (i.e., learners), which will then send state (i.e., attribute profiles), and a reward signal back to the agent.
We next model the learning paths as a MDP. The state space is the set of all attribute profiles . The action space is defined to be the set of all learning materials . The reward is shown in Algorithm 1, designed to be decreased if the episode length is too long. As discussed earlier, the transition kernel satisfies the Markov property.
Since both the state space and the action space are discrete, a classical modelfree RL algorithm—the Qlearning algorithm—can be applied to learn the optimal policy (?, ?). The Qlearning algorithm estimates an action value function—the so called Qfunction—that gives the longterm value of a stateaction pair, denoted by . By taking a discount factor into consideration, the algorithm discounts the future rewards into current time step. The Qlearning algorithm proves to converge with probability 1 if the learning rate is properly chosen and the stateaction space is sufficiently explored (?, ?). In practice, greedy exploration policy is commonly used with a probability of to explore at the beginning and decayed later for exploitation. The detailed algorithm for optimal learning strategy is presented in Algorithm 1.
Experiments
Overview
The experiment considers two attributes with linear hierarchical structure and three proficiency levels for each attribute. Denote the two attributes as and . The three proficiency levels are represented as , , , , , and respectively. or is used when the corresponding attribute is not mastered.
Assume is a prerequisite attribute of , satisfying all assumptions in the section “Hierarchical Learning Model". An intuitive way to understand the hierarchy structure here is to assume to be the addition and to be the multiplication. The three proficiency levels can be translated to beginner, intermediate and advanced level, while indicates the learner has no knowledge of and so does .
Assume six learning materials are available, three of which are beginner, intermediate and advanced level materials for attribute and the other three are for attribute . We thus construct the Markov process shown as a directed graph in Fig. 4. Each circle represents a state. A full arrow shows a transition of attribute while a dotted arrow shows a transition of attribute . Only one attribute can be improved in each learning step. The process satisfies the three assumptions in the hierarchy learning model. Note that the transition from a state to itself is neglected in the directed graph and can be easily calculated by Markov properties. The transition matrix, which is unknown to the environment and only applied to predict learners’ next state, is constructed accordingly.
The state space is shown in Table 4. If the learner acquires the attribute profile of , no more learning material will be provided and the learning process ends. After a learning material is selected by the Elearning system and fed to the learner, a set of test items will be given, to test the learner’s current attribute profile. Therefore, the new state can be estimated and updated.
State  

1  0  0  0  0  0  0 
2  1  0  0  0  0  0 
3  1  1  0  0  0  0 
4  1  1  1  0  0  0 
5  1  0  0  1  0  0 
6  1  1  0  1  0  0 
7  1  1  1  1  0  0 
8  1  1  0  1  1  0 
9  1  1  1  1  1  0 
10  1  1  1  1  1  1 
Figure 4 reveals the difference between the strategy that only considers immediate reward and the strategy given by RL method that takes future rewards into consideration. For instance, suppose a learner reaches the beginner level of the first attribute and has no knowledge of the second attribute , i.e., in state . The beginner level material for attribute gives the shortest expected learning time at this step defined as which is . However, although the intermediate level material for attribute brings relatively longer learning time, leading to less rewards at current step, the overall expected learning time of path through to , which is , is less than that through to , which is . As a result, although to learn beginner level attribute first is quicker at current step, it is not the most optimal learning strategy overall.
In order to simulate the psychometric model estimation step, an estimation error of was added to the state, indicating there is a probability that the estimated state is incorrect. In CDM researches, the average pattern correct classification rate (PCCR) is usually larger than . Therefore, an estimation error of is large enough to show the reliability of the optimal learning strategy. In addition, simulation results for cases with an estimation error ranging from to are included to show that the Qlearning algorithm is reliable and stable to find the optimal learning strategy even with the presence of estimation error. In practice, the states are estimated and updated from responses of test items and item parameters by psychometric models.
The rest of parameters are as follows: initial learning rate , discount factor , and initial exploration probability . A decay rate of is applied for and a decay rate of is used for . Therefore, after episodes, the learning rate decays to a value of and the exploration probability decays to .
The Qlearning algorithm is trained in episodes. After that, the trained model is applied in another episodes and compared with a heuristic strategy, which selects the next learning material that can improve the learner’s proficiency level in accordance with hierarchical learning model assumptions. For instance, if the learner’s attribute profile is estimated to be , the learning material will be selected from beginner level material for attribute and intermediate level material for attribute . The two methods are compared under both with and without estimation error.
Two numerical experiments are conducted in the study. In the first experiment, the initial states for all learners are , which means none of the learners have any knowledge of the two attributes. In the second experiment, learners start with different proficiency levels except for . The second experiment shows that as long as the learner has not fully mastered attributes in the Elearning system, no matter where they begin with, the system can find the optimal learning strategy for each of them.
Results
Learning Strategy Comparison
Figures 5 and 6 present the rewards under the RL method across episodes, including both the immediate reward and the smoothed reward with a smoothing window of . Figure 5 shows that the reward becomes stable after episodes under RL method without estimation error, which means the method finds the optimal strategy after training on students. The result indicates that the RL method finds the optimal learning strategy quickly. After a estimation error is added to the system, the Figure 6 presents that the RL method still finds the optimal learning strategy after around episodes.
Figures 7 and 8 give a comparison between the RL method and heuristic method across episodes where the RL method has been trained in episodes and applied to new students. No estimation error is added in Fig. 7 while a estimation error is added to both methods in Fig. 8. Both figures show that the reward under the RL method is higher than the heuristic method. The smoothed reward of the RL method is significantly higher than that of the heuristic method in both with or without estimation error.
Table 5 shows the overall mean and standard deviation of rewards and episode lengths in two methods. The RL method has much higher mean and lower standard deviation of rewards than the heuristic method, together with shorter episode lengths and smaller episode length standard deviation as well. It is worth noting that although the average episode length with estimation error is slightly higher than that without estimation error, the difference is minimal.
Methods  RL  Heuristic  

No Estimation Error 
Reward mean  6.43  3.99 
Reward SD  3.61  5.34  
EL mean  7.34  8.57  
EL SD  1.90  2.62  
Estimation Error 
Reward mean  6.41  3.98 
Reward SD  3.60  5.37  
EL mean  7.73  9.01  
EL SD  2.07  2.74 
Figure 9 gives a comparison between the RL method and heuristic method across episode under different estimation errors and no estimation error using the box plot where the RL method has been trained in episodes. The figure shows that the average award under the RL method is much higher than that under the heuristic method across estimation errors. In addition, the RL method also produces smaller standard deviation of rewards than the heuristic method. Although the standard deviation of the RL method tends to increase when the estimation error increases, it is still smaller than that of the heuristic method.
The simulation results shown above indicate that the RL method finds the better learning strategy than heuristic method. More importantly, the estimation error has negligible impact on the performance of RL method in searching for optimal strategy.
Impacts of Various Initial States
Figure 10 presents the smoothed rewards of nine different initial states other than , with a smoothing window of . A estimation error is added to the system to simulate realistic cases. The result demonstrates that the RL method can quickly find the optimal learning strategy for all learners with different initial attributes. The algorithm converges after episodes indicating that the optimal strategy can be found after it is trained on only learners. Therefore, once a learner’s initial attribute is estimated by a set of items, the learner can follow the optimal learning strategy to acquire new attributes with the fastest route provided by the system.
Concluding Remarks and Future Directions
In this paper, we proposed a hierarchical learning model that incorporates attribute hierarchy and proficiency levels of attributes together in the Elearning system. The model follows the same form of discrete attributes and Qmatrix required by CDMs so that parameters and hidden states can be easily recovered and estimated. In addition, the transition process for student learning is formulated as a MDP. Then, a modelfree RL method is applied to finding the optimal learning strategy on top of the hierarchical framework.
Experiment results suggest that the optimal design with the RL method outperforms the heuristic strategy substantially with and without the estimation error. The mean and the standard deviation of the learning episode length achieved by the RL method is significantly smaller compared to those obtained in the heuristic method. In addition, the RL method can find the optimal learning strategy quickly for all learners with different initial attribute proficiency levels. As a result, learners with various proficiency levels will be fed with the most appropriate material at each step. To implement the system in the real world, a set of items will be given to learners after they finish each learning stage. Their attributes will be estimated and the state can be updated based on their responses to the given items.
Several directions are possible for future researches. First, other dimensionality methods can be applied to classify learners at the first stage (?, ?, ?), in addition to using estimation method to get learners’ initial states. It is important to have an accurate estimation learners’ initial states so that the most appropriate optimal learning strategy can be distributed to each individual. Second, different algorithms can be proposed to selects the personalized learning materials that can maximize learners’ immediate or future rewards (?, ?). Lastly, learners’ attributes are restricted to a state space satisfying hierarchical learning model assumptions. CDMs with restricted state space as well as Qmatrix can be further explored (?, ?). The identifiability conditions for the restricted latent structure model shall also be rigorously studied (?, ?).
References
 Chen, Culpepper, Chen, DouglasChen, Culpepper, Chen, Douglas Chen, Y., Culpepper, S. A., Chen, Y., Douglas, J. (2018). Bayesian estimation of the dina q matrix. Psychometrika, 83(1), 89–108.
 Chen, Culpepper, Wang, DouglasChen, Culpepper, Wang, Douglas Chen, Y., Culpepper, S. A., Wang, S., Douglas, J. (2018). A hidden markov model for learning trajectories in cognitive diagnosis with application to spatial rotation skills. Applied psychological measurement, 42(1), 5–23.
 Chen, Li, Liu, YingChen, Li, et al. Chen, Y., Li, X., Liu, J., Ying, Z. (2018). Recommendation system for adaptive learning. Applied psychological measurement, 42(1), 24–41.
 Corbett AndersonCorbett Anderson Corbett, A. T., Anderson, J. R. (1994). Knowledge tracing: Modeling the acquisition of procedural knowledge. User modeling and useradapted interaction, 4(4), 253–278.
 De La TorreDe La Torre De La Torre, J. (2009). Dina model and parameter estimation: A didactic. Journal of educational and behavioral statistics, 34(1), 115–130.
 De La TorreDe La Torre De La Torre, J. (2011). The generalized dina model framework. Psychometrika, 76(2), 179–199.
 DiBello, Roussos, StoutDiBello et al. DiBello, L., Roussos, L., Stout, W. (2007). Review of cognitively diagnostic assessment and a summary of psychometric models. cr rao, & s. sinharay (eds.), handbook of statistics, vol. 26: Psychometrics (pp. 970–1030). Amsterdam: NorthHolland Publications.
 EmbretsonEmbretson Embretson, S. (1984). A general latent trait model for response processes. Psychometrika, 49(2), 175–186.
 Henson, Templin, WillseHenson et al. Henson, R. A., Templin, J. L., Willse, J. T. (2009). Defining a family of cognitive diagnosis models using loglinear models with latent variables. Psychometrika, 74(2), 191.
 Junker SijtsmaJunker Sijtsma Junker, B. W., Sijtsma, K. (2001). Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement, 25(3), 258–272.
 Kaelbling, Littman, CassandraKaelbling et al. Kaelbling, L. P., Littman, M. L., Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(12), 99–134.
 Kaelbling, Littman, MooreKaelbling et al. Kaelbling, L. P., Littman, M. L., Moore, A. W. (1996). Reinforcement learning: A survey. Journal of artificial intelligence research, 4, 237–285.
 Leighton, Gierl, HunkaLeighton et al. Leighton, J. P., Gierl, M. J., Hunka, S. M. (2004). The attribute hierarchy method for cognitive assessment: A variation on tatsuoka’s rulespace approach. Journal of educational measurement, 41(3), 205–237.
 LittmanLittman Littman, M. L. (1994). Markov games as a framework for multiagent reinforcement learning. In Machine learning proceedings 1994 (pp. 157–163). Elsevier.
 Liu, Xu, YingLiu et al. Liu, J., Xu, G., Ying, Z. (2012). Datadriven learning of qmatrix. Applied psychological measurement, 36(7), 548–564.
 Manickam, Lan, BaraniukManickam et al. Manickam, I., Lan, A. S., Baraniuk, R. G. (2017). Contextual multiarmed bandit algorithms for personalized learning action selection. In Acoustics, speech and signal processing (icassp), 2017 ieee international conference on (pp. 6344–6348).
 MarisMaris Maris, E. (1999). Estimating multiple classification latent class models. Psychometrika, 64(2), 187–212.
 Means, Toyama, Murphy, Bakia, JonesMeans et al. Means, B., Toyama, Y., Murphy, R., Bakia, M., Jones, K. (2009). Evaluation of evidencebased practices in online learning: A metaanalysis and review of online learning studies.
 NorrisNorris Norris, J. R. (1998). Markov chains (No. 2). Cambridge university press.
 Roussos, Templin, HensonRoussos et al. Roussos, L. A., Templin, J. L., Henson, R. A. (2007). Skills diagnosis using irtbased latent class models. Journal of Educational Measurement, 44(4), 293–311.
 StuderStuder Studer, C. (2012). Incorporating learning over time into the cognitive assessment framework. Unpublished PhD, Carnegie Mellon University, Pittsburgh, PA.
 Sutton BartoSutton Barto Sutton, R. S., Barto, A. G. (2011). Reinforcement learning: An introduction.
 Templin HensonTemplin Henson Templin, J. L., Henson, R. A. (2006). Measurement of psychological disorders using cognitive diagnosis models. Psychological methods, 11(3), 287.
 Tu, Wang, Cai, Douglas, ChangTu et al. Tu, D., Wang, S., Cai, Y., Douglas, J., Chang, H.H. (2018). Cognitive diagnostic models with attribute hierarchies: Model estimation with a restricted qmatrix design. Applied Psychological Measurement, 0146621618765721.
 TwymanTwyman Twyman, J. S. (2014). Competencybased education: Supporting personalized learning. connect: Making learning personal. Center on Innovations in Learning, Temple University.
 Wang, Yang, Culpepper, DouglasWang et al. Wang, S., Yang, Y., Culpepper, S. A., Douglas, J. A. (2018). Tracking skill acquisition with cognitive diagnosis models: A higherorder, hidden markov model with covariates. Journal of Educational and Behavioral Statistics, 43(1), 57–87.
 Watkins DayanWatkins Dayan Watkins, C. J., Dayan, P. (1992). Qlearning. Machine learning, 8(34), 279–292.
 Xu et al.Xu et al. Xu, G., et al. (2017). Identifiability of restricted latent class models with binary responses. The Annals of Statistics, 45(2), 675–707.
 J. ZhangJ. Zhang Zhang, J. (2013). A procedure for dimensionality analyses of response data from various test designs. Psychometrika, 78(1), 37–58.
 J. Zhang StoutJ. Zhang Stout Zhang, J., Stout, W. (1999). The theoretical detect index of dimensionality and its application to approximate simple structure. Psychometrika, 64(2), 213–249.
 S. Zhang ChangS. Zhang Chang Zhang, S., Chang, H.H. (2016). From smart testing to smart learning: how testing technology can assist the new generation of education. International Journal of Smart Technology and Learning, 1(1), 67–92.