Deep Knowledge Tracing and Dynamic Student Classification for Knowledge Tracing

Deep Knowledge Tracing and Dynamic Student Classification for Knowledge Tracing

Sein Minn, Yi Yu, Michel C. Desmarais, Feida Zhu , Jill-Jênn Vie
Department of Computer Engineering, Polytechnique Montreal, Canada
Digital Content and Media Sciences Research Division, National Institute of Informatics, Japan
School of Information Systems, Singapore Management University, Singapore
RIKEN Center of Advanced Intelligence Project, Japan

In Intelligent Tutoring System (ITS), tracing the student’s knowledge state during learning has been studied for several decades in order to provide more supportive learning instructions. In this paper, we propose a novel model for knowledge tracing that i) captures students’ learning ability and dynamically assigns students into distinct groups with similar ability at regular time intervals, and ii) combines this information with a Recurrent Neural Network architecture known as Deep Knowledge Tracing. Experimental results confirm that the proposed model is significantly better at predicting student performance than well known state-of-the-art techniques for student modelling.

Student model, Deep knowledge tracing, K-means clustering, RNNs, LSTMs
111This work is available at

I Introduction

ITS is an active field of research that aims to provide personalized instructions to students. Early work dates back to the late 1970s. A wide array of Artificial Intelligence and Knowledge Representation techniques have been explored, of which we can mention rule-based and Bayesian representation of student knowledge and misconceptions, skills modeling with logistic regression in Item Response Theory, case-based reasoning, and, more recently reinforcement learning and deep learning [brown1978diagnostic, polson2013foundations]. One can even argue that most of the main techniques found in Artificial Intelligence and Data Mining have found their way into the field of ITS, and in particular for the problem of knowledge tracing, which aims to model the student’s state of mastery of conceptual or procedural knowledge from observed performance on tasks [corbett1994knowledge].

In this paper we propose a novel model for knowledge tracing, Deep Knowledge Tracing with Dynamic Student Classification (DKT-DSC). At each time interval, the model first assigns a student into a distinct group of students that share similar learning ability. This information is then fed to a Recurrent Neural Network (RNN), known as the DKT architecture [piech2015deep] for predicting student’s performance from data. We can consider the student classification as a long-term memory of the student’s ability as input to the RNN improves knowledge tracing with DKT, which is among the state-of-the-art approach to knowledge tracing.

The rest of this paper is organized as follows. Section II reviews related work on student modelling techniques. Section III presents the proposed DKT-DSC model. Section IV describes the datasets used in our experiments. Experimental results are shown in Section V and finally Section VI concludes this paper and discusses future avenues of research.

Ii Related Work

We review here four of the best known state-of-the-art student modelling methods for estimating student’s performance, either for their predominance in psychometrics (IRT) or Educational Data Mining (BKT), or because they are best performers (PFA, DKT). See [desmarais2012review] for a general review.

Ii-a Item Response Theory (IRT)

IRT assumes the student knowledge state is static and represented by her proficiency when completing an assessment during an exam [wilson2016back, van2013handbook, gonzalez2014general, ekanadham2017t]. IRT models a single skill and assumes the test items are unidimensional. It assigns student with a static proficiency . Each item has its own difficulty . The main idea of IRT is estimating a probability that student answers item correctly by using student’s ability and item’s difficulty. The widely used one-parameter version of IRT, known as the Rasch model, is


Recently, Wilson [wilson2016back] proposed an IRT model that outperforms state-of-the-art knowledge tracing models. In which, maximum a posteriori (MAP) estimates of and are computed using the Newton-Raphson method.

Ii-B Bayesian Knowledge Tracing (BKT)

BKT was introduced for knowledge tracing within a learning environment for which the assumption on static knowledge states is dropped [corbett1994knowledge, d2008more]. It also assumes a single skill is tested per item, but this assumption is relaxed in later work on BKT. Standard BKT estimate of student’s knowledge about a skill is continually updated with four probabilities: [ initial probability of mastery, transitioning from non-mastery to mastery, guessing and slipping], once the student gives her response at each time:


There have been various extensions of BKT in the last decades [Baker2008More, pardos2011kt].

Ii-C Performance Factor Analysis (PFA)

PFA, which was proposed as an alternative to BKT, also relaxes the static knowledge assumption and models multiple skills simultaneously [pavlik2009performance] with its basic structure. It defines the probability of success to an item  by student  as:


where is the bias for the skill , and and represent the learning gain per success and failure attempt to skill , respectively. is the number of successful attempts and is the number of failure attempts made by student on skill  [pavlik2009performance].

Ii-D Deep Knowledge Tracing (DKT)

DKT was introduced in [piech2015deep]. It uses a Long Short-Term Memory (LSTM)[hochreiter1997long] to represent the latent knowledge space of students dynamically. The increase in student’s knowledge through an assignment can be inferred by utilizing the history of student’s previous performance. DKT uses large numbers of artificial neurons for representing latent knowledge state along with a temporal dynamic structure and allows a model to learn the latent knowledge state from data. It is defined by the following equations:


In DKT, both tanh and the sigmoid function are applied element wise and parameterized by an input weight matrix , recurrent weight matrix , initial state , and readout weight matrix . Biases for latent and readout units are represented by and .

Iii Deep Knowledge Tracing with Dynamic Student Classification

Human learning is a process that involves practice: we become proficient through practice. However, learning is also affected by the individual’s ability to learn, or to become proficient with more or less practice. We refer to the ability to become proficient with little practice as the learning ability. Based on that notion, we proposed a model Deep Knowledge Tracing with Dynamic Student Classification (DKT-DSC), that assesses a student’s learning ability and assign her into a distinct group of students with similar ability, and then the model invokes an RNN to trace her knowledge in each distinct group at different time intervals. It can trace the performance of students based on their learning ability, reassessed regularly over time.

Iii-a Dynamic assessment of student’s learning ability and grouping

Dividing students into distinct groups with similar learning ability, according to their previous performance on various contents in a learning system, has been explored in several research works in the field of education [merceron2005clustering, trivedi2011clustering] for providing more adaptive instructions to each group of students with similar ability. Dynamic assessment of student learning ability at each time interval is performed by clustering based on the assessment of their previous performance history before the start of next time interval.

Iii-A1 Time interval

Time interval is a segment containing a number of student’s attempts to answer questions in the system. In this perspective, a tick of time is a single first attempt to a question or exercise.

Iii-A2 Segmenting students’ attempt sequence

segmentation of each student response sequence into multiple time intervals serves two purposes: 1) To reduce computational burden and memory space allocation for learning throughout a long sequence. 2) To re-assess a student’ learning ability after each time interval and assign her into a group which she belongs to for the next time interval dynamically.

Fig. 1: Segmentation of a student’s attempt sequence.

Fig. 1 illustrates an example of dividing a 24-attempt response sequence of a student into 5 segments (time intervals) where a segment represents a time interval in which that student answered 6 problems in the system. When the student stopped interacting with system, it is represented with -1 in the last time interval. The number of attempts made by each student varies based on the number of questions they answered during the interaction with system.

Iii-A3 Long-term skills encoding for clustering

Student are grouped according to their learning ability profile: the skills or knowledge they acquired. Data for assessing student’s learning ability is available from previous attempts on test items or exercises corresponding to a specific skill.

The learning ability profile is encoded as a vector of length the number of skills, and updated after each time interval by using all previous attempts on each skill. The differences between success and failure ratios on each skill of student’s previous attempts are transformed into a data vector for clustering student at time interval as follows:


in which and  represent the ratios of skill  being correctly answered or incorrectly, by student  on  number of skills from time interval 1 to current time interval . is the total number of practices of skill  up to time interval . represents the difference between how much student  performs on skill , correctly or incorrectly, for time interval  to and  represents a vector containing the learning ability profile of student  on each skill from time interval 1 until . Each student may have a different number of total time intervals in the lifetime of their interactions with the system (see Fig. 3).

Iii-A4 K-means Clustering

Assigning students into a group with similar ability at each time interval is performed by k-means clustering on data  [macqueen1967some, ball1965novel]. At the time of the clustering training phase, we find the centroids for each student group without considering the time interval index. Once it has been computed, the centroid of each group will not change any more during the whole clustering process. After that, we assign students (in both training and testing data) into distinct groups at each time interval (see Fig 2).

Fig. 2: Clustering students at each time interval.

When we find the group which student  belongs to at time interval , we use the learning ability profile data points  because we are not supposed to know the current attempts of student  at time interval . After learning the centroids of all clusters, each student at each time interval is assigned into the nearest cluster  by the following equation:


where  is the mean of points in a cluster set  (a group of students), and ability profile data  represents the previous performance data of student  from time interval 1 to .

Fig. 3: Evolution of students’ learning ability over each time interval (each time interval contains 20 attempts) throughout their interactions.

Figure 3 illustrates the data of 33 students’ learning abilities based on their previous performance and the evolution over time intervals. Dark blue (-1) means students do not have any attempt by the time when they quit the system. Group 1 is for the first time interval of every student and the rest of the groups  are assigned by the k-means clustering method at each time interval by using previous performance data .

Iii-B Deep knowledge tracing

DKT-DSC incorporates student’s learning ability to the DKT for better individualization of the system, by assigning a student into a group of students with similar ability dynamically. It relaxes the assumption that all students have the same ability and that students’ ability is consistent over time. In fact, student’s ability is evolving continuously and some students may learn faster than others.

In the standard DKT,  is a one-hot encoding vector of the student interaction tuple that represents the combination of the skills  practiced, and of  which indicates if the answer is correct. But DKT-DSC also requires  additionally with  which is a group or cluster  indicating ’s ability at current time interval  . In the hidden layers, the last node of each time interval is served as first node  for next time interval when we segment the response sequence into multiple time intervals. The output  is a vector of same length as the number of problems. Thus, the probability of the next problem answered correctly at of can be obtained from . In that respect, Eq. 7 and 8 are still valid for DKT-DSC. The output  of both DKT and DKT-DSC is the same which provides the predicted probability for a particular problem.

Fig. 4: DKT-DSC prediction in each time interval (each segment) is associated with a distinct group (cluster) throughout interactions of a student with the system.

Figure 4 illustrates how DKT-DSC model has been adapted by incorporating student’s learning ability as distinct group information at each time interval (each segment) to improve individualization in knowledge tracing. The colour at each time interval at the input layer represents which group a student belongs to at that time interval according to her learning ability. Note that without incorporating student’s ability, DKT-DSC model is the same as the standard DKT model.

By adding this cluster information of what group the student belongs to, we ensure that these high-level statistics are still available to the model for making its predictions throughout the whole academic year. This is what the DKT model does, treating all students in same way without considering their learning abilities. On the contrary, DKT-DSC uses clustering to find a group of students with similar ability by using their ability profile data at different time intervals. Tracing student’s knowledge in each different group can provide more effectiveness in student’s performance prediction.

Finally, we summarize the characteristics of each model in this paper in Table I.

Use of student’s ability Yes No No No Yes
Use of item difficulty Yes No No No No
Use of single skill Yes No Yes Yes Yes
Use of multiple skill No Yes No No No
Learn on ordered sequence No No Yes Yes Yes
TABLE I: Comparison of different models

Iv Datasets

In order to validate the proposed model, we tested it on four public datasets from two distinct tutoring scenarios in which students interact with a computer-based learning system in educational settings.

  • The ASSISTment system222 is an online tutoring system that was first created in 2004 which engages middle and high-school students with scaffolded hints in their math problems. If students working on ASSISTments answer a problem correctly, they are given a new problem. If they answer it incorrectly, they are provided with a small tutoring session where they must answer a few questions that break the problem down into steps. Datasets are: ASSISTments 2009-2010 skill builder data set, ASSISTments 2012-2013, ASSISTments 2014-2015.

    In all datasets, problems are usually tagged with just one skill, but a rare few may be associated with two or three skills. It typically depends on the structure given by the content creator. Some researchers separate a record with multiple skills into multiple single skill records by duplicating. Wilson [wilson2016back] claimed that this type of data processing can artificially boost prediction results significantly, because these duplicate rows can be accounted for approximately 25% of the records in the Assistment09 dataset for DKT models. So we removed duplicate and multiple-skill repeated records in all datasets for the fairness of comparison.

  • KDD Cup: The PSLC DataShop released several data sets derived from Carnegie Learning’s Cognitive Tutor. Algebra 2005-2006 [corbett2001cognitive] is a development dataset released during the KDD Cup 2010 competition333 In this dataset, the problems are associated with multiple skills. So we regard a subset of multiple skills as a new skill [xiong2016going].

We evaluate models described above with four datasets from two separate real world tutors. The experimental results show how the models perform across different datasets. Only the first correct attempts to original problems are considered in our experiment.

To the best of our knowledge, these are the largest publicly available knowledge tracing datasets.

Dataset Number of Description
Skills Students Records
ASSISTments 123 4,163 278,607 2009-2010 [razzaq2005assistment]
198 28,834 2,506,769 2012-2013 [feng2009addressing]
100 19,840 683,801 2014-2015 [xiong2016going]
Cognitive Tutor 437 574 808,775 KDD Cup 2010 [corbett2001cognitive]
TABLE II: Overview of datasets

V Experimental Study

DKT-DSC is extended from the original DKT algorithm, and is combined with the k-means clustering method with a Euclidean distance. Ten iterations were made in the training stage. DKT-DSC and DKT share the same loss function, with 200 fully-connected hidden nodes for each hidden layer. For speeding up the training process, mini-batch stochastic gradient descent is used to minimize the loss function. The batch size for our implementation is 32, corresponding to 32 split sequences from each student. We train the model with a learning rate of 0.01 and dropout is also applied for avoiding overfitting [srivastava2014dropout].

In our experiment, 5 fold cross-validations are used to make predictions on both datasets. Each fold involves randomly splitting each dataset into 80% training data and 20% test data at the student level. So both training and test datasets contain response records from different students. Training for clustering is performed only using data from students in the training dataset. We use EM to train BKT and the limit of iterations is set to 200. We learn models for each skill and make predictions separately, then the results for each skill are averaged. For DKT and DKT-DSC, we set the number of epochs to 100. All these models are trained and tested on the same sets of data. Next response of a student is predicted by using current and previous response sequence in chronological order.

We compare our model with state-of-the-art models: IRT [wilson2016back], BKT [Baker2008More], PFA [pavlik2009performance], DKT [piech2015deep]. But we do not compare with other variant models, because those are more or less similar and do not show significant difference in performance. For IRT, we apply the code from Knewton [wilson2016back] and the code for DKT is from WPI [xiong2016going]. For DKT, we use the same setting of parameters as DKT-DSC and also apply segmentation for a fair comparison. Predicted sequences of student performance by each model are tabulated and evaluated in terms of Area Under the Curve (AUC) and Root mean squared error (RMSE). AUC provides a robust metric where the value to predict is binary, as it is the case of our datasets. An AUC of 0.50 represents the score achieved by random guess. We set AUC 0.61 of BKT as a baseline in our experiment.

ASSISTments09 0.67 0.75 0.70 0.73 0.91
ASSISTments12 0.61 0.74 0.67 0.72 0.87
ASSISTments14 0.64 0.67 0.69 0.72 0.87
Cognitive Tutor 0.61 0.81 0.76 0.79 0.81

TABLE III: AUC result for all datasets

In Table III, DKT-DSC performs significantly better than state-of-the-art models in all datasets. On the ASSISTments09 dataset, compared with the standard DKT which has an AUC of 0.73, our DKT-DSC model achieves an AUC of 0.92, which represents a significant gain of 26%. On the ASSISTments12 dataset with 2.5 million records, the result shows 11% increase, AUC 0.80 in DKT-DSC compared with AUC 0.72 in the original DKT. In the latest ASSISTments14 dataset, DKT-DSC achieves an improvement of 19% over the original DKT. In the Cognitive Tutor dataset, DKT-DSC also achieves about 1% gain with AUC=0.81 while the original DKT has AUC=0.79. As for other algorithms, IRT also provides a slight improvement over the original DKT in all datasets but DKT-DSC performs significantly better than both DKT and IRT. Note that Problem ID is not provided in the original ASSISTments14 dataset. So we use Skill ID as Problem ID for the IRT model, and that is why IRT only gets a AUC of 0.67. In all models described above, only the IRT model learns the problem difficulty while all other models only rely on skills.


0.46 0.44 0.45 0.45 0.33
ASSISTments12 0.51 0.44 0.44 0.43 0.35
ASSISTments14 0.51 0.44 0.42 0.42 0.35
Cognitive Tutor 0.47 0.37 0.39 0.36 0.36

TABLE IV: RMSE result for all datasets

In Table IV, when we compare the models in terms of RMSE, BKT is 0.46 in ASSISTments09, 0.51 in ASSISTments12 and 0.47 in Cognitive Tutor. RMSE results of DKT-DSC in all dataset are under 0.40 while that of all other models are no less than 0.42 (except IRT, PFA and DKT in the Cognitive Tutor dataset). According to these results, DKT-DSC outperforms in all ASSISTments datasets and shows a slightly better performance in the Cognitive Tutor dataset. All of the above experiments are conducted on the time interval containing 20 attempts and 8 clusters (groups of students).

Time interval
Ass09 Ass12 Ass15 KDD
20 0.91 0.81 0.86 0.81
30 0.88 0.80 0.82 0.81
50 0.87 0.80 0.78 0.81
100 0.82 0.77 0.73 0.82

TABLE V: AUC of DKT-DSC with different lengths of time interval

Segmentation of student responses into fixed-time intervals is applied for DKT-DSC (on 4 groups of students) and tested with each time interval containing 20, 30, 50, and 100 attempts in this experiment. The performance of DKT-DSC is described in Table V. DKT-DSC performs better when each time interval contains 100 attempts in KDD dataset with 574 (joint) skills. It can be considered as a small dataset according to the numbers of skills contained in it. So it may inefficiently identify the student’s group because of the sparsity of data. When it contains sufficient amount of attempts in each time interval, it shows a better performance.

# clusters
Ass09 Ass12 Ass15 KDD
2 0.88 0.78 0.83 0.79
4 0.91 0.81 0.86 0.80
6 0.91 0.84 0.86 0.80
8 0.91 0.87 0.87 0.81

TABLE VI: AUC of DKT-DSC with different number of groups of students

As well as with different lengths of time interval, various number of clusters provide different performances as described in Table VI. According to experimental results, 8 clusters with 20 number of attempts in each time interval is the best parameter for DKT-DSC.

Vi Conclusion and Future Work

In this paper, we proposed a new model, DKT-DSC that assesses student’s learning ability at each time interval and dynamically assigns a student into a distinct group of students with the same ability. A student’s knowledge is traced based on the group which she belongs to at each time interval. Experiments with four datasets show that the proposed model performs statistically and significantly better than state-of-the-art models. DKT assumes all students have the same learning ability and only tracks the improvement of knowledge in a skill sequence without considering difference between abilities of each student and learning rate. In comparison, our model improves over DKT by capturing the student’s ability over time. Assessing student’s ability in this way gives the model critical information in the prediction of student performance in their next time interval and tracing their knowledge where abilities of the students evolve dynamically. We individualize the input vector by taking both student’s ability and practicing skill into account. Instead of using the skill level alone, incorporating student’s ability in terms of group information in DKT-DSC yields an improvement in the prediction of performance. Dynamically assessing student’s ability at each time interval plays the critical role and helps the DKT-DSC model capture more variance in the data, leading to more accurate predictions.

In our future work, we will adapt this model to problems with multiple associated subskills in the system and apply it in the recommendation of problems with multiple associated skills. Problems for practice should be recommended according to the knowledge level and the ability a student possesses. The significant gain obtained by DKT-DSC can make a difference in current knowledge tracing respect. Further investigation on the potential application of DKT-DSC to other content recommendations (movies and other commercial products) will also be considered.

Vii Acknowledgements

This work was partially done while the first author was interning at National Institute of Informatics, Japan and also funded by the NSERC Discovery funding awarded to the third author.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description