Deep Knowledge Tracing and Dynamic Student Classification for Knowledge Tracing
Abstract
In Intelligent Tutoring System (ITS), tracing the student’s knowledge state during learning has been studied for several decades in order to provide more supportive learning instructions. In this paper, we propose a novel model for knowledge tracing that i) captures students’ learning ability and dynamically assigns students into distinct groups with similar ability at regular time intervals, and ii) combines this information with a Recurrent Neural Network architecture known as Deep Knowledge Tracing. Experimental results confirm that the proposed model is significantly better at predicting student performance than well known stateoftheart techniques for student modelling.
I Introduction
ITS is an active field of research that aims to provide personalized instructions to students. Early work dates back to the late 1970s. A wide array of Artificial Intelligence and Knowledge Representation techniques have been explored, of which we can mention rulebased and Bayesian representation of student knowledge and misconceptions, skills modeling with logistic regression in Item Response Theory, casebased reasoning, and, more recently reinforcement learning and deep learning [brown1978diagnostic, polson2013foundations]. One can even argue that most of the main techniques found in Artificial Intelligence and Data Mining have found their way into the field of ITS, and in particular for the problem of knowledge tracing, which aims to model the student’s state of mastery of conceptual or procedural knowledge from observed performance on tasks [corbett1994knowledge].
In this paper we propose a novel model for knowledge tracing, Deep Knowledge Tracing with Dynamic Student Classification (DKTDSC). At each time interval, the model first assigns a student into a distinct group of students that share similar learning ability. This information is then fed to a Recurrent Neural Network (RNN), known as the DKT architecture [piech2015deep] for predicting student’s performance from data. We can consider the student classification as a longterm memory of the student’s ability as input to the RNN improves knowledge tracing with DKT, which is among the stateoftheart approach to knowledge tracing.
The rest of this paper is organized as follows. Section II reviews related work on student modelling techniques. Section III presents the proposed DKTDSC model. Section IV describes the datasets used in our experiments. Experimental results are shown in Section V and finally Section VI concludes this paper and discusses future avenues of research.
Ii Related Work
We review here four of the best known stateoftheart student modelling methods for estimating student’s performance, either for their predominance in psychometrics (IRT) or Educational Data Mining (BKT), or because they are best performers (PFA, DKT). See [desmarais2012review] for a general review.
Iia Item Response Theory (IRT)
IRT assumes the student knowledge state is static and represented by her proficiency when completing an assessment during an exam [wilson2016back, van2013handbook, gonzalez2014general, ekanadham2017t]. IRT models a single skill and assumes the test items are unidimensional. It assigns student with a static proficiency . Each item has its own difficulty . The main idea of IRT is estimating a probability that student answers item correctly by using student’s ability and item’s difficulty. The widely used oneparameter version of IRT, known as the Rasch model, is
(1) 
Recently, Wilson [wilson2016back] proposed an IRT model that outperforms stateoftheart knowledge tracing models. In which, maximum a posteriori (MAP) estimates of and are computed using the NewtonRaphson method.
IiB Bayesian Knowledge Tracing (BKT)
BKT was introduced for knowledge tracing within a learning environment for which the assumption on static knowledge states is dropped [corbett1994knowledge, d2008more]. It also assumes a single skill is tested per item, but this assumption is relaxed in later work on BKT. Standard BKT estimate of student’s knowledge about a skill is continually updated with four probabilities: [ initial probability of mastery, transitioning from nonmastery to mastery, guessing and slipping], once the student gives her response at each time:
(2) 
(3) 
(4) 
There have been various extensions of BKT in the last decades [Baker2008More, pardos2011kt].
IiC Performance Factor Analysis (PFA)
PFA, which was proposed as an alternative to BKT, also relaxes the static knowledge assumption and models multiple skills simultaneously [pavlik2009performance] with its basic structure. It defines the probability of success to an item by student as:
(5) 
(6) 
where is the bias for the skill , and and represent the learning gain per success and failure attempt to skill , respectively. is the number of successful attempts and is the number of failure attempts made by student on skill [pavlik2009performance].
IiD Deep Knowledge Tracing (DKT)
DKT was introduced in [piech2015deep]. It uses a Long ShortTerm Memory (LSTM)[hochreiter1997long] to represent the latent knowledge space of students dynamically. The increase in student’s knowledge through an assignment can be inferred by utilizing the history of student’s previous performance. DKT uses large numbers of artificial neurons for representing latent knowledge state along with a temporal dynamic structure and allows a model to learn the latent knowledge state from data. It is defined by the following equations:
(7) 
(8) 
In DKT, both tanh and the sigmoid function are applied element wise and parameterized by an input weight matrix , recurrent weight matrix , initial state , and readout weight matrix . Biases for latent and readout units are represented by and .
Iii Deep Knowledge Tracing with Dynamic Student Classification
Human learning is a process that involves practice: we become proficient through practice. However, learning is also affected by the individual’s ability to learn, or to become proficient with more or less practice. We refer to the ability to become proficient with little practice as the learning ability. Based on that notion, we proposed a model Deep Knowledge Tracing with Dynamic Student Classification (DKTDSC), that assesses a student’s learning ability and assign her into a distinct group of students with similar ability, and then the model invokes an RNN to trace her knowledge in each distinct group at different time intervals. It can trace the performance of students based on their learning ability, reassessed regularly over time.
Iiia Dynamic assessment of student’s learning ability and grouping
Dividing students into distinct groups with similar learning ability, according to their previous performance on various contents in a learning system, has been explored in several research works in the field of education [merceron2005clustering, trivedi2011clustering] for providing more adaptive instructions to each group of students with similar ability. Dynamic assessment of student learning ability at each time interval is performed by clustering based on the assessment of their previous performance history before the start of next time interval.
IiiA1 Time interval
Time interval is a segment containing a number of student’s attempts to answer questions in the system. In this perspective, a tick of time is a single first attempt to a question or exercise.
IiiA2 Segmenting students’ attempt sequence
segmentation of each student response sequence into multiple time intervals serves two purposes: 1) To reduce computational burden and memory space allocation for learning throughout a long sequence. 2) To reassess a student’ learning ability after each time interval and assign her into a group which she belongs to for the next time interval dynamically.
Fig. 1 illustrates an example of dividing a 24attempt response sequence of a student into 5 segments (time intervals) where a segment represents a time interval in which that student answered 6 problems in the system. When the student stopped interacting with system, it is represented with 1 in the last time interval. The number of attempts made by each student varies based on the number of questions they answered during the interaction with system.
IiiA3 Longterm skills encoding for clustering
Student are grouped according to their learning ability profile: the skills or knowledge they acquired. Data for assessing student’s learning ability is available from previous attempts on test items or exercises corresponding to a specific skill.
The learning ability profile is encoded as a vector of length the number of skills, and updated after each time interval by using all previous attempts on each skill. The differences between success and failure ratios on each skill of student’s previous attempts are transformed into a data vector for clustering student at time interval as follows:
(9) 
(10) 
(11) 
(12) 
in which and represent the ratios of skill being correctly answered or incorrectly, by student on number of skills from time interval 1 to current time interval . is the total number of practices of skill up to time interval . represents the difference between how much student performs on skill , correctly or incorrectly, for time interval to and represents a vector containing the learning ability profile of student on each skill from time interval 1 until . Each student may have a different number of total time intervals in the lifetime of their interactions with the system (see Fig. 3).
IiiA4 Kmeans Clustering
Assigning students into a group with similar ability at each time interval is performed by kmeans clustering on data [macqueen1967some, ball1965novel]. At the time of the clustering training phase, we find the centroids for each student group without considering the time interval index. Once it has been computed, the centroid of each group will not change any more during the whole clustering process. After that, we assign students (in both training and testing data) into distinct groups at each time interval (see Fig 2).
When we find the group which student belongs to at time interval , we use the learning ability profile data points because we are not supposed to know the current attempts of student at time interval . After learning the centroids of all clusters, each student at each time interval is assigned into the nearest cluster by the following equation:
(13) 
where is the mean of points in a cluster set (a group of students), and ability profile data represents the previous performance data of student from time interval 1 to .
Figure 3 illustrates the data of 33 students’ learning abilities based on their previous performance and the evolution over time intervals. Dark blue (1) means students do not have any attempt by the time when they quit the system. Group 1 is for the first time interval of every student and the rest of the groups are assigned by the kmeans clustering method at each time interval by using previous performance data .
IiiB Deep knowledge tracing
DKTDSC incorporates student’s learning ability to the DKT for better individualization of the system, by assigning a student into a group of students with similar ability dynamically. It relaxes the assumption that all students have the same ability and that students’ ability is consistent over time. In fact, student’s ability is evolving continuously and some students may learn faster than others.
In the standard DKT, is a onehot encoding vector of the student interaction tuple that represents the combination of the skills practiced, and of which indicates if the answer is correct. But DKTDSC also requires additionally with which is a group or cluster indicating ’s ability at current time interval . In the hidden layers, the last node of each time interval is served as first node for next time interval when we segment the response sequence into multiple time intervals. The output is a vector of same length as the number of problems. Thus, the probability of the next problem answered correctly at of can be obtained from . In that respect, Eq. 7 and 8 are still valid for DKTDSC. The output of both DKT and DKTDSC is the same which provides the predicted probability for a particular problem.
Figure 4 illustrates how DKTDSC model has been adapted by incorporating student’s learning ability as distinct group information at each time interval (each segment) to improve individualization in knowledge tracing. The colour at each time interval at the input layer represents which group a student belongs to at that time interval according to her learning ability. Note that without incorporating student’s ability, DKTDSC model is the same as the standard DKT model.
By adding this cluster information of what group the student belongs to, we ensure that these highlevel statistics are still available to the model for making its predictions throughout the whole academic year. This is what the DKT model does, treating all students in same way without considering their learning abilities. On the contrary, DKTDSC uses clustering to find a group of students with similar ability by using their ability profile data at different time intervals. Tracing student’s knowledge in each different group can provide more effectiveness in student’s performance prediction.
Finally, we summarize the characteristics of each model in this paper in Table I.
IRT  PFA  BKT  DKT  DKTDSC  

Use of student’s ability  Yes  No  No  No  Yes 
Use of item difficulty  Yes  No  No  No  No 
Use of single skill  Yes  No  Yes  Yes  Yes 
Use of multiple skill  No  Yes  No  No  No 
Learn on ordered sequence  No  No  Yes  Yes  Yes 
Iv Datasets
In order to validate the proposed model, we tested it on four public datasets from two distinct tutoring scenarios in which students interact with a computerbased learning system in educational settings.

The ASSISTment system^{2}^{2}2https://sites.google.com/site/assistmentsdata is an online tutoring system that was first created in 2004 which engages middle and highschool students with scaffolded hints in their math problems. If students working on ASSISTments answer a problem correctly, they are given a new problem. If they answer it incorrectly, they are provided with a small tutoring session where they must answer a few questions that break the problem down into steps. Datasets are: ASSISTments 20092010 skill builder data set, ASSISTments 20122013, ASSISTments 20142015.
In all datasets, problems are usually tagged with just one skill, but a rare few may be associated with two or three skills. It typically depends on the structure given by the content creator. Some researchers separate a record with multiple skills into multiple single skill records by duplicating. Wilson [wilson2016back] claimed that this type of data processing can artificially boost prediction results significantly, because these duplicate rows can be accounted for approximately 25% of the records in the Assistment09 dataset for DKT models. So we removed duplicate and multipleskill repeated records in all datasets for the fairness of comparison.

KDD Cup: The PSLC DataShop released several data sets derived from Carnegie Learning’s Cognitive Tutor. Algebra 20052006 [corbett2001cognitive] is a development dataset released during the KDD Cup 2010 competition^{3}^{3}3https://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp. In this dataset, the problems are associated with multiple skills. So we regard a subset of multiple skills as a new skill [xiong2016going].
We evaluate models described above with four datasets from two separate real world tutors. The experimental results show how the models perform across different datasets. Only the first correct attempts to original problems are considered in our experiment.
To the best of our knowledge, these are the largest publicly available knowledge tracing datasets.
Dataset  Number of  Description  

Skills  Students  Records  
ASSISTments  123  4,163  278,607  20092010 [razzaq2005assistment] 
198  28,834  2,506,769  20122013 [feng2009addressing]  
100  19,840  683,801  20142015 [xiong2016going]  
Cognitive Tutor  437  574  808,775  KDD Cup 2010 [corbett2001cognitive] 
V Experimental Study
DKTDSC is extended from the original DKT algorithm, and is combined with the kmeans clustering method with a Euclidean distance. Ten iterations were made in the training stage. DKTDSC and DKT share the same loss function, with 200 fullyconnected hidden nodes for each hidden layer. For speeding up the training process, minibatch stochastic gradient descent is used to minimize the loss function. The batch size for our implementation is 32, corresponding to 32 split sequences from each student. We train the model with a learning rate of 0.01 and dropout is also applied for avoiding overfitting [srivastava2014dropout].
In our experiment, 5 fold crossvalidations are used to make predictions on both datasets. Each fold involves randomly splitting each dataset into 80% training data and 20% test data at the student level. So both training and test datasets contain response records from different students. Training for clustering is performed only using data from students in the training dataset. We use EM to train BKT and the limit of iterations is set to 200. We learn models for each skill and make predictions separately, then the results for each skill are averaged. For DKT and DKTDSC, we set the number of epochs to 100. All these models are trained and tested on the same sets of data. Next response of a student is predicted by using current and previous response sequence in chronological order.
We compare our model with stateoftheart models: IRT [wilson2016back], BKT [Baker2008More], PFA [pavlik2009performance], DKT [piech2015deep]. But we do not compare with other variant models, because those are more or less similar and do not show significant difference in performance. For IRT, we apply the code from Knewton [wilson2016back] and the code for DKT is from WPI [xiong2016going]. For DKT, we use the same setting of parameters as DKTDSC and also apply segmentation for a fair comparison. Predicted sequences of student performance by each model are tabulated and evaluated in terms of Area Under the Curve (AUC) and Root mean squared error (RMSE). AUC provides a robust metric where the value to predict is binary, as it is the case of our datasets. An AUC of 0.50 represents the score achieved by random guess. We set AUC 0.61 of BKT as a baseline in our experiment.
Datasets 
Model  
BKT  IRT  PFA  DKT  DKTDSC  
ASSISTments09  0.67  0.75  0.70  0.73  0.91 
ASSISTments12  0.61  0.74  0.67  0.72  0.87 
ASSISTments14  0.64  0.67  0.69  0.72  0.87 
Cognitive Tutor  0.61  0.81  0.76  0.79  0.81 

In Table III, DKTDSC performs significantly better than stateoftheart models in all datasets. On the ASSISTments09 dataset, compared with the standard DKT which has an AUC of 0.73, our DKTDSC model achieves an AUC of 0.92, which represents a significant gain of 26%. On the ASSISTments12 dataset with 2.5 million records, the result shows 11% increase, AUC 0.80 in DKTDSC compared with AUC 0.72 in the original DKT. In the latest ASSISTments14 dataset, DKTDSC achieves an improvement of 19% over the original DKT. In the Cognitive Tutor dataset, DKTDSC also achieves about 1% gain with AUC=0.81 while the original DKT has AUC=0.79. As for other algorithms, IRT also provides a slight improvement over the original DKT in all datasets but DKTDSC performs significantly better than both DKT and IRT. Note that Problem ID is not provided in the original ASSISTments14 dataset. So we use Skill ID as Problem ID for the IRT model, and that is why IRT only gets a AUC of 0.67. In all models described above, only the IRT model learns the problem difficulty while all other models only rely on skills.
Datasets 
Model  

BKT  IRT  PFA  DKT  DKTDSC  
ASSISTments09 
0.46  0.44  0.45  0.45  0.33 
ASSISTments12  0.51  0.44  0.44  0.43  0.35 
ASSISTments14  0.51  0.44  0.42  0.42  0.35 
Cognitive Tutor  0.47  0.37  0.39  0.36  0.36 

In Table IV, when we compare the models in terms of RMSE, BKT is 0.46 in ASSISTments09, 0.51 in ASSISTments12 and 0.47 in Cognitive Tutor. RMSE results of DKTDSC in all dataset are under 0.40 while that of all other models are no less than 0.42 (except IRT, PFA and DKT in the Cognitive Tutor dataset). According to these results, DKTDSC outperforms in all ASSISTments datasets and shows a slightly better performance in the Cognitive Tutor dataset. All of the above experiments are conducted on the time interval containing 20 attempts and 8 clusters (groups of students).
Time interval 
Datasets  

Ass09  Ass12  Ass15  KDD  
20  0.91  0.81  0.86  0.81 
30  0.88  0.80  0.82  0.81 
50  0.87  0.80  0.78  0.81 
100  0.82  0.77  0.73  0.82 

Segmentation of student responses into fixedtime intervals is applied for DKTDSC (on 4 groups of students) and tested with each time interval containing 20, 30, 50, and 100 attempts in this experiment. The performance of DKTDSC is described in Table V. DKTDSC performs better when each time interval contains 100 attempts in KDD dataset with 574 (joint) skills. It can be considered as a small dataset according to the numbers of skills contained in it. So it may inefficiently identify the student’s group because of the sparsity of data. When it contains sufficient amount of attempts in each time interval, it shows a better performance.
# clusters 
Datasets  

Ass09  Ass12  Ass15  KDD  
2  0.88  0.78  0.83  0.79 
4  0.91  0.81  0.86  0.80 
6  0.91  0.84  0.86  0.80 
8  0.91  0.87  0.87  0.81 

As well as with different lengths of time interval, various number of clusters provide different performances as described in Table VI. According to experimental results, 8 clusters with 20 number of attempts in each time interval is the best parameter for DKTDSC.
Vi Conclusion and Future Work
In this paper, we proposed a new model, DKTDSC that assesses student’s learning ability at each time interval and dynamically assigns a student into a distinct group of students with the same ability. A student’s knowledge is traced based on the group which she belongs to at each time interval. Experiments with four datasets show that the proposed model performs statistically and significantly better than stateoftheart models. DKT assumes all students have the same learning ability and only tracks the improvement of knowledge in a skill sequence without considering difference between abilities of each student and learning rate. In comparison, our model improves over DKT by capturing the student’s ability over time. Assessing student’s ability in this way gives the model critical information in the prediction of student performance in their next time interval and tracing their knowledge where abilities of the students evolve dynamically. We individualize the input vector by taking both student’s ability and practicing skill into account. Instead of using the skill level alone, incorporating student’s ability in terms of group information in DKTDSC yields an improvement in the prediction of performance. Dynamically assessing student’s ability at each time interval plays the critical role and helps the DKTDSC model capture more variance in the data, leading to more accurate predictions.
In our future work, we will adapt this model to problems with multiple associated subskills in the system and apply it in the recommendation of problems with multiple associated skills. Problems for practice should be recommended according to the knowledge level and the ability a student possesses. The significant gain obtained by DKTDSC can make a difference in current knowledge tracing respect. Further investigation on the potential application of DKTDSC to other content recommendations (movies and other commercial products) will also be considered.
Vii Acknowledgements
This work was partially done while the first author was interning at National Institute of Informatics, Japan and also funded by the NSERC Discovery funding awarded to the third author.