Addressing Two Problems in Deep Knowledge Tracing via PredictionConsistent Regularization
Abstract
Knowledge tracing is one of the key research areas for empowering personalized education. It is a task to model students’ mastery level of a knowledge component (KC) based on their historical learning trajectories. In recent years, a recurrent neural network model called deep knowledge tracing (DKT) has been proposed to handle the knowledge tracing task and literature has shown that DKT generally outperforms traditional methods. However, through our extensive experimentation, we have noticed two major problems in the DKT model. The first problem is that the model fails to reconstruct the observed input. As a result, even when a student performs well on a KC, the prediction of that KC’s mastery level decreases instead, and vice versa. Second, the predicted performance for KCs across timesteps is not consistent. This is undesirable and unreasonable because student’s performance is expected to transit gradually over time. To address these problems, we introduce regularization terms that correspond to reconstruction and waviness to the loss function of the original DKT model to enhance the consistency in prediction. Experiments show that the regularized loss function effectively alleviates the two problems without degrading the original task of DKT.^{1}^{1}1The implementation of this work is available on https://github.com/ckyeungac/deepknowledgetracingplus.
Addressing Two Problems in Deep Knowledge Tracing via PredictionConsistent Regularization
ChunKit Yeung 
Hong Kong University of Science and Technology 
ckyeungac@cse.ust.hk 
DitYan Yeung 
Hong Kong University of Science and Technology 
dyyeung@cse.ust.hk 
copyrightbox[b]
\end@floatI.2.6 Artificial Intelligence: Learning; K.3.m Computer and Education: Miscellaneous
Knowledge tracing; Deep learning; Regularization; Educational data mining; Personalized learning; Sequence modeling
With the advancement of digital technologies, online platforms for intelligent tutoring systems (ITSs) and massive open online courses (MOOCs) are becoming prevalent. These platforms produce massive datasets of student learning trajectories about the knowledge components (KCs), where KC is a generic term for skill, concept, exercise, etc. The availability of online activity logs of students has accelerated the development of learning analytics and educational data mining tools for predicting the performance and advising the learning of students. Among many topics, knowledge tracing (KT) is considered to be important for enhancing personalized learning. KT is the task of modeling student’s knowledge state, which represents the mastery level of KCs, based on historical data. With the estimated students’ knowledge state, teachers or tutors can gain a better understanding of the attainment levels of their students and can tailor the learning materials accordingly. Moreover, students may also take advantage of the learning analytics tools to come up with better learning plans to deal with their weaknesses and maximize their learning efficacy.
Generally, the KT task can be formalized as follows: given a student’s historical interactions up to time on a particular learning task, it predicts some aspects of his next interaction . Questionandanswer interactions are the most common type in KT, and thus is usually represented as an ordered pair which constitutes a tag for the question being answered at time and an answer label indicating whether the question has been answered correctly. In many cases, KT usually seeks to predict the probability that the student will answer the question correctly during the next timestep, i.e., . Many approaches have been developed to solve the KT problem, such as using the hidden Markov model (HMM) in Bayesian knowledge tracing (BKT) [?] and the logistic regression model in performance factors analysis (PFA) [?]. More recently, a recurrent neural network (RNN) model has been applied in a method called deep knowledge tracing (DKT) [?]. Experiments show that DKT outperforms traditional methods without requiring substantial feature engineering by humans.
Although DKT achieves impressive performance for the KT task, we have noticed two major problems in the prediction results of DKT when trying to replicate the experiments in [?] (where the authors adopted the skilllevel tag as the question tag.) These two problems are illustrated using a heatmap in Figure Addressing Two Problems in Deep Knowledge Tracing via PredictionConsistent Regularization, which visualizes the predicted knowledge state at each timestep of a student (namely id1) from the ASSISTment 2009 skill builder dataset.
The first problem of DKT is that the DKT model fails to reconstruct the input information in prediction. When a student performs well in a learning task related to a skill , the model’s predicted performance for the skill may drop instead, and vice versa. For example, at the timestep in Figure Addressing Two Problems in Deep Knowledge Tracing via PredictionConsistent Regularization, the probability of correctly answering the exercise related to increases compared to the previous timestep even though the student answered incorrectly.
Second, it is observed that the transition in prediction outputs, i.e., the student’s knowledge states, across timesteps is not consistent. As depicted in Figure Addressing Two Problems in Deep Knowledge Tracing via PredictionConsistent Regularization, there are sudden surges and plunges in the predicted performance of some skills across timesteps. For example, the probabilities of correctly answering , , , and fluctuate when the student answers and in the middle of the learning sequence. This is intuitively undesirable and unreasonable as students’ knowledge state is expected to transit gradually over time, but not to alternate between mastered and notyetmastered. Such wavy transitions are therefore not favorable as it would mislead the interpretation of the student’s knowledge state.
To address the problems described above, we propose to augment the original loss function of the DKT model by introducing additional quality measures other than the original one which solely considers the prediction accuracy of the next interaction. Specifically, we define the reconstruction error and waviness measures and use them to augment the original loss function as a regularized loss function. Experiments show that the regularized DKT is more accurate in reconstructing the answer label of the observed input and is more consistent in its prediction across timesteps, yet without sacrificing the prediction accuracy for the next interaction.
Our main contributions are summarized as follows:

Two problems in DKT that have not been revealed in the literature are raised: failure in current observation reconstruction and wavy prediction transition;

Three regularization terms for enhancing the consistency of prediction in DKT are proposed: to address the reconstruction problem, and and to address the wavy prediction transition problem;

Five other performance measures are proposed to evaluate three aspects of goodness in KT: AUC(C) for the prediction performance of the current interaction, and for the waviness in KT’s prediction overall, and and for the consistency between the current observation and the corresponding change in prediction.
Researchers have been investigating mathematical and computational models to tackle the KT task since the 1990s. Various approaches, ranging from probabilistic models to deep neural networks, have been developed over the past two decades.
The Bayesian knowledge tracing (BKT) model was proposed in [?] during the 1990s. It models the knowledge states of KCs using one HMM for each KC. Specifically, the hidden state in the HMM represents the student’s knowledge state which indicates whether or not the KC is mastered. However, many simplifying assumptions adopted by BKT are unrealistic. For example, BKT assumes that forgetting does not occur and the KCs are mutually independent. To address these shortcomings, some variants of BKT such as those with forgetting power [?] and skill dependency [?] have been proposed. Extensive works have also been conducted to empower the individualization of BKT on both skillspecific parameters [?, ?] and studentspecific parameters [?]. Some other attempts have been made to extend the capabilities of BKT on the partial score [?], subskills and temporal features [?], and more features from cognitive science such as the recency effect and contextualized trial sequences [?]. However, it should be noted that such extensions often require considerable feature engineering efforts and may incur a significant increase in the computational requirements.
In the 2000s, learning factors analysis (LFA) [?] was proposed to model student knowledge states using the logistic regression model to deal with the multiKCs issue and to incorporate student ability into the model. There is a reconfiguration of the LFA, called performance factors analysis (PFA) [?], which offers higher sensitivity to student performance rather than student ability. Both LFA and PFA exploit the number of successes or failures of applying a KC to predict whether a student has acquired the knowledge about the KC. Although both LFA and PFA can handle a learning task that is associated with multiple KCs, they cannot deal with the inherent dependency among KCs, e.g., “addition” is a prerequisite of “multiplication”. Moreover, the features used in LFA and PFA are relatively simple and they cannot provide a deep insight into students’ latent knowledge state.
Recently, with a surge of interest in deep learning models, DKT [?], which models student’s knowledge state based on an RNN, has been shown to outperform the traditional models, such as BKT and PFA, without the need for humanengineered features such as recency effect, contextualized trial sequence, interskill relationship, and students’ ability variation [?]. Since the DKT was proposed, a few comprehensive studies have been reported to compare DKT with other KT models [?, ?] or to apply the ideas of DKT to other applications [?, ?, ?]. Nevertheless, to the best of our knowledge, all such attempts in the literature evaluate the DKT model mainly with respect to the prediction of the next interaction based on the area under the ROC curve (AUC) measure, without considering other quality aspects in the prediction result.
DKT employs the RNN as its backbone model (see Figure Addressing Two Problems in Deep Knowledge Tracing via PredictionConsistent Regularization). A (vanilla) RNN [?] aims to map an input sequence to an output sequence . To map the input to the output, the input vector undergoes a series of transformations via a hidden layer, which captures useful information that is hard to humanengineer, and forms a sequence of hidden states . More concretely, at timestep , the hidden state is the encoding of the past information obtained up to timestep , i.e., , and the current input . The inputtohidden transformation and hiddentooutput transformation can be stated mathematically as follows:
(1)  
(2) 
where both the hyperbolic tangent and the sigmoid function are applied in an elementwise manner. The model is parameterized by a weight matrix and a bias vector with appropriate dimensions.
Piech et al. [?] adopts an RNN variant with long shortterm memory (LSTM) cells. An LSTM cell incorporates three gates to imitate the human memory system [?] so as to calculate the hidden state . The three gates are forget gate , input gate and output gate , which control a memory cell state . Mathematically, they are simply three vectors calculated based on the current input and the previous hidden state :
where denotes concatenation. Different gates play different roles to control what information should be stored in . The forget gate decides what information to forget from the previous memory cell state , while the input gate decides what new information should be added to the recent cell state . Thus, the recent cell state depends on the previous cell state after forgetting and the new information added from the input gate. Eventually, the output gate determines what information should be extracted from to form the hidden state . These can be expressed mathematically as follows:
(3) 
where denotes elementwise multiplication. This formulation empowers RNN to store information occurred in a distance history, and thus has a more powerful capability than the vanilla RNN. The unfolded RNN architecture is visualized in Figure Addressing Two Problems in Deep Knowledge Tracing via PredictionConsistent Regularization, with a highlevel interpretation of DKT.
To train a DKT model, an interaction needs to be transformed into a fixedlength input vector . As a question can be identified by a unique ID, it can be represented using onehot encoding as a vector . The corresponding answer label can also be represented as the onehot vector of the corresponding question if a student answers it correctly, or a zero vector otherwise. Therefore, if there are unique questions, then .
After the transformation, DKT passes the to the hidden layer and computes the hidden state using the vanilla RNN or LSTMRNN. As the hidden state summarizes the information from the past, the hidden state in the DKT can therefore be conceived as the latent knowledge state of student resulted from his past learning trajectory. This latent knowledge state is then disseminated to the output layer to compute the output vector , which represents the probabilities of answering each question correctly. For student , if she has a sequence of questionandanswer interactions of length , the DKT model maps the inputs to the outputs accordingly.
The objective of the DKT is to predict the next interaction performance, so the target prediction will be extracted by performing a dot product of the output vector and the onehot encoded vector of the next question . Based on the predicted output and the target output , the loss function can be expressed as follows:
(4) 
where is the number of students and is the crossentropy loss.
While we were replicating the experiments on the original DKT proposed in [?] using the skill builder dataset provided by ASSISTment in 2009 (denoted ASSIST2009)^{2}^{2}2More information about the dataset can be found in https://sites.google.com/site/assistmentsdata/home/assistment20092010data., we noticed two major problems in the prediction results of DKT. First, it sometimes fails to reconstruct the input observation because the prediction result is counterintuitive. When a student answers a question of skill correctly/incorrectly, the predicted probability of that student answering correctly sometimes decreases/increases instead. Second, the predicted knowledge state is wavy and inconsistent over time. This is not desirable because student’s knowledge state is expected to transit only gradually and steadily over time. Therefore, we propose three regularization terms to rectify the consistency problem in the prediction of DKT: reconstruction error to resolve the reconstruction problem, and waviness measures and to smoothen the predicted knowledge state transition.
As we saw from Figure Addressing Two Problems in Deep Knowledge Tracing via PredictionConsistent Regularization, when the student answers incorrectly, the probability of correctly answering grows significantly compared to the previous timestep. This problem can be attributed to the loss function defined in the DKT model (eq. (4)). Specifically, the loss function takes only the predicted performance of the next interaction into account, but not the predicted performance of the current one. Accordingly, when the input order occurs frequently enough, the DKT model will tend to learn that if a student answers incorrectly, he/she will likely answer incorrectly, but not . Consequently, the prediction result is counterintuitive for the current observed input.
However, one might argue that such transition in prediction reveals that is a prerequisite of .^{3}^{3}3We note that is “Ordering Positive Decimals” and is “Ordering Fractions”. This is because the predicted performance for is lower only when the DKT model receives , but it is higher when the DKT model receives . To gainsay the above argument, we are going to impeach by contradiction. We hypothesize that if is indeed a prerequisite of , then when a student answers incorrectly in the current timestep, it is more probable that he/she will answer incorrectly in the next timestep, but not vice versa. To verify this hypothesis, Tables Addressing Two Problems in Deep Knowledge Tracing via PredictionConsistent Regularization and Addressing Two Problems in Deep Knowledge Tracing via PredictionConsistent Regularization are made to tabulate the frequency counts when and appear consecutively in different orders. According to the above hypothesis, it is expected that the lower right cell will have a larger value than the lower left cell in Table Addressing Two Problems in Deep Knowledge Tracing via PredictionConsistent Regularization, but not Table Addressing Two Problems in Deep Knowledge Tracing via PredictionConsistent Regularization.
Next =  
Correct  Incorrect  Total  
Current =  Correct  1543  159  1702 
Incorrect  81  367  448  
Total  1624  526  2510 
Next =  
Correct  Incorrect  Total  
Current =  Correct  1362  72  1434 
Incorrect  90  361  451  
Total  1452  433  1885 
From Table Addressing Two Problems in Deep Knowledge Tracing via PredictionConsistent Regularization, we can see that if a student answers incorrectly in the current timestep, it is more probable that he/she will answer incorrectly in the next timestep. However, Table Addressing Two Problems in Deep Knowledge Tracing via PredictionConsistent Regularization shows that if a student answers incorrectly, it is also more probable that he/she will answer incorrectly in the next timestep. This means that an inverse dependency also exists and contradicts the above hypothesis, and therefore the statement that is a prerequisite of becomes questionable. Moreover, the distributions of these two matrices would advocate that and are likely to be interdependent and acquired simultaneously.
If is not a prerequisite of and they should be acquired at the same time, then there should be room for improvement to deal with cases like and in DKT. As mentioned above, the loss function considers only the predicted performance of the next interaction but ignores the current one. An immediate remedy to alleviate the problem is to regularize the DKT model by taking the loss between the prediction and the current interaction into consideration. By doing so, the model will adjust the prediction with respect to the current input. Thus, a regularization term for the reconstruction problem is defined as follows:
(5) 
The second problem is the wavy transition in the student’s predicted knowledge state. This problem may be attributed to the hidden state representation in the RNN. The hidden state is determined by the previous hidden state and the current input . It summarizes the student’s latent knowledge state of all the exercises in one single hidden layer. Although it is difficult to explicate how the elements in the hidden layer influence the predicted performance of the KCs, it is plausible to confine the hidden state representation to be more invariant via regularization over the output layer.
We define two waviness measures and as regularization terms to smoothen the transition in prediction:
(6)  
(7) 
To quantify how disparate the two prediction vectors are, both norm and norm are used to measure the difference between the prediction results and . This is similar to the elastic net regularization [?]. The two measures are averaged over the total number of input timesteps and the number of KCs . Thus, the magnitude of would be seen as the average value change of each component in the output vector between and . The larger the values of and , the more inconsistent the transitions in the model.
In summary, the original loss function is augmented by incorporating three regularization terms to give the following regularized loss function:
(8) 
where , and are regularization parameters. By training with this new loss function, the DKT model is expected to address the reconstruction and wavy transition problems.
In the following experiments, 20% of the data is used as a test set and the other 80% is used as a training set. Furthermore, 5fold crossvalidation is applied on the training set to select the hyperparameter setting. The test set is used to evaluate the model, and also to perform early stopping [?]. The weights of the RNN are initialized randomly from a Gaussian distribution with zero mean and small variance. For fair comparison, we follow the hyperparameter setting in [?] even though it may not be optimal. A singlelayer RNNLSTM with a state size of 200 is used as the basis of the DKT model. The learning rate and the dropout rate are set to 0.01 and 0.5, respectively. In addition, we also consistently set the norm clipping threshold to 3.0. Moreover, our preliminary experiment using ASSIST2009 shows that using the exercise tag as the question tag induces data sparsity and results in poor performance^{4}^{4}4An AUC of 0.73 if 26,668 exercise IDs are used; an AUC of 0.82 if 124 unique skill IDs are used., so we adopt the skill tag to be the question tag in the following experiment.
We perform hyperparameter search for the regularization parameters , and . At first, each parameter is examined separately to identify a range of values giving good results according to some evaluation measures to be explained later. The initial search ranges for , and are {0, 0.25, 0.5, 1.0}, {0, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1.0, 3.0, 10.0}, and {0, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1.0, 3.0, 10.0, 30.0, 100.0}, respectively. After narrowing down the range of each parameter, a grid search over combinations of , and is conducted. The final search ranges for , and are {0, 0.05, 0.10, 0.15, 0.20, 0.25}, {0, 0.01, 0.03, 0.1, 0.3, 1.0}, and {0, 0.3, 1.0, 3.0, 10.0, 30.0, 100.0}, respectively.
The performance of the DKT is customarily evaluated by AUC, which provides a robust metric for binary prediction evaluation. An AUC score of 0.5 indicates that the model performance is merely as good as random guess. In this paper, we report not only the AUC for the next performance prediction (named AUC(N) in this paper for clarity) which is tantamount to the evaluation in [?], but also five other quantities with respect to the reconstruction accuracy and consistency of the input observation as well as the waviness of the prediction result.
For the reconstruction accuracy, the AUC for the current performance prediction (called AUC(C)) is used. With regard to the consistency in prediction of the input observation, two more quantities and are defined to measure the consistency between the observed input and the change in the corresponding prediction. For a single student at time , we define
(9)  
(10) 
and
(11)  
(12) 
Accordingly, when the model gives a correct change in prediction with respect to the input, we will obtain positive values for and . Otherwise negative values will be obtained. A positive value of indicates that more than half of the predictions comply with the input observations; a zero value implies that the model makes half of the predictions change in the right direction while another half change in a wrong direction; a negative value means that the model makes more than half of the predictions change in a wrong direction. A similar interpretation also holds for though it takes changes in both the direction and magnitude into account. Accordingly, the higher the values of and are, the better the model is from the perspective of consistency in prediction for the current observation.
Besides, the waviness measures and are also used as performance measures to quantify the consistency in prediction of the other KCs in the model. We deem that a good DKT model should achieve a high AUC score while maintaining a low waviness value.
This dataset is provided by the ASSISTments online tutoring platform and has been used in several papers for the evaluation of DKT models. Owing to the existence of duplicated records in the original dataset [?], we have removed them before conducting our experiments. The resulting dataset contains 4,417 students with 328,291 questionanswering interactions from 124 skills. Some of the students in this dataset are used for visualizing the prediction result.
This dataset contains 19,917 student responses for 100 skills with a total of 708,631 questionanswering interactions. Although it contains more interactions than ASSIST2009, the average number of records per skill and student is actually smaller due to a larger number of students.
This dataset has been made available for the 2017 ASSISTments data mining competition. It is richer in terms of the average number of records per student as there are 686 students with 942,816 interactions and 102 skills.
This dataset is obtained from an engineering statics course with 189,927 interactions from 333 students and 1,223 exercise tags. We have adopted the processed data provided by [?] with a train/test split of ratio 70:30, and exercise tags are used.
Piech et al. [?] also simulated 2000 virtual students’ answering trajectories in which the exercises are drawn from five virtual concepts. Each student answers the same sequence of 50 exercises each of which has a single concept and a difficulty level , with an assigned ability of solving the task related to the skill . The probability of a student answering an exercise correctly is defined based on the conventional item response theory as , where denotes the probability of guessing it correctly and it is set to .
The experiment results are shown in Table Addressing Two Problems in Deep Knowledge Tracing via PredictionConsistent Regularization which gives a comparison of DKT models with and without regularization with respect to all of the evaluation measures. For clarity, here the DKT model without regularization is simply denoted as DKT, while the DKT model with regularization is denoted as DKT+.
For the ASSIST2009 dataset, the DKT achieves an average test AUC(N) of 0.8212, while the DKT+ performs slightly better with an AUC(N) of 0.8227. However, for the DKT+, there is a considerable improvement in AUC(C) with an increase from 0.9044 to 0.9625. The waviness quantities also decrease significantly, from 0.0830 to 0.0229 for and from 0.1279 to 0.0491 for . Moreover, although the DKT has already made half of the predictions change in the right direction, the DKT+ further uplifts the values of and from 0.3002 to 0.4486 and from 0.0156 to 0.0573, respectively.
Similar improvements in the evaluation measures are observed in ASSIST2015 as well. The DKT+ retains a similar AUC(N) of 0.7371, compared to that of the DKT. The values of AUC(C), and are boosted to 0.9233, 0.8122 and 0.0591, respectively. Moreover, the values of and in the DKT+ are only half of those for the DKT.
As for ASSISTChall, although the performance of AUC(N) slightly decreases from 0.7343 to 0.7285 in the DKT+, improvement with respect to the other evaluation criteria is very significant. The DKT+ pushes the AUC(C) from 0.7109 to 0.8570 and reduces the from 0.0690 to 0.0147 and the from 0.1045 to 0.0301. Moreover, the DKT+ also improves the performance in and , from 0.1151 to 0.3052 and from to 0.0441, respectively.
For statics2011, a noticeable increase is observed in both AUC(N) and AUC(C), from 0.8159 to 0.8349 and from 0.7404 to 0.9038, respectively. Moreover, and shrink from 0.1358 to 0.0074 and from 0.1849 to 0.0130, respectively. This substantial decrease in and would be ascribed to the large number of exercises in the dataset since the waviness regularizers act to confine the prediction changes on those exercises which are unrelated to the input. With a potentially substantial amount of unrelated exercises, and shrink significantly as a result. The DKT+ also ameliorates the situation that DKT makes more than half of the predictions change in the wrong direction. The values of and surge from to 0.47597 and from to 0.05712, respectively.
With regard to Simulated5, the DKT and the DKT+ result in a similar AUC(N), of 0.8252 and 0.8264, respectively. However, the DKT+ gives a huge improvement in AUC(C), and . The DKT+ boosts the values of AUC(C) from 0.8642 to 0.9987, from to 0.9064 and from to . This means the DKT+ model makes almost all of the predictions and the prediction changes for the input exercise correct. Moreover, the waviness in the prediction transition is also reduced.
In summary, the experiment results reveal that the regularization based on , and effectively alleviates the reconstruction problem and the wavy transition in prediction without sacrificing the prediction accuracy for the next interaction. Furthermore, for some combinations of , and , the DKT+ even slightly outperforms the DKT in AUC(N).
Apart from the experiment results, we plot Figures Addressing Two Problems in Deep Knowledge Tracing via PredictionConsistent Regularization and Addressing Two Problems in Deep Knowledge Tracing via PredictionConsistent Regularization to better understand how the regularizers based on reconstruction and waviness affect the performance with respect to different evaluation measures.
In Figure Addressing Two Problems in Deep Knowledge Tracing via PredictionConsistent Regularization, we plot the average test AUC(N) and AUC(C) of different values of over all combinations of and . It is observed that the higher the is, the higher the AUC(C) is achieved for all of the datasets. On the other hand, the AUC(N) generally decreases when the increases, but the downgrade in AUC(N) is not significant compared with the upgrade in AUC(C). This reveals that the reconstruction regularizer robustly resolves the reconstruction problem, without sacrificing much of the performance in AUC(N). Moreover, from the result in Table Addressing Two Problems in Deep Knowledge Tracing via PredictionConsistent Regularization, we are usually able to find a combination of , and that gives a comparable or even better AUC(N). This implies that the waviness regularizers can help to mitigate the slight degradation in AUC(N) incurred by the reconstruction regularizer.
To also see how the regularization parameters and affect the evaluation measures, their 3D mesh plots, with a fixed , for the ASSIST2009 dataset are shown in Figure Addressing Two Problems in Deep Knowledge Tracing via PredictionConsistent Regularization. The AUC(N) has a relatively flat and smooth surface when lies between 0.0 and 1.0 and lies between 0.0 and 10.0. Within this region, the DKT+ model also results in a higher AUC(C) value between 0.94 and 0.96. The performance of AUC(N) and AUC(C) starts to decline when and are larger than 1.0 and 10.0, respectively. It suggests that the model performance has a low sensitivity in AUC(N) and AUC(C) with respect to the hyperparameters and . As for the waviness measures and , they decrease in a belllike shape when and increase. With regard to the consistency measures, even though the mesh surface is a bit bumpy, increases along with larger values of and within the same range mentioned above. This observation implies that both the reconstruction regularizer and the waviness regularizers help to improve the prediction consistency for the current input. On the other hand, has a decreasing trend with larger values of and . This is reasonable because the waviness regularizers will reduce the prediction change between the outputs and thus the value of , which takes the change in magnitude into account, is reduced. All in all, the robustness of the regularizers and is ascertained thanks to the low sensitivity in the prediction accuracy (AUC(N) and AUC(C)), the observable reduction in the waviness measures ( and ), and the improvement in the consistency measures ( and ).
In addition to the overall goodness in terms of the evaluation measures, the prediction results of DKT and DKT+ for the student (id1) are visualized in Figure Addressing Two Problems in Deep Knowledge Tracing via PredictionConsistent Regularization in order to give a better sense of the regularization’s effect on the prediction. Specifically, the prediction results are visualized in two line plots, in addition to the heatmap plot. The first line plot illustrates the change in prediction of all the answered skills of the DKT/DKT+ model, while the second line plot emphasizes the change in prediction of each skill between DKT and DKT+, showing their differences in prediction. Concretely, as for the DKT, it shows a relatively wavy transition of knowledge state across timesteps (Figure (c)). Moreover, the predicted knowledge states of most of the skills share the same directional change in prediction in DKT (Figure (c)). This means that when a student answers a question wrongly, most of the predicted skills’ mastery level will decrease simultaneously, and vice versa. However, it is intuitively untrue, as answering skill wrongly does not necessarily lead to a knowledge fade in other skills. On the other hand, the DKT+ demonstrates a smooth prediction transition notably. For example, as seen from Figures (c) and (c), when the DKT+ receives the inputs or , the changes in prediction for , and across timesteps are not as wavy as those in the DKT, revealing that DKT+ retains latent knowledge state in RNN for , and from the previous timesteps. This consistent prediction will alleviate the misinterpretation of the student knowledge state caused by the wavy transition problem, and enhance the interpretability of the knowledge state in DKT.
This paper points out two problems which arise when interpreting DKT’s prediction results: (1) the reconstruction problem, and (2) the wavy transition in prediction. Both problems are undesirable because it would mislead the interpretation of students’ knowledge states. We thereby proposed three regularization terms for enhancing the consistency of prediction in DKT. One of them is the reconstruction error , which is evaluated in AUC(C), and . The other two are waviness measures and , which are the norms for measuring the changes between two consecutive prediction output vectors and are used directly as evaluation measures. Experiments show that these regularizers effectively alleviate the two problems without sacrificing the prediction accuracy (AUC(N)) on the original task which is to predict the next interaction performance.
Although the reconstruction regularizer improves the performance with respect to AUC(C) and the waviness regularizers reduce the waviness in the prediction transition, it is hard to say how low the values of and should be in order to qualify for a good KT model. Ideally, a KT model should only change those prediction components which are related to the current input while keeping the other components unchanged or only changed slightly due to some other subtle reasons, e.g., forgetting. Nevertheless, the KCdependency graphs are different from one dataset to another dataset, so different values of and are expected in their ideal KT models.
Moreover, more work should be done on improving the stability and accuracy of the prediction for unseen data, more specifically the unobserved KCs. The objective function and the evaluation measures for the DKT+ take only the current and next interactions into account. There is no consideration for interactions in the further future, let alone the evaluation measures for the prediction precision of the unobserved KCs. Yet, unobserved KCs are of vital importance because an ITS should make personalized recommendation on the learning materials for students over not only the observed KCs, but also the unobserved ones. An accurate estimation on the unobserved KCs will help an ITS provide more intelligent pedagogical guidance to students. One of the possible extensions of this work is to take the further future interaction into account when training the DKT model:
(13) 
where is the normalizing term, is the decay factor similar to that in reinforcement learning. This potentially leads the DKT model to learn a more robust representation of the latent knowledge state.
This research has been supported by the project ITS/205/15FP from the Innovation and Technology Fund of Hong Kong.
 1
 2 Hao Cen, Kenneth Koedinger, and Brian Junker. 2006. Learning factors analysis – a general method for cognitive model evaluation and improvement. In Proceedings of the 8th International Conference in Intelligent Tutoring Systems. Springer, Berlin, Heidelberg, 164–175.
 3 Albert T. Corbett and John R. Anderson. 1995. Knowledge tracing: modeling the acquisition of procedural knowledge. User Modeling and UserAdapted Interaction 4, 4 (March 1995), 253–278.
 4 José GonzálezBrenes, Yun Huang, and Peter Brusilovsky. 2014. General features in knowledge tracing to model multiple subskills, temporal item response theory, and expert knowledge. In Proceedings of the 7th International Conference on Educational Data Mining. 84–91.
 5 William J. Hawkins, Neil T. Heffernan, and Ryan S.J.D. Baker. 2014. Learning Bayesian knowledge tracing parameters with a knowledge heuristic and empirical probabilities. In Proceedings of the 12th International Conference on Intelligent Tutoring Systems. Springer, Cham, 150–155.
 6 Tanja Kaeser, Severin Klingler, Alexander G. Schwing, and Markus Gross. 2017. Dynamic Bayesian networks for student modeling. IEEE Transactions on Learning Technologies 10 (March 2017), 450–462.
 7 Mohammad Khajah, Robert V. Lindsey, and Michael C. Mozer. 2016. How deep is knowledge tracing. In Proceedings of the 9th International Conference on Educational Data Mining. 94–101.
 8 Zachary C. Lipton, John Berkowitz, and Charles Elkan. 2015. A critical review of recurrent neural networks for sequence learning. ArXiv eprints 1506.00019 (May 2015).
 9 Christopher Olah. 2015. Understanding LSTM networks. Colah. github. io. (Augest 2015). Retrieved December 10, 2017 from http://colah.github.io/posts/201508UnderstandingLSTMs/.
 10 Zachary A. Pardos and Neil T. Heffernan. 2010. Modeling individualization in a Bayesian networks implementation of knowledge tracing. In Proceedings of the 18th International Conference on User Modeling, Adaptation, and Personalization, Vol. 6075. Springer, Berlin, Heidelberg, 255–266.
 11 Zachary A. Pardos and Neil T. Heffernan. 2011. KTIDEM: introducing item difficulty to the knowledge tracing model. In Proceedings of the 19th International Conference on User Modeling, Adaption and Personalization, Vol. 6787. Springer, 243–254.
 12 Philip I. Pavlik, Hao Cen, and Kenneth R. Koedinger. 2009. Performance factors analysis – a new alternative to knowledge tracing. In Proceedings of the 14th International Conference on Artificial Intelligence in Education. IOS Press, Amsterdam, Netherlands, 531–538.
 13 Chris Piech, Jonathan Bassen, Jonathan Huang, Surya Ganguli, Mehran Sahami, Leonidas J Guibas, and Jascha SohlDickstein. 2015. Deep knowledge tracing. In Advances in Neural Information Processing Systems. 505–513.
 14 Lutz Prechelt. 1998. Automatic early stopping using cross validation: quantifying the criteria. Neural Networks 11, 4 (June 1998), 761–767.
 15 Siddharth Reddy, Igor Labutov, and Thorsten Joachims. 2016. Learning student and content embeddings for personalized lesson Sequence Recommendation. In Proceedings of the 3rd ACM Conference on Learning @ Scale. ACM, New York, NY, USA, 93–96.
 16 Steven Tang, Joshua C. Peterson, and Zachary A. Pardos. 2016. Modelling student behavior using granular large scale action data from a MOOC. ArXiv eprints 1608.04789 (Augest 2016).
 17 Lisa Wang, Angela Sy, Larry Liu, and Chris Piech. 2017. Learning to represent student knowledge on programming exercises using deep learning. In Proceedings of the 10th International Conference on Educational Data Mining. 324–329.
 18 Yutao Wang and Neil T. Heffernan. 2013. Extending knowledge tracing to allow partial credit: using continuous versus binary nodes. In Proceedings of the 13th International Conference on Artificial Intelligence in Education. Springer, Berlin, Heidelberg, 181–188.
 19 Kevin H. Wilson, Yan Karklin, Bojian Han, and Chaitanya Ekanadham. 2016. Back to the basics: Bayesian extensions of IRT outperform neural networks for proficiency estimation. In Proceedings of the 9th International Conference on Educational Data Mining. 539–544.
 20 Xiaolu Xiong, Siyuan Zhao, Eric Van Inwegen, and Joseph Beck. 2016. Going deeper with deep knowledge tracing. In Proceedings of 9th International Conference on Educational Data Mining. 545–550.
 21 Michael V. Yudelson, Kenneth R. Koedinger, and Geoffrey J. Gordon. 2013. Individualized Bayesian knowledge tracing models. In Proceedings of the 16th International Conference on Artificial Intelligence in Education, Vol. 7926. Springer, Berlin, Heidelberg, 171–180.
 22 Jiani Zhang, Xingjian Shi, Irwin King, and DitYan Yeung. 2017. Dynamic keyvalue memory networks for knowledge tracing. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 765–774.
 23 Hui Zou and Trevor Hastie. 2004. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 2 (September 2004), 301–320.