Intelligent Knowledge Tracing:
More Like a Real Learning Process of a Student
Knowledge tracing (KT) refers to a machine learning technique to assess a student’s level of understanding (so-called knowledge state) of a certain concept based on the student performance on problem solving. KT accepts a series of question-answer pairs as an input and iteratively updates the knowledge state of the student, eventually returning the probability of the student solving an unseen question. From the viewpoint of neuroeducation (the field of applying neuroscience, cognitive science, and psychology to education), however, KT leaves much room for improvement in terms of explaining the complex process of human learning. In this paper, we identify three problems of KT (namely non adaptive knowledge growth, neglected latent information, and unintended negative influence) and propose a memory-network-based technique named intelligent knowledge tracing (IKT) to address them, thus approaching one step closer to understanding the complex mechanism underlying human learning. In addition, we propose a new performance metric called correct update count (CUC) that can measure the degree of unintended negative influence, thus quantifying how closely a student model resembles the human learning process. The proposed CUC metric can complement the area under the curve (AUC) metric, allowing us to evaluate competing models more effectively. According to our experiments using a widely used public benchmark, IKT significantly (over two times) outperformed the existing KT approaches in terms of CUC, while preserving the correctness behavior measured in AUC.
An intelligent tutoring system (ITS) Goodkovsky (2004); Brusilovsky et al. (1996) that provides educational services (i.e., lectures and exercises) online is widely used by many students and contributes to reducing the inequality of education. For the ITS to provide high-quality education to students, tracing each individual’s level of understanding is necessary. Knowledge tracing (KT) is a machine learning-based task that identifies the current knowledge states of students based on their past performance Piech et al. (2015). Using KT, we can identify the vulnerabilities of individual students and build a policy that suggests educational contents to help students learn better.
A student interplays with an ITS, and the ITS can observe an interaction (so-called knowledge growth signal) at time step , where is an exercise tag (id) where is the number of all the questions, and is a student response. KT is a supervised learning problem in that given past interactions and a new exercise , predicts the probability of answering correctly (i.e., ) Corbett and Anderson (1994); Piech et al. (2015); Zhang et al. (2017); Lee et al. (2017). In every time step, KT updates the student knowledge state given the knowledge growth signal .
Understanding the student’s knowledge acquisition process and the level of mastery is a complex and difficult research topic. Educational neuroscience, also known as neuroeducation Mehta (2009); Wright (2017), has been studied to apply psychological and brain scientific findings to education, so as to model a student’s learning process more concretely Council et al. (2002). Some major theories in neuroscience are applied to KT; however, there is still room for improvement. Recently, deep learning Min et al. (2017); Kim et al. (2018) based KT studies such as deep knowledge tracing (DKT) Piech et al. (2015) and dynamic key value memory networks (DKVMN) Zhang et al. (2017) have been proposed, showing large performance improvement over previous hand-crafted models d Baker et al. (2008); Embretson and Reise (2013); Pavlik Jr et al. (2009); Cen et al. (2006). However, these deep learning based methods do not reflect some actual learning processes of students discovered in neuroeducation.
The amount of knowledge growth through an exercise should be determined adaptively by prior knowledge, (i.e., current knowledge state) Wandersee et al. (1994); Brod et al. (2013). Modeling a student’s knowledge shift only with , which previous studies mostly assume Corbett and Anderson (1994); Piech et al. (2015); Zhang et al. (2017), cannot reflect prior knowledge because is independent of the student’s current knowledge state. We define this problem as a non adaptive knowledge growth problem.
In addition to that, the student knowledge change in previous approaches is only incurred through , and not in use of other side information such as a hint or an advise. However, such restricted input data cannot consider the student knowledge state concretely without latent information, which is a kind of a hint or an intuition that students get by being familiar with concepts of questions Tolman (1948). We define this problem as a neglected latent information problem.
According to intuition, it can be assumed that when a student solves an exercise, the understanding of the related concept increases, so that the probabilities of the student solving other questions also increase. Importantly, a correct answer should not have a negative impact on the probabilities of the student solving other exercises. In other words, if we updated the knowledge state through a positive knowledge growth signal , the probability of the student answering correctly should not decrease: for all exercise tag . This problem occurs in previous deep learning based-studies, and can be regarded as a negative influence problem where a positive knowledge growth signal gives an unintended negative influence to other exercises. This is because the loss function of existing models focuses only on the predictive performance of a given exercise and does not see the probability change of other questions. A KT model with a negative influence problem cannot be used in real applications (i.e., test, recommendation), because the internal operation of the model is reluctant to trust. We need a complementary metric to measure the reliability of the model behavior as well as the performance metric area under the curve (AUC).
In this paper, we propose intelligent knowledge tracing (IKT) that reflects the actual learning process of the student from the point of view of the neuroeducation missed in the existing KT approaches, and resolve the limitations of existing approaches. Our contributions can be summarized as follows:
We analyze existing KT models’ limits from the viewpoint of neuroeducation (i.e., non adaptive knowledge growth, neglected latent information, unintended negative influence problems).
We propose three methods (i.e., adaptive knowledge growth, counter memory, negative influence loss) to resolve the limitations of existing work in terms of neuroeducational facts.
To measure the unintended negative influence, we define a new metric, namely correct update count (CUC) that can be regarded as an evaluation metric of reliability of KT models.
2 Related Work
2.1 Traditional Knowledge Tracing
Traditional KT models that do not use deep learning basically require experts to label exercise tags directly. Such models have low complexity, and are not enough to express the level of understanding continuously Corbett and Anderson (1994).
A binary knowledge tracing (BKT) Corbett and Anderson (1994), which is a prominent traditional KT model, assumes the student knowledge state as a binary state and models the level of understanding using a hidden markov model (HMM) Sonnhammer et al. (1998) for each concept. Since the BKT tracks the level of understanding separately for each concept, it does not take into account the entanglement between concepts and hence dealing with a mixture of exercises involving various concepts is difficult.
2.2 Deep Learning-based Knowledge Tracing
A DKT Piech et al. (2015) treats a hidden state of recurrent neural networks LeCun et al. (2015); Park et al. (2017b); Lee et al. (2016); Bae et al. (2017) with a long short-term memory (LSTM) Hochreiter and Schmidhuber (1997); Yi et al. (2018) as the student knowledge state, and hence assumes that a hidden state represents the level of understanding of whole concepts. The DKT differs from the BKT in that the DKT can deal with several concepts simultaneously for various questions and express the student knowledge state in a continuous manner instead of a binary form. Even though the DKT represents the knowledge state stronger than the BKT, based on a high dimensional and continuous representation power of the LSTM, a single representative hidden state cannot be disentangled for each concept. Therefore, in the DKT, tracking the level of understanding of each concept is challenging.
A DKVMN Zhang et al. (2017), which is a memory augment neural network Santoro et al. (2016a); Park et al. (2017a) based model, can analyze the level of understanding of each concept as the BKT and utilize the correlation between concepts as the DKT. The DKVMN adopts a key memory and a value memory. A key memory stores the representation of concepts involved in exercises, and a value memory stores the student’s mastery level for each concept. With read and write operations to these two memories, the DKVMN updates the student knowledge state. Nevertheless, the DKVMN does not consider neuroeducational facts such as non adaptive knowledge growth, neglected latent information, and unintended negative influence problems.
2.3 Educational Neuroscience
Educational neuroscience research specifies human brain activities while students learn. Humans understand by organizing concepts of a certain knowledge Chi et al. (1981) and using prior knowledge. These neuroeducational findings can be connected to trace the student knowledge.
Latent learning refers to a term in cognitive education in which learning occurs without apparent reward Tolman (1948). In the experiment, the group that received delayed reward showed that its average error caught up with the average error of the group quickly that received the reward from the beginning. This suggests that the latent information can help to perform the task. Thus, the knowledge state of a student can be predicted more accurately by reflecting latent information to the KT model properly.
We compared human learning, DKT, DKVMN, and IKT as shown in Table 1. The DKT stores a student’s knowledge state in the LSTM’s hidden neurons and uses the previous hidden state to update the knowledge state. The DKVMN and IKT store the student knowledge state in the value memory and use the key memory to organize concepts of an exercise. IKT applies a adaptive knowledge growth to reflect student’s prior knowledge, and is the only model that implements latent learning of humans by adopting a counter memory. IKT also adds an additional loss term to regularize the negative influence problem.
Figure 1 shows our proposed IKT model which is a kind of memory-augmented neural network Khan et al. (2018); Santoro et al. (2016a, b) with three types of memory: key memory, value memory, and counter memory. The key memory stores the high dimensional () embeddings of each concept () in its each slots. Each slot of the value memory represents a student’s mastery level of a concept.
The trained DKVMN model performs three major processes: attention, read, and write. Given an exercise tag and the student knowledge state at a time step , the attention process produces an attention vector dividing the question into corresponding concepts. The read process receives the weight vector from the attention process and outputs . The write process receives the exercise and as well as the information on whether the student’s answer is correct, and updates the value memory () by adding and erasing values.
IKT adds new adaptive knowledge growth, counter memory, and additional loss term. In addition, we define the CUC as a metric for measuring unintended negative influence, which can be used complementarily to the AUC for measuring performance.
3.1 Attention Process
In the attention process Graves et al. (2014), an attention vector between and each latent concepts is acquired by using the key matrix . The input is embedded to a key vector by multiplying an embedding matrix Zhang et al. (2017). The concept attention vector is computed by taking the softmax of the inner product of and as follows:
where and are the student’s knowledge state and the attention weight of the concept, respectively.
3.2 Read Process
The read process retrieves the attended knowledge state of the student from a value matrix using , and predicts the probability of answering correctly. The read content vector provides the understanding level of the student for each concept by the concept-wise weighted sum of and Zhang et al. (2017) as follows:
However, alone does not contain the characteristic of the given exercise and latent information such as, the summary, hint, and student’s proficiency. To consider this neglected latent information, we add a counter memory to track the amounts of concepts has been seen at time step . is initialized to and updates when it sees regardless of , as follows:
where is embedded to by a counter embedding matrix . The counter memory can also be implemented as exercise level that counts the number of exercises and the comparison between two types of memories are performed in the experiments. While the value memory tracks a student’s knowledge changes according to the response, the counter memory tracks the changes in the knowledge gained through the learning process without considering the response.
To express the student’s understanding level given the exercise information and latent information, we concatenate , , and to represent the summary vector , which involves the comprehensive representation of the student at time step , as follows:
where and denote the weight and the bias of fully connected layer respectively. The probability for is computed from .
where and denote the trainable parameters of the last fully connected layer.
3.3 Write Process
The write process updates , to , respectively. To update , the given knowledge growth signal is embedded to a knowledge growth vector by multiplying with an embedding matrix Zhang et al. (2017).
However, is independent of the current knowledge state of the student since only depends on , defined as a non adaptive knowledge growth problem in Section 1. To resolve the non adaptive knowledge growth problem, we expand to adaptive knowledge growth that contains the student’s prior knowledge (current knowledge state ). There are some candidates of such as read content and summary vector . We choose as the student’s current knowledge since involves the concept mastery level of the student with and , and define as follows:
Motivated from the operations of LSTM Hochreiter and Schmidhuber (1997), an erase vector and an add vector are exploited to erase needless information and add new information Zhang et al. (2017) as follows:
where , and are the trainable parameters of each fully connected layer. The erase and add vectors can be calculated adaptively based on the prior knowledge of the student so that we can personalize the student’s knowledge state.
To reflect the student’s knowledge change through , and need to be applied concept-wise, because the state of each concept is represented differently by the attention vector . Therefore, the value memory is updated for each concept Zhang et al. (2017) as follows:
where is an index for a concept.
3.4 Optimization Process
|Objective function||Asymptotic analysis||Real case|
To improve the predictive performance for the given exercise, the DKVMN is trained with the cross entropy loss Zhang et al. (2017):
The negative influence problem arises because that does not consider whether the probabilities of other exercises change. We add a negative influence loss that penalizes the cases where has unintended negative influence on other exercises, by analyzing the total prediction probability. Given , the total prediction probability vector is defined as follows:
We define that has non negative influence on other exercises when . is the squared error for the probability difference only when the positive knowledge growth signal has negative influence as follows:
The objective function is then where is a hyper-parameter. Since the value memory is a major bottleneck for a calculation, asymptotic analysis for the required space in -notation is where is the batch size. To calculate , we should inference , which requires times more space: . To reduce the required resources, we propose that is approximated version of . The key idea behind introducing is to share the value memory for all exercises to calculate the uniform averaged read content only once. To get , the weighted averaged read content has to be calculated times. However, we replace by which considers all the concepts equally regardless of as follows:
and its required space is . To calculate the approximated probability of answering correctly for each from , Equation 4 should be repeated times and its required space is . In Table 2, we compare an asymptotic analysis of the required memory resource.
The negative influence is defined in the attention mechanism, whereas is defined in the non attention mechanism; hence, the question of the usefulness of may arise. Even though the probability of answering correctly obtained through is not accurate, the prediction performance is in charge of . It is more important for to regularize the negative influence problem than to predict the probability accurately. The negative influence problem should not occur regardless of whether the attention or the non attention mechanism is used. It is reasonable to calculate with , which is a vector that shows the understanding of all concepts, but does not show understanding of specific problems.
3.5 Proposed Metric: Correct Update Count (CUC)
We cannot trust the behavior of a model with negative influence, even if the model has a high AUC, and it is difficult to apply the model to real applications such as a test and a recommendation. Therefore, we propose a new metric called right update count (CUC) to count the number of non negative influence exercises as follows:
where denotes an indicator function. If the CUC is large, the negative influence is not prevalent. The CUC can be viewed as a metric for the reliability of a KT model by measuring unintended negative influence and can be used complementary to the AUC for the predictive performance.
4 Experimental Results
We tested the performance of the KT models on a widely used public benchmark: Assistments2009. Assistments2009111available from https://sites.google.com/site/assistmentsdata/home/assistment-2009-2010-data Feng et al. (2009) was gathered from the ASSISTments, online tutoring platforms. Assistments2009 has 325,637 question-answer pairs from 4,151 students each solving 110 questions. This tutoring system provides hints to students when the response is incorrect, which can be regarded as latent information. We implemented IKT with Tensorflow Abadi et al. (2016) using SGD with momentum Qian (1999).
4.1 Performance Evaluation
We used the two evaluation metrics discussed above: AUC and CUC. Higher AUC values indicate more accurate prediction of answering questions correctly, whereas higher CUC values indicate that the degree of unintended negative influence is lower. Table 3 lists the structure and the measured AUC and CUC values for each model under comparison.
In this table, the Activation column shows the type of activation function of the summary vector (Equation 4. DKVMN-ta (activation: ) is the original DKVMN, and DKVMN-si (activation: ) is a variant of DKVMN. We initially conjectured that the negative value in the image of tanh would cause some unintended effects, which was confirmed by our results shown here. While the average CUC of DKMVN-ta was 14.6, the average CUC of DKVMN-si was 52.7. We chose the sigmoid as the default activation function for the other models since using sigmoid alone brought a significant improvement of CUC values.
The Current state column indicates the type of which is used to get adaptive knowledge growth . In addition to Equation 6 (IKT-su), we tested another as (IKT-re) for the reference. IKT-su and IKT-re outperformed the DKVMN models in terms of CUC while preserving performance measured in AUC. Notably, IKT-su showed a significant increase of CUC compared to IKT-re, which means that the summary vector can represent the current students’ knowledge state more concretely than the read vector alone.
The Objective column in Table 3 indicates the objective function of each model, and IKT-ne was optimized to minimize . IKT-ne outperformed DKVMN-ta (by over seven times) and DKVMN-si (by over two times) while preserving the performance measured in AUC. Moreover, IKT-ne outperformed IKT-su and IKT-re, which considered adaptive knowledge growth only. We can interpret this performance improvement as follows: adding negative influence loss directly handles unintended probability decrease of other exercises, while using affects the students’ knowledge tracing indirectly. In addition, IKT-all (which considers both of adaptive knowledge growth and unintended negative influence) produced the highest performance among other models in terms of CUC, while preserving the performance measured in AUC. This result can be interpreted that solving non adaptive knowledge growth and negative influence problems simultaneously leads IKT to imitate human learning more effectively.
4.2 Effectiveness of Latent Learning
To reflect latent information into IKT, we add the counter memory. We compare two types of counter memory, namely IKT-ex (which counts the number of exercises) and IKT-co (which accumulates concepts encountered). To reproduce the latent learning test Tolman (1948) in a KT scenario, we provide a series of the same during and then the same positive knowledge growth signal during to the IKT model, as shown in Figure 4. During , the value memory is not updated sine there is no response , while the counter memory is updated based on as latent information. We define a learning speed as the amount by which the probability of answering correctly increases at time step from the time when a response is given. depends on how much latent information the KT models acquired during . ( at the case of ) is a reference for comparing the learning speed.
We calculate the ratio of to the in Table 4. For the IKT-ex model, the ratio of learning speed is almost constant near one, which means that the exercise counter does not capture the latent information. For the IKT-co model, the ratio of learning speed is 2.87 for , and 4.56 for at . Since IKT accumulates the knowledge concept-wise, IKT-co learns faster using latent information than IKT-ex that represents latent information in an exercise level. In addition, since the effect of latent information is reduced as more positive knowledge growth signals are given, as increases, the ratio of learning speed decreases. In our experiments, latent information in the exercise level was not useful, but latent information in the concept level was.
The negative influence problem prevents an ITS from providing a good educational service to students. We cannot rely on the KT model with the negative influence problem to properly model a student’s knowledge learning process. In particular, it would be difficult to apply reinforcement learning (RL) Sutton and Barto (1998); Zhao et al. (2017); Choi et al. (2018); Yoo et al. (2017) to recommend appropriate contents to the current knowledge state when the negative influence problem prevails. In RL, it is crucial to define suitable rewards; however, a reward defined in any way based on the KT model with negative influence may not function properly. Since the benefits from a proper exercise recommendation can be enormous, solving the negative influence problem is important.
We believe that we can apply IKT, which involves the latent learning mechanism, to resolving the cold start problem Lika et al. (2014) of a recommender system. The cold start problem occurs when recommending a suitable item for a user, there is no history of the user, and the user cannot get a proper recommendation. In a similar manner, the cold start problem can also occur when ITS attempts to recommend an exercise to a student who does not solve any exercises. However, even if the student did not actually solve the exercise but only listened to the lecture provided by the ITS, we can estimate the knowledge state of the student by applying the latent learning process as was done in this paper.
Our proposed IKT model introduces effective solutions to the limitations of the previous studies from the neruoeducaiton viewpoints. We also have proposed a new metric CUC that can be used to evaluate the reliability of the KT model behavior. We believe that the IKT model can closely resemble a real student’s learning process and is more reliable than existing models.
- Abadi et al.  Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.
- Bae et al.  Ho Bae, Byunghan Lee, Sunyoung Kwon, and Sungroh Yoon. Dna steganalysis using deep recurrent neural networks. arXiv preprint arXiv:1704.08443, 2017.
- Brod et al.  Garvin Brod, Markus Werkle-Bergner, and Yee Lee Shing. The influence of prior knowledge on memory: a developmental cognitive neuroscience perspective. Frontiers in behavioral neuroscience, 7:139, 2013.
- Brusilovsky et al.  Peter Brusilovsky, Elmar Schwarz, and Gerhard Weber. Elm-art: An intelligent tutoring system on world wide web. In International conference on intelligent tutoring systems, pages 261–269. Springer, 1996.
- Cen et al.  Hao Cen, Kenneth Koedinger, and Brian Junker. Learning factors analysis–a general method for cognitive model evaluation and improvement. In International Conference on Intelligent Tutoring Systems, pages 164–175. Springer, 2006.
- Chi et al.  Michelene TH Chi, Paul J Feltovich, and Robert Glaser. Categorization and representation of physics problems by experts and novices. Cognitive science, 5(2):121–152, 1981.
- Choi et al.  Sungwoon Choi, Heonseok Ha, Uiwon Hwang, Chanju Kim, Jung-Woo Ha, and Sungroh Yoon. Reinforcement learning based recommender system using biclustering technique. arXiv preprint arXiv:1801.05532, 2018.
- Corbett and Anderson  Albert T Corbett and John R Anderson. Knowledge tracing: Modeling the acquisition of procedural knowledge. User modeling and user-adapted interaction, 4(4):253–278, 1994.
- Council et al.  National Research Council et al. Learning and understanding: Improving advanced study of mathematics and science in US high schools. National Academies Press, 2002.
- d Baker et al.  Ryan SJ d Baker, Albert T Corbett, and Vincent Aleven. More accurate student modeling through contextual estimation of slip and guess probabilities in bayesian knowledge tracing. In International Conference on Intelligent Tutoring Systems, pages 406–415. Springer, 2008.
- Embretson and Reise  Susan E Embretson and Steven P Reise. Item response theory. Psychology Press, 2013.
- Feng et al.  Mingyu Feng, Neil Heffernan, and Kenneth Koedinger. Addressing the assessment challenge with an online system that tutors as it assesses. User Modeling and User-Adapted Interaction, 19(3):243–266, 2009.
- Goodkovsky  Vladimir A Goodkovsky. Intelligent tutoring system, October 19 2004. US Patent 6,807,535.
- Graves et al.  Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
- Hochreiter and Schmidhuber  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Khan et al.  Asjad Khan, Hung Le, Kien Do, Truyen Tran, Aditya Ghose, Hoa Dam, and Renuka Sindhgatta. Memory-augmented neural networks for predictive process analytics. arXiv preprint arXiv:1802.00938, 2018.
- Kim et al.  Hui Kwon Kim, Seonwoo Min, Myungjae Song, Soobin Jung, Jae Woo Choi, Younggwang Kim, Sangeun Lee, Sungroh Yoon, and Hyongbum Henry Kim. Deep learning improves prediction of crispr–cpf1 guide rna activity. Nature biotechnology, 36(3):239, 2018.
- Kitamura et al.  Takashi Kitamura, Sachie K Ogawa, Dheeraj S Roy, Teruhiro Okuyama, Mark D Morrissey, Lillian M Smith, Roger L Redondo, and Susumu Tonegawa. Engrams and circuits crucial for systems consolidation of a memory. Science, 356(6333):73–78, 2017.
- LeCun et al.  Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
- Lee et al.  Byunghan Lee, Junghwan Baek, Seunghyun Park, and Sungroh Yoon. deeptarget: end-to-end learning framework for microrna target prediction using deep recurrent neural networks. In Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pages 434–442. ACM, 2016.
- Lee et al.  Byunghan Lee, Taesup Moon, Sungroh Yoon, and Tsachy Weissman. Dude-seq: Fast, flexible, and robust denoising for targeted amplicon sequencing. PloS one, 12(7):e0181463, 2017.
- Lika et al.  Blerina Lika, Kostas Kolomvatsos, and Stathes Hadjiefthymiades. Facing the cold start problem in recommender systems. Expert Systems with Applications, 41(4):2065–2073, 2014.
- Mehta  A Mehta. „neuroeducation‟ emerges as insights into brain development, learning abilities grow. The DANA foundation, 2009.
- Min et al.  Seonwoo Min, Byunghan Lee, and Sungroh Yoon. Deep learning in bioinformatics. Briefings in bioinformatics, 18(5):851–869, 2017.
- Park et al. [2017a] Seongsik Park, Seijoon Kim, Seil Lee, Ho Bae, and Sungroh Yoon. Quantized memory-augmented neural networks. arXiv preprint arXiv:1711.03712, 2017a.
- Park et al. [2017b] Seunghyun Park, Seonwoo Min, Hyun-Soo Choi, and Sungroh Yoon. Deep recurrent neural network-based identification of precursor micrornas. In Advances in Neural Information Processing Systems, pages 2895–2904, 2017b.
- Pavlik Jr et al.  Philip I Pavlik Jr, Hao Cen, and Kenneth R Koedinger. Performance factors analysis–a new alternative to knowledge tracing. Online Submission, 2009.
- Piech et al.  Chris Piech, Jonathan Bassen, Jonathan Huang, Surya Ganguli, Mehran Sahami, Leonidas J Guibas, and Jascha Sohl-Dickstein. Deep knowledge tracing. In Advances in Neural Information Processing Systems, pages 505–513, 2015.
- Qian  Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
- Santoro et al. [2016a] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. One-shot learning with memory-augmented neural networks. arXiv preprint arXiv:1605.06065, 2016a.
- Santoro et al. [2016b] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. One-shot learning with memory-augmented neural networks. arXiv preprint arXiv:1605.06065, 2016b.
- Sonnhammer et al.  Erik LL Sonnhammer, Gunnar Von Heijne, Anders Krogh, et al. A hidden markov model for predicting transmembrane helices in protein sequences. In Ismb, volume 6, pages 175–182, 1998.
- Sutton and Barto  Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
- Tolman  Edward C Tolman. Cognitive maps in rats and men. Psychological review, 55(4):189, 1948.
- Wandersee et al.  James H Wandersee, Joel J Mintzes, and Joseph D Novak. Research on alternative conceptions in science. Handbook of research on science teaching and learning, 177:210, 1994.
- Wright  Cara Megan Wright. When trauma disrupts learning: A neuroeducation-informed professional learning experience. 2017.
- Yi et al.  Hayoon Yi, Gyuwan Kim, Jangho Lee, Sunwoo Ahn, Younghan Lee, Sungroh Yoon, and Yunheung Paek. Mimicry resilient program behavior modeling with lstm based branch models. arXiv preprint arXiv:1803.09171, 2018.
- Yoo et al.  Jaeyoon Yoo, Heonseok Ha, Jihun Yi, Jongha Ryu, Chanju Kim, Jung-Woo Ha, Young-Han Kim, and Sungroh Yoon. Energy-based sequence gans for recommendation and their connection to imitation learning. arXiv preprint arXiv:1706.09200, 2017.
- Zhang et al.  Jiani Zhang, Xingjian Shi, Irwin King, and Dit-Yan Yeung. Dynamic key-value memory networks for knowledge tracing. In Proceedings of the 26th International Conference on World Wide Web, pages 765–774. International World Wide Web Conferences Steering Committee, 2017.
- Zhao et al.  Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Dawei Yin, Yihong Zhao, and Jiliang Tang. Deep reinforcement learning for list-wise recommendations. arXiv preprint arXiv:1801.00209, 2017.