Teaching Multiple Concepts to Forgetful Learners
How can we help a forgetful learner learn multiple concepts within a limited time frame? For long-term learning, it is crucial to devise teaching strategies that leverage the underlying forgetting mechanisms of the learners. In this paper, we cast the problem of adaptively teaching a forgetful learner as a novel discrete optimization problem, where we seek to optimize a natural objective function that characterizes the learner’s expected performance throughout the teaching session. We then propose a simple greedy teaching strategy and derive strong performance guarantees based on two intuitive data-dependent parameters, which characterize the degree of diminishing returns of teaching each concept. We show that, given some assumptions of the learner’s memory model, one can efficiently compute the performance bounds. Furthermore, we identify parameter settings of our memory models where greedy is guaranteed to achieve high performance. We have deployed our approach in two concrete applications, namely (1) an educational app for online vocabulary teaching and (2) an app for teaching novices how to recognize bird species. We demonstrate the effectiveness of our algorithm using simulations along with user studies.
In many real-world educational applications, human learners often intend to learn more than one concept. For example, in a language learning scenario, a learner aims to memorize a number of words from a foreign language. In citizen science projects such as eBird , the goal of a learner is to recognize multiple bird species from a given geographic region. As the number of concepts increases, the learning problem may become overwhelmingly challenging due to the learner’s limited memory and propensity to forget. It has been well established in the psychology literature that in the context of human learning, the knowledge of a learner decays rapidly without reconsolidation . Somewhat analogously, in the sequential machine learning setting, modern machine learning methods, such as artificial neural networks, can be drastically disrupted when presented with new information from different domains, which leads to catastrophic interference and forgetting . Therefore, to retain long-term memory (for both human and machine learners), it is crucial to devise teaching strategies that leverage the underlying forgetting mechanisms of the learners.
A prominent approach towards teaching forgetful learners is through repetition. Properly-scheduled repetitions and reconsolidations of previous knowledge have proven effective for a wide variety of real-world learning tasks, including piano practice [13, 16], surgery skills [22, 17, 4], video games [15, 18], and vocabulary learning , among others. For many of the above application domains, it has been shown that by carefully designing the scheduling policy, one can achieve substantial gains over simple heuristics (such as spaced repetition at fixed time intervals, or a simple round robin schedule) . Unfortunately, despite the extensive empirical results in these fields, most of these scheduling techniques are based on heuristics, and little is known about their theoretical performance.
In this paper, we explore the following research question: Given limited time, can we help a forgetful learner efficiently learn multiple concepts in a principled manner? More concretely, we consider an adaptive setting where at each time step, the teacher needs to pick a concept from a finite set based on the learner’s previous responses, and the process iterates until the learner’s time budget is exhausted. Given the memory model of the learner, what is an optimal teaching curriculum? How should this sequence be adapted based on the learner’s performance history?
For a high-level overview of our approach, let us consider the example in Fig. 1, which illustrates one of our applications (cf. [1, 2]) on German vocabulary learning. Here, our goal is to teach the learner three German words within six iterations. One trivial approach could be to show the flashcards in a round robin fashion. However, the round robin sequence is deterministic and thus not capable of adapting to the learner’s input. In contrast, our algorithm outputs a personalized teaching sequence based on the learner’s performance history. Our algorithm is based on a novel formulation of the adaptive teaching problem. In §3, we propose a novel discrete optimization problem, where we seek to maximize a natural surrogate objective function that characterizes the learner’s expected performance throughout the teaching session. Note that constructing the optimal teaching policy could be prohibitively expensive for long teaching sessions, as it boils down to solving a stochastic sequence optimization problem, which is NP-hard in general. In §4, we introduce our greedy algorithm, and derive strong performance guarantees based on two intuitive data-dependent parameters. We then show that for certain memory models of the learner, one can efficiently compute the performance bounds. Furthermore, we identify parameter settings of the memory models where the greedy algorithm is guaranteed to achieve high performance. We describe results for simulated learners in §5, and show significant improvements over baselines for the challenging task of teaching real humans in §6.
2 Related Work
Optimal scheduling with spaced repetition models
Numerous studies in neurobiology and psychology have emphasized the importance of the spacing and lag effects in human learning. The spacing effect is the observation that spaced repetition produces greater improvements in learning compared to massed repetition (i.e., “cramming”). The lag effect refers to the benefit of introducing appropriate time lags between study sessions . These findings lay the foundations of modern spaced repetition research, including widely-used heuristic-based approaches, such as Leitner , Pimsleur , and SuperMemo . Settles and Meeder (2016)  introduced Half-life Regression (HLR) as a generalization of these heuristics, and showed that HLR in general outperforms the existing approaches. In this paper, we adopt a variant of the HLR to model the learner.
Recently, Reddy et al. (2016)  presented a queueing network for flashcard learning and provided a tractable algorithm to approximate a solution. However, their approach is specifically designed for Leitner systems, where the meters of learners’ skills often do not adequately reflect what they have learned . Tabibian et al. (2017)  considered optimizing learning schedules in continuous time for independent items, and used optimal control theory to derive optimal scheduling when optimizing for a penalized recall probability area-under-the-curve loss function. In contrast to , we consider the discrete time setting. We are interested in the scenario where a learner studies their flashcards at constant time intervals (e.g. on the way to work or before going to bed), rather than at arbitrary times.
Sequence optimization / sequential decision making
Our theoretical framework is inspired by recent results on string submodular function maximization  and adaptive submodular optimization . In particular, Zhang et al. (2016)  introduced the notion of string submodular functions, which, analogous to the classical notion of submodular set functions, enjoy similar performance guarantees for maximization deterministic sequence functions. However, we note that our setting is drastically different from . The authors focus on deterministic string submodular functions, whereas our teaching algorithm operates in the stochastic setting, and our objective function is highly non-submodular. As a second note, our framework (in particular Corollary 2) can be viewed as a strict generalization of string submodular function maximization to the adaptive setting.
3 The Teaching Model
In this section, we first introduce the notation for our teaching model. Then, we describe the interactive teaching protocol and formally state the problem studied in this paper.
3.1 Target concepts and memory of the learner
Suppose that the teacher aims to teach the learner independent concepts in a finite time horizon . W.l.o.g., we assume that each concept consists of one instance and a corresponding label111In the case where a concept consists of multiple instances, we consider the teacher, at time , showing the full batch of instances to teach concept .. For instance, in language learning, a concept corresponds to a word in the vocabulary of a second language. Let us use to denote the event that the learner recalls a concept at time step , where means that the learner successfully recalls the label, (i.e., the learner correctly translates the word), and otherwise. We assume that the learner’s memory of concept at time is captured by a memory model . Here, denotes the historical events in which concept was revealed and denotes the set of feasible histories. As an example, the probability of recalling concept at time for the exponential forgetting curve model is given by , and the recall probability for the power-law forgetting curve model is given by . Here, the variable depends on the historical frequency of showing concept , and are scaling parameters that characterize the forgetting rate.
3.2 Model of interaction
We consider the following interactive teaching protocol. At iteration , the teacher picks a concept from the set and presents an instance of it to the learner without revealing its label. The learner then tries to recall the concept. After the learner makes an attempt, the teacher collects the outcome and reveals the true label. We use to denote the sequence of concepts picked by the teacher, and use to denote the element of the sequence. At the end of iteration , the teacher adds to the observation history , and updates the memory model .
3.3 Objective function, policy and the optimal teaching problem
The goal of the teacher is to maximize the learner’s performance in recalling all concepts after iterations. A natural choice of the objective function is the average recall probability of all concepts at the end of the teaching session. This objective, however, does not explicitly capture the performance of the learner during the training phase, which may stretch over years for language learning. Therefore, to provide the learner with high proficiency as soon as possible, we optimize for concept retrievability during learning. We consider the following objective, which measures the learner’s average cumulative recall probability for all the concepts across the teaching horizon
Here, denotes the probability of the learner recalling concept correctly at time step , given the sequence of examples selected up to time step . Intuitively, our objective function can be interpreted as the (discrete) area under the learner’s forgetting curve over the entire teaching session (i.e., we are summing over the recall probabilities across all time steps up to (and hence to ), even when we have only observed the learner’s history up to ).
The teacher’s teaching strategy can be represented as a policy , which maps any observation history to the next concept to be revealed. For a given policy , we use to denote a random trajectory from the policy until time . The average utility of a policy is defined as
Given the learner’s memory model for each concept and the time horizon , we seek the optimal teaching policy that achieves the maximal average utility
Finding the optimal solution for Problem (3) is a formidable task. It requires searching through the space of all possible feasible policies. In fact, even for the simple setting where the objective function does not depend on the learner’s responses, i.e., when , Problem (3) reduces to a combinatorial optimization problem over sequences, which is NP-hard. In the following, we present a simple greedy algorithm, and provide a data-dependent lower bound on its average utility against the optimal policy. Moreover, we prove that under some additional conditions on the learner’s memory model, one can efficiently compute such an empirical bound.
4 Algorithms and Theoretical Analysis
We consider a simple, greedy approach towards constructing teaching policies. Formally, given an observation history , we define the conditional marginal gain of teaching a concept at time as
where denotes the concatenation operation, and the expectation is taken over the randomness of learner’s recall , conditioned on having observed . The greedy algorithm, as described in Algorithm 1, iteratively picks the item that maximizes this conditional marginal gain.
4.1 Theoretical Guarantees
We now present a general theoretical framework for analyzing the performance of the adaptive greedy policy (Algorithm 1). Importantly, our bound depends on two natural properties of the objective function , both related to the notion of diminishing returns of a sequence function. Intuitively, the following two properties reflect how much a bad choice by the greedy algorithm can affect the optimality of the solution.
Definition 1 (Online stepwise submodular coefficient).
Fix policy of length . The online submodular coefficient of function with respect to policy at step is defined as
where denotes the minimal ratio between the gain of any item given current observation history and the gain of in any future steps.
Definition 2 (Online stepwise backward curvature).
Fix policy of length . The online backward curvature of function with respect to policy at step is defined as
where denotes the normalized maximal expected second-order difference when considering the current observation history .
Here, and generalizes the notion of string submodularity and total backward curvature for sequence functions  to the stochastic setting. Intuitively, measures the degree of diminishing returns of a sequence function in terms of the ratio between the conditional marginal gains. If , then the conditional marginal gain of adding any item to any subsequent observation history is non-decreasing. In contrast, measures the degree of diminishing returns in terms of the difference between the marginal gains. As our first main theoretical result, we provide a data-dependent bound on the average utility of the greedy policy against the optimal policy.
The summand on the R.H.S. of Eq. (7) is in fact a lower bound on the expected one-step gain of the greedy policy. Therefore, if we run the greedy algorithm for only iterations, we can bound its expected utility by , where is the optimal policy (of length ). We can further relax the bound by considering the worst-case online stepwise submodularity ratio and curvature across all time steps.
Let and . For all ,
The proofs are deferred to the Appendix. Note that Corollary 2 generalizes the string submodular optimization framework of , which only holds under the deterministic setting, to the stochastic sequence optimization problem. In particular, for the special case where and is independent of , Corollary 2 reduces to where denote the sequences selected by the greedy and the optimal algorithm. However, constructing the bounds in Theorem 1 and Corollary 2 requires us to compute for , which is as expensive as computing . In the following subsection, we investigate a specific learner’s model, and provide polynomial time approximation algorithms for computing theoretical lower bound in Theorem 1.
4.2 Performance Analysis: Half-life Regression (HLR) Learners
We consider the case of HLR learners with the following exponential forgetting curve model
where is the last time concept was taught, and denotes the half life of the learner’s recall probability of concept . Here, parametrizes the retention rate of the learner’s memory, and , where and denote the number of correct recalls and and incorrect recalls of concept in .
We would like to bound the performance of Algorithm 1. While computing is NP-hard in general, we show that one can efficiently approximate in the deterministic setting.
Assume that the learner is characterized by the HLR model (Eq. (8)) where . We can compute empirical bounds on in polynomial time.
We defer the proof of Theorem 3, as well as the approximation algorithms for to the Appendix. In Fig. 2, we demonstrate the behavior of three teaching algorithms on a toy problem with . Fig. 1(a)-1(c) shows the learner’s forgetting curve (i.e., recall probabilities) and the sequences selected by three algorithms: Greedy (Algorithm 1), Optimal (the optimal solution for Problem (3)), and Round Robin (a fixed round robin teaching schedule for all concepts). Observe that Greedy starts with easy concepts (i.e., concepts with higher memory retention rates), moves on to teaching new concepts when the learner has “enough” retention for the current concept, and repeats previous examples towards the end of the teaching session. This behavior is similar to the optimal teaching sequence, and achieves higher utility in comparison to the fixed round robin scheduling (Fig. 1(d)).
In Fig. 1(e)-1(g), we demonstrate the behavior of the conditional marginal gain, the empirical bounds on , as well as the exact values of when running the greedy algorithm. In particular, in Fig. 1(e), we see that the marginal gain of the orange item is increasing in the early stages (as opposed to many classical discrete optimization problems that exhibit the diminishing returns property), which makes the analysis of the greedy algorithm non-trivial. Note that our algorithm for computing actually outputs the exact value of (a näive approach to computing is via extensive enumeration of all possible teaching sequences). In Fig. 1(h), we plug in the empirical bounds on and to Theorem 1 and Corollary 2, and plot the empirical approximation bounds on as a function of the teaching horizon . For problem instances with a large teaching horizon , it is infeasible to compute the true approximation ratio. However, one can still efficiently compute the empirical approximation bound as a useful indicator of the greedy performance.
Theorem 3 shows that it is feasible to compute explicit lower bounds on the utility of Algorithm 1 against the maximal achievable utility. The following proposition, proven in the Appendix, shows that for certain types of learners, the greedy algorithm is guaranteed to achieve a high utility.
Consider the task of teaching a HLR learner independent concepts in time horizon , where all concepts share the same parameter configurations, i.e., . A sufficient condition for the greedy algorithm to achieve utility is .
In this section, we experimentally evaluate our algorithm by simulating learners’ responses based on a known memory model. This allows us to inspect the behavior of our algorithm and several baseline algorithms in a controlled setting, which we cannot explicit access in a real-world user study.
We simulated concepts of three different types: “easy”, “medium”, and “hard”. The learner’s memory for each concept is captured by an independent HLR model. Concepts of the same type share the same parameter configurations. Specifically, for “easy” concepts, the parameters are , for “medium”, , and for “hard”, . Our parameters are chosen by first fixing , and then calculating the corresponding values of and by which the learner’s recall probability of item drops to a preset recall probability in the immediate next step after showing concept . For an “easy” concept, one can compute the corresponding recall probability in the next step according to Eq. (8): and . Similarly, these recall probabilities for “medium” concepts is , and for “hard” concepts they are .
We consider two different criteria when assessing the performance of the candidate algorithms. Our first evaluation metric is the objective value as defined Eq. (1), which measures the learner’s average cumulative recall probability across the entire teaching session. The second evaluation metric is the learner’s average recall probability of all concepts at the end of the teaching session. We call this objective “Recall at ”, where is a integer measuring how far in the future we choose to evaluate the learner’s recall.
To demonstrate the performance of our adaptive greedy policy (referred to as GR), we consider three baseline algorithms. The first baseline, denoted by RD, is the random teacher that presents a random concept at each time step. The second baseline is round robin, denoted by RR, which picks concepts according to a fixed round robin schedule. Our third baseline is a variant of the greedy approach employed in the original HLR paper  (where we consider a sightly different formulation of the half life), which can be considered as a generalization of the popular Leitner / Pimsleur approaches. At each iteration, the teacher chooses to display the concept with lowest recall probability according to the HLR memory model of the learner. We refer to this algorithm as LR.
We first evaluate the performance of our algorithm against the baselines as a function of the teaching horizon . In Fig. 2(a) and Fig. 2(b), we plot the objective value and average recall at for all algorithms over 10 random trials, where we set , with half medium and half difficult concepts, and vary . As we can see from both plots, GR consistently outperforms the random baseline in all scenarios. For reasonably small , when we are teaching multiple concepts with very limited resources (i.e., small budget on ), our greedy approach (GR) outperforms the other baselines. The performance of the lowest recall (LR) and round robin (RR) improves and eventually beats GR as we increase the budget — this behavior is expected, as it corresponds to the scenario where all items get a fair chance of repetition with abundant time budget. Furthermore, our analysis from §4.2 suggests that hard concepts (i.e., items with low retention rate) suffer more from the non-diminishing returns effect (see Fig. 1(e)), and thus can keep the myopic policy from approaching the optimal utility. In Fig. 2(c) and Fig. 2(d), we show the performance plot for a fixed teaching horizon of when we vary the number of concepts . Here we observe a similar behavior as before. Our results suggest that GR is optimized for the more challenging problem of teaching multiple concepts given a tight time budget.
6 User Study
We have developed online apps for two concrete real-world applications: (1) German vocabulary teaching , and (2) teaching novices to recognize bird species as part of a citizen science project . We now briefly introduce the two systems, and present the results of deploying our vocabulary learning app to real human learners.
As part of our beta testing for the German vocabulary teaching app, we collected 100 English-German word pairs in the form of flashcards, each associated with a descriptive image. To extract a fine-grained signal for our user study, we further categorize the words into three difficulty levels based on a thorough evaluation of each word from a domain expert. For the bird teaching app, we collected an initial set of 18 of the most common bird species in North America. Examples from both datasets can be seen in Fig. 3(a)-3(b).
Online teaching interface
We set up a simple and intuitive adaptive teaching interface to keep the learners engaged in our user study (see Fig. 3(d)). In the following discussion, we use German vocabulary learning as an example. Importantly, to establish an experimental setup that accurately reflects our modeling assumptions, we integrate the following design ideas.
An important component of the user evaluation is to understand the learner’s bias (or prior knowledge), which we cannot easily assess purely based on the learner’s feedback while learning. To resolve this issue, we introduce a prequiz phase for the user study, where we test the learner’s knowledge of all the words in the task by asking them to type in translations before the learning phase starts. After the learning phase, the learner will enter a postquiz testing phase. By recording this change in the learner’s performance, we can estimate the gain of the teaching session.
To leverage the lag effect of human learning, we impose a minimum time window for each flashcard presentation. In the learning phase, after a user enters her input for a question, she will have 10 seconds to review the correct answers provided by the system, before proceeding to the next question. Furthermore, we also set a maximal answering time for a question to prevent unnecessary delays of the teaching process. Therefore, the user will learn in constant time intervals, which is well-aligned with our discrete-time problem formulation.
Another important aspect is the short-term memory effect. In general, it is highly non-trivial to carry out large scale user studies that span over weeks/months (even though it better fits our HLR model of the learner). Given the physical constraints of real-world experiments, we consider shorter teaching sessions around 25-30 mins, involving teaching 15 words across a total number of 40 iterations. To mitigate the short-term memory effect raised by our experimental setting, we impose an additional constraint on our algorithm (henceforth GR) for the user study, such that it does not pick the same concept twice in a row (otherwise, a learner will simply “copy” the answer she sees on the previous screen). Furthermore, when computing the postquiz score, we exclude the first five entries at the postquiz phase (from a randomly shuffled test sequence) to further reduce the short-memory bias.
For the user study, we focus on the German vocabulary learning problem, and run each candidate algorithm with on 30 workers each on Amazon Mechanical Turk. Note that for these real-world experiments, we do not have explicit access to the learner’s memory model. While it is possible to fit a HLR model through an extensive pre-study survey as in , we observe from our simulated experiments that our adaptive algorithm is robust to a wide range of parameter configurations. After a thorough validation on the simulated learners, we choose for both the GR and LR as the “robust” version of the two teaching algorithms. Results for real human workers are shown in Fig. 3(c). Overall, GR achieved higher gain than the baselines. Although a fair number of learners fail to achieve good performance, GR managed to teach a larger fraction of the “fast” learners achieving better performance compared to the baselines, which suggests that our framework is a promising strategy for vocabulary teaching.
We presented an algorithmic framework for teaching multiple concepts to forgetful learners. We proposed a novel discrete formulation of teaching based on stochastic sequence function optimization, and provided a general theoretical tool for deriving performance bounds. We showed that although the theoretical performance bound is NP-hard to compute in general, we can efficiently compute such bounds for certain memory models of the learner. We have implemented a publicly available learning platform for two concrete applications. We believe our results have made an important step towards bringing the theoretical understanding of machine teaching closer to real-world applications where the forgetting phenomenon is an intrinsic factor.
This work was supported in part by NSF Award #1645832, Northrop Grumman, Bloomberg, AWS Research Credits, Google as part of the Visipedia project, and a Swiss NSF Early Mobility Postdoctoral Fellowship.
-  Website for teaching birds species. https://www.teaching-birds.cc.
-  Website for teaching German vocabulary. https://www.teaching-german.cc.
-  SuperMemo: your breakthrough speed-learning software : version 6 : Benutzerhandbuch und Referenz. 1993.
-  Steven Arild Wuyts Andersen, Peter Trier Mikkelsen, Lars Konge, Per Cayé-Thomasen, and Mads Sølvsten Sørensen. Cognitive load in distributed and massed practice in virtual reality mastoidectomy simulation. The Laryngoscope, 126(2), 2016.
-  David A Balota, Janet M Duchek, and Jessica M Logan. Is expanded retrieval practice a superior form of spaced retrieval? A critical review of the extant literature. Psychology Press New York, NY, 2007.
-  Kristine C Bloom and Thomas J Shuell. Effects of massed and distributed practice on the learning and retention of second-language vocabulary. The Journal of Educational Research, 74(4):245–248, 1981.
-  Hermann Ebbinghaus. Über das gedächtnis: untersuchungen zur experimentellen psychologie. Duncker & Humblot, 1885.
-  Daniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications in active learning and stochastic optimization. Journal of Artificial Intelligence Research, 42:427–486, 2011.
-  S. Leitner and R. Totter. So lernt man lernen. Angewandte Lernpsychologie ein Weg zum Erfolg. Herder, 1972.
-  Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989.
-  Paul Pimsleur. A memory schedule. The Modern Language Journal, 51(2):73–75, 1967.
-  Siddharth Reddy, Igor Labutov, Siddhartha Banerjee, and Thorsten Joachims. Unbounded human learning: Optimal scheduling for spaced repetition. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1815–1824. ACM, 2016.
-  Grace Rubin-Rabson. Studies in the psychology of memorizing piano music: II. A comparison of massed and distributed practice. Journal of Educational Psychology, 31(4):270, 1940.
-  Burr Settles and Brendan Meeder. A trainable spaced repetition model for language learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, volume 1, pages 1848–1858, 2016.
-  Wayne L Shebilske, Barry P Goettl, Kip Corrington, and Eric Anthony Day. Interlesson spacing and task-related processing during complex skill acquisition. Journal of Experimental Psychology: Applied, 5(4):413, 1999.
-  Amy L Simmons. Distributed practice and procedural memory consolidation in musicians’ skill learning. Journal of Research in Music Education, 59(4):357–368, 2012.
-  Edward N Spruit, Guido PH Band, and Jaap F Hamming. Increasing efficiency of surgical training: effects of spacing practice on skill acquisition and retention in laparoscopy training. Surgical endoscopy, 29(8):2235–2243, 2015.
-  Tom Stafford and Michael Dewar. Tracing the trajectory of skill learning with a very large sample of online game players. Psychological science, 25(2):511–518, 2014.
-  Brian L Sullivan, Christopher L Wood, Marshall J Iliff, Rick E Bonney, Daniel Fink, and Steve Kelling. ebird: A citizen-based bird observation network in the biological sciences. Biological Conservation, 142(10):2282–2292, 2009.
-  Behzad Tabibian, Utkarsh Upadhyay, Abir De, Ali Zarezade, Bernhard Schoelkopf, and Manuel Gomez-Rodriguez. Optimizing human learning. arXiv preprint arXiv:1712.01856, 2017.
-  Ovid J Tzeng. Stimulus meaningfulness, encoding variability, and the spacing effect. Journal of Experimental Psychology, 99(2):162–166, 1973.
-  EGG Verdaasdonk, LPS Stassen, RPJ Van Wijk, and J Dankelman. The influence of different training schedules on the learning of psychomotor skills for endoscopic surgery. Surgical endoscopy, 21(2):214–219, 2007.
-  Zhenliang Zhang, Edwin KP Chong, Ali Pezeshki, and William Moran. String submodular functions with curvature constraints. IEEE Transactions on Automatic Control, 61(3):601–616, 2016.
Appendix A Proofs
a.1.1 Notations and Definitions
For simplicity, we first introduce the notation which will be used in the proof.
Let us use function to represent a learner’s recall of item at , where indicates that the learner recalls item correctly at time , and otherwise. We call the function a realization, and use to denote a random realization. A realization is consistent with the observation history , if for all . We denote such case by .
We further use to denote the sequence of items and observations obtained by running policy under realization . Here, denotes the sequence of items selected by if the learner is responding according to .
Similarly with the conditional marginal gain of an item (Eq. (4)), we define the conditional marginal gain of a policy as follows.
Definition 3 (Conditional marginal gain of a policy).
Given observation history and an item , the conditional marginal gain of a policy is defined as
a.1.2 Proof of Theorem 1
To prove Theorem 1, we first establish a lower bound on the one-step gain of the greedy algorithm. The following lemma provides a lower bound of the one-step conditional marginal gain of the greedy policy against the conditional marginal gain of any policy (of length ).
Suppose we have selected sequence and observed . Then, for any policy of length ,
By Definition 3 we know that for all it holds that
Here, step (a) is a telescoping sum, and step (b) is by the law of total expectation.
In the following we provide the proof of Theorem 1.
Proof of Theorem 1.
Therefore, we get
Therefore, we get
where step (a) and step (b) are by the law of total expectation. Recursively applying Eq. (15) gives us
which completes the proof. ∎
a.1.3 Proof of Corollary 2
a.2 Proof of Theorem 3
In this section, we provide the proof for Theorem 3. In particular, we divide the proof into two parts. In §A.2.1, we propose a polynomial time algorithm which outputs a lower bound on ; in §A.2.2, we provide an upper bound on which can be computed in linear time.
a.2.1 Empirical lower bound on for the case
Let us use to denote the function that returns the number of times item appears in sequence . We first show the following lemma.
Fix . For any , we have
where denotes the sequence of items of length , where the first items are item and the remaining items are empty.
By definition of the marginal gain (Eq. (4))
For the case , the objective function is independent of the observed outcomes of the learner’s recall. That is,
Denote . For any , we know that
Here, step (a) is due to the fact that the learner’s recall of an item is monotonously decreasing (therefore showing item earlier leads to lower recall in the future). Therefore, it completes the proof. ∎