Generative Grading: Neural Approximate Parsing for Verifiable Automated Student Feedback

Generative Grading: Neural Approximate Parsing for
Verifiable Automated Student Feedback

Ali Malik, Mike Wu, Vrinda Vasavada, Jinpeng Song
John Mitchell, Noah Goodman, Chris Piech
Department of Computer Science, Stanford University
Department of Psychology, Stanford University
{malikali, wumike, vrindav, jsong5, jcm, ngoodman, piech}
Equal contribution.

Open access to high-quality education is limited by the difficulty of providing student feedback at scale. In this paper, we present Generative Grading with Neural Approximate Parsing (GG-NAP): a novel computational approach for providing feedback at scale that is capable of both accurately grading student work while also providing verifiability—a property where the model is able to substantiate its claims with a provable certificate. Our approach uses generative descriptions of student cognition, written as probabilistic programs, to synthesise millions of labelled example solutions to a problem; it then trains inference networks to approximately parse real student solutions according to these generative models. With this approach, we achieve feedback prediction accuracy comparable to human experts in many settings: short-answer questions, programs with graphical output, block-based programming, and short Java programs. In a real classroom, we ran an experiment where humans used GG-NAP to grade, yielding doubled grading accuracy while halving grading time.


Enabling global access to high-quality education at scale is one of the core grand challenges in education. With recent advancements in machine learning, computer-assisted approaches show promise in providing open access to world-class instruction and a reduction in the growing cost of learning (Bowen, 2012). However, a major barrier to this endeavour has been the need to automatically provide meaningful and timely feedback on student work.

Learning to provide feedback has proven to be a hard machine learning problem. Despite extensive research that combines massive education data with cutting-edge deep learning (Piech et al., 2015; Basu et al., 2013; Yan et al., 2019; Wang et al., 2017; Liu et al., 2019; Hu and Rangwala, 2019), most approaches fall short. Five issues have emerged: (1) student work is highly varied, exhibiting a heavy tailed (Zipf) distribution so that most solutions will not be observed even in large datasets, (2) student work is hard and expensive to label, (3) we want to provide feedback (without historical data) for even the very first student, (4) grading is a precision-critical domain since there is a high cost to misgrading a student, and (5) predictions must be explainable and justifiable to instructors and students. These challenges are typical of many human-centred AI problems, such as diagnosing rare diseases or predicting recidivism rates.

When real instructors provide feedback, they perform the difficult task of classifying a student’s misconceptions () given their solution (). In practice, instructors are much more adept at thinking “generatively”, : they can imagine the misconceptions a student might have, and construct the space of solutions a student with these misconceptions would likely produce. Recently, Wu et al. (2018b) used this intuition to show that if student misconceptions and their corresponding solution set can be decomposed in the form of a probabilistic context free grammar (PCFG), then a neural network trained on samples from this PCFG vastly outperforms data-hungry supervised approaches in classifying student misconceptions. While this work provides a novel paradigm, it is limited by the difficulty of writing cognitive models in the form of just PCFGs. Further, the inference techniques of Wu et al. (2018b) do not scale well to more complex problems and provide no notion of verifiability.

In this paper, we address these limitations by introducing a more flexible class of probabilistic program based grammars (PPGs) for describing student cognitive models. These grammars support arbitrary functional transformations and complex decision dependencies, allowing an instructor to model student solutions to more difficult problems like CS1 programming or short-answer questions. These more expressive grammars present a challenging inference problem that cannot be tackled by prior methods from Wu et al. (2018b). In response, we develop Neural Approximate Parsing (GG-NAP): a novel algorithm that parses a given student solution to find an execution trace of the grammar that produces this solution. Not only does this kind of inference allow for classifying misconceptions (the execution trace can be inspected for which confusions are present), but the provided execution trace of the grammar can serve as a verifiable justification for the model’s predictions.

When we apply GG-NAP to open-access datasets we are able to grade student work with close to expert human-level fidelity, substantially improving upon prior work across a spectrum of public education datasets: introduction to computer programming, short answers to a citizenship test, and graphics-based programming. We show a 50%, 160% and 350% improvement above the state-of-the-art, respectively. When used with human verification in a real classroom, we are able to double grading accuracy while halving grading time. Moreover, the grading decisions made by our algorithm are auditable and interpretable by an expert teacher due to the provided execution trace. Our algorithm is “zero-shot” and thus works for the very first student. Further, writing a generative grammar requires no expertise, and is orders of magnitude cheaper than manually labelling.

Since predicted labels correspond to meaningful cognitive states, not merely grades, they can be used in many ways: to give hints to students without teachers, to help teachers understand learning ability of students and classrooms, or to help teachers customise curriculums, etc. We see this work as an important stepping stone to scaling automated feedback to student work at the level of introductory classes where instructor resources are especially stretched thin.


The Automated Grading Challenge

In computational education, there are two important machine learning tasks related to “grading” student work. First, we consider feedback prediction, or labelling a given student solution with misconceptions. These misconceptions usually represent semantic concepts e.g. a student who manually iterates over a sequence may not understand loop structures.

Unlike most machine learning problems however, we cannot solely judge a computational model by just its accuracy on this predictive task. In a safety-critical domain like education, teachers must be able to verify and justify the claims of a computational agent before providing them to the student. Otherwise, we run the costly risk of providing incorrect feedback; a mistake with potentially devastating impact on student learning. Therefore, the second task we tackle is verifiable prediction, in which the algorithm must either return a prediction along with a certificate for correctness, or declare uncertainty (and perhaps still provide a best guess). While many methods have been presented for feedback prediction (Wu et al., 2018b; Piech et al., 2015; Wang et al., 2017), to the best of our knowledge, this work is the first to tackle verifiable prediction for grading student work in education.

Difficulty of Automated Feedback

Disregarding the requirement of verifiability, feedback prediction alone has been an extremely difficult challenge in education research. Even limited to simple problems in computer science like beginner drag-and-drop programming, automated solutions to providing feedback have been restricted by limited data and lack of robustness. In 2014, is one of largest and most widely used online programming resources for beginners in computer science. ran an initiative to crowdsource thousands of instructors to label 55,000 student solutions to simple geometric drawing problems in their block programming language. With over 40,228,194 enrolled students, the problem of automating feedback on problems like these is one of the hardest and most impactful challenges they face. Yet, despite having access to an unprecedented amount of labelled data222Labelling educational data requires expert knowledge, unlike labelling images. For example, 800 student solutions to a block programming problem took 26 hours to label (Wu et al., 2018b)., traditional supervised methods failed to perform well on even these “simple” questions. In the broader landscape of education, the situation is worse: there is hardly ever any labelled data and student solutions are Zipfian i.e. the space of correct solutions is simple but the space of incorrect solutions is enormous (see Fig. 1).

(b) Liftoff
(c) Pyramid
(d) Power
Figure 1: Student solutions (across many domains) exhibit heavy-tailed Zipf distributions, meaning a few solutions are extremely common but all other solutions are highly varied and show up rarely. This suggests that the probability of a student submission not being present in a dataset is high, making supervised learning on a small dataset ineffective.

Generative Grading

Faced with the limitations of traditional supervised approaches, we tackle these grading problems using a “generative” approach. Instead of labelling data, an expert is asked to model the student cognitive process by describing the misconceptions a student might have along with the corresponding space of solutions a student with these misconceptions would likely produce. If we can instantiate these expert beliefs as a real generative model (e.g. probabilistic grammar), then we possess a simulator from which we can sample infinite amounts of “labelled” data, allowing for zero-shot learning. While modelling solutions to large problems is difficult, representing the problem-solving process as a hierarchical set of decisions allows decomposition of this hard task into simpler ones, making it surprisingly easy for experts to express their knowledge of student cognition. We refer to this approach as “generative grading”.

In previous work, Wu et al. (2018b) represent these student cognition models as instructor-written probabilistic context-free grammars (PCFGs) and use them to generatively grade student submissions to problems. Although they boast promising results, we find the limitation to context-free grammars excessively restrictive, especially when tackling more complex domains like CS1 programming. Our challenge, then, is to define an expressive enough class of probabilistic models that can capture the complexities of expert priors (and student behaviour), while still being able to do inference and parsing of student solutions.

Neural Parsing for Inference in Grammars

In this section, we define the class of grammars called Probabilistic Program Grammars and describe several motivating properties that make them useful for generative grading.

Probabilistic Program Grammar

We aim to describe a class of grammars powerful enough to easily encode any instructor’s knowledge of the student decision-making process. While it is easy to reason about context-free grammars, context independence is a strong restriction that generally limits what instructors can express. As an example, imagine capturing the intuition that students can write a for loop two ways:

for (int i = 0; i < 10; i++) { println(10 - i); }  # version 1
for (int n = 10; n > 0; n-=1) { println(n); }      # version 2

Clearly, the decision for the “for loop” header (i 0; i++), and “print” statement are dependent on the start index (i = 0) and the choice of variable name (i) as are future decisions like off-by-one. Coordinating these decisions in a context-free grammar requires a great profusion of non-terminals and production rules, which are burdensome for a human to create. Perhaps not surprisingly, even with simple programming exercises in Java or Python, this (and more complex) types of conditional execution are abundant.

We thus introduce a broader class of grammars called Probabilistic Program Grammars (PPGs) that enable us to condition choices on previous decisions and a globally accessible state. A Probabilistic Program Grammar is more rigorously defined as a subclass of general probabilistic programs, equipped with a tuple denoting a set of nonterminals, a set of terminals, a start node, a global state, and a set of probabilistic programs, respectively. A production from the grammar is a recursive generation from the start node to a sequence of terminals based on production rules. Unlike PCFGs, a production rule is described by a probabilistic program so that a given nonterminal can be expanded in different ways based on samples from random variables in , the shared state , and contextual information about other nonterminals rendered in the production. Further, the production rule can also modify the global state , thus affecting the behaviour of future nonterminals. Lastly, the PPG can transform the final sequence of terminals into an arbitrary space (e.g. from strings to images), to yield the production . Each derivation is associated with a trajectory of nonterminals encountered during execution. Here, denotes a unique lexical identifier for each random variable encountered in order and stores the sampled value. Define the joint distribution (induced by ) over trajectories and productions as .

Given such a grammar, we are interested in parsing: this is the task of mapping a production to the most likely trajectory in the PPG, that could have produced . This is a difficult search problem: the number of trajectories grows exponentially even for simple grammars, and common methods for parsing by dynamic programming (Viterbi, CYK) are not applicable in the presence of context-sensitivity and functional transformations. To make this problem tractable, we present deep neural networks to approximate the posterior distribution over trajectories. We call this approach neural approximate parsing with generative grading, or GG-NAP.

Neural Inference Engine

The challenge of MAP inference over trajectories is a difficult one. Trajectories can vary in length and contain nonterminals with different support. To approach this, we decompose the inference task into a set of easier sub-tasks. The posterior distribution over a trajectory given a yield can be written as the product of individual posteriors over each nonterminal using the chain rule:


where denotes previous (possibly non-contiguous) nonterminals . Eqn. 1 shows that we can learn each posterior separately. With an autoregressive model , we can efficiently represent the influence of previous nonterminals using a shared hidden representation over timesteps. Since the input to needs to be fixed dimension, we have to represent all relevant inputs in a consistent manner (see appendix for details).

Firstly, to encode the production , we use standard machinery (e.g. CNNs for images, RNNs for text) with a fixed output dimension. To represent the nonterminals with different support, we define three layers for each random variable : (1) a one-hot embedding layer that uses the index to lexically identify the random variable, (2) a value embedding layer that maps the value of to a fixed dimension vector and (3) an value decoding layer that transforms the hidden output state of into parameters of the posterior for the next nonterminal . Thus, the input to the is a fixed size, being the concatenation of the value embedding, index embedding, and production encoding.

To train the GG-NAP, we optimize the objective,


where are all trainable parameters and represents the posterior distribution defined by the inference engine333Since we are given , we can parameterise to be from the correct distributional family.. At test time, given only a production , GG-NAP recursively samples for and uses each sample as the input to the next step in , as in usual sequence generation models (Graves, 2013).

Note that inference over trajectories is much more difficult than just classification. Previous work in generative grading (Wu et al., 2018b) only learned to classify an output program to a fixed set of labels. To draw the distinction, GG-NAP produces a distribution over possible parses where each nonterminal is associated with one or more labels.

Relationship to Viterbi Parsing

To check that neural approximate parsing is a sensible approach, we evaluate it on a simple class of grammars where exact parsing (via dynamic programming) is possible. In (Wu et al., 2018b), the authors released PCFGs for two exercises from (P1 and P8) that produce block code. These grammars are large: P1 has 3k production rules whereas P8 has 263k. Given a PCFG,

PCFG Trajectory Acc. P1 (MAP) 0.943 P1 (best-of-10) 0.987 P8 (MAP) 0.917 P8 (best-of-10) 0.921
Table 1: Agreement between Viterbi and Neural Parsing

we compare GG-NAP to Viterbi (CYK) in terms of retrieving the correct trajectory for productions from the grammar. We measure trajectory accuracy: the fraction of nodes that are in both parses.

Using 5,000 generated samples from each PCFG, we found trajectory accuracies of 94% and 92% for P1 and P8 respectively, meaning that Viterbi and GG-NAP agree in almost all cases. Further, if we draw multiple samples from the GG-NAP posterior and take the best one, we find improvements of up to 4%. In exchange for being approximate, GG-NAP is not restricted to PCFGs and can even parse outputs not in the the grammar to a plausible nearest in-grammar neighbour. Finally, it is orders of magnitude faster than Viterbi: 0.3 vs 183 sec for P8 (see appendix).

Verifiable Nearest Neighbour Retrieval

If we can parse a student solution to a trajectory of nonterminals, then we can sample the grammar production from this trajectory—if this sample is equal to the original solution, then that is a proof that the parse was correct. In the case that the sample is not an exact match, we can treat the parsed production as a “nearest in-grammar neighbour” of the original solution, which is still useful in downstream tasks.

More formally, assume we are given a production from a grammar . Let the sequence refer to the inferred trajectory for and refer to the true (unknown) trajectory. If we repeatedly generate from the grammar while fixing the values for each encountered random variable to , then we should be able to generate the exact production , showing with certainty that . In practice, very few samples are needed to recover . On the other hand, if an observation is not in the grammar (like some real student programs), is not well-defined and the inferred trajectory will be incorrect. However, will be still specify a production that we can interpret as an approximate nearest neighbour to in . Intuitively, we expect and to be “similar” semantically as specified by the nonterminals in . In practice, we can measure a domain-specific distance between and e.g. token edit distance for text.

In education, verifiable prediction adds an important ingredient of interpretability, whereby teachers can be confident in the feedback that models provide. Furthermore, with intelligent grading systems, the nearest neighbour , along with its known labels, , can greatly assist human grading. A grader can “grade the diff” by comparing the real solution with this nearest neighbour and adjusting the labels accordingly. In our experiments, we show this to achieve super-human grading precision while reducing grading time.

-Nearest Neighbour Baseline

As a strong baseline for verifiable prediction, we simply use a -nearest neighbour classifier: we generate and store a dataset with hundreds of thousands of unique productions as well as their associated trajectories. At test time, given an input to parse, we can find its nearest neighbour using a linear search of the stored samples and return its associated trajectory. If the neighbour is an exact match, the prediction is verifiable. We refer to this baseline as GG-kNN. Depending on the grammar, will be in a different output space (image, text) and thus the distance metric used for GG-kNN will be domain dependent. Note that GG-kNN is much more costly than GG-NAP in memory and runtime as it needs to store and iterate through all samples.

Adaptive Sampling

Figure 2: Efficiency of different sampling strategies for Liftoff grammar. (left) Number of unique samples vs total samples so far. (right) Good-Turing estimates: probability of sampling a unique next program given samples so far.

As both GG-kNN and GG-NAP require a dataset of samples for training, we must be able to generate unique productions from a grammar efficiently. For GG-kNN specifically, the number of unique productions strictly defines the quality of the model. However, due to the nature of Zipfs, generating unique data points can be expensive due to over-sampling of the most common productions.

To make sampling more efficient, we present a novel method called Adaptive Grammar Sampling that downweights the probabilities of decisions proportional to how many times they lead to duplicate productions. This algorithm has many useful properties and is based on Monte-Carlo Tree Search and the Wang-Landau algorithm from statistical physics. We consider this an interesting corollary and refer the reader to the supplement. Fig. 7 shows an example of how much more efficient this algorithm is compared to simply sampling naively from the Liftoff grammar. In practice, adaptive sampling has a parameter that can be toggled to control how fast we explore the Zipf, allowing us to preserve likely productions from the head and body.


Figure 3: Summary of results for three datasets. GG-NAP outperforms the old state of the art (SOTA).

We test GG-NAP on a suite of public education datasets focusing on introductory courses either from online platforms or large universities. In each, we compare against the existing state-of-the-art (SOTA) model. First, we briefly introduce the datasets, then present results, focusing on a real classroom experiment we conducted. In summary, we find that GG-NAP beats the previous SOTA by a significant margin in all four educational domains. Further, it approaches (or surpasses in one case) human performance (see Fig. 3).


We consider four educational contexts. Refer to the supplement for example student solutions for each problem. (Block Coding)

Wu et al. (2018b) released a dataset of student responses to 8 exercises from, involving drawing shapes with nested loops. We take the most difficult problem—drawing polygons with an increasing number of sides—which has 302 human graded responses with 26 labels regarding looping and geometry (e.g. “missing for loop” or “incorrect angle”).

Powergrading (Text)

Powergrading (Basu et al., 2013) contains 700 responses to a US citizenship exam, each graded for correctness by 3 humans. Responses are in natural language, but are typically short (average of 4.2 words). We focus on the most difficult question, as measured by (Riordan et al., 2017): “name one reason the original colonists came to America”. Responses span economic, political, and religious reasons.

PyramidSnapshot (Graphics)

PyramidSnapshot is a university CS1 course assignment intended to be a student’s first exposure to variables, objects, and loops. The task is to build a pyramid using Java’s ACM graphics library. The dataset is composed of images of rendered pyramids from intermediary “snapshots” of student work. (Yan et al., 2019) annotated 12k unique snapshots with 5 categories representing “knowledge stages” of understanding.

Liftoff (Java)

Liftoff is a second assignment from an university CS1 course that tests looping. Students are tasked to write a program that prints a countdown from 10 to 1 followed by the phrase ”Liftoff”. We measure the performance of verifiable prediction with GG-NAP and a human-in-the-loop to grade 176 solutions from a semester of students and measure accuracy and grading time.

Results for Feedback Prediction

In each domain except Liftoff, we are given a small test dataset of student programs and labelled feedback. By design, we include each of the labels as a nonterminal in the grammar444In generality, we only require that the labels can be derived deterministically from the nonterminals., thereby reducing prediction to parsing. To evaluate our models, we separately calculate performance for different regions of the Zipf: we define the head as the most popular solutions, the tail as solutions that appear only once or twice, and the body as the rest. As solutions in the head can be trivially memorised, we focus on the body and tail.

Figure 4: CDF of edit distance between student programs and nearest-neighbours using various strategies.

GG-NAP sets the new SOTA, beating (Wu et al., 2018b) in both the body and tail, and surpassing human performance (historically measured as F1). This is a big improvement over previous work involving supervised classifiers (Wu et al., 2018b; Wang et al., 2017) as well as zero-shot approaches like Wu et al. (2018b), which perform significantly below human quality. By removing restrictions of context-dependence, we are able to easily write richer grammars; combining this with the better predictive power of neural parsing leads to the improved performance. The potential impact of a human-level autonomous grader is large: is used by 610 million students worldwide, and using GG-NAP could save thousands of human hours for teachers by providing the same quality of feedback at scale.


For this open dataset of short answer responses, GG-NAP outperforms the previous SOTA with an F1 score of 0.93, an increase of 0.35 points. We close the gap to human performance, measured to be F1 = 0.97, surpassing earlier work that used hand-crafted features (Daxenberger et al., 2014) and supervised neural networks (Riordan et al., 2017). We also note that, since the Powergrading responses contain (simple) natural language, we find these results to be a promising signal that GG-NAP could generalise to domains beyond just computer science classes.


As in the last two cases, GG-NAP is the new SOTA, out-performing baselines (kNN and VGG classifier) from Yan et al. (2019) by about a 50% gain in accuracy.555These baselines were trained on 200 labelled images. Unlike other datasets, PyramidSnapshot includes student’s intermediary work, showing stages of progression through multiple attempts at solving the problem. With our near-human level performance, instructors could use GG-NAP to measure student cognitive understanding over time as students work. This builds in a real-time feedback loop between the student and teacher that enables a quick and accurate way of assessing teaching quality and characterising both individual and classroom learning progress. From a technical perspective, since PyramidSnapshot only includes rendered images (and not student code), GG-NAP was responsible for parsing student solutions from just images alone, a feat not possible without the functional transformations allowed in PPGs.

(a) Classroom Experiment Results
(b) Automated Dense Feedback
(c) Auto-improving Grammars
Figure 5: (a) Plot of average time taken to grade 30 student solutions to Liftoff. GG-NAP convincingly reduces grading time for 26/30 solutions. The amount of time saved correlates with the token edit distance (yellow) to the GG-NAP nearet neighbour. (b) GG-NAP allows for automatically associating student work with fine-grained automated feedback. (c) Given a Liftoff grammar that can only increment up, we can track nonterminals where inference often fails and use that to estimate where the grammar need improvement. The height of each bar represents the likelihood that improvements are needed for that nonterminal.

Human Guided Grading in a Classroom Setting

While good performance on benchmark datasets is promising, a true test of an algorithm is its effectiveness in the real world. For GG-NAP, we investigated its impact on grading accuracy and speed in a real classroom setting. To do this, we created a human-in-the-loop grading system using GG-NAP: for each student solution, a grader is presented with the student solution to grade, as well as a diff to the nearest in-grammar neighbour found using GG-NAP (see Fig. 9 in appendix). This nearest neighbour already has associated labels, and the grader adjusts these labels based on the diff to determine grades for the real solution.

As an experiment, we hired a cohort of expert graders (teaching assistants with similar experience from a large private university) who graded 30 real student solutions to Liftoff. For control, half the graders proceeded traditionally, assigning a set of feedback labels by just inspecting the student solutions. The other half of graders additionally had access to (1) the feedback assigned to the nearest neighbour by GG-NAP and (2) a code differential666The differential is in the style of Github. See appendix . between the student program and the nearest neighbour. Some example feedback labels included “off by one increment”, “uses while loop”, or “confused with ”. All grading was done on a web application that kept track of the time taken to grade a problem.

We found that the average time for graders with GG-NAP was 507 seconds while the average time using traditional grading was 1130 seconds, a more than double increase. Moreover, with GG-NAP, only 3 grading errors (out of 30) were made with respect to gold-standard feedback given by the course Professor, compared to the 8 errors made with traditional grading. The improved performance stems from the semantically meaningful nearest neighbours provided by GG-NAP; compared to the GG-kNN baseline, the quality of nearest neighbours of the former are noticeably better (see Fig. 4). Having access to graded nearest neighbours that are semantically similar to the student solution helps increase grader efficiency and reliability by allowing them to focus on only “grading the diff” between the real solution and the nearest neighbour. By halving both the number of errors and the amount of time, GG-NAP can have a large impact in classrooms today, saving instructors and teaching assistants unnecessary hours and worry over grading assignments.

Related Work

“Rubric sampling” (Wu et al., 2018b) first introduced the concept of encoding expert priors in grammars of student decisions, and was the inspiration for our work. The authors design PCFGs to curate synthetically labelled datasets to train supervised classifiers. Our approach builds on this, but GG-NAP operates on a more expressive family of grammars that are context sensitive and comes with new innovations that enable effective inference. From, we see that expressivity is responsible for pushing GG-NAP past human level performance. Furthermore, our paradigm adds an important notion of verifiability lacking in previous work, opposing the typical black-box nature of neural networks.

Inference over grammar trajectories is similar to “compiled inference” for execution traces in probabilistic programs. As such, our inference engine shares similarities to PPL literature (Le et al., 2016). With PPGs, we get a nice interpretation of compiled inference as a parsing algorithm. We also show the promise of compiled inference in much larger probabilistic programs (with skewed prior distributions). Previous work usually involved less than ten random variables whereas our grammars grow to hundreds (Le et al., 2016; Wu et al., 2016; Lake et al., 2015).

The design of PPGs also draws on many influences from natural language processing. For starters, our neural inference engine can be viewed as an encoder in a RNN-based variational autoencoder (Bowman et al., 2015) that specifies a posterior distribution over many categorical variables. Further, the index embedding layer serves as a unique lexical identifier, similar to the positional encoding in transformers (Vaswani et al., 2017). Finally, the verifiable properties of GG-NAP have strong ties to explainable AI (Selvaraju et al., 2017; Hancock et al., 2018; Koh and Liang, 2017; Wu et al., 2018a; Ross and Doshi-Velez, 2018).


Highlighting feedback in student solutions

Rather than predicting feedback labels, it would be more useful to give “dense” feedback that highlights the section of the code or text responsible for the student misunderstanding. This would be much more effective for student learning than vague error messages currently found on most online education platforms. To achieve this, we use GG-NAP to infer a trajectory, for a given production . For every nonterminal , we want to measure its “impact” on . If for each we have an associated production rule with an intermediate output , then highlighting amounts to finding the part of which was responsible for (via string intersection). Fig. 4(a) shows a random program with automated, segment-specific feedback given by GG-NAP. This level of explainability is sorely needed in both education and AI and could revolutionise how students are given feedback at scale.

Cost of writing good grammars.

Writing a good grammar does not require special expertise and can be undertaken by a novice in a short time. For instance, the PyramidSnapshot grammar that sets the new SOTA was written by a first-year undergraduate within a day. Furthermore, many aspecst of grammars are re-usable: similar problems will share nonterminals and some invariances (e.g. the nonterminals that capture different ways of writing i++ are the same everywhere). This means every additional grammar is easier to write since it likely shares a lot in structure with existing grammars. Moreover, compared to weeks spent hand-labelling data, the cost of writing a grammar is orders of magnitude cheaper and leads to much better performance.

Automatically improving grammars

Building PPGs is an iterative process; a user wishing to improve their grammar would want a sense of where it is lacking. Fortunately, given a set of difficult examples where GG-NAP does poorly, we can deduce the nodes in the PPG that consistently lead to mistakes and use these to suggest components to improve. To illustrate this, we took the Liftoff PPG which contains a crucial node that decides between incrementing up or down in a “for” loop, and removed the option of incrementing down. Training GG-NAP on the smaller PPG, we fail to parse student solutions that “increment down”. Given such a solution, to compute the probability that a nonterminal is “responsible” for the failure, we find its GG-NAP nearest neighbour and associated trajectory. Then, for each nonterminal in this trajectory, we can associate it with its substring in the solution (via highlighting). By finding the nonterminals where the substring often differs between the neighbour and the solution, we can identify nonterminals that often causes mismatches. Fig. 4(c) shows the distribution over which nodes GG-NAP believes to be responsible for the failed parses. The top 6 nonterminals that GG-MAP picked out all rightfully relate to looping and incrementation.


In this paper we make novel contributions to the task of providing automated student feedback that beats numerous state-of-the-art approaches and shows significant impact when used in practice. The ability to finely predict student decisions opens up many doors in education. This work could be used to automate feedback, visualise student approaches for instructors, and make grading easier, faster, and more consistent. Although more work needs to be done on making powerful grammars easier to write, we believe this is an exciting direction for the future of education and a huge step in the quest for combining machine learning and human-centred artificial intelligence.


  • S. Basu, C. Jacobs, and L. Vanderwende (2013) Powergrading: a clustering approach to amplify human effort for short answer grading. Transactions of the Association for Computational Linguistics 1, pp. 391–402. Cited by: Introduction, Powergrading (Text).
  • W. G. Bowen (2012) The ‘cost disease’in higher education: is technology the answer?. The Tanner Lectures Stanford University. Cited by: Introduction.
  • S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio (2015) Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349. Cited by: Related Work.
  • H. S. Chang, M. C. Fu, J. Hu, and S. I. Marcus (2005) An adaptive sampling algorithm for solving markov decision processes. Operations Research 53 (1), pp. 126–139. External Links: Document Cited by: Appendix B.
  • J. Daxenberger, O. Ferschke, I. Gurevych, and T. Zesch (2014) DKPro tc: a java-based framework for supervised learning experiments on textual data. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 61–66. Cited by: Powergrading.
  • X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. Cited by: Appendix A.
  • A. Graves (2013) Generating sequences with recurrent neural networks. CoRR abs/1308.0850. External Links: Link, 1308.0850 Cited by: Neural Inference Engine.
  • B. Hancock, P. Varma, S. Wang, M. Bringmann, P. Liang, and C. Ré (2018) Training classifiers with natural language explanations. arXiv preprint arXiv:1805.03818. Cited by: Related Work.
  • Q. Hu and H. Rangwala (2019) Reliable deep grade prediction with uncertainty estimation. arXiv preprint arXiv:1902.10213. Cited by: Introduction.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix A.
  • P. W. Koh and P. Liang (2017) Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1885–1894. Cited by: Related Work.
  • B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum (2015) Human-level concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. Cited by: Related Work.
  • T. A. Le, A. G. Baydin, and F. Wood (2016) Inference compilation and universal probabilistic programming. arXiv preprint arXiv:1610.09900. Cited by: Related Work.
  • J. Liu, Y. Xu, and L. Zhao (2019) Automated essay scoring based on two-stage learning. arXiv preprint arXiv:1901.07744. Cited by: Introduction.
  • C. Piech, J. Bassen, J. Huang, S. Ganguli, M. Sahami, L. J. Guibas, and J. Sohl-Dickstein (2015) Deep knowledge tracing. In Advances in neural information processing systems, pp. 505–513. Cited by: Introduction, The Automated Grading Challenge.
  • B. Riordan, A. Horbach, A. Cahill, T. Zesch, and C. M. Lee (2017) Investigating neural architectures for short answer scoring. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 159–168. Cited by: Powergrading (Text), Powergrading.
  • A. S. Ross and F. Doshi-Velez (2018) Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In Thirty-second AAAI conference on artificial intelligence, Cited by: Related Work.
  • R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626. Cited by: Related Work.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Appendix A.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: Related Work.
  • F. Wang and D. Landau (2001) Efficient, multiple-range random walk algorithm to calculate the density of states. Physical review letters 86, pp. 2050–3. External Links: Document Cited by: Appendix B.
  • L. Wang, A. Sy, L. Liu, and C. Piech (2017) Learning to represent student knowledge on programming exercises using deep learning.. In EDM, Cited by: Introduction, The Automated Grading Challenge,
  • M. Wu, M. C. Hughes, S. Parbhoo, M. Zazzi, V. Roth, and F. Doshi-Velez (2018a) Beyond sparsity: tree regularization of deep models for interpretability. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: Related Work.
  • M. Wu, M. Mosse, N. Goodman, and C. Piech (2018b) Zero shot learning for code education: rubric sampling with deep learning inference. arXiv preprint arXiv:1809.01357. Cited by: Introduction, Introduction, The Automated Grading Challenge, Generative Grading, Neural Inference Engine, Relationship to Viterbi Parsing, (Block Coding),, Related Work, footnote 2.
  • Y. Wu, L. Li, S. Russell, and R. Bodik (2016) Swift: compiled inference for probabilistic programming languages. arXiv preprint arXiv:1606.09242. Cited by: Related Work.
  • L. Yan, N. McKeown, and C. Piech (2019) The pyramidsnapshot challenge: understanding student process from visual output of programs. In Proceedings of the 50th ACM Technical Symposium on Computer Science Education, SIGCSE ’19, New York, NY, USA, pp. 119–125. External Links: ISBN 978-1-4503-5890-3, Link, Document Cited by: Introduction, PyramidSnapshot (Graphics), PyramidSnapshot.
Figure 6: We show the prompt and example solutions for 4 problems from programming assignments to history tests.

Appendix A Model Hyperparameters

For reproducibility, we include all hyperparameters used in training GG-NAP. Unless otherwise stated, we use a batch size of 64, train for 10 or 20 epochs on 100k samples from a PPG. The default learning rate is 5e-4 with a weight decay of 1e-7. We use Adam (Kingma and Ba, 2014) for optimization. If the encoder network is an RNN, we use the Elman network with 4 layers, a hidden size of 256, and a probability of dropping out hidden units of 1%. If the encoder network is a CNN, we train VGG-11 (Simonyan and Zisserman, 2014) with Xavier initialization (Glorot and Bengio, 2010) from scratch. For training VGG, we found it important to lower the learning rate to 1e-5. The neural inference engine itself is an unrolled RNN: we use a gated recurrent unit with a hidden dimension of 256 and no dropout. The value and index embedding layers output a vector of dimension 32. These hyperparameters were chosen using grid search.

Appendix B Adaptive Grammar Sampling

In the text, we introduced a nearest neighbour baseline (KNN) for verifiable parsing. The success of KNN is highly dependent on storing a set of unique samples. With Zipfs, i.i.d. sampling often over-samples from the head of the distribution, resulting in a low count of unique samples and poor performance. To build a strong baseline, we must sample uniques more efficiently.

Input: Probabilistic program grammar , decay factor , reward , and desired size of dataset .

Output: Dataset of unique samples from the grammar: .

1:procedure AdaptiveSample(, , , )
3:     while  do
5:          if  then
7:          for  to  do
8:                get -th node in trajectory, , of length
Algorithm 1 Adaptive Sampling

Further, training the neural inference engine requires sampling a dataset from a PPG . These samples need to cover enough of the grammar to allow the model to learn meaningful representations and, moreover, they again need to be unique. The uniqueness requirement is paramount for Zipfs since otherwise models would be overwhelmed by the most probable samples.

Naively, we can i.i.d. sample a set of unique observations and use it train NAP. However, again, due to the Zipfian nature, generating unique data points can be expensive as gets large due to having to discard duplicates. To sample efficiently, a simple idea is to pick each decision uniformly (we call this uniform sampling). Although this will generate uniques more often, it has two major issues: (1) it disregards the priors, resulting in very unlikely productions, and (2) it might not be effective as multiple paths can lead to the same production.

Ideally, we would sample in a manner such that we cover all the most likely programs and then smoothly transition into sampling increasingly unlikely programs. This would generate uniques efficiently while also retaining samples that are relatively likely. To address these desiderata, we propose a method called Adaptive Grammar Sampling (Alg. 1) that downweights the probabilities of decisions proportional to how many times they lead to duplicate productions. We avoid overly punishing nodes early in the decision trace by discounting the downweighting by a decay factor . This method is inspired by Monte-Carlo Tree Search (Chang et al., 2005) and shares similarities with Wang-Landau from statistical physics (Wang and Landau, 2001).

(a) Uniqueness and Good-Turing Estimates
(b) Likelihood of Samples over Time
Figure 7: Effectiveness of sampling strategies for Liftoff. Left/Middle: Number of unique programs generated (left) and Good-Turing estimate (middle) as a function of total samples. Right: Likelihood of generated samples over time for various sampling strategies. In particular, we note the effect of reward and decay on the exploration rate. The ideal sampling strategy for Zipfs first samples from the head, then body, and finally the tail.

Properties of Adaptive Sampling

In the main text, we expressed the belief that adaptive grammar sampling increases the likelihood of generating unique samples. To test this hypothesis, we sampled 10k (non-unique) Java programs using the Liftoff PPG and track the number of uniques over time. Fig. 7a shows that adaptive sampling has linear growth in number of unique programs compared to sublinear growth with i.i.d. or uniform sampling. Fig. 7b compute the Good-Turing estimate, a measure for the probability of the next sample being unique, and found adaptive sampling to “converge” to a constant while other sampling methods approach zero. Interestingly, adaptive sampling is customisable. Fig. 7c show the log probability of the sampled trajectories over time. With higher reward or a smaller decay rate , adaptive sampling will sample less from the head/body of the Zipf. In contexts where we care about the rate of sample exploration, adaptive sampling provides a tune-able algorithm to search a distribution.

Appendix C Grammar Descriptions

We provide an overview of the grammars for each domain, covering the important choices. P8

This PPG contains 52 decisions. The primary innovation in this grammar decision is the use of a global random variable that represents the ability of the student. In this turn will affect the distributions over values for nonterminals later in the trajectory such as deciding the loop structure and body. The intuition this captures is that high ability students make very few to no mistakes whereas low ability students tend to make many correlated misunderstandings (e.g. looping and recursion).

CS1: Liftoff

This PPG contains 26 decisions. It first determines whether to use a loop, and, if so, chooses between “for” and “while” loop structures. It then formulates the loop syntax, choosing a condition statement and whether to count up or count down. Finally, it chooses the syntax of the print statements. Notably, each choice is dependent on previous ones. For example, choosing an end value in a for loop is sensibly conditioned on a chosen start value.

Powergrading: Short Answer

This PPG contains 53 nodes. Unlike code, grammars over natural language need to explain variance in both semantic meaning and prose. This is not as difficult for short sentences. In designing the grammar, we inspect the first 100 responses to gauge student thinking. Procedurally, the grammar’s first decision is choosing whether the production will be correct or incorrect. It then chooses a subject, verb, and noun. These three choices are dependent on the correctness. Correct answers lead to topics like religion, politics, and economics while incorrect answers are about taxation, exploration, or physical goods. Finally, the grammar chooses a writing style to craft a sentence. To capture variations in tense, we use a conjugator777Python’s mlconjug library: as a functional transformation on the output.


The grammar contains 121 nodes, the first of which decides between 13 “strategies” (e.g. making a parallelogram, right triangle, a brick wall, etc.). Each of the 13 options leads its own set of nodes that are responsible for deciding shape, location, and colour. Finally, the trajectory of decisions is used to render an image. The first version of the grammar was created by peaking at 200 images. A second version was updated by viewing 50 more.

Appendix D NAP Architecture

Figure 8: Architecture of the neural inference engine. We show a single RNN update to parameterize . This procedure is repeated for each , the length of the trajectory.

Fig. 8 visualises the architecture for the neural inference engine in NAP. The ProductionEncoder network is responsible for transforming unstructured images and text to a fixed vector space representation, using a domain specific architecture like a CNN for images or RNN for text. The lexical index of the current random variable, , is encoding using the OneHotEncoding transformation and its current value, , is encoded to a fixed dimension using a layer that is specific to this random variable. To get a posterior distribution over the next random variable, the transformer specific to the next random variable maps from the hidden state of to a distribution over values of the next random variable.

At train time, the inputs to the autoregressive model, , at each timestep, , are the true values of from the data. We train the model and all encoding/decoding layers end-to-end by backpropogating per-timestep gradients using the cross-entropy loss of the posterior distribution output by the model and the true value taken on by .

At inference time, we do not have a true value for to use in the next timestep so we sample this value from the posterior produced by . This sample is then fed to the next timestep of and the process is repeated until the trajectory is completely determined.

Appendix E Grading UI

Figure 9: Grading UI based on GG-NAP

We show an image of the user-interface used in the field experiment. This is the view a grader (with access to NAP) would see. The real student response is give on the left and the nearest neighbour given by GG-NAP on the right. A differential between the two images is provided, inspired by Github design. On the very right is a set of labels that the grader is responsible for assigning values to.

Appendix F GG-NAP and Viterbi Cost Comparison

Table 2 compares the wall clock cost of Viterbi and GG-NAP on very large PCFGs. We can see significant time savings (of 700x).

PCFG Parser # Production Rules Cost (Sec.) P1 Viterbi  3k 0.79 1.2 P1 NAP  3k 0.17 0.1 P8 Viterbi  263k 182.8 40.2 P8 NAP  263k 0.25 0.2
Table 2: Inference Cost of Viterbi and Neural Parsing

Appendix G Grammar Sample Zoo

In the following, we show many generated samples from the PPGs for Powergrading,, Liftoff, and PyramidSnapshot (in that order).

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description