Learning to Mine Aligned Code and
Natural Language Pairs from Stack Overflow
For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. Stack Overflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high quality code snippets. However, existing heuristic methods (e.g., pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data.
Recent years †† PY and BD contributed equally to this work. have witnessed the burgeoning of a new suite of developer assistance tools based on natural language processing (NLP) techniques, for code completion (franks2015cacheca), source code summarization (allamanis2016convolutional), automatic documentation of source code (wong2013autocomment), deobfuscation (raychev2015predicting; vasilescu2017jsnaughty; decompiled-names), cross-language porting (nguyen2013lexical; nguyen2014statistical), code retrieval (wei2015building; allamanis2015bimodal) and even code synthesis from natural language (quirk2015language; desai2016program; locascio2016regex; yin2017acl).
Besides the creativity and diligence of the researchers involved, these recent success stories can be attributed to two properties of software source code. First, it is highly repetitive (gabel2010study; devanbu2015new), therefore predictable in a statistical sense. This statistical predictability enabled researchers to expand from models of source code and natural language (NL) created using hand-crafted rules, which have a long history (miller1981natural), to data-driven models that have proven flexible, relatively easy-to-create, and often more effective than corresponding hand-crafted precursors (hindle2016naturalness; nguyen2014statistical). Second, source code is available in large amounts, thanks to the proliferation of open source software in general, and the popularity of open access, “Big Code” repositories like GitHub and Stack Overflow (SO); these platforms host tens of millions of code repositories and programming-related questions and answers, respectively, and are ripe with data that can, and is, being used to train such models (raychev2015predicting).
However, the statistical models that power many such applications are only as useful as the data they are trained on, i.e., garbage in, garbage out (sheng2008get). For a particular class of applications, such as source code retrieval given a NL query (wei2015building), source code summarization in NL (iyer2016summarizing), and source code synthesis from NL (yin2017acl; rabinovich17syntaxnet), all of which involve correspondence between NL utterances and code, it is essential to have access to high volume, high quality, parallel data, in which NL and source code align closely to each other.
While one can hope to mine such data from Big Code repositories like SO, straightforward mining approaches may also extract quite a bit of noise. We illustrate the challenges associated with mining aligned (parallel) pairs of NL and code from SO with the example of a Python question in Figure 1. Given a NL query (or intent), e.g., “removing duplicates in lists”, and the goal of finding its matching source code snippets among the different answers, prior work used either a straightforward mining approach that simply picks all code blocks that appear in the answers (allamanis2015bimodal), or one that picks all code blocks from answers that are highly ranked or accepted (iyer2016summarizing; wong2013autocomment).111 There is at most one accepted answer per question; see green check symbol in Fig 1. However, it is not necessarily the case that every code block accurately reflects the intent. Nor is it that the entire code block is answering the question; some parts may simply describe the context, such as variable definitions (Context 1) or import statements (Context 2), while other parts might be entirely irrelevant (e.g., the latter part of the first code block).
There is an inherent trade-off here between scale and data quality. On the one hand, when mining pairs of NL and code from SO, one could devise filters using features of the SO questions, answers, and the specific programming language (e.g., only consider accepted answers with a single code block or with high vote counts, or filtering out print statements in Python, much like one thrust of prior work (wong2013autocomment; iyer2016summarizing)); fine-tuning heuristics may achieve high pair quality, but this inherently reduces the size of the mined data set and it may also be very language-specific. On the other hand, extracting all available code blocks, much like the other thrust of prior work (allamanis2015bimodal), scales better but adds noise (and still cannot handle cases where the “best” code snippets are smaller than a full code block). Ideally, a mining approach to extract parallel pairs would handle these tricky cases and would operate at scale, extracting many high-quality pairs. To date, none of the prior work approaches satisfies both requirements of high quality and large quantity.
In this paper, we propose a novel technique that fills this gap (see Figure LABEL:fig:overview for an overview). Our key idea is to treat the problem as a classification problem: given an NL intent (e.g., the SO question title) and all contiguous code fragments extracted from all answers of that question as candidate matches (for each answer code block, we consider all line-contiguous fragments as candidates, e.g., for a 3-line code block 1-2-3, we consider fragments consisting of lines 1, 2, 3, 1-2, 2-3, and 1-2-3), we use a data-driven classifier to decide if a candidate aligns well with the NL intent. Our model uses two kinds of information to evaluate candidates: (1) structural features, which are hand-crafted but largely language-independent, and try to estimate whether a candidate code fragment is valid syntactically, and (2) correspondence features, automatically learned, which try to estimate whether the NL and code correspond to each other semantically. Specifically, for the latter we use a model inspired by recent developments in neural network models for machine translation (bahdanau2015alignandtranslate), which can calculate bidirectional conditional probabilities of the code given the NL and vice-versa. We evaluate our method on two small labeled data sets of Python and Java code that we created from SO. We show that our approach can extract significantly more, and significantly more accurate code snippets in both languages than previous baseline approaches. We also demonstrate that the classifier is still effective even when trained on Python then used to extract snippets for Java, and vice-versa, which demonstrates potential for generalizability to other programming languages without laborious annotation of correct NL-code pairs.
Our approach strikes a good balance between training effort, scale, and accuracy: the correspondence features can be trained without human intervention on readily available data from SO; the structural features are simple and easy to apply to new programming languages; and the classifier requires minimal amounts of manually labeled data (we only used 152 Python and 102 Java manually-annotated SO question threads in total). Even so, compared to the heuristic techniques from prior work (allamanis2015bimodal; wong2013autocomment; iyer2016summarizing), our approach is able to extract up to an order of magnitude more aligned pairs with no loss in accuracy, or reduce errors by more than half while holding the number of extracted pairs constant.
Specifically, we make the following contributions:
We propose a novel technique for extracting aligned NL-code pairs from SO posts, based on a classifier that combines snippet structural features, readily extractable, with bidirectional conditional probabilities, estimated using a state-of-the-art neural network model for machine translation.
We propose a protocol and tooling infrastructure for generating labeled training data.
We evaluate our technique on two data sets for Python and Java and discuss performance, potential for generalizability to other languages, and lessons learned.
All annotated data, the code for the annotation interface and the mining algorithm are available at http://conala-corpus.github.io.
2. Problem Setting
Stack Overflow (SO) is the most popular Q&A site for programming related questions, home to millions of users. An example of the SO interface is shown in Figure 1, with a question (in the upper half) and a number of answers by different SO users. Questions can be about anything programming-related, including features of the programming language or best practices. Notably, many questions are of the “how to” variety, i.e., questions that ask how to achieve a particular goal such as “sorting a list”, “merging two dictionaries”, or “removing duplicates in lists” (as shown in the example); for example, around 36% of the Python-tagged questions are in this category, as discussed later in Section LABEL:sec:annotation:dataset. These how-to questions are the type that we focus on in this work, since they are likely to have corresponding snippets and they mimic NL-to-code (or vice versa) queries that users might naturally make in the applications we seek to enable, e.g., code retrieval and synthesis.
Specifically, we focus on extracting triples of three specific elements of the content included in SO posts:
Intent: A description in English of what the questioner wants to do; usually corresponds to some portion of the post title.
Context: A piece of code that does not implement the intent, but is necessary setup, e.g., import statements, variable definitions.
Snippet: A piece of code that actually implements the intent.
An example of these three elements is shown in Figure 1. Several interesting points can be gleamed from this example. First, and most important, we can see that not all snippets in the post are implementing the original poster’s intent: only two of four highlighted are actual examples of how to remove duplicates in lists, the other two highlighted are context, and others still are examples of interpreter output. If one is to train, e.g., a data-driven system for code synthesis from NL, or code retrieval using NL, only the snippets, or portions of snippets, that actually implement the user intent should be used. Thus, we need a mining approach that can distinguish which segments of code are actually legitimate implementations, and which can be ignored. Second, we can see that there are often several alternative implementations with different trade-offs (e.g., the first example is simpler in that it doesn’t require additional modules to be imported first). One would like to be able to extract all of these alternatives, e.g., to present them to users in the case of code retrieval222Ideally one would also like to present a description of the trade-offs, but mining this information is a challenge beyond the scope of this work. or, in the case of code summarization, see if any occur in the code one is attempting to summarize.
These aspects are challenging even for human annotators, as we illustrate next.
3. Manual Annotation
To better understand the challenges with automatically mining aligned NL-code snippet pairs from SO posts, we manually annotated a set of labeled NL-code pairs. These also serve as the gold-standard data set for training and evaluation. Here we describe our annotation method and criteria, salient statistics about the data collected, and challenges faced during annotation.
For each target programming language, we first obtained all questions from the official SO data dump333Available online at https://archive.org/details/stackexchange dated March 2017 by filtering questions tagged with that language. We then generated the set of questions to annotate by: (1) including all top-100 questions ranked by view count; and (2) sampling 1,000 questions from the probability distribution generated by their view counts on SO; we choose this method assuming that more highly-viewed questions are more important to consider as we are more likely to come across them in actual applications. While each question may have any number of answers, we choose to only annotate the top-3 highest-scoring answers to prevent annotators from potentially spending a long time on a single question.
3.1. Annotation Protocol and Interface
Consistently annotating the intent, context, and snippet for a variety of posts is not an easy task, and in order to do so we developed and iteratively refined a web annotation interface and a protocol with detailed annotation criteria and instructions.
The annotation interface allows users to select and label parts of SO posts as (I)intent, (C)ontext, and (S)nippet using shortcut keys, as well as rewrite the intent to better match the code (e.g., adding variable names from the snippet into the original intent), in consideration of potential future applications that may require more precisely aligned NL-code data; in the following experiments we solely consider the intent and snippet, and reserve examination of the context and re-written intent for future work. Multiple NL-code pairs that are part of the same post can be annotated this way. There is also a “not applicable” button that allows users to skip posts that are not of the “how to” variety, and a “not sure” button, which can be used when the annotator is uncertain.
The annotation criteria were developed by having all authors attempt to perform annotations of sample data, gradually adding notes of the difficult-to-annotate cases to a shared document. We completed several pilot annotations for a sample of Python questions, iteratively discussing among the research team the annotation criteria and the difficult-to-annotate cases after each, before finalizing the annotation protocol. We repeated the process for Java posts. Once we converged on the final annotation standards in both languages, we discarded all pilot annotations, and one of the authors (a graduate-level NLP researcher and experienced programmer) re-annotated the entire data set according to this protocol.
While we cannot reflect all difficult cases here for lack of space, below is a representative sample from the Python instructions:
Intents: Annotate the command form when possible (e.g., “how do I merge dictionaries” will be annotated as “merge dictionaries”). Extraneous words such as “in Python” can be ignored. Intents will almost always be in the title of the post, but intents expressed elsewhere that are different from the title can also be annotated.
Context: Contexts are a set of statements that do not directly reflect the annotated intent, but may be necessary in order to get the code to run, and include import statements, variable definitions, and anything else that is necessary to make sure that the code executes. When no context exists in the post this field can be left blank.
Snippet: Try to annotate full lines when possible. Some special tokens such as “>>>”, “print”, and “In[...]” that appear at the beginning of lines due to copy-pasting can be included. When the required code is encapsulated in a function, the function definition can be skipped.
Re-written intent: Try to be accurate, but try to make the minimal number of changes to the original intent. Try to reflect all of the free variables in the snippet to be conducive to future automatic matching of these free variables to the corresponding position in code. When referencing string literals or numbers, try to write exactly as written in the code, and surround variables with a grave accent “‘”.
|Lang.||#Annot.||#Ques.||#Answer Posts||#Code Blocks||Avg. Code Length||%Full Blocks||%Annot. with Context|