Deep Learning to Detect Redundant Method Comments
Comments in software are critical for maintenance and reuse. But apart from prescriptive advice, there is little practical support or quantitative understanding of what makes a comment useful. In this paper, we introduce the task of identifying comments which are uninformative about the code they are meant to document. To address this problem, we introduce the notion of comment entailment from code, high entailment indicating that a comment’s natural language semantics can be inferred directly from the code. Although not all entailed comments are low quality, comments that are too easily inferred, for example, comments that restate the code, are widely discouraged by authorities on software style. Based on this, we develop a tool called Craic which scores method-level comments for redundancy. Highly redundant comments can then be expanded or alternately removed by the developer. Craic uses deep language models to exploit large software corpora without requiring expensive manual annotations of entailment. We show that Craic can perform the comment entailment task with good agreement with human judgements. Our findings also have implications for documentation tools. For example, we find that common tags in Javadoc are at least two times more predictable from code than non-Javadoc sentences, suggesting that Javadoc tags are less informative than more free-form comments.
Reading code is central to software maintenance. Studies have suggested that programmers spend as much or more time reading and browsing code as actually writing it (swebok:2004; latoza06maintaining; ko06exploratory). Naturally, developers are advised to write code so that it is easier to read later, perhaps most famously by Knuth: “Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.” (knuth1984literate). But it is not easy to write code that is easily read: Code cannot always be made self-explanatory using descriptive names, and in any case, code cannot explain why the current approach was taken and others were not (raskin). In such cases, comments are an important addition to enable understanding code. While comments play many roles (Pascarella:2017; Jml; Cheon02aruntime), an important role is to explain and clarify code. Such comments have been called purpose, or explanatory comments (Pascarella:2017), and are the focus of our work.
But not all explanatory comments are equally useful, and
writing useful comments requires experience and judgement.
For example, developers are advised by no less an authority
than SWEBOK that “some comments
are good, some are not” (swebok:2004).
Advice from Google, as part of a carefully curated series of articles
that is distributed to all Google engineers,
explicitly encourages developers that
“comments are not always good” and specifically to
“avoid comments that just repeat what the code does”.
To an academic software engineering researcher, the expert advice that we have cited may seem counterintuitive, perhaps even self-contradictory: If explanatory comments exist in order to explain the code, isn’t it necessary that, at least to some extent, they repeat what the code does? It seems difficult to reconcile this natural line of reasoning against that seemingly contradictory advice above from authorities on software development.
To resolve this dilemma, consider two examples of real-world Java methods. In 1, the comment contains nothing more than two identical restatements of the method signature. The code would contain exactly the same information if the comment were deleted entirely. The same is true of the Javadoc documentation — there is no situation in Java documentation or development in which a method comment is visible but not the method signature. Contrast this comment with the one in 2. While the first and final sentences are simple restatements of the signature, the other two sentences explain the effect of the parameter setting in more detail. We argue that 1 is the type of explanatory comment that is “just restating” that the authorities intend to discourage. But in our corpus of popular open-source Java projects on Github, we find that such redundant comments are prevalent (Section 6). Based on our own experience, we suggest a few hypotheses as to why such comments are discouraged. First, they clutter a codebase, making it difficult for a reader to find important logic. Second, they can pose a maintenance burden as the code changes (spinellis10). Finally, and insidiously, they can trick a programmer into thinking that the code is well explained, when in fact design rationales and deeper facts about the code, such as the ones in 2, are missing.
In this paper, we introduce a machine learning (ML) framework to help developers write better, more informative comments. Our framework aims to identify which explanatory comments provide the most information about the code, by exploiting the fundamental insight from information theory that sentences that are highly predictable provide little new information. In the extreme case, comments that are too easily inferred from the code might trivially restate what the code does, or even trivially restate the method signature, as in 1. By highlighting explanatory comments that are easily inferred from the code, we hope to encourage developers to write better comments. A developer who sees all her comment sentences marked as easily inferred, e.g. during a code review, might be motivated to write in more detail — for example, to write good explanatory comments that cover design decisions or that “explain why the program is being written, and the rationale for choosing this or that method” (raskin).
More generally, we introduce a new research problem which we call comment entailment, named after the well-studied problem of textual entailment (2013Dagan) in the Natural Language Processing (NLP) literature. The comment entailment problem is to determine which sentences in a comment logically follow from the information in the code. This is a general problem which we hope will have many uses within software engineering; we suggest that the categorization of comments as entailed versus non-entailed, essentially “does the comment describe the content of the code,” is a fundamental axis along which to characterize comments. Many entailed comments are high quality; for example, summary comments, which briefly explain the purpose of a method or class, are often seen as desirable (mcconnell2004code). But comments that are too easily inferred, such as those that are “word-for-word” restatements of the code, as in 1, do not provide any additional explanatory power and are discouraged by the authorities cited above.
The entailment problem is technically challenging because it requires a bimodal software analysis that considers both the source code and the natural language comment simultaneously. It is perhaps for this reason that there is almost no support in popular development environments to help developers write better explanatory comments. We exploit recent advances in deep learning in NLP to develop a model that scores the level of redundancy in the comment. We present a first approach to the comment entailment problem based on applying deep sequence-to-sequence (seq2seq) learning (sutskever14; bahdanau2014neural). Comment sentences that have highest average probability conditioned on the code are those that we deem as potentially low quality, under the rationale that they are easily inferred.
We incorporate these models into Craic
Our main contributions are: (a) We introduce the comment entailment problem (Section 3); (b) We introduce a first approach for this problem based on sequence-to-sequence learning (Section 4); (c) We show that sequence-to-sequence methods are effective at predicting comments, as measured by perplexity, i.e. they are capable of using code to improve comment prediction, compared to a unimodal language model trained only on comments (Section 6.3); (d) We present evidence that Craic effectively identifies redundant comment sentences, correlating strongly with human judgements of entailment (Section 6.5), based on a new data set of code-comment pairs that we have collected; (e) Finally, we explore how our framework can be used to make recommendations about the design of documentation tools. We examine the hypothesis that in Java, some kinds of Javadoc comments, such as in 1, are often uninformative. We indeed find that the predictability of sentences in common Javadoc fields is more than two times higher than non-Javadoc sentences (Section 7). Our data, trained models and software are available at [link removed for blind submission].
2. Related work
Software is bimodal: it combines an algorithmic channel, that targets devices, and a natural language channel, comprising comments and identifiers, aimed at developers. With the early and notable exception of literate programming (knuth1992literate), most research has focused on one of the two channels in isolation. As Section 3 makes clear, Craic targets the bimodal problem of deciding whether a method entails a comment.
Our survey of relevant work in the software engineering community begins by discussing related work focused solely on software’s natural language channel, then moves to more closely related bimodal work. Bimodal analysis is also motivated by a growing line of work that applies data mining and ML techniques to software repositories, especially work on language models for code, which we review in more detail.
NL channel (unimodal): Researchers have developed unimodal methods to study both comments and names in software. For example, Binkley and colleagues have measured the comprehensibility of identifier names (binkley2011improving). Researchers have customised part of speech tagging for tokenised identifier names and have mined semantically related word pairs, by mapping the main action verb of a function’s header comment to the main action verb in its signature (gupta2013part). Also relevant to our work is a classification created by Pascarella:2017 for Java comments. This work annotates a corpus of Java comments, placing comments into categories such as those explaining the purpose, those informing how to use the code, providing metadata, license information, or todo signals. The aim is to create a comprehensive classification of the role of comments in software. This work also develops a Naive Bayes classifier which predicts the category for a comment. In Craic, we seek to identify redundant comments for removal to improve the readability of a codebase. In Section 6.4, we examine how our predictions relate to the categories identified by Pascarella:2017.
NL+code channels (bimodal): There has been some work on developing bimodal methods for the NL and code channels of software. khamis10automatic studied both the quality of the NL comments, and the consistency between code and the comments to assess quality of comments. steidl13quality classified comments using machine learning using four features: consistency of the comment, coherence of code and comments, completeness of comments at focal points in the code, and usefulness of the comment in describing the code. ibrahim2012relationship studied how to identify code changes that trigger comment changes (ibrahim2012relationship). Fluri et al. used lexical similarity and heuristics to connect comments to code (fluri2007code). CloCom extracts commented code from a codebase, then finds uncommented clones using detection as a black box (wong2015clocom).
Tan and coauthors have presented methods for automatically producing code annotations from comments (icomment; tan2011acomment; tcomment). First, iComment (icomment) looks at comments to extract rules that should govern the code and then verifies whether the code accompanying the comment obeys the rules. aComment extracts assertion macros from code, and assertional phrases from comments, and combines them (tan2011acomment). tcomment analyses Javadoc comments to infer properties of the method accompanying the comment. It then generates random test cases for the code to identify inconsistencies between the comment and the code. Our work on Craic is complementary to aComment and iComment. Indeed, the sentences that are entailed from the code are, in many cases, likely to be explanatory sentences rather than sentences that make assertions as considered in the previous work, in other words, precisely those sentences that were labelled as “unexploitable” by padioleau09listening; part of our goal in Craic is to guide developers to make those explanatory, unexploitable sentences better.
Software traceability (see (cleland2012software) for some recent work) is also an inherently multimodal problem, but in a different way than the bimodal problems we consider here, as names, comments, and code are embedded within the same files, rather than requiring inference of cross-document links as in traceability. miltos-bimodal also consider a different type of bimodal problem in software, presenting a model for code search from natural language queries.
Closely related to our work is movshovitz13natural who develop topic models that jointly model comments and code; however, this work focused only on autosuggestion rather than comment entailment. Also, sequence-to-sequence models have been applied within software engineering to the API mining problem (Gu2016deepapi). But this work also did not consider the comment entailment problem. Instead it predicts code based on natural language, rather than comments based on code. More recently, LinWPVZE2017:TR apply sequence-to-sequence learning to program synthesis from natural language, again an inverse of the comment entailment problem that we propose. Finally, interesting recent work (oda15; fudaba15) generates psuedo-code, which can be viewed as a type of comment, from code using machine translation. Unfortunately, the pseudo-code that can currently be generated by these methods seems to be relatively literal transcriptions of the code, line-by-line.
Language Models for code (unimodal): hindle12naturalness were the first to apply -gram language models (LM) to source code. allamanis13mining continued in this line by presenting the first source code LM trained on over a billion tokens and demonstrating that predicting identifier names causes the most difficulty to current LMs for code. LMs for code have been applied widely, to discover syntax errors (campbell14), to learn coding conventions (naturalize), and within cross-language porting tools (nguyen13lexical). Many language models that are specifically adapted to code have also been proposed (nguyen13statistical; maddison2014; hellendoorn17deep; raychev16; bielik16; DBLP:journals/corr/AmodioCR17). Recent work has also applied deep language models for code in a unimodal fashion. Feedforward neural network language models, simpler than the recurrent models applied here, have been applied to code by neural-naturalize. Deep language models, such as RNNs and LSTMs, have been presented for the unimodal setting of code by several authors (white15deep; dam16lstm).
3. Problem Definition
As a way of attacking the problem of identifying redundant comments, in this paper we define and address a task which we call comment entailment. Intuitively, the comment entailment problem is to identify whether a snippet of code logically implies the statements made by a natural language comment. Identifying entailment is easier than directly identifying redundant uninformative comments because entailment specifies how to exploit code to decide whether a comment is uninformative. Our definition will not be entirely formal, because, although the semantics of code can be described formally, the full semantics of natural language is beyond the reach of current attempts at logical formalization.
The comment entailment problem considers as input a snippet of source code , such as a block, method, or class, and a natural language sentence from a comment that is associated in the code with . For brevity we will call a code-comment pair. might be a single method in Java, and a sentence from the method-initial Javadoc comment. For the purpose of the comment entailment problem, we assume that we know in advance that is intended to comment on ; in many cases, such as class- and method-level comments, detecting which comments are intended to describe which regions of code can be performed accurately using simple heuristics. The entailment problem is defined at the level of a sentence in a comment, rather than the comment as a whole, because comments vary considerably in length. There are many examples, such as 3, where a longer comment will contain both some sentences that are entailed and some that are not. Therefore, a sentence-level notion seems more useful.
The comment sentence is entailed by the code snippet , which we denote , if the content of the text can be semantically inferred by a reader solely from information internal to . Here “semantically inferred” means that a developer can verify that the sentence is true, based solely on the method . This definition depends on what information is known by the reader, for example, an expert programmer may understand many details about the project, the language, the standard library, and so on, that allow her to verify comment sentences that a novice programmer cannot. This dependence cannot be fully removed, because whether a piece of writing is clear always depends on its audience. However, we claim that there is a core knowledge shared by professional programmers of a language that allows them to consistent judgement whether a comment correctly describes a method. We provide evidence for this claim by measuring interannotator agreement in 6.5.
3 shows an example taken from the android project that has a comment containing three sentences: on line 2 (Sentence ), line 3 (), and line 4 (). By definition, this example contains three code-comment pairs. Each pair represents a different entailment relation. Sentence is completely entailed by the method, as it is simply a restatement of the method name. , on the other hand, depends on the semantics of the prefs.getString method and hence not is directly entailed by . Finally, is partially entailed: the empty result assertion is not immediate.
Instead of logical yes/no definition of entailment, we suggest a more flexible notion of an entailment score , which is a real number that measures the degree of entailment. We will take the convention that lower scores indicate a higher degree of entailment. This allows us to produce a ranked list of comment sentences as more or less entailed.
It is important to clarify the implications of comment entailment. We are not claiming that entailed comments are bad, nor are we claiming that non-entailed comments are good. To the contrary, both of these incorrect statements have clear counterexamples. A summary comment that briefly explains the algorithm in a large method is an entailed comment that is often considered good (mcconnell2004code). Conversely, a completely unrelated comment, such as a comment from the Linux kernel pasted above the function in 3 is a non-entailed comment which is clearly bad. Instead, we make two claims. The first is that entailed versus non-entailment is a conceptually useful distinction. For example, good entailed comments (among other roles) describe the code at a higher level of abstraction, as recommended by mcconnell2004code, and good non-entailed comments (among other roles) can explain rationale, as recommended by raskin. Secondly, entailed comments that are too easily inferred are not useful, and should be discouraged. At the very least, if all comments in a file are easily inferred, then the comments are likely to be missing important information such as design rationales.
4. Deep Learning Comment Entailment
One approach to comment entailment would be to use supervised learning, such as text classification, in which we train a machine learning model
to predict a binary variable indicating the presence of entailment
directly from a code-comment pair.
But such an approach requires large amounts of labelled training data, in which
programmers have annotated code-comment pairs as to whether an entailment exists,
which is time-consuming and expensive to produce.
Instead, we avoid this problem by applying machine learning
in an indirect way, which does
not require explicitly labeled examples of whether
an entailment relationship exists for training.
Our approach is based on language modelling. An overview of our approach can be seen in Figure 1. It consists of two stages. First, we train recurrent neural network language models to generate comments based on code (Section 4.1), by which we mean that they define a probability distribution over comment sentences given code. We use deep learning methods because they are currently the most effective language models for natural language text like comments. Second, once we have such a model, we can use the probability values to measure the predictability of the comment sentence. Comments that are too easy to predict by the model (conditioned on the code) are likely to be easily inferred by developers as well, and hence less informative. We use the probability to define a numerical score called perplexity (Section 4.2). Comment sentences with low perplexity are most easily predictable. Note that for this approach, we only need a collection of code snippets paired with comments written for them, which is readily available from open source code bases, without requiring us to collect large amounts of explicit annotations of entailment decisions. We will show that despite this lack of explicit supervision, these scores correlate with human judgements (Section 6.5).
4.1. Deep Sequence-to-Sequence Learning
Now we describe the sequence-to- sequence learning framework that underlies our method. First, a language model is a probability distribution over strings. Using the chain rule of probability, we can write the probability of a sequence of words as
Different language models approximate in different ways the individual terms in this product. The earliest types of language models that were applied to source code were -gram models, which make the Markov assumption that previous tokens of context are sufficient for predicting each word. Standard -gram models do not perform as well as the deep language models that we describe next, although for code, the extension of cache language models (localness-software; hellendoorn2017fse) are considerably better, and competitive with deep models. Nevertheless, even these cache models are not easily applied to the sequence-to-sequence learning setting that we require here, so we do not consider them further.
Deep language models remove the Markov assumption, achieving better performance than traditional -gram language models. The current state of the art in natural language (mellis17lstm) are language models based on a type of recurrent neural network (RNN) called the long-short term memory network (LSTM) (lstm). More details on the LSTM can be found in Goodfellow:book, but at a high level, an LSTM computes a hidden state vector , that corresponds to every word and that summarizes the information from words in the sequence. The size of the hidden layer is a parameter that we set during development. An LSTM language model computes the probability of a sequence based on two neural networks and , where is the number of words in the vocabulary. The probability of a sequence is computed as
Here the function is an “LSTM cell”, which is essentially a neural network that computes the next hidden state from the previous ones, and includes several so-called “gates”, such as the forget gate and the output gate. The function is a feedforward neural network that computes a distribution over output words given the current value of the hidden state .
The language models we just described are unimodal and trained on strings from one language. But often, we wish to predict one sequence from another one; for example, given a sentence written in French, we might wish to translate it into English. Sequence to sequence models are built to learn such mappings between sequences in two languages (sutskever14; bahdanau2014neural), and are currently the state-of-the-art for machine translation. They work as follows. Suppose we want to predict an output sequence given an input sequence First we run an LSTM on to compute a final hidden state by iterating (1). Then, to define a distribution , we use a second LSTM again following equations (1) and (2), where the initial state of the second LSTM is . By reusing the initial state in this way, we train the first LSTM so as to summarize the information from the first sequence that is relevant to the second. The parameters of both of the LSTMs are jointly trained by gradient descent to maximize .
It is this sequence to sequence learning model which is useful for generating comments conditioned on code , as we explain in the next section.
4.2. Craic: Deep Sequence Models for Comment Entailment
Now we can describe how LSTM language models and sequence-to-sequence learning can be applied to develop a method for the comment entailment problem. In both cases, the entailment score is a measure of predictability or probability of a comment sentence under the respective language model.
In a unimodel language model, we can obtain the predictability of a comment irrespective of the code it is attached to. This probability measures how easy it is for the language model to predict the comment. Note that this model ignores the code snippet . A comment will have high probability under this model if it matches frequent word sequences seen in the training corpus. In contrast, a sequence-to-sequence model learns a distribution for a code-comment pair . Therefore, the comments that are assigned high probability from the sequence-to-sequence model are those that are easy to predict based on the text of . We show in the later sections that utilizing the code results in a better prediction of the comment .
In fact, our tool Craic uses perplexity instead of probability (low perplexity corresponds to high probability) as the entailment scores. Given a test corpus of tokens, , the perplexity is
Perplexity is inversely proportional to the probability of the text under the model and the probability is normalized for the number of words in the text. Sentences whose perplexity are sufficiently low are those easily predicted by the model, and hence are likely to be easily inferred from the code. So the output of the current version of Craic tool is a ranked list of sentences from the comments, lowest perplexity first. The developer can then review those sentences, e.g. during a code review, and consider revising them. For example, prompted by Craic, the developer could decide to add more design rationale, to rewrite the sentence to make it a more useful summary. In some cases, the appropriate choice may be to remove the redundant comment altogether; authorities on coding style are consistent that more comments are not always better (mcconnell2004code; swebok:2004).
Our entailment corpus is a large collection of Java methods paired with comments. We focus on method-level comments only so that we can draw on easily identifiable data for training our models, but our framework can be extended to other types of comments. We start with a large collection of Java projects, the GitHub Java Corpus (githubCorpus2013), containing 14,785 projects. This corpus also contains project popularity ratings compiled from the number of forks and watchers. We use these ratings for creating test data representative of a variety of projects. For easy availability of both code and ratings, we use this snapshot of Github for all our experiments.
From these projects, we identify those comments describing a method as a whole and immediately preceding the method. We remove all other multi-line and inline comments within a method. The resulting corpus is a collection of pairs of method and full comment texts. At this point, the comment is a span of text which may contain more than one sentence. We call the span a full comment to distinguish it from the single sentences (comment sentence) that our models use. We ignore methods that do not have comments.
This code-full comment corpus contains over 3M pairs. We preprocess this corpus in a few ways. For code, we use a lexer to tokenize the method. For the comments, we use the Stanford CoreNLP toolkit (corenlp) to tokenize the text. In both methods and comments, we subtokenize any camelCase names into separate words.
The resulting methods and comments can vary widely in length. Table 1 shows how the token counts of methods and comments are distributed in our corpus. The average length of a method is 75 tokens, three times longer than the average length of a full comment. It is also apparent that the distribution over method lengths in the corpus is highly skewed, with the mean length being not only larger than the median but larger even than the 3rd quartile.
We also segment the comment text into sentences, using the CoreNLP toolkit, together with additional heuristics. For example, fields in Javadocs such as @param and the accompanying parameter description are treated as a single sentence. Once we have comment sentences, each comment sentence is paired with the method individually, resulting in a collection of method-comment sentence pairs which are used in all our models.
We randomly draw 3M method-comment sentence pairs for training, 5000 for a validation set, and 5000 for testing.
Here we describe how we represented the methods and comments to input into our models, model implementation, and the resulting performance. We also examine how our best model’s judgements correlate with human assessments of comment entailment, and with categories of comments proposed by prior work.
6.1. Input Representation
As described in the previous section, our models are trained with pairs, each containing a method and a comment sentence. In contrast to comment sentences in the pair, method text can be arbitrarily long. This variation over method length makes training sequence-to-sequence models rather difficult. Hence we developed three ways to compress a method to a maximum of tokens.
a) signature: In some cases, the method signature alone is sufficient to determine entailment. So this representation retains only the method signature and ignores the rest of the method body. When the signature is longer than tokens, it is truncated.
b) begin-end: Here the method is represented by a total of tokens, half taken from the start of the method and the other half from the end. By sampling tokens from both ends, this representation makes more use of the method body compared to the signature based compression, such as the return statement (if any).
c) identifier-based: This representation first preserves the method signature, and then retains a subsequence of the method body comprising only salient identifier names. We limit the overall sequence to tokens. While the sequence is shorter than , we incrementally add braces and names to it based on precedence. This precedence is over braces and names as follows: braces, locals, globals, user-defined types, externally defined methods, locally defined methods, and formals. This order heuristically captures salience. We define locals ¿ globals, because they name internal computations of the method; we define externally defined methods ¿ locally defined methods, since external method names are often semantically significant. Subject to , we exhaust the names in a higher salience category before moving to a lower category. Within a category, we add names in their order of appearance in the code. Because of , braces may be empty or unbalanced, but this happens rarely in practice. Braces surface identifier nesting to our model.
For all three compression methods, we set tokens. Comments are also truncated to 50 tokens. We choose to compromise between information and computational efficiency.
6.2. Model Details
During development, we examined the performance of our models with different hidden units and depth. Our best configurations of
a single hidden layer, and 2048 units for the language model and 512 units for the sequence model was used for the final
training. We used a vocabulary size of 25000 for both our models, on both the method side and comment side. The initial learning
rate was set to 0.5, and we used a decay factor of 0.96 which was applied when the validation perplexity did not improve over
an epoch. We used a batch size of 64 and a dropout probability of 0.65. For the language model, we truncated backpropagation at
30 steps and use the final states of the previous batch to initalize the start state of the next batch. We use gradient descent to
optimize the models and clip the gradients at 5.0. We implemented the models in Tensorflow
6.3. Validation via Predictive Performance
In this section, we are evaluate whether our language models over comments are effective at predicting text. In later sections, we evaluate whether the resulting perplexity scores are effective for predicting entailment. We evaluate two models. One is a LSTM language model trained only on comment text. Next we examine whether sequence-to-sequence models are effective at leveraging the method code to improve their capability to predict comments. In these s2s models, we experiment with the three method summarization techniques from Section 6.1. Clearly, if the sequence-to-sequence model does not display better predictive performance, then it is not using information from the code effectively. To compare comment corpus to a general English language corpus, we also consider a state-of-the-art LSTM language model built for English newswire text (mellis17lstm). It was trained and tested on partitions of the Penn Treebank (marcus93building), a corpus containing Wall Street Journal news articles.
Language models are typically evaluated using their perplexity on a test set, which is a collection of texts unseen by the models during training. This is standard methodology in natural language processing for measuring the quality of a language model. Lower perplexities are better and a language model with the lowest perplexity on a test set is best at predicting strings from the language. Previous studies of language models for code have also reported cross-entropy, . The relationship is therefore simply . Perplexity has an intuitive interpretation. The perplexity of a uniform distribution over words is exactly , so perplexity can be viewed as an “effective vocabulary size” of the model, or how many guesses the model would need on average to predict every word in the text. The units of measure for perplexity can be intuitively understood as “number of vocabulary entries”. Both a general language model, and sequence learning models can be evaluated using perplexity.
Table 2 shows the perplexities of our models on the training, validation and test sets. We see that indeed our language models are dramatically better at predicting comment text than state-of-the-art models are at predicting newswire text. The perplexity of 58 for newswire text compared to those on the order of 10 and 5 for comments. We hypothesize that comments are easier to predict compared to natural language news, because comments belong to a narrower domain in terms of both vocabulary and the productive nature of sentences.
For the sequence-to-sequence models, we compare all three code representation methods from 6.1, namely (a) the signature-based representation (s2s-signature in Table 2), (b) begin-end representation (s2s-begin-end), and (c) the identifier-based representation (s2s-identifier). The perplexity of the best sequence-to-sequence model is about half the number from the language model. Hence simply capturing the most frequent comment tokens, while informative, does not perform as well as the entailment models which use the method to make the predictions. In terms of which method representation is most useful, we find that there is an improvement upon using the method body in some compressed form (either as sampled tokens in the begin-end case or using identifier sequences) compared to signature only. Overall, this evaluation indicates that the sequence to sequence models are effective at predicting comments conditioned on the code, thereby providing a proxy for entailment.
Since s2s-begin-end and the identifier compression perform similarly, we use the simpler begin-end model as our best model for the rest of the analysis in this paper.
In Table 3, we also show qualitative examples for the highly entailed (low quality) and low entailment comments according to our best model. We do not show the methods due to space constraints, but the comments themselves are often enough to understand the distinction we are trying to convey between redundant examples and those which would be difficult to predict from the code.
|LM English newswire||58|
|High perplexity comments|
|149.84||Such error prevents checking out and creating new branch.|
|93.50||Keep id to make sure temp file will be removed after use uploaded file.|
|74.36||Here the stacktrace serves as the main information since it has the method which was invoked causing this exception.|
|73.83||Assumes that statistics rows collect over time , and that none of them have disappeared.|
|73.20||This helps to prevent (bad) application code from accidentally holding onto extraneous garbage.|
|68.11||The only place this flag is used right now is in multiple page dialog icon_style and tab_style.|
|50.83||The client property dictionary is not intended to support large scale extensions to jcomponent nor should be it considered an alternative to subclassing when designing a new component.|
|Low perplexity comments|
|1.03||setter method for “name” tag attribute.|
|1.07||Finds the user id mapper with the primary key or returns null if it could not be found.|
|1.10||Calls case xxx for each class of the model until one returns a non null result; it yields that result.|
|1.56||compares this uuid with the specified uuid.|
|2.84||@throws SettingNotFoundException thrown if a setting by the given name can’t be found or the setting value is not an integer.|
|3.05||removes a global ban for a player.|
|3.06||private default constructor.|
6.4. Comparison to Comment Categories
Prior work on comments, Pascarella:2017 has classified comments discounting their usefulness. We examined how our model predicts comment sentences which were involved in their manual comment classification work. This analysis identifies categories from the manual classification which are deemed redundant or non-redundant by our model. In this way, we gain intuition into the predictions of our model. We hypothesize that some categories of comments are more likely to be entailed than others. For example, comments that are categorized as explaining functionality are likely to be more easily inferred than comments that explain the deeper rationale of the code.
Pascarella:2017’s corpus contains 11,226 annotated comments. The comments were annotated into 6 major categories: Purpose (explain the functionality of the code), Notice (warnings, alerts, and information about usage), Under development (todo and incomplete comments), Style and IDE (IDE directives and formatting text), Metadata (license, ownership etc), and Discarded (noisy comments). Each category is further divided into finer sub-categories.
To compare with our work, we identified method-level comments and their code span from their corpus. We were able to obtain the code spans for 837 method comments successfully. For these comment-code pairs, we grouped the comments using the taxonomy adopted in Pascarella:2017. Then, we studied how the categorisation rings with the predictions of our model.
Table 4 shows the categorisation of the 837 comments using the Pascarella:2017 taxonomy. Most of the method comments belong to either Purpose or Notice types. This is expected as License, Todo or Metadata comments are less unlikely to be method-level comments. Within Purpose and Notice categories, most comments are in the Purpose-summary (comments on what the method does), Notice-usage (how a method must be used, or parameter definitions) and Purpose-rationale (why code was written in a certain way), and Purpose-expand (how the code was implemented, purpose of different parts of the code in detail).
Our analysis shows that method level comments, unlike license or todo comments, are likely to fall under categories where a notion of entailment is well-defined. This lends confidence that our model is trained and tested on data that is likely to have categories like those in Table 4 and entailment could be used to differentiate the comments which are redundant.
We used our best model (s2s-begin-end) to examine model predictions on the above categories of comments. We split each comment into sentences as our model is designed to make predictions at the level of sentences. Each sentence is paired with the source code from the comment-code pair. This process gave a total of 1352 method-comment sentence pairs for which we obtain model predictions of perplexity per comment. We average the perplexities for each category and report them in Table 5.
The Notice-usage category represents explicit instructions on how to use a piece of code. This category has the lowest perplexity; part of the reason is that a lot of Javadoc is under this category, and as we show later, comments in Javadoc tags are highly predictable. Similarly, purpose-summary is meant to say what the code does and its perplexity is lower compared to rationale or expand. Purpose-expand and Purpose-rationale indicate how certain things in the code were done and the choices made respectively. These categories have highest perplexity which matches our intuition/claims that those comments which add explanations going beyond the code would be predicted as non-entailing.
6.5. Comparison with Human Judgements
In this section, we validate our core claim that predictive performance from a sequence to sequence model can be used to develop a comment entailment method by comparing the entailment decisions from Craic to the judgement of human developers. To avoid confusion, recall that low perplexity on a comment indicates that it is highly entailed by the code, high perplexity indicates a high degree of non-entailment. Now we verify if our scores can predict cases where human would also judge the comments similarly as entailing or not.
To perform this evaluation, we select a sample of 45 projects from the Github corpus (githubCorpus2013). This set is chosen such that projects of varying quality are included. For quality, we use a popularity score for each project which is the sum of the number of forks and the number of watchers (each of the two values is first converted into z-scores before adding them up). We then sample 15 projects from a high popularity range, 15 medium and 15 low popularity according to these scores. In these projects, we collected methods which had maximum 100 tokens so that they would be easier for the annotators to read without the context of the full code base. We randomly sampled 500 method-comment sentence pairs from this set, and performed an annotation experiment.
We hired five M.Sc. students in Informatics as annotators to provide human judgements of entailment. All of them have at least 2 years of experience in professional software development. Our interface presented a method and an associated comment sentence. The annotators read them both and decided on one among five entailment options:
entails: the comment sentence is logically entailed by the method
does not entail: the comment sentence is not entailed by the method
partly entails: option to be used for long comment sentences where some portion of the comment sentence is entailed though not the full sentence.
cannot decide: option allows annotators to refrain from making a decision when they do not understand the method or comment either due to high context dependence or low quality comments.
un-related: when the annotators understood the comment and the method, but were unable to see how they go together.
We also tracked two properties of the comments which we thought can be used to analyze annotator decisions based on intrinsic comment properties. Annotators could mark individual comments as incoherent when the comment is low quality or too short to understand. Another option allowed annotators to mark off comments which are part of javadoc. These markings were in addition to the entailment choice.
We did not ask the annotators to judge whether the comment sentence was useful, but only whether the comment was entailed by the code, which we argue is a more objective notion.
Every annotator marked each of the 500 examples, but each annotator saw the examples in a different random order. On average, an annotator took 8 hours to complete the task.
We removed one annotator who had substantial disagreement with the rest of the annotators (measured by pairwise Cohen’s Kappa score for rater agreement). The remaining annotators had a fair level of agreement. The majority of the confusion was between entail and partly entail categories. The pairwise Cohen’s Kappa for the remaining four annotators ranges from 0.1 to 0.29 indicating fair agreement. We noticed that there were two subgroups within our annotators, where two of them overwhelmingly picked ‘entails’ for ambiguous examples and two picked ‘partly entailed’ or ‘not entailed’. The first pair of annotators have an agreement of 0.19 and the second 0.29.
On one-fifth of our examples (95 out of 500), all four annotators picked the same entailment decision (out of five possible) indicating that this task is meaningful to the annotators. A majority decision was possible on 292 examples i.e. three or all of the four annotators picked the same choice on these examples. These numbers indicate that close to 60% of the examples could be annotated reliably.
For our class of interest, the redundant comments, 85 samples were marked by all four annotators as “entailed”. We will examine our model’s predictions on this subset in the next section.
Below we provide some examples of annotator decisions.
M1,C1 is a pair where all four annotators agreed that the comment is not entailed by the method.
In M2,C2 all four annotators agree that the comment is entailed or partly entailed by the method.
M3,C3 is a pair where annotators disagreed.
Here, two annotators picked entailed/partly entailed and two chose the non-entailed. It is likely that the information about the columns being “nullable” is not fully inferrable from the code (as opposed to columns containing null values already). Similarly, the return value being a keyword is not directly inferrable from the code. These points may trigger a “not entailed” decision. At the same time, the return type being a string may have been considered by the two other annotators as sufficient to entail/partly entail.
Other comments where annotators disagreed included comments which were not fluent or were context-dependent. In fact, out of the 84 examples where no majority decision was reached, 30 were marked by annotators as incoherent. Note that incoherent comment sentences also result from errors in the sentence segmentation performed on the comment text.
Comparison between model and human judgements
Since our annotators could reliably annotate and agree on the examples, we now examine whether the entailment predictions from Craic match human judgements.
For this analysis, we use the 292 examples where a majority decision was reached by the annotators. Table 6 shows the perplexities of our best model (s2s-begin-end) on these examples split by the majority category from the annotation. We see that the entailed examples have lowest average and median perplexities compared to those partly entailed, which in turn are lower than the non-entailed examples. This finding shows that our model predictions correspond well with manual annotations by software developers.
Beyond agreement, it is also of interest to explore illustrative examples of when the model predictions are incorrect. For the example below, all four annotators marked the comment as entailing or partly entailing but the model assigned a high perplexity (74.3). For humans it is clear that is an index but a model with only surface code tokens will fail to predict those comment tokens. In addition, the fact that range is constant may be treated as subsidiary information by humans but a model score will be affected by the low predictability of these tokens.
Another noteworthy example is the following where the model predicts the comment to have low perplexity (1.12, hence entailed). However, three out of four annotators marked it as not entailed.
Here a crucial component of the semantics of the comment depends on the negation that is being conveyed: variables start and end are indexes in a result set and not primary keys. A surface level sequence to sequence model is not sensitive to such nuances.
7. javadoc comments
Finally, our approach also allows us to make more systematic recommendations about broad classes of comment sentences. As an exemplar of this type of study, we evaluate the usefulness of fields in javadoc comments, to test the hypothesis that some fields such as @param and @return encourage developers to restate the method signature rather than providing useful information. While these tags are useful for the document generator to pepper the documentation with code snippets, such comments are rarely elaborate or insightful.
To explore this hypothesis, we computed the average perplexity assigned by our best sequence to sequence model (begin-end) on the comment sentences in our 45 project dataset, a subset of which we used for the annotation experiment. There are 73,430 comment sentences in this set. We only consider those javadoc elements that occur at least 25 times in the corpus.
|javadoc type||no. sentences||avgppx|
Table 7 shows the number of sentences belonging to javadoc elements and the average perplexity. For comparison, the Non-javadoc of the table is the average perplexity of the sentences not belonging to a javadoc element. The non-javadoc sentence perplexity is around 15. In comparison all but one of the javadoc element sentences have lower perplexities. Many common tags, such as @since, @throws, @inherit, @param and @deprecated, have much lower perplexity than non-Javadoc comments, showing that these elements can be easily predicted by a simple surface analysis of the method body. In fact, two of the commonly used fields, @param and @throws, are at least two times more predictable than a non-javadoc comment sentence.
To validate the results from our deep models, we also examine how javadoc elements were treated by our annotators. Of the 500 code-comment pairs that were annotated, 190 were annotated by at least one of our annotators as involving javadoc. The entailment decisions on these samples are heavily towards entailment: 180 marked as entailing or partly entailing, and 10 as not entailing.
This result has implications for the design of documentation generators like javadoc. It is consistent with the claim that comments in javadoc fields are less informative than other types of comments. Uninformative comments increase visual clutter and decrease readability, as evidenced by the advice to developers cited earlier to avoid such such comments. Therefore, if confirmed by more extensive studies, these empirical findings could motivate an effort by designers of documentation systems to consider ways to modify the available fields to encourage more informative comments, such as by refining the set of available javadoc elements, or revising the set of best practices for filling in the existing fields. The result also raises the possibility that that deep learning language models will be able to automatically generate such comment categories allowing a developer to focus on comments that require greater knowledge and depth of understanding.
In this paper, we have introduced the problem of comment entailment and used it to develop a tool to detect redundant comments. While all entailed comments need not be of low quality, highly entailed ones, that can be readily detected automatically, are likely to be uninformative. Going forward, we aim to develop models which can identify other types of uninformative comments with wide coverage and minimal annotation. Beyond comment quality, our entailment model could be useful in a variety of settings, for example, as a scoring tool within code search, within program synthesis from natural language, and within code summarization.
- copyright: none
- doi: 10.475/123_4
- isbn: 123-4567-24-567/08/06
- Cleaner of Repetitive Areas In Comments, pronounced “crack”.
- We will still require a small amount of labelled data to evaluate the model, but this is much less of a concern, as long experience in the machine learning community has shown that the amount of data required to evaluate the model can be several orders of magnitude smaller than the amount of data required for training.