Automatic Source Code Summarization with Extended TreeLSTM
Abstract
Neural machine translation models are used to automatically generate a document from given source code since this can be regarded as a machine translation task. Source code summarization is one of the components for automatic document generation, which generates a summary in natural language from given source code. This suggests that techniques used in neural machine translation, such as Long ShortTerm Memory (LSTM), can be used for source code summarization. However, there is a considerable difference between source code and natural language: Source code is essentially structured, having loops and conditional branching, etc. Therefore, there is some obstacle to apply known machine translation models to source code.
Abstract syntax trees (ASTs) capture these structural properties and play an important role in recent machine learning studies on source code. TreeLSTM is proposed as a generalization of LSTMs for treestructured data. However, there is a critical issue when applying it to ASTs: It cannot handle a tree that contains nodes having an arbitrary number of children and their order simultaneously, which ASTs generally have such nodes. To address this issue, we propose an extension of TreeLSTM, which we call Multiway TreeLSTM and apply it for source code summarization. As a result of computational experiments, our proposal achieved better results when compared with several stateoftheart techniques.
1 Introduction
In developing and maintaining software, it is desirable that details about a program, such as its package dependencies and behavior, are appropriately commented in its source code files to enable readers to understand the program’s usage and purpose. Given this, software developers are strongly encouraged to document source code. However, documentation is often inaccurate, misleading, or even omitted because it is costly to write accurate and effective documentation, leading to developers spending a lot of time reading the source code [30]. To address this issue, automatic document generation has been studied in many software engineering studies. Source code summarization is an important component of automatic document generation, which generates a short natural language summary from the source code.
Recent studies on source code summarization showed that high quality comments can be automatically generated with deep neural networks trained on a largescale corpus [10, 8]. To generate a good summary, a machine learning model needs to learn the functionality of the source code and translates it into natural language sentences. Since the structural properties of source code are of a different nature from those in natural language, that is, they have loops, conditional branching, etc., we should leverage such properties rather than sequential representations of source code. In many programming languages, the source code can be parsed into a treestructured representation called an abstract syntax tree (AST), which enables us to use structural information of the source code. Several studies have reported that the results various tasks related to source code were improved by utilizing ASTs. Such tasks include classifying source code [16], code clone detection [29], predicting of method name [1] and source code summarization [8, 27], which is the focus of this paper.
Long ShortTerm Memory (LSTM) networks [7] play an important role in neural machine translation. This network is suitable for sequential data such as natural language sentences. However, due to the structured nature of source code, it may not be applicable to the sequential representation of source code.
TreeLSTM [24], originally proposed for predicting the semantic relatedness of two sentences and for sentiment classification, is a neural network architecture that handles treestructured data, such as ASTs. It can be applied to other natural language processing (NLP) tasks (e.g. machine translation [5]). Tai et al. proposed two types of TreeLSTM in their paper: The first type can handle trees in which each node has an arbitrary number of children, and the other type can handle the order of a fixed number of children at each node. However, it is difficult to apply them to ASTs since ASTs have a node that has an arbitrary number of ordered children as in Figure 1. In this research, we propose an extension of TreeLSTM to solve this issue and use it as an encoder in our source code summarization model.
The contributions of this paper are shown below.

We propose an extension of TreeLSTM: The Multiway TreeLSTM unit can handle a tree which contain a node having an arbitrary number of ordered children in ASTs.

We show that a treestructured model with Multiway TreeLSTM, which can learn tree structures in ASTs directly, is more effective than a sequential model used for machine translation in NLP when applied to source code summarization.
To evaluate our model, we conducted computational experiments using a dataset consisting of pairs of a method and its documentation comment. Our experimental results show that our model is significantly better when compared with a stateoftheart summarization model due to [8], and some source code summaries generated by our model are more expressive than those in the original dataset.
2 Background
Source code summarization is related to machine translation. Recently, Recurrent Neural Networks (RNNs) and LSTM are of a great importance in the NLP field. In this section, we review some concepts and previous work related to our study.
2.1 Recurrent Neural Networks
RNNs have been frequently used in the NLP field. Unlike feedforward neural networks, RNNs take sequences of arbitrary lengths as input and generate sequences of the same length while updating their internal states as shown in Figure (a)a.
Since sentences in natural languages can be seen as sequences of words, RNNs are wellsuited to NLP.
The standard RNN receives a sequence of dimensional vectors as input and outputs a sequence of dimensional vectors while updating the hidden state at each time step as , where and are the input and hidden state vectors at time step , respectively, , , and are model parameters. Here, denotes the hyperbolic tangent and is used as an activation function.
2.1.1 Long ShortTerm Memory (LSTM)
Standard RNNs are not capable of learning “longterm dependencies”, that is, they may not propagate information that appeared earlier in the input sequence later because of the vanishing and exploding gradient problems. LSTM [7] has additional internal states, called memory cells, that do not suffer from the vanishing gradients and it controls what information will be propagated using gates as shown in Figure (b)b. LSTM contains three independent gates. A forget gate discards irrelevant information from the memory cell. An input gate adds new information to the memory cell. An output gate computes the new hidden state. With these structures, we can avoid vanishing gradients and train RNNs on long sequences, which can be used in various applications in the NLP field [22]. For each time step , each unit in the LSTM can be computed by the following equations:
where , and denote the forget gate, the input gate, and the output gate for time step , respectively, denotes the sigmoid function, and denotes an elementwise product over matrices. The model parameters , and are matrices and vectors for , , , and .
2.1.2 TreeLSTMs
We have seen that LSTM networks generate a sequence from an input sequence. Tai et al. [24] extended this type of network to generate a tree from an input tree, which they call TreeLSTMs. For each time step, standard LSTMs take an input vector and a single hidden state vector from the previous time step and propagate information from forward to backward. TreeLSTMs can take multiple hidden states and propagate information from leaves to the root as shown in Figure (c)c. Tai et al. [24] proposed two kinds of TreeLSTMs: Childsum TreeLSTM and Nary TreeLSTM.
Childsum TreeLSTM: For an input vector , we denote as the children of and as the number of children . In Childsum TreeLSTM, the memory cell and the hidden state are computed as follows:
(1)  
(2)  
(3)  
(4)  
(5)  
(6)  
(7) 
where , , and denote the forget gates, the input gate, and the output gate for time step , respectively. Note that the summation (1) of the hidden states of the children is given as input to (3), the input gate (4) and the output gate (5) and the same parameter is used for all the hidden states of children of in the forget gates (2). In the evaluating equation (2), the parameters are shared for all children . Therefore, the Childsum TreeLSTM can handle an arbitrary number of children. As shown in Figure (a)a, since the forget gate is independently computed for each child , interactions among children are not taken into consideration when discarding information in the forget gate.
Furthermore, with the exception of the forget gate, the order of the children cannot be considered since information propagated from the children cannot be distinguished due to the summation (1).
Nary TreeLSTM: In Nary TreeLSTM, the memory cell and the hidden state are computed as follows:
(8)  
(9)  
(10)  
(11)  
(12)  
(13)  
(14) 
where is the vector obtained by concatenating vectors . Unlike Childsum TreeLSTM, parameters are not shared among the children and the concatenation (8) is used in (10), the input gate (11), and the output gate (12) instead of the summation (1). As shown in Figure (b)b, interactions among children can be considered when discarding information in the forget gate since the forget gate is computed by the concatenation (8), and moreover, the children can be distinguished in (10), (11), and (12) due to the concatenation. However, it is impossible to input trees containing nodes that have an arbitrary number of children because the size of parameter matrices must be fixed to compute the equations from (9) to (12).
These TreeLSTMs are not appropriate for ASTs of source code since they have nodes with an arbitrary number of children and their order is significant. In previous studies (e.g. [27]) of source code summarization, ASTs are converted into binary trees for applying the Nary TreeLSTM.
2.2 Related Work
Various methods for automatic source code summarization have been proposed. There are several nonneural approaches: methods based on call relationships [15] and topic modeling [17]. Oda et al. [18] proposed a pseudocode generation method, which generates linebyline comments from given source code.
Our focus is on neural networkbased source code summarization. In our approach, we train neural networks on a largescale parallel corpus consisting of pairs of a method and its documentation comment. This approach is frequently used in recent source code summarization studies. Iyer et al. [10] proposed a neural source code summarization method based on an LSTM network with attention, called CODENN, and showed that this approach is promising for source code summarization as well as machine translation. DeepCom [8] exploits the structural property of source code by means of ASTs. DeepCom is given an AST as a sequence obtained by traversing it and encodes the sequence with an LSTM encoder. Note that the given AST is uniquely reconstructible from the encoded sequence they used. ASTs are extensively used not only in code summarization studies but also in various software engineering studies [29, 1, 28, 2].
3 Proposed Approach
In this section, we propose an extension of TreeLSTM and describe our code summarization framework.
3.1 Multiway TreeLSTM
As mentioned in Section 2, standard TreeLSTMs proposed by Tai et al. [24] cannot handle a node that has an arbitrary number of children and their order in ASTs simultaneously. To overcome this difficulty, we develop an extension of TreeLSTM, which we call Multiway TreeLSTM. The key to our extension is that we use LSTMs to encode the information of ordered children. This idea enables us not only to handle an arbitrary number of ordered children but also to consider some interactions among children, which can take advantage in both Childsum and Nary TreeLSTMs.
In Multiway TreeLSTM, we add an ordinary chainlike LSTM to each gate immediately before linear transformation to flexibly adapt to a node that has an arbitrary number of ordered children, as shown in Figure 4.
The memory cell and the hidden state at each time step are updated as follows:
(15)  
(16)  
(17)  
(18)  
(19)  
(20)  
(21)  
(22)  
(23)  
(24) 
Here, in (15) to (18) denotes standard chainlike LSTMs and is the result of giving a sequence of vectors to . Let us note that is a sequence of vectors and , , and are the last vectors in the sequence of , , and , respectively. Moreover, we adopt bidirectional LSTMs [20] for at each gate to carry the information on forward children to backward children and vice versa. A bidirectional LSTM internally has two LSTMs for the forward and backward directions. Given an input sequence , a bidirectional LSTM feeds and to its LSTMs and gets sequences and , respectively. The two sequences are then combined as , where is the concatenation of and . Thanks to bidirectional LSTMs, our Multiway TreeLSTM can utilize interactions among children at each gate.
3.2 Code Summarization Framework
An overview of our approach is illustrated in Figure 5.
The proposed framework is based on sequencetosequence (seq2seq) models [4, 23] and can be roughly divided into three parts: parsing to ASTs, encoding ASTs, and decoding to sequences with attention. First, we convert each source code into an AST with a standard AST parser. In our model, each node in the parsed AST is embedded into a vector of fixed dimension. The AST with vectorlabeled nodes is then encoded by our Multiway TreeLSTM. Finally, the encoded vectors are decoded to a natural language sentence using a LSTM decoder with attention.
3.2.1 Encoder
Given an AST, the encoder learns distributed representations of the nodes. At each node in the AST, the Multiway TreeLSTM encoder computes the hidden state from input AST node and the hidden states of children as
3.2.2 Attention Mechanism
The attention mechanism [3] allows neural networks to focus on the relevant part of the input rather than the unrelated part. This mechanism particularly evolves neural machine translation models [3, 14, 25].
In our model, for the hidden state in the encoder at node and that in the decoder hidden state at time step , the context vector is computed as
where is the weight between and defined as
Here, score is a function that measures the relevance between and . We adopt the simple additive attention [3] as
where and are model parameters in the attention mechanism.
The attention mechanism works between source code and natural language as well. For example, the token “=” can be translated directly into “equal”. Moreover, in our model, the attention mechanism can focus on subtrees of an AST as shown in Figure 6.
In ASTs, subtrees are meaningful components in source code such as single expressions, “if” statements, and loop statements. It is possible to focus on such components of various sizes by using the attention mechanism at each node in the treestructured encoder.
3.2.3 Decoder
The decoder decodes the hidden states in the encoder to a sentence in the target language. Following [3], at time step , the LSTM decoder computes the hidden state as
Finally, the hidden state in the decoder is projected to a word as , where is the predicted probability distribution of , and , are model parameters of the projection layer. The model parameters are trained by minimizing cross entropy expressed as
where is the true distribution of the word .
4 Experiments
We conducted comparative experiments with the above framework. In order to fairly compare the ability of encoders, we made all parts other than those the same as much as possible.
4.1 Dataset
We performed computational experiments with a dataset consisting of pairs of a method written in Java and a Javadoc documentation comment collected by [8]. Since comments are not always given in an appropriate manner, we filtered pairs with comments with oneword descriptions, constructors, setters, getters, and tester methods from the dataset, as in Hu et al [8]. Moreover, when a comment has two or more sentences, we only used the first sentence since it typically expresses the functionality of the method. Hu et al truncated the encoded sequences obtained from the dataset to some fixed length. However, similar truncation cannot be applied directly to ASTs. Therefore, we only use ASTs with nodes at most 100. Finally, we used samples for training, for validation, and for testing. Likewise many NLP studies, we limited the vocabulary of identifiers to and those exceeding the limit were replaced with a special token, UNKID. We also limited the vocabulary of literals to , with the remaining string literals and number literals replaced with UNKSTR and UNKNUM, respectively.
4.2 Baselines
In addition to the CODENN and DeepCom mentioned in Section 2.2, we compared our model with the Transformer model [25], a stateoftheart natural language translation model consisting of only attention mechanisms for both the encoder and the decoder, and attentionbased seq2seq models using Childsum TreeLSTM and Nary TreeLSTM [24] as the encoder. Although the Transformer, was not designed for the purpose of source code summarization, we include this approach in the experiment^{1}^{1}1In the transformer model, we use the same decoder as [25]. Unlike [25], we set the number of layers to 3 (originally 6) and the dimensions of embedding and model parameters to 256 (originally 512).. In the Nary TreeLSTM model, it is difficult to use ASTs as input since they have an arbitrary number of children. Therefore, we converted ASTs into binary trees with a standard binarization technique. Features of each model used in our experiments are shown in Table 1.
Approaches 






CODENN  Set  No      
DeepCom  Sequence  Yes  Yes  No  
Multiway (Ours)  Tree  Yes  Yes  Yes  
Childsum  Tree  No  No  Yes  
Nary  Tree*  Yes  Yes  Yes  
Transformer  Sequence  Yes     
*Note: Nary TreeLSTMs can handle trees whose node has fixed number of children only.
The attention mechanism in DeepCom cannot focus on subtrees in the AST since, with their sequence encoding scheme from the AST, the attention mechanism can focus on only prefixes of the encoded sequence, which do not correspond to subtrees of the AST. In contrast to DeepCom, our proposed model can focus on subtrees in the AST. Subtrees form “chunks of meaning” in a method. This may be useful to translate a method into a natural language sentence.
4.3 Implementation
Using the dataset described in Section 4.1, we trained the models, validated them after every epoch, and tested them. The models^{2}^{2}2Codes are available at https://github.com/sh1doy/summarization_tf. were written in TensorFlow and trained on a single GPU (NVIDIA Tesla P100) with the following settings:

We used a minibatch size of 80 in training.

The adaptive moment estimation (Adam) algorithm [11] was used with the learning rate set to 0.001 for optimization.

Both encoders and decoders were twolayered with shortcut connections [6] .

We also implemented a onelayered encoder for Mutiway TreeLSTM.

Word embedding and hidden states of the encoder and decoder were all 256 dimensional.

To avoid overfitting, we adopted dropout [21] with a drop probability of 0.5.
4.4 Evaluation Metrics
We evaluated the models in several metrics covering different contexts. BLEU (BLEUN) [19] is a metric evaluating Ngram overlaps between two sentences. CIDEr [26] is a consensusbased metric for evaluating image captioning. METEOR [12] is a metric based on the weighted mean of the unigram precision and recall. RIBES [9] is a metric based on rank correlation coefficients with word precision. ROUGEL [13] is a metric for summaries and is based on the longest common subsequence between two summaries.
5 Results
In this section, we provide and answer the following two research questions:

How effective is the Multiway TreeLSTM in source code summarization compared with some baseline approaches and the two conventional TreeLSTMs?

How effective are our and other methods when varying comment lengths, AST sizes, and the maximum number of children?
5.1 Experimental Analysis
The detail of the experimental results are shown in Table 2 and Figure 7. TABLE 2 shows the comparison among our and other methods in several evaluation criteria.


For RQ1, our methods (1layered) and (2layered) are better than the previous methods CODENN and DeepCom in all evaluation criteria. Moreover, the conventional TreeLSTMs are even better than the previous methods. In consequence of these facts, we can see that ASTs should be treated as they are without encoding to sequences, and TreeLSTMs, including our proposal, can leverage the treestructured nature of ASTs. The experiment also shows that Transformer, which is one of the stateoftheart methods in neural machine translation, does not work well for source code summarization. This suggests that source code is quite different from natural language sentences. It would be interesting that our onelayered Multiway TreeLSTM encoder model outperforms its twolayered model in multiple evaluation criteria, whereas multilayered seq2seq models are better than singlelayered. For example, the BLEU4 score of a onelayered seq2seq encoder model (onelayered version of DeepCom) was in our implementation, which is lower than the twolayered one (original DeepCom).
Figure 7 shows the detail of BLEU4 scores one some methods based on ASTs when varying comment lengths (Figure (a)a), AST sizes (Figure (b)b) and maximum children sizes (Figure (c)c). In the following, we call the maximum number of children the maximum degree of the AST.
For RQ2, we can conclude that our summarization models based on Multiway TreeLSTMs are better than other models. Although we do not see any considerable difference among our and other models in various comment lengths (Figure (a)a), our onelayered model is still better than other models when generating summaries of moderate lengths. On the other hand, AST sizes and their maximum degrees have an impact on the quality of summaries. We find that our onelayered Multiway TreeLSTM model significantly outperforms the other models when ASTs have many nodes or large degree as shown in Figures (b)b and (c)c. It is worth noting that ASTs containing many nodes are needed to be appropriately commented, and hence we would like to say that our model is more suitable for practical purposes.
5.2 Output Examples
Table 3 shows some examples of summaries generated by our method. We only picked some interesting examples and hence do not claim that our method always generates such a summary. The summaries are quite natural compared with the original documentation comments in the dataset. In some cases, our model generated exactly the same sentences as in (1), (2). In other cases, our model expressed almost the same meaning in different words (3), (4). It is worth noting that some summaries are more expressive than the original sentences as in (5), (6).
ID  Source code and comment  

1  Source code 
public static Charset toCharset(Charset charset){
return charset == null ? Charset.defaultCharset(): charset;
}

Gold 


Generated 


2  Source code 
public boolean more() throws JSONException {
next();
if (end()) {
return false;
}
back();
return true;
}

Gold 


Generated 


3  Source code  
Gold 


Generated 


4  Source code  
Gold 


Generated 


5  Source code 
public static boolean removeFile(File file){
if (fileExists(file)) {
return file.delete();
}
else {
return true;
}
}

Gold 


Generated 


6  Source code 
public void dismissProgressDialog(){
if (isProgressDialogShowing()) {
mProgressDialog.dismiss();
mProgressDialog=null;
}
}

Gold 


Generated 

6 Conclusion
Neural network approaches are certainly successful in machine translation. These approaches are expected to be so in source code summarization since we can see it as translations from source code to natural language sentences. However, there is an indispensable difference between source code and natural language: Source code is essentially structured. This fact arises a natural question: How do we use structural information in neural networks? Fortunately, the essential structure of source code forms a tree, namely an AST. This suggests that neural networks for trees would be useful in source code summarization.
In this paper, we proposed an extension of TreeLSTM on the basis of the work of Tai et al. [24], which is a generalization of LSTM for trees. Our extension obtains a distributed representations of ordered trees, such as ASTs, which cannot be directly handled by the known TreeLSTMs since they have an arbitrary number of ordered children. We applied our extension to source code summarization as the encoder and compared with other baseline methods. Our experimental results show that our extension is suitable for dealing with ASTs, and code summarization framework with our extension can generate highquality summaries. We would like to mention that some summaries generated by our method are more expressive than the original handmade summaries. This indicates the effectiveness of automatic document generation with neural networks for ASTs.
References
 [1] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. code2vec: Learning distributed representations of code. arXiv preprint arXiv:1803.09473, 2018.
 [2] Bander Alsulami, Edwin Dauber, Richard Harang, Spiros Mancoridis, and Rachel Greenstadt. Source code authorship attribution using long shortterm memory based networks. LNCS, 10492:65–82, 2017.
 [3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
 [4] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proc. of EMNLP, pages 1724–1734, 2014.
 [5] Akiko Eriguchi, Kazuma Hashimoto, and Yoshimasa Tsuruoka. Treetosequence attentional neural machine translation. In Proc. of ACL, volume 1, pages 823–833, 2016.
 [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. of the IEEE CVPR, pages 770–778, 2016.
 [7] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural Comput., 9(8):1735–1780, November 1997.
 [8] Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. Deep code comment generation. In Proc. of IEEE/ACM ICPC, pages 200–210, 2018.
 [9] Hideki Isozaki, Tsutomu Hirao, Kevin Duh, Katsuhito Sudoh, and Hajime Tsukada. Automatic evaluation of translation quality for distant language pairs. In Proc. of EMNLP, pages 944–952, 2010.
 [10] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. Summarizing source code using a neural attention model. In Proc. of ACL, volume 1, pages 2073–2083, 2016.
 [11] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 [12] Alon Lavie and Abhaya Agarwal. Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgments. In Proc. of ACL Workshop on Statistical Machine Translation, pages 228–231, 2007.
 [13] ChinYew Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out, 2004.
 [14] Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attentionbased neural machine translation. In Proc. of EMNLP, pages 1412–1421, 2015.
 [15] Paul W McBurney and Collin McMillan. Automatic documentation generation via source code summarization of method context. In Proc. of ICPC, pages 279–290, 2014.
 [16] Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. Convolutional neural networks over tree structures for programming language processing. In Proc of AAAI, pages 1287–1293, 2016.
 [17] Dana MovshovitzAttias and William Cohen. Natural Language Models for Predicting Programming Comments. In Proc. of ACL, volume 2, pages 35–40, 2013.
 [18] Yusuke Oda, Hiroyuki Fudaba, Graham Neubig, Hideaki Hata, Sakriani Sakti, Tomoki Toda, and Satoshi Nakamura. Learning to generate pseudocode from source code using statistical machine translation. In Proc. of IEEE/ACM ASE, pages 574–584, 2015.
 [19] Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. Bleu: A method for automatic evaluation of machine translation. In Proc. of ACL, pages 311–318, 2002.
 [20] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE Trans. on Signal Processing, 45(11):2673–2681, 1997.
 [21] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 15(1):1929–1958, 2014.
 [22] Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. Lstm neural networks for language modeling. In Proc. of ISCA, 2012.
 [23] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In Proc. of NIPS, volume 2, pages 3104–3112, 2014.
 [24] Kai Sheng Tai, Richard Socher, and Christopher D. Manning. Improved semantic representations from treestructured long shortterm memory networks. In Proc. of ACL, volume 1, pages 1556–1566, 2015.
 [25] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proc. of NIPS, pages 5998–6008, 2017.
 [26] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensusbased image description evaluation. In Proc. of IEEE CVPR, pages 4566–4575, 2015.
 [27] Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, and Philip S. Yu. Improving automatic source code summarization via deep reinforcement learning. In Proc. of ACM/IEEE ASE, pages 397–407, 2018.
 [28] Hui Hui Wei and Ming Li. Supervised deep features for Software functional clone detection by exploiting lexical and syntactical information in source code. In Proc. of IJCAI, pages 3034–3040, 2017.
 [29] Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. Deep learning code fragments for code clone detection. In Proc. of IEEE/ACM ASE, pages 87–98, 2016.
 [30] Xin Xia, Lingfeng Bao, David Lo, Zhenchang Xing, Ahmed E. Hassan, and Shanping Li. Measuring program comprehension: A largescale field study with professionals. In Proc. of ICSE, pages 584–584, 2018.