Leveraging Code Generation to Improve Code Retrieval and Summarization via Dual Learning

Leveraging Code Generation to Improve Code Retrieval and Summarization via Dual Learning

Abstract.

Code summarization generates brief natural language description given a source code snippet, while code retrieval fetches relevant source code given a natural language query. Since both tasks aim to model the association between natural language and programming language, recent studies have combined these two tasks to improve their performance. However, researchers have yet been able to effectively leverage the intrinsic connection between the two tasks as they train these tasks in a separate or pipeline manner, which means their performance can not be well balanced. In this paper, we propose a novel end-to-end model for the two tasks by introducing an additional code generation task. More specifically, we explicitly exploit the probabilistic correlation between code summarization and code generation with dual learning, and utilize the two encoders for code summarization and code generation to train the code retrieval task via multi-task learning. We have carried out extensive experiments on an existing dataset of SQL and Python, and results show that our model can significantly improve the results of the code retrieval task over the-state-of-art models, as well as achieve competitive performance in terms of BLEU score for the code summarization task.

code retrieval, code summarization, code generation, dual learning
Both authors contributed equally to this research.
Corresponding author.
1234567891011

1. Introduction

Modern software engineering relies heavily on large amount of third-party libraries and public codes from various websites. During software development and maintenance, developers often spend lots of time understanding code and searching for the code snippets they need (Allamanis et al., 2018). Therefore, code retrieval and code summarization play an important role in many software engineering activities. Code summarization generates brief natural language description given a source code snippet, while code retrieval fetches relevant source code given a natural language query. As for both academics and industry of software engineering, they are two important and challenging tasks.

A large amount of code and related text information have been accumulated on the Web. For example, Stack Overflow  (Overflow, 2019) contributes to a huge amount of code snippets, usually paired with natural-language-based questions and comments. Researchers have extracted ¡natural language description, code¿ pairs from those resources to help develop data-hungry models for associating natural languages with programming languages (Iyer et al., 2016; Yao et al., 2018). For example, StaQC (Yao et al., 2018) is a large-scale dataset automatically mined from Stack Overflow, which contains more than 100K ¡question, code¿ pairs for SQL and Python respectively. Iyer el al. (Iyer et al., 2016) built a dataset for SQL and C# in a similar way. These datasets have greatly contributed to the progress of research on code retrieval and code summarization.

With these datasets mined from the Web, deep learning is now widely used in code retrieval and code summarization as a mainstream approach. Researchers in the fields of Web technologies, natural language processing, deep learning and software engineering have proposed a variety of neural models for these two tasks, aiming to design better features or more sophisticated network structures to capture more accurate semantics of code and natural language text. For example, the code retrieval model DCS (Gu et al., 2018) used two neural encoders to model the semantics of natural language queries and code snippets respectively, and measured their correlation by cosine similarity of the encoders’ output. There are more deep learning works on code summarization due to its resemblance to machine translation. In particular, SBT (Hu et al., 2018) transformed an abstract syntax tree (AST) into a token sequence to generate a code summary, in which brackets were used to represent the tree structure. Code2Seq (Alon et al., 2019) represented a code snippet as a set of compositional paths in its AST. Wan et al. (Wan et al., 2018) adopted Tree-RNN and reinforcement learning to improve code summarization.

Since both code retrieval and code summarization tasks aim to model the association between natural language and programming language, recent studies have combined these two tasks to improve their performance. Chen et al. (Chen and Zhou, 2018) proposed BVAE, a neural framework that contains two Variational-Auto-Encoders (VAEs) to model source code and natural language respectively. Both VAEs are trained jointly to reconstruct their inputs with regularization that captures the closeness between the latent variables of code and description. CoaCor (Yao et al., 2019) trained a code summarization model to generate code summaries that can be used for the retrieval task based on reinforcement learning. These approaches, however, have yet effectively leveraged the intrinsic connection between the two tasks. For example, the BVAE models for retrieval and summarization tasks are designed and trained independently. CoaCor adopted a pipeline approach to feed generated summary to a separate assemble module for code retrieval. These simplified solutions would lead to two problems:

  • Performance cannot be well balanced between code retrieval and code summarization. For example, the BLEU score (Papineni et al., 2002) of the code summarization model in CoaCor is not satisfactory though it improves the performance of existing code retrieval models significantly. Different from what claimed in  (Yao et al., 2019), we respectively argue that generating summaries close to human-provided queries is naturally fit to code retrieval. The compromise of BLEU score, which represents the similarity between the generated summaries and human-written ones, can be avoided if we can model the inner connection between the two tasks better.

  • The complexity of overall procedure makes the model training more challenging. For example, BVAE (Chen and Zhou, 2018) only provides a proof-of-concept implementation based on simplest network architecture, because more powerful models are harder to train when VAE is applied to text. Similarly, the convergence of reinforcement learning in CoaCor (Yao et al., 2019) is also a problem.

To overcome these issues, we propose an easy-to-train, end-to-end method to improve both code retrieval and code summarization. Our work is inspired by CoaCor (Yao et al., 2019). If the similarity between generated summaries and code-search queries can help improve code retrieval, it is natural to assume that the similarity between generated code and code snippets in code bases can also be beneficial. Therefore, we introduce an additional code generation task. Since code summarization and code generation are two tasks that have dual forms, we explicitly exploit the probabilistic correlation between code summarization and code generation with dual learning (Xia et al., 2017). The two encoders for code summarization and code generation are also shared with a code retrieval scorer via multi-task learning. In this way, we get an extremely simple yet effective model, named CO3, in which the 3 COde-related tasks can be trained simultaneously.

In this paper, we make the following contributions:

  • We design a simple yet effective end-to-end model for both code retrieval and code summarization by introducing the code generation task and exploiting the intrinsic connection between these tasks via dual learning and multi-task learning.

  • We carried out extensive experiments on an existing dataset of SQL and Python. Experiment results show that our model can improve code retrieval significantly compared to state-of-the-art models, without affecting the performance in code summarization.

  • With ablation study and case study, we provide strong evidence that the introduction of code generation and dual learning leads to better representations of source code and text in terms of semantics.

The rest of this paper is organized as follows. In Section 2, we introduce some background knowledge on code retrieval, code summarization, code generation and dual learning. In Section 3, we explain our approach in details. In Section 4, we describe our experiment setup, including datasets and evaluation metrics used. In Section 5, we present and analyze our experiment results. Before we conclude in Section 7, we discuss related research efforts in the areas of code retrieval, code summarization, dual learning and multi-task learning.

2. Background

2.1. Code Retrieval

Code retrieval aims to match code with natural language query. Formally, given a set of code snippets and a natural language query , code retrieval aims to retrieve the corresponding code snippet that matches the semantics of the query . Firstly, we have an encoder to represent the code snippet as code vector , which is calculated as follows:

(1)

where is the neural network to encode source code. Then, we map the natural language query to the same semantic space as code vector , represented as query vector , which is calculated as follows:

(2)

where is another neural network to encode natural language queries Finally, the similarity value in the matching step is calculated as follow:

(3)

where is a similarity function between code vector and query vector . By maximizing the similarity function, we can get the most relative code to a given description.

2.2. Code Summarization and Code Generation

As we mentioned above, code summarization could be treated as a text generation task. Given an input code snippet that has code tokens, code summarization aims to generate a readable natural language query with words which describe the input code snippet . Let be the set of all possible summary sequences. The system tries to find the optimal sequence for :

(4)

In contrast, let be the set of all possible code snippets, given a natural language query , code generation is to generate the optimal code snippets for :

(5)

Note that for these two Seq2Seq models, the input of code summarization is the expected result of code generation, and vice versa for code generation. Thus, as we mentioned in Section 1, code summarization and code generation are dual tasks of each other.

2.3. Dual Learning

Dual Learning is introduced in (Xia et al., 2017), which aims to utilize the duality so that two dual tasks can keep learning from each other until convergence. Given two tasks: a primal task that takes samples from space as input to map into output space , and a dual task that takes samples from space as input to map to output space . Then, for training pairs , the primal task is to find function , and the dual task is to find function .

With the principle of duality, if the learned primal and dual models are perfect, we should have:

(6)

where is the learned parameter from , is the learned parameter from , and we call this property probabilistic duality.

By incorporating dual learning into specific training objectives of deep learning models, we would have:

(7)
(8)
(9)

where and are the loss functions decided by the function and . and are the sets of all training pairs.

To optimize these training objectives, the common practice in dual learning is to introduce Lagrange multipliers and add the equality constraint of probabilistic duality into the objective functions. Convert the probabilistic duality constraint into the following regularization term:

(10)

and the training objectives become:

(11)
(12)

which can be trained using common optimization methods. In this paper, we utilize this regularization term to restrain code summarization and code generation to achieve a better performance.

3. Approach

3.1. Overview

In this section, we formulate the problem and describe our model of CO3.

We take the dual learning mechanism in  (Xia et al., 2017) and propose two dual tasks: a primal code summarization task that takes source code sequence as input and summarizes it into text sequence ; and a dual code generation task that takes text sequence as input and uses it to generate code sequence . We reuse and to supervise and respectively, with dual learning mechanism to improve the performance of both tasks. Afterwards, we use the hidden states of these two tasks to facilitate and improve the performance of code retrieval task.

Figure 1. Overall framework of CO3. The blocks with the same color indicate that these modules use same LSTM-cell parameters. More specifically, Code Encoder and Code Decoder share the same LSTM-cell instance; Query Encoder and Query Decoder share the same LSTM-cell instance.

The model is split into three parts, as shown in Figure 1:

  1. The code summarization module encodes the code sequence , and summarizes it into a text sequence .

  2. The code generation module encodes the text sequence , and uses it to generate a code sequence .

  3. The code retrieval module calculates similarity scores between the hidden states of the code summarization module and the code generation module, and then retrieves matching source code based on the scores.

Our model has two major features:

  1. We add a restriction between the code summarization module and the code generation module with dual learning method to connect them and help capture more intrinsic and precise representations of text and code.

  2. We share the parameters of LSTM cells between the encoder of the code summarization module and the decoder of the code generation module, since they both deal with source code; so do the decoder of the code summarization module and the encoder of the code generation module. This reduces the number of model parameters. In this way, we use only two LSTM instances for the three tasks, constructing an extremely simple model.

3.2. Embedding Layer

To deal with both natural language texts and source code snippets, we use separate embedding layers to convert input sequences into high-dimensional vectors.

For source code input , we directly use code snippets processed by StaQC (Yao et al., 2018), which is a code sequence . Then we use an embedding matrix to map the tokens in code sequence into a high-dimensional vector space to get the code input embeddings .

For natural language input , we simply split the sequence by space and covert them into one-hot representations. Then we use an embedding matrix to map the one-hot representation of each word into a high-dimensional vector space to get the text input embeddings .

3.3. Code Summarization Module

To begin with, we employ a Bi-Directional Long Short-Term Memory Network(Bi-LSTM)-based Recurrent Neural Network (RNN) (Hochreiter and Schmidhuber, 1997) encoder on the source code embedding sequence to model the temporal interactions between words.

The LSTM cell is composed of three multiplicative gates. At each time step , it takes the embedding of the word in the source code sequence, and then merges it with the last hidden state to generate a new hidden state. The transmission rate of each input is calculated by the gates in LSTM cell. At each time step , the hidden state is updated as:

(13)
(14)
(15)
(16)
(17)
(18)

where is the element-wise sigmoid function and is the element-wise product; are weight matrices for last hidden states; are weight matrices for source code embeddings; are biases. For simplicity, we denote the above calculation as below (the memory cell vector is omitted):

(19)

where denotes the hidden state of the step in Bi-LSTM encoder for . Since only receives information from codes before position , we use Bi-Directional LSTM to further incorporate information after position into hidden states:

(20)
(21)
(22)

Then we employ another LSTM-based decoder to generate text summary:

(23)

where denotes the hidden state of the step in LSTM decoder for , ; denotes the generated word of the step, “;“ denotes the concatenation operation, and is the context vector produced by the standard attention mechanism:

(24)
(25)

where is a trainable function to calculate the similarity between hidden states. We use a simple bi-linear function where is a trainable matrix.

Finally, the context vector is concatenated with the decoder output and fed into a linear layer to obtain the generated word distribution :

(26)
(27)

The objective is to maximize the probability of generating with input , so the loss function here is designed to be the negative log likelihood of target word :

(28)

where denotes all the trainable parameters in the code summarization module.

3.4. Code Generation module

To help our primal task—code summarization—achieve better performance, we construct a dual task with an opposite network, so it would output the source code sequence from its summary.

The whole structure is basically the same as the code summarization module, so we simply denote it as:

(29)
(30)

where and denote the hidden states of the step in the Bi-LSTM encoder and the LSTM decoder, respectively; denotes the generated word of the step; and denotes the context vector.

Since both the code generation encoder and the code summarization decoder try to work with source code input, we believe sharing the LSTM-cell parameters between them would reduce the model’s complexity. The same is true for the code generation decoder and the code summarization encoder. We find that performance does not change much when using or not using individual encoders/decoders.

The loss function is also designed to be the negative log likelihood of target word :

(31)

where is the generated word distribution calculated by and , and denotes all the trainable parameters in the code generation module.

3.5. Dual Learning for Code Summarization and Generation

Referring to the idea of dual learning in  (Xia et al., 2017), we hope that the code summarization module and the code generation module reflect the same data distributions. So it is reasonable to restrict these two modules by the following equation:

(32)

In the equation, and can be calculated by the code summarization module and the code generation module. But and are marginal distributions which cannot be directly calculated, so we use language models to fit their real values. We pre-train these language models with input corpus of natural language and source code separately. And the marginal distribution of a sentence can be defined as:

(33)

where is calculated by the pre-trained language model.

3.6. Code Retrieval Module

For code retrieval task with natural language input , we first calculate the similarities between and all candidate source code , and then choose those code sequences with highest scores as output.

The similarity score is defined as :

(34)

where , is calculated by the encoders’ hidden states in the code summarization module and the code generation module when they take and as input:

(35)
(36)

and is the similarity function of vectors:

(37)

where denotes cosine similarity, and are trainable parameters.

Then we use Ranking Loss here to define the training objective: for a paired sample , we randomly choose another source code input and a text summary , and we assume the score of pair is higher than that of pair with a margin of at least . So the loss function is:

(38)

where denotes all the trainable parameters in the code retrieval module.

3.7. Training

The final training objectives are:

(39)
(40)
(41)
(42)

where and in and are randomly sampled from and .

To optimize the training objectives, following the common practice in dual learning, we use Lagrange multipliers and add the equality constraint of probabilistic duality into the objective functions, by converting the probabilistic duality constraint into the following regularization term:

(43)

where and are calculated by pre-trained language models; and are calculated by the code summarization module and the code generation module.

Then we train the models by minimizing the weighted combination of the original loss functions and the added regularization term. The algorithm is shown in Algorithm 1.

Data: Marginal distributions and for any ; Lagrange parameters and ; optimizers , , ;
Result: Trained parameters , and
1 Initialization;
2 while model not converged do
3       Get a mini-batch of pairs ;
4       Calculate the gradients of the code summarization module:
5       ; Update the parameters of the code summarization module:
6       ; Calculate the gradients of the code generation module:
7       ; Update the parameters of the code generation module:
8       ; Randomly sample m (, ) from and ; Calculate the gradients of the code retrieval module:
9       ; Update the parameters of code retrieval module:
10       ;
11 end while
Algorithm 1 Supervised Learning Algorithm of CO3

4. Experiments

4.1. Details of Datasets

To ensure the generality of our experiment results, we used the existing dataset StaQC (Yao et al., 2018), which contains data of two different programming languages—SQL and python—to evaluate our approach.

StaQC was provided by Yao et al. (Yao et al., 2018), which is the largest dataset in SQL and Python domain. It contains 119,519 and 147,546 ¡question title, code¿ pairs in SQL and Python respectively, which were mined from Stack Overflow. We consider the question title as the query text of the code snippets in code retrieval task, and in code summarization task it is treated as the corresponding text summary for code. We followed the division of dataset in  (Yao et al., 2018), and used 75% of the pairs for training, 10% for validation and 15% for testing.

4.2. Implementation Details

We set the dimensionality of the LSTM hidden states, code embeddings, query embeddings to 400, 200, 200, respectively, following CoaCor (Yao et al., 2019). Based on our observation of query text and code text, we set the maximum lengths for query sequence and code sequence as 200 and 120, respectively. A small, fixed value of 0.05 was used in all experiments. To evaluate the code retrieval performance, we randomly selected another 49 code snippets for each ¡query, code snippet¿ pair, which were sorted according to the similarity score calculated with the query and the corresponding code snippet. Next we chose the model with the highest validation performance, and computed evaluation metrics on the test set. Adam was used for parameter optimization and the learning rate was set to 0.001. We implemented our model in Pytorch, and trained our models on Tesla T4. We will open-source our code in the near future.

In our experiments, training with dual learning (65 min/epoch) took more time than training without dual learning (20 min/epoch), due to the calculation of the regularization term of duality. The speed of convergence was similar (both averagely 4 epochs). Note that the training overhead due to dual learning is a one-time cost, and the inference speed is not affected.

4.3. Evaluation Metrics

Code Retrieval

Following previous works, we evaluated the retrieval performance of CO3 and baselines based on MRR (Voorhees, 1999) and NDCG (Wang, 2013) metrics, which are widely used in evaluating performance of code retrieval tasks. MRR is a popular metric used for evaluating ranking result. It calculates the Mean Reciprocal Rank over the entire set, rewarding each item with the reciprocal of its order. A higher value of MRR value indicates better performance of code retrieval. NDCG is also widely used in evaluating rankings. It adds the score of an item to each item after it, sums all scores up, and normalizes the scores to the range [0, 1]. It is similar to MRR but with different distribution of weights on each item. Here we choose relevance weight as 1 if the corresponding code snippet is positive and 0 otherwise.

Code Summarization

Following previous works, we evaluated the summarization performance of CO3 and baselines based on BLEU (Papineni et al., 2002) and METEOR (Banerjee and Lavie, 2005) (Banerjee and Lavie 2005) metrics, which are widely used in evaluating performance of text generation tasks. BLEU score is a popular accuracy-based metric for Neural Machine Translation, and is also used in the code summary generation task. It calculates the similarity between the generated sequence and reference sequence by counting the n-grams that appear in both the candidate sequence and the reference sequence. METEOR measure is the harmonic average of precision and recall, and a prior study (Banerjee and Lavie, 2005) shows that recall-based metrics can be more correlative to manual judgement than accuracy-based metrics like BLEU.

4.4. Baselines

Code Retrieval

In code retrieval task, we compare our approach with the following state-of-the-art models as baselines in our evaluation.

  • DCS  (Gu et al., 2018) is a deep code search model which uses two deep neural networks to encode source code and natural language description into vector representation, and then uses a cosine similarity function to calculate their similarity.

  • CoaCor  (Yao et al., 2019) is a code annotation model trained to generate a natural language annotation which represents the semantics of a given code snippet. Then it projects the annotation and candidate code snippets into a high-dimensional vector space to calculate cosine similarity, thus fulfilling the code retrieval task.

Code Summarization

In the code summarization task, we compare our approach with the following baselines.

  • CODE-NN  (Iyer et al., 2016) is an end-to-end code summarization approach. This approach uses LSTM as the decoder and applies an attention mechanism in each decoding step.

  • Seq2Seq is a basic encoder-decoder model. We choose Bi-LSTM as the RNN layer for the encoder and LSTM for the decoder, and the attention mechanism  (Iyer et al., 2016) is applied in each decoding step.

  • CoaCor  (Yao et al., 2019) is a code annotation model, but the annotation can also be considered as the summary of code snippets.

4.5. CO3 Variants

In our approach, we introduce an additional code generation task, adopt dual learning to improve the dual tasks of code generation and code summarization, and finally train them with code retrieval task with multi-task learning. To evaluate the effect of the code generation task, dual learning and multi-task learning, we will perform ablation study on the following variants of our model CO3. Note that the architecture of CO3 is extremely simple. Though we model three tasks simultaneously, we actually have only two LSTM instances. The variants mainly involve different combination of loss functions.

  • CO3: When we mention CO3 in our experiment results, it refers to the full model described in Section 3. Loss functions of CO3 consist of , , and .

  • CO3(Dual Learning)-1: This is a CO3-variant that removes the regularization term of dual learning. The loss functions consist of , and .

  • CO3(Dual Learning)-2: This is a CO3-variant that removes the regularization term of dual learning and parameter sharing between code summarization and code generation, modeling them as two independent tasks. The loss functions consist of , and .

  • CO3(Code Generation): This is a CO3-variant that removes the code generation module, which means it has only the code summarization module and the code retrieval module, sharing only the code encoder. The loss functions consist of and .

  • DCS: This is a CO3-variant that has only code retrieval module, which is a simplified DCS (Gu et al., 2018). The loss function is .

5. Experiments Results

5.1. Performance of Code Retrieval

To evaluate the performance of the code retrieval task, we use DCS and CoaCor as baselines, in which DCS (Gu et al., 2018) is a very competitive code retrieval models from software engineering community and CoaCor (Yao et al., 2019) is a state-of-the-art model proposed recently. As shown in Table 1, among the three models, CO3 achieves the highest score across all metrics for both SQL and Python. Since CoaCor uses ranking metrics for retrieval as reward to producing retrieval-friendly code summary and further assembles two diverse retrieval models, CoaCor outperforms DCS by a large margin. Despite the simplicity of CO3, it outperforms CoaCor by nearly 0.01 for SQL in terms of both MRR and NDGG. The performance margin between CoaCor and CO3 is even larger for Python, which speaks to the superiority of CO3.

In addition to achieving a new state-of-the-art model for code retrieval, more importantly, we do not sacrifice the BLEU score of code summarization task, as evidenced by the following experiment results.

Model SQL Python
MRR NDCG MRR NDCG
DCS 0.522 0.629 0.617 0.705
CoaCor 0.576 0.670 0.636 0.721
Our Model 0.585 0.679 0.682 0.756
Table 1. Code Retrieval Results of CO3 and Baselines. We appiyed T-test on the results between Our Model and baselines, and all p-values are smaller than 0.05, which means that the improvement is stable and statistically significant.

5.2. Performance of Code Summarization

To evaluate the performance of code summarization task, we use CODE-NN and a strong Seq2Seq model equipped with attention mechanism as baselines. Results in Table 2 show that our model outperforms the two baselines. CO3 is built upon the Seq2Seq model, and we find the introduction of code generation task and the adoption of dual learning and multi-task learning improve the performance by more than in terms of BLEU4 and METEOR, which verifies the effectiveness of our approach. How each of these design choices contributes to the improvement will analysed in Sections 5.4 through 5.6.

We also present CoaCor’s scores for code summarization when it achieves the best performance for code retrieval. As we can see, for both languages, CO3 outperforms CoaCor in both BLEU and METEOR by a very large margin, even by more than . The reason is that CoaCor prefers to generate code summary with longer length, which contains more conceptual words for code retrieval and thus weakens human readability. In contrast, CO3 generates code summaries with style more in line with human-written queries in training data.

5.3. Balance between Code Retrieval and Code Summarization

We have shown that our model can significantly improve the results of code retrieval task over the-state-of-art models, as well as achieve competitive performance in terms of BLEU score for code summarization task. To explain why CO3 can balance code retrieval and code summarization better, we divided the test set of SQL dataset and Python dataset into 10 groups according to the BLEU score (within [0, 1]) with an interval of 0.1. Then we calculated the average MRR score for all samples in each group. The results are presented in Figure 2. We can see for both the SQL dataset and Python dataset, as the BLEU score increases, the MRR score also increases. Note that the data with BLEU score over 0.9 are very sparse, and one outlier for the SQL dataset (MRR 0.3 – BLUE 1.0) affects the average score and causes the MRR plot to go down towards the right end.

The correlation between MRR and BLUE scores corroborates our argument in Section 1 that generating summaries close to human-provided queries is naturally fit to code retrieval, and explains why CO3 can excel at code retrieval and code summarization at the same time.

Figure 2. MRR scores for data in the test set of SQL and Python grouped by BLEU score.
Model SQL Python
BLEU4 METEOR BLEU4 METEOR
CODE-NN 0.083 0.041 0.092 0.053
Seq2Seq 0.104 0.067 0.106 0.062
CoaCor 0.067 0.029 0.078 0.034
Our Model 0.119 0.087 0.124 0.085
Table 2. Code Summarization Results of CO3 and Baselines

5.4. Effect of Code Generation Task

To investigate the effect of code generation task, we first conducted experiments on CO3(Code Generation), which consists of only the code summarization module and the code retrieval module. For code retrieval, we find the performance degrades significantly when code generation task is removed. For example, as shown in Table 3, compared with the full CO3 model, the MRR scores of CO3(Code Generation) decreased 0.037 and 0.028 for SQL and Python, respectively. For code summarization, the result is similar. As shown in Table 4, the BLEU scores of CO3(Code Generation) decreased 0.030 and 0.015 for SQL and Python, respectively.

Meanwhile, From Table 3 and Table 4, we find CO3(Dual Learning)-1 (with LSTM parameters shared) and CO3(Dual Learning)-2 (with independent LSTM parameters) always outperform CO3(Code Generation). So we conclude that the code generation task can be helpful even without dual learning.

These two observations indicate that the code generation task is beneficial to capturing more accurate semantics of code and the paired natural language text, serving as fundamental component in our architecture.

We also have an interesting observation that the performances of CO3(Dual Learning)-1 and CO3(Dual Learning)-2 are basically the same, which indicates that whether sharing LSTM parameters for dual learning or not does not have too much effect on the performance. However, with parameter sharing, CO3(Dual Learning)-1 reduces nearly half of the parameters of CO3(Dual Learning)-2, which makes training more efficient.

5.5. Effect of Dual Learning

To investigate the effect of dual learning, we compare the performance of CO3(Dual learning) with the full CO3 model. From Table 3 and Table 4, we can observe that for both code retrieval and code summarization, full CO3 outperforms CO3(Dual learning) for both SQL and Python. Thus, we can conclude that dual learning can help generate more accurate semantic embeddings used by decoders to generate code or text, which is also beneficial to the code retrieval module.

Compared with code retrieval, the improvement caused by dual learning for code summarization is larger. The reason is that the regularization term of dual learning mainly imposes a direct constraint on the joint probability of the two dual tasks: code summarization and code generation. It is conceivable that dual learning can help generate better code summaries, and the regularization term of dual learning is effective.

Model SQL Python
MRR NDCG MRR NDCG
CO3 0.585 0.679 0.682 0.756
CO3(Dual Learning)-1 0.583 0.678 0.660 0.740
CO3(Dual Learning)-2 0.581 0.675 0.664 0.742
CO3(Code Generation) 0.548 0.650 0.654 0.734
DCS 0.522 0.629 0.617 0.705
Table 3. Code Retrieval Results of CO3 Variant
Model SQL Python
BLEU4 METEOR BLEU4 METEOR
CO3 0.119 0.087 0.124 0.085
CO3(Dual Learning)-1 0.100 0.061 0.117 0.081
CO3(Dual Learning)-2 0.102 0.061 0.111 0.082
CO3(Code Generation) 0.089 0.051 0.109 0.080
Table 4. Code Summarization Results of CO3 Variant

5.6. Effect of Multi-task Learning

To study whether code retrieval enhances code summarization, we removed the code retrieval module from CO3 and found that it did not affect the performance of CO3 much.

And for code retrieval, we compare CO3 with an individual code retrieval module, a simplified DCS. As shown in Table 3, since CO3 trains the code retrieval model with the dual Seq2Seq model of code summarization/generation, it achieves an improvement of around in terms of MRR and NDCG compared to DCS. In this multi-task learning architecture, the dual Seq2Seq model can be deemed as an auxiliary task to provide inductive bias, which makes the model more focused on those hypotheses that can explain both the dual Seq2Seq model and code retrieval at the same time, and consequently improve the performance and generalization of code retrieval.

5.7. Case Study

Figure 3. Examples of Code and Generated Code Summaries.
Model Example(a) Example(b)
DCS 3 23
CoaCor 13 9
CO3 1 1
Table 5. Rank of Code Retrieval for Examples.
Figure 4. Heatmap of Query and Code Semantic Representation in Max-Pooling Process.

To perform qualitative analysis, we list two examples in Figure 3, with summaries provided by human beings (the titles) and generated by different code summarization models. Their ranking results of each code retrieval model are summarized in Table 5. The examples show that our model not only achieves the best rank in code retrieval task, but also generates a human-readable summary. This indicates that CO3 can capture the inner association between human-provided text and code more accurately, which is consistent with previous quantitative analysis. More details are explained below.

In the first example, all models generated a clear, coherent and informative summary except CoaCor. Although CoaCor did find the keyword “delete“, it failed to extract other keywords like “duplicate“. Though the reinforcement learning mechanism can make the generated summaries more friendly to code retrieval, these summaries lost some human readability. In contrast, our model can find more key information while conserving the fluency and naturalness of the generated text. CO3 can easily find the corresponding code snippets with the help of more accurate semantic embeddings, which can be used to generate keyword like “delete“ and “duplicate,“ as shown in the code summary generated by CO3. This again shows that a better code summary has the potential to lead to a better performance in code retrieval.

In the second example, due to the more complex logic of the query, none of the code summarization models were able to achieve a good result. Compared with the human-written summary, all the generated summaries lost the key information about “compare“ and “discarding.“ However, even in this complex scenario, CO3 did successfully find the corresponding code snippet. In contrast, CoaCor could not locate the correct code snippet, resulting in a target rank of 23, as shown in Table 5. The main reason behind this is that CO3 can capture the intrinsic connection among tasks of code summarization, code generation and code retrieval better through an end-to-end model, thanks to dual learning and multi-task learning. More specifically, the three tasks collaborate to generate more accurate and richer semantic representations of both natural language and source code. These representations are vectors in a continuous semantic space, and contain most of the key information. The discrete words generated by sampling from continuous space may lose some key information, but those continuous representations can still maintain such information and thus assist down-stream tasks such as code retrieval better.

We further transform the representations into a heat map of the query, and find that the words “compare“ and “discarding“ are assigned with more weights, as shown in Figure  4. The weight of a word is calculated as the number of times it is selected as the maximum value when performing max-pooling in the scorer of the code retrieval module. That explains why CO3 can retrieve the target code snippets that contains tokens of “=“ and “convert“ since in SQL, we often need to use “=“ to “compare“ two values and use “convert“ to “discarding time part.“ However, CoaCor uses generated summary consisting of discrete words to feed into the code retrieval module in a pipeline manner, which can lead to loss of key information (e.g. “compare“ and “discarding“) when sampling from continuous space. Error propagation is a common issue in pipeline-based approaches. This example fully demonstrates that out model can establish better association between natural language and programming language, showing the effectiveness of our end-to-end model.

6. Related Work

6.1. Code Retrieval

As introduced in previous sections, code retrieval has been studied widely with information retrieval methods (Haiduc et al., 2013; Hill et al., 2014; Keivanloo et al., 2014; Lu et al., 2015; Vinayakarao et al., 2017) and recent deep learning methods (Allamanis et al., 2015; Gu et al., 2018; Iyer et al., 2016). Chen et al.  (Chen and Zhou, 2018) used VAEs to model both source code and natural language. Two VAEs are trained jointly to reconstruct their inputs as much as possible with regularization that captures the closeness between the latent variables of code and description, which will be used for measuring similarity. Similarly, Yao et al. (Yao et al., 2019) constructed a neural network-based code annotation model to describe the functionality of an entire code snippet. It produces meaningful words that can be used for code retrieval where these words and a natural language query are projected into a vector space to measure the cosine similarity between them.

Like some of these efforts, our code retrieval model directly encodes the query and code, and projects them into high-dimensional vector space. But we also use dual learning (Xia et al., 2017) and multi-task learning to catch more intrinsic and precise representations of query and code, which improve the performance significantly.

6.2. Code Summarization

The existing works for code summarization can be mainly categorized as traditional approaches based on information retrieval (Eddy et al., 2013), code keywords (Moreno et al., 2013; Sridhara et al., 2010), statistical language models (Movshovitz-Attias and Cohen, 2013), and deep learning based approaches (Allamanis et al., 2016; Iyer et al., 2016; Hu et al., 2018). In (Eddy et al., 2013), the authors generated code summarization by searching for similar codes. Sridhara et al. (Sridhara et al., 2010) generated comments according to the keywords extracted from code. Movshovitz-Attias et al. (Movshovitz-Attias and Cohen, 2013) predicted comments from Java source files using topic models and n-grams. Allamanis et al.  (Allamanis et al., 2016) proposed an attention-based neural network to summarize source code into method name-like summaries. They employed convolution on the source code tokens to detect local time-invariant and long-range topical attention features. Iyer et al. (Iyer et al., 2016) proposed an attention-based Recurrent Neural Network (RNN) model, which aligns the words in comments with individual code tokens directly by an attention component. The code summarization task can be modeled as machine translation problem, so some models based on Seq2Seq paradigm (Sutskever et al., 2014) were proposed. Hu et al. (Hu et al., 2018) proposed a structure-based traversal (SBT) algorithm in the encoder to flatten an AST and link the tokens in the source code with their AST node types.

Different from previous deep learning based works that design better features or more sophisticated network structures, this paper introduces a dual task of code generation to improve the performance of code summarization.

6.3. Dual Learning

For multi-tasks learning scenarios where there are two dual tasks of each other, He et al. (He et al., 2016) proposed a new framework to utilize the duality, named dual learning. According to their motivations, dual learning can be applied in two cases: 1) As semi-supervised or unsupervised method to deal with lack of data. 2) as supervised method to regularize the models. For example,  (Xia et al., 2017) proposed to use dual learning on machine translation with unsupervised data, and leveraged the duality via mutual translation of two languages, such as translating English to Chinese and translating Chinese to English. Later,  (Wang et al., 2018) utilized this mechanism with transfer learning to transfer the knowledge of duality on dual machine translation among three languages. Also for machine translation,  (He et al., 2016) proposed a supervised learning, called dual supervised learning, resulting in a remarkable improvement on dual tasks involving two languages. Recently, more and more researchers start to apply this mechanism to other tasks.  (Su et al., 2019) use dual learning directly in natural language understand (NLU) and generation (NLG), where the input of NLU is a natural language sentence, and the input of NLG is semantic frame. In essence, this paper utilizes the dual learning to improve the performance for two tasks with different representations of input.  (Ye et al., 2019) achieved a better performance via jointly learning and dual learning on both Semantic Parser and Natural Language tasks.

Therefore, to deal with the problem of different representations of input, and to model the inner connection between dual tasks, we employ dual learning on two related tasks: code summarization and code generation.

6.4. Multi-Task Learning

Multi-task learning (MTL) is heavily used in machine learning and natural language processing tasks. MTL aims to help improve the learning of a model by leveraging the domain-specific knowledge contained in the training signals of related tasks (Caruana, 1997). Usually, relevance among tasks is learned in two ways in deep neural networks: hard parameter sharing and soft parameter sharing of hidden layer (Ruder, 2017).

Hard parameter sharing MTL was first proposed in  (Caruana, 1993), which shares the hidden layer between all tasks and keeps task-specific output layers. Collobert et al. (Collobert and Weston, 2008) described a single convolutional neural network architecture trained jointly on NLP tasks such as part-of-speech tags, chunks, named entity tags, and semantic roles. Zheng et al. (Zheng et al., 2018) proposed a module in which all tasks share the same sentence representation and each task can select the task-specific information from the shared sentence representation with attention mechanism. On the other hand, each task in soft parameter MTL contains its own model and parameters, and the parameters are encouraged to be similar with some regularization. For example, Misra et al. (Misra et al., 2016) connected two separate networks in a soft parameter sharing way. Then the model leverages a unit called cross-stitch to determine how to combine the knowledge learned in other related tasks as task-specific networks.

Here, we use hard parameter sharing as our multi-task learning method. Our approach not only uses hard parameter sharing, but also adds a regularization term of duality, which resembles soft parameter sharing.

7. Conclusion

We have presented an end-to-end model named CO3 for code retrieval and code summarization. CO3 leverages code generation to bridge programming language and natural language better via dual learning and multi-task learning. Though involving three tasks, CO3 has a simple yet effective architecture, which consists of only two LSTM instances. Compared with previous models which process code retrieval and code summarization in an independent or pipeline manner, CO3 can better capture the intrinsic connection between these tasks, so that it not only improves the results of code retrieval over the state of the art, but also balances the performance of the two tasks much better.

In the future, we plan to further explore the interaction among the ranking loss for code retrieval, the maximal likelihood estimation objective of text generation, and the dual regularization.

Footnotes

  1. journalyear: 2020
  2. copyright: iw3c2w3
  3. conference: Proceedings of The Web Conference 2020; April 20–24, 2020; Taipei, Taiwan
  4. booktitle: Proceedings of The Web Conference 2020 (WWW ’20), April 20–24, 2020, Taipei, Taiwan
  5. price:
  6. doi: 10.1145/3366423.3380295
  7. isbn: 978-1-4503-7023-3/20/04
  8. ccs: Information systems Web search engines
  9. ccs: Computing methodologies Natural language generation
  10. ccs: Computing methodologies Machine translation
  11. ccs: Computing methodologies Search methodologies

References

  1. A survey of machine learning for big code and naturalness. ACM Comput. Surv. 51 (4), pp. 81:1–81:37. External Links: Link, Document Cited by: §1.
  2. A convolutional attention network for extreme summarization of source code. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pp. 2091–2100. External Links: Link Cited by: §6.2.
  3. Bimodal modelling of source code and natural language. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pp. 2123–2132. External Links: Link Cited by: §6.1.
  4. Code2seq: generating sequences from structured representations of code. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Link Cited by: §1.
  5. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, Ann Arbor, Michigan, USA, June 29, 2005, pp. 65–72. External Links: Link Cited by: §4.3.2.
  6. Multitask learning: a knowledge-based source of inductive bias. machine learning. Cited by: §6.4.
  7. Multitask learning: a knowledge-based source of inductive bias. Machine Learning Proceedings 10 (1), pp. 41–48. Cited by: §6.4.
  8. A neural framework for retrieval and summarization of source code. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, Montpellier, France, September 3-7, 2018, pp. 826–831. External Links: Link, Document Cited by: 2nd item, §1, §6.1.
  9. A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pp. 160–167. Cited by: §6.4.
  10. Evaluating source code summarization techniques: replication and expansion. In IEEE 21st International Conference on Program Comprehension, ICPC 2013, San Francisco, CA, USA, 20-21 May, 2013, pp. 13–22. External Links: Link, Document Cited by: §6.2.
  11. Deep code search. In Proceedings of the 40th International Conference on Software Engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018, pp. 933–944. External Links: Link, Document Cited by: §1, 1st item, 5th item, §5.1, §6.1.
  12. Automatic query reformulations for text retrieval in software engineering. In 35th International Conference on Software Engineering, ICSE ’13, San Francisco, CA, USA, May 18-26, 2013, pp. 842–851. External Links: Link, Document Cited by: §6.1.
  13. Dual learning for machine translation. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 820–828. External Links: Link Cited by: §6.3.
  14. NL-based query refinement and contextualized code search results: A user study. In 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering, CSMR-WCRE 2014, Antwerp, Belgium, February 3-6, 2014, pp. 34–43. External Links: Link, Document Cited by: §6.1.
  15. Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. External Links: Link, Document Cited by: §3.3.
  16. Deep code comment generation. In Proceedings of the 26th Conference on Program Comprehension, ICPC 2018, Gothenburg, Sweden, May 27-28, 2018, pp. 200–210. External Links: Link, Document Cited by: §1, §6.2.
  17. Summarizing source code using a neural attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, External Links: Link Cited by: §1, 1st item, 2nd item, §6.1, §6.2.
  18. Spotting working code examples. In 36th International Conference on Software Engineering, ICSE ’14, Hyderabad, India - May 31 - June 07, 2014, pp. 664–675. External Links: Link, Document Cited by: §6.1.
  19. Query expansion via wordnet for effective code search. In 22nd IEEE International Conference on Software Analysis, Evolution, and Reengineering, SANER 2015, Montreal, QC, Canada, March 2-6, 2015, pp. 545–549. External Links: Link, Document Cited by: §6.1.
  20. Cross-stitch networks for multi-task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3994–4003. Cited by: §6.4.
  21. Automatic generation of natural language summaries for java classes. In IEEE International Conference on Program Comprehension, Cited by: §6.2.
  22. Natural language models for predicting programming comments. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4-9 August 2013, Sofia, Bulgaria, Volume 2: Short Papers, pp. 35–40. External Links: Link Cited by: §6.2.
  23. External Links: Link Cited by: §1.
  24. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA., pp. 311–318. External Links: Link Cited by: 1st item, §4.3.2.
  25. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098. Cited by: §6.4.
  26. Towards automatically generating summary comments for java methods. In Proceedings of the IEEE/ACM international conference on Automated software engineering, pp. 43–52. Cited by: §6.2.
  27. Dual supervised learning for natural language understanding and generation. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 5472–5477. External Links: Link Cited by: §6.3.
  28. Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems 4, pp. . Cited by: §6.2.
  29. ANNE: improving source code search using entity retrieval approach. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, WSDM 2017, Cambridge, United Kingdom, February 6-10, 2017, pp. 211–220. External Links: Link, Document Cited by: §6.1.
  30. The TREC-8 question answering track report. In Proceedings of The Eighth Text REtrieval Conference, TREC 1999, Gaithersburg, Maryland, USA, November 17-19, 1999, External Links: Link Cited by: §4.3.1.
  31. Improving automatic source code summarization via deep reinforcement learning. In the 33rd ACM/IEEE International Conference, Cited by: §1.
  32. Dual transfer learning for neural machine translation with marginal distribution regularization. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pp. 5553–5560. External Links: Link Cited by: §6.3.
  33. A theoretical analysis of normalized discounted cumulative gain (ndcg) ranking measures.. In In Proceedings of the 26th Annual Conference on Learning Theory (COLT 2013)., Cited by: §4.3.1.
  34. Dual supervised learning. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp. 3789–3798. External Links: Link Cited by: §1, §2.3, §3.1, §3.5, §6.1, §6.3.
  35. CoaCor: code annotation for code retrieval with reinforcement learning. In The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019, pp. 2203–2214. External Links: Link, Document Cited by: 1st item, 2nd item, §1, §1, 2nd item, 3rd item, §4.2, §5.1, §6.1.
  36. StaQC: A systematically mined question-code dataset from stack overflow. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon, France, April 23-27, 2018, pp. 1693–1703. External Links: Link, Document Cited by: §1, §3.2, §4.1, §4.1.
  37. Jointly learning semantic parser and natural language generator via dual information maximization. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 2090–2101. External Links: Link Cited by: §6.3.
  38. Same representation, different attentions: shareable sentence representation learning from multiple tasks. arXiv preprint arXiv:1804.08139. Cited by: §6.4.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
409327
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description