Testing Neural Program Analyzers
Deep neural networks have been increasingly used in software engineering and program analysis tasks. They usually take a program and make some predictions about it, e.g., bug prediction. We call these models neural program analyzers. The reliability of neural programs can impact the reliability of the encompassing analyses.
In this paper, we describe our ongoing efforts to develop effective techniques for testing neural programs. We discuss the challenges involved in developing such tools and our future plans. In our preliminary experiment on a neural model recently proposed in the literature, we found that the model is very brittle, and simple perturbations in the input can cause the model to make mistakes in its prediction.
The advances of deep neural models in software engineering and program analysis research have received significant attention in recent years. Researchers have already proposed various neural models (e.g., Tree-LSTM , Gemini , GGNN , Code Vectors , code2vec , code2seq , DYPRO [17, 15], LIGER , Import2Vec ) to solve problems related to different program analysis or software engineering tasks. Although each neural model has been evaluated by its authors, in practice, these neural models may be susceptible to untested test inputs. Therefore, a set of testing approaches has already been proposed to trace the unexpected corner cases. Recent neural model testing techniques include [13, 19] for models of autonomous systems, [10, 9, 8] for models of QA systems, and [14, 16] for models of embedding systems. However, testing neural models that work on source code has received little attention from researchers except the exploration initiated by Wang et al. .
Evaluating the robustness of neural models that process source code is of particular importance because their robustness would impact the correctness of the encompassing analyses that use them. In this paper, we propose a transformation-based testing framework to test the correctness of state-of-the-art neural models running on the programming task. The transformation mainly refers to the semantic changes in programs that result in similar programs. The key insight of transformation is that the transformed programs are semantically equivalent to their original forms of programs but have different syntactic representations. For example, one can replace a switch statement of a program with conditional if-else statements. The original program of the switch statement is semantically equivalent to the new program of if-else statements. A set of transformations can be applied to a program to generate more semantically equivalent programs, and those new transformed programs can be evaluated on neural models to test the correctness of those models.
The main motivation to apply transformation is the fact that such transformations may cause the neural model to behave differently and mispredict the input. We are conducting a small study to assess the applicability of transformations in the testing of neural models. The preliminary results show that the transformations are very effective in finding irrelevant output in neural models. We closely perceive that the semantic-preserving transformations can change the predicted output or the prediction accuracy of neural models compared to the original test programs.
Ii Motivating Example
We use Figure 1 as a motivating example to highlight the usefulness of our approach. The code snippet shown in Figure 1 is a simple Java method that demonstrates the prime functionality. The functions check whether an integer is a prime number. The only difference between these functions is that the implementation on the left uses a for loop, while the implementation on the right uses a while loop.
We instrument the prediction of the code2vec model  with these two equivalent functions. The code2vec takes a program and predicts its content. The result of the online demo  reveals that the code2vec model successfully predicts the program on the left as an “isPrime” method, but cannot predict the program on the right as an “isPrime” method. The model mistakenly predicts the program on the right as a “skip” method, even though the “isPrime” method is not included in the top-5 predictions made by the code2vec model.
Iii Proposed Methodology
In this section, we describe our efforts for testing neural programs. Currently, we are investigating semantic-preserving transformations that can potentially mislead a neural model of programs.
Figure 2 depicts an overview of our approach for testing the neural models. It can broadly be divided into two main steps: (1) Generating synthetic test programs using the semantic transformation of the programs in the original dataset, and (2) Comparing the predictions for the transformed programs with those for the original programs.
Semantic-Equivalent Program Transformations We have implemented multiple semantic program transformations to generate synthetic programs. Those semantic-preserving transformations include renaming variables, exchanging loops, swapping boolean values, converting switches and permuting the order of statements. In variable-renaming transformation, we rename all the occurrences of specific variables in program using an arbitrary name. The boolean-swapping transformation refers to swapping true with false and vice versa, and we also neglect the condition so that the semantic is maintained. In the same way, the loop-exchanging transformation means replacing a while loop with a for loop, and vice versa. In switch-converting transformation, we replace the switch statements with the conditional if-else statements. Finally, we include another transformation by permuting the order of statements without any semantic violations. All these transformations maintain semantic equivalence but generate different syntactic programs. Thus far, we have not found any one transformation that works substantially better than others.
Test Oracle We evaluate both the original program and the transformed program in the neural model. We mainly look at the predicted label and the prediction accuracy of the model for both original and transformed programs. The neural model should behave similarly with both the original and the transformed program, which we define as a transformation-based metamorphic relation. The main challenge in this phase is to define a measure for the similarity of the predictions. We are experimenting with a few ideas for this phase, for example, setting a threshold for the similarity of the predictions.
Challenges Ahead There are five main challenges that we are aiming to address in this project: (1) what types of transformation should be performed, (2) how to preserve the semantic equivalence during transformations, (3) where to apply those transformations, (4) how to control the transformation strategies, and (5) how to evaluate the transformed programs.
Iv Our Plan
Thus far, we have applied five types of transformation. Those transformations are only capable of making basic changes in the syntactic representations of programs. However, our target is to devise more systematic transformations. We are investigating the techniques and heuristics to suggest places in programs to transform, and the types of transformation that are most likely to cause the neural model to mispredict.
Moreover, we have only evaluated our transformation on the code2vec model , where the target task is to label the method name given a method body. We also plan to evaluate the transformation on the GGNN model , where the target task is to label the correct variable name based on the understanding of its usage.
V Related Work
Tian et al.  proposed DeepTest, a tool for automated generation of real-world test images and testing of DNN-driven autonomous cars. They introduced potential image transformations (e.g., blurring, scaling, fog and rain effects) that mimic real-world conditions. They applied transformation-based testing to identify the numerous corner cases that may lead to serious consequences, such as a collision in an autonomous car. Another study in this area was conducted by the authors of DeepRoad , who applied extreme realistic image-to-image transformations (e.g., heavy snow or hard rain) using the DNN-based UNIT method.
Wang et al.  proposed COSET, a framework for standardizing the evaluation of neural program embeddings. They applied transformation-based testing to measure the stability of neural models and identify the root cause of misclassifications. They also implemented and evaluated a new neural model called LIGER  with COSET’s transformations, where they embedded programs with runtime information rather than learning from the source code.
-  (2017) Learning to represent programs with graphs. CoRR abs/1711.00740. External Links: Cited by: §I, §IV.
-  (2018) Code2seq: generating sequences from structured representations of code. CoRR abs/1808.01400. External Links: Cited by: §I.
-  (2018) Code2vec: learning distributed representations of code. CoRR abs/1803.09473. External Links: Cited by: §I, §IV.
-  CODE2VEC Dataset. Note: \urlhttps://github.com/tech-srl/code2vec#additional-datasets/ Cited by: §IV.
-  CODE2VEC Online Demo. Note: \urlhttps://code2vec.org/ Cited by: §II, §IV.
-  GGNN Dataset. Note: \urlhttps://aka.ms/iclr18-prog-graphs-dataset/ Cited by: §IV.
-  (2018) Code vectors: understanding programs through embedded abstracted symbolic traces. CoRR abs/1803.06686. External Links: Cited by: §I.
-  (2018) Discrete attacks and submodular optimization with applications to text classification. CoRR abs/1812.00151. External Links: Cited by: §I.
-  (2018-07) Semantically equivalent adversarial rules for debugging NLP models. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 856–865. External Links: Cited by: §I.
-  (2018) Does it care what you asked? understanding importance of verbs in deep learning QA system. CoRR abs/1809.03740. External Links: Cited by: §I.
-  (2015) Improved semantic representations from tree-structured long short-term memory networks. CoRR abs/1503.00075. External Links: Cited by: §I.
-  (2019) Import2Vec learning embeddings for software libraries. In Proceedings of the 16th International Conference on Mining Software Repositories, MSR ’19, Piscataway, NJ, USA, pp. 18–28. External Links: Cited by: §I.
-  (2018) DeepTest: automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th International Conference on Software Engineering, ICSE ’18, New York, NY, USA, pp. 303–314. External Links: Cited by: §I, §V, §V.
-  (2019) COSET: A benchmark for evaluating neural program embeddings. CoRR abs/1905.11445. External Links: Cited by: §I, §V, §V.
-  (2017) Dynamic neural program embedding for program repair. CoRR abs/1711.07163. External Links: Cited by: §I.
-  (2019) Learning blended, precise semantic program embeddings. ArXiv abs/1907.02136. Cited by: §I, §V.
-  (2019) Learning scalable and precise representation of program semantics. CoRR abs/1905.05251. External Links: Cited by: §I.
-  (2017) Neural network-based graph embedding for cross-platform binary code similarity detection. CoRR abs/1708.06525. External Links: Cited by: §I.
-  (2018) DeepRoad: gan-based metamorphic testing and input validation framework for autonomous driving systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, New York, NY, USA, pp. 132–142. External Links: Cited by: §I, §V.