Learning Fitness Functions for Genetic Algorithms

Learning Fitness Functions for Genetic Algorithms


The problem of automatic software generation is known as Machine Programming. In this work, we propose a framework based on genetic algorithms to solve this problem. Although genetic algorithms have been used successfully for many problems, one criticism is that hand-crafting its fitness function, the test that aims to effectively guide its evolution, can be notably challenging. Our framework presents a novel approach to learn the fitness function using neural networks to predict values of ideal fitness function. We also augment the evolutionary process with a minimally intrusive search heuristic. This heuristic improves the framework’s ability to discover correct programs from ones that are approximately correct and does so with negligible computational overhead. We compare our approach with two state-of-the-art program synthesis methods and demonstrate that it finds more correct programs with fewer candidate program generations.


1 Introduction

In recent years, there has been notable progress in the space of automatic software generation, also known as Machine Programming (MP) Gottschlich et al. (2018); Ratner et al. (2019). MP can be achieved in many ways. One way is by using formal program synthesis, a technique that uses formal methods and rules to generate programs Manna and Waldinger (1975). Formal program synthesis usually guarantees some program properties by evaluating a generated program’s semantics against a corresponding specification Gulwani et al. (2012); Alur et al. (2015). Although useful, such formal synthesis techniques can often be limited by exponentially increasing computational overhead that grows with the program’s instruction size Heule et al. (2016); Bodík and Jobstmann (2013); Solar-Lezama et al. (2006); Loncaric et al. (2018); Cheung et al. (2012).

An alternative to formal methods for MP is to use machine learning (ML). Machine learning differs from traditional formal program synthesis in that it generally does not provide correctness guarantees. Instead, ML-driven MP approaches are usually only probabilistically correct, i.e., their results are derived from sample data relying on statistical significance Murphy (2012). Such ML approaches tend to explore software program generation using an objective function. Objective functions are used to guide an ML system’s exploration of a problem space to find a solution.

More recently, there has been a surge of research exploring ML-based MP using neural networks (NNs). For example, In Balog et al. (2017), the authors train a neural network with input-output examples to predict the probabilities of the functions that are most likely to be used in a program. Raychev et al. \yrciteraychev14 take a different approach and use an n-gram model to predict the functions that are most likely to complete a partially constructed program. Bunel et al. \yrcitebunel18 explore a unique approach that combines reinforcement learning (RL) with a supervised model to find semantically correct programs. These are only a few of the works in MP space using neural networks Reed and de Freitas (2016); Cai et al. (2017).

As demonstrated by Becker and Gottschlich \yrciteaip, genetic algorithms also show promise for MP. Real et al. \yrcitereal18 subsequently demonstrated that genetic algorithms can generate accurate image classifiers. Their approach produced a state-of-the-art classifier for CIFAR-10 Krizhevsky (2009) and ImageNet Deng et al. (2009) datasets. Moreover, genetic algorithms have been exploited to successfully automate the neural architecture optimization process Salimans et al. (2017); Such et al. (2017); Liu et al. (2017); Labs (). Even with this notable progress, genetic algorithms can be challenging to use due to the complexity of hand-crafting fitness functions that guide the search.

While genetic algorithms have had demonstrable success in their practical application Korns (2011); Such et al. (2017); Real et al. (2018), an open challenge is in creating simple, yet effective fitness functions. Contrary to this goal, fitness function complexity tends to increase proportionally with the problem being solved. In this paper, we explore an approach to automatically generate these fitness functions by representing their structure with a neural network, making the following technical contributions:

  • Fitness Function: Our fundamental contribution is in the automation of fitness functions for genetic algorithms. To the best of our knowledge, our work is the first of its kind to use a neural network as a genetic algorithm’s fitness function for software generation.

  • Convergence: A secondary contribution is in our utilization of local neighborhood search to improve the convergence of approximately correct candidate solutions. We demonstrate its efficacy empirically.

  • Generality: We demonstrate that our approach can support different neural network fitness functions, uniformly. While our experiments are strictly in the space of MP, we are not aware of any restrictions from utilizing our fitness function automation technique in other domains.

  • Metric: We contribute with a new metric suitable for MP domain. While prior work Balog et al. (2017); Zohar and Wolf (2018a) aims at optimizing program generation time, we argue that program generation time does not fully capture the efficiency of the generation algorithm. Instead, it captures implementation efficiency of the algorithm. Therefore, we propose to use “search space” size, i.e., how many candidate programs have been searched, as an alternative metric and demonstrate its value for comparing MP approaches.

2 Background

Let be a set of input-output pairs, such that the output is obtained by executing the program on the input . Inherently, the set of input-output examples describes the behavior of the program . One would like to synthesize a program that recovers the same functionality of . However, is usually unknown, and we are left with the set , which was obtained by running . Based on this assumption, we define equivalency between two programs as follows:

Definition 2.1 (Program Equivalency).

Programs and are equivalent under the set of input-output examples if and only if , for . We denote the equivalency by .

Definition 2.1 suggests that to obtain a program equivalent to , we need to synthesize a program that is consistent with the set . Therefore, our goal is find a program that is equivalent to the target program (which was used to generate ), i.e., . This task is known as the Inductive Program Synthesis (IPS) problem.

As suggested by Balog et al. (2017), a machine learning based solution to the IPS problem requires the definition of some components. First, we need a programming language that defines the domain of valid programs. Consequently, this language enforces a definition of the domain for the input and output samples. Second, we need a method to search over the programs domain. The search method sweeps over the program domain to find that satisfies the equivalency property. Optionally, we may want to define a ranking function to rank all the solutions found and choose the best ones. Last, as we plan to base our solution on machine learning techniques, we will need data, e.g., programs with input-output examples, to train models.

3 NetSyn

Here, we describe our solution to the IPS problem in more detail, including the choices and novelties for each of the proposed components. We name our solution NetSyn as it is based on neural networks for program synthesis.

Figure 1: Overview of NetSyn

3.1 Domain Specific Language

As NetSyn’s programming language, we choose a domain specific language (DSL) constructed specifically for it. This choice allows us to constrain the program space by restricting the operations used by our solution.

NetSyn’s DSL follows the DeepCoder’s DSL Balog et al. (2017), which was inspired by SQL and LINQ Dinesh et al. (2007). The only data types in the language are (i) integers and (ii) lists of integers. The DSL contains 41 functions, each taking one or two arguments and returning one output. Many of these functions include operations for list manipulation. Likewise, some operations also require lambda functions. There is no explicit control flow (conditionals or looping) in the DSL. However, several of the operations are high-level functions and are implemented using such control flow structures. A full description of the DSL can be found in the supplementary material. With these data types and operations, we define a program as a sequence of functions. Table 1 presents an example of a program of 4in instructions with an input and respective output.

Arguments to functions are not specified via named variables. Instead, each function uses the output of the previously executed function, which produces the type of output that is used as the input to the next function. The first function of each program uses the provided input . If has a type mismatch, default values are used (i.e., 0 for integers and an empty list for a list of integers). The final output of a programs is the output of its last function.

[int] Input:
Filter (0) [-2, 10, 3, -4, 5, 2]
Map (*2)
Sort Output:
Reverse [20, 10, 6, 4]

Table 1: A Program of length 4 with input and output examples

As a whole, NetSyn’s DSL is novel and amenable to genetic algorithms. The language is defined such that all possible programs are valid by construction. This makes the whole program space valid and is important to facilitate the search of programs by any learning method. In particular, this is very useful in evolutionary processes in genetic algorithms. When genetic crossover occurs between two programs or mutation occurs within a single program, the resulting program will always be valid. This eliminates the need for pruning to identify valid programs.

3.2 Search Process

NetSyn synthesizes a program by searching the program space with a genetic algorithm-based method Thomas (2009). It does this by creating a population of random genes (i.e., candidate programs) of a given length and uses a learned neural network-based fitness function to estimate the fitness of each gene. Higher graded genes are preferentially selected for crossover and mutation to produce the next generation of genes. In general, NetSyn uses this process to evolve the genes from one generation to the next until it discovers a correct candidate program as verified by the input-output examples. From time to time, NetSyn takes the top scoring genes from the population, determines their neighborhoods, and looks for the target program using a local proximity search. If a correctly generated program is not found within the neighborhoods, the evolutionary process resumes. Figure 1 summarizes the NetSyn’s search process.

We use a value encoding approach for each gene. A gene is represented as a sequence of values from , the set of functions. Formally, a gene , where . Practically, each contains an identifier (or index) corresponding to one of the DSL functions. The encoding scheme satisfies a one-to-one match between programs and genes.

The search process begins with a set of randomly generated programs. If an equivalent program to the target program is found, the search process stops. Otherwise, the genes are ranked using a learned fitness function. A small percentage (e.g., 20%) of the top graded genes in are passed in an unmodified fashion to the next generation for the next evolutionary phase. This guarantees some of the top graded genes are identically preserved, aiding in forward progress guarantees. The remaining genes for the new generation are created through crossover or mutation with some probability. For crossover, two genes from are selected using the Roulette Wheel algorithm with the crossover point selected randomly Goldberg (1989). For mutation, one gene is Roulette Wheel selected and the mutation point in that gene is selected randomly. The selected value is mutated to some other random value such that and . Sometimes, crossovers and mutations can lead to a new gene with dead code (see Section  B.4). If dead code is present, we repeat crossover and mutation until a gene without dead code is produced.

Learning the Fitness Function

Evolving the population of genes in a genetic algorithm requires a fitness function to rank the fitness (quality) of genes based on the problem being solved. Ideally, a fitness function should measure how close a gene is to the solution. Namely, it should measure how close a candidate program is to an equivalent of under . Finding a good fitness function is of great importance to reduce the size of the search domain and to make the genetic algorithm more likely to find in less time.

Often times, a fitness function is handcrafted to approximate some ideal function that is impossible (due to incomplete knowledge about the solution) or too computationally intensive to implement in practice. For example, if we knew beforehand, we could have designed an ideal fitness function that compares a candidate program with and calculates some metric of closeness (e.g., edit distance, the number of common functions etc.) as the fitness score. Since we do not know , we cannot implement the ideal fitness function. Instead, in this work, we propose to approximate the ideal fitness function by learning it from training data (generated from a number of known programs). For this purpose, we use a neural network model. We train it with the goal of predicting the values of an ideal fitness function. We call such an ideal fitness function (that would always give the correct answer with respect to the actual solution) the oracle fitness function as it is impossible to achieve in practice merely by examining input-output examples. In this case, our models will not be able to approach the 100% accuracy of the oracle but rather will still have sufficiently high enough accuracy to allow the genetic algorithm to make forward progress. Also, we note that the trained model needs to generalize to predict for any unavailable solution and not a single specific target case.

We follow ideas from works that have explored the automation of fitness functions using neural networks for approximating a known mathematical model. For example, Matos Dias et al. \yrciteDias:2014:cejor automated them for IMRT beam angle optimization, while Khuntia et al. \yrciteBonomali:2005:motl used them for rectangular microstrip antenna design automation. In contrast, our work is fundamentally different in that we use a large corpus of program metadata to train our models to predict how close a given, incorrect solution could be from an unknown correct solution (that will generate the correct output).

Given the input-output samples of the target program and an ideal fitness function , we would like a model that predicts the fitness value for a gene . In practice, our model needs to predict the values of from input-output samples in and/or samples with the output of : , with .

In NetSyn, we use a neural network to model the fitness function, referred to as NN-FF. This task requires us to generate a training dataset of programs with respective input-output samples. To train the NN-FF, we randomly generate a set of example programs, , along with a set of random inputs per program . We then execute each program in with its corresponding input set to calculate the output set . Additionally, for each in , we randomly generate another program . We apply the previously generated input to to produce the output . It then compares the programs and to calculate the fitness value and use it as an example to train the neural network.

In NetSyn, the inputs for the neural network can be chosen from one of these options: (i) only, which we call the format, (ii) and , which we call the model and (iii) the difference between the inputs and outputs in and , which we call the model. During training, we set to be that of programs in , and to be of the respective program . In this way, the network is trained on a variety of program correctness levels. During inference time, each generated candidate program behaves like and the target program replaces . Therefore, the model takes inputs from during search. The case for and models is similar. Details of how we trained the neural network can be found in the supplementary material. For inferencing with many input-output examples for a gene , all of them are fed into the neural network, and the NN-FF predictions are averaged to obtain the final prediction for ’s fitness.

To illustrate, suppose the program in Table 1 is in . Let us assume that is another program {[int], Filter (0), Map (*2), Reverse, Drop (2)}. If we use the input in Table 1 (i.e., [-2, 10, 3, -4, 5, 2]) with , the output is [6, 20]. In case of model, the input for NN-FF is {[-2, 10, 3, -4, 5, 2], [6, 20]}. For and model, the inputs are {[-2, 10, 3, -4, 5, 2], [20, 10, 6, 4], [-2, 10, 3, -4, 5, 2], [6, 20]} and {[-22, 0, -3, -8, 5, 2], [-8, -10, 3, -4, 5, 2]} respectively.

There are different ways to quantify how close two programs are to one another. Each of these different methods then has an associated metric and ideal fitness value. We investigated three such metrics – common functions, longest common subsequence, and function probability – which we use as the expected predicted output for the NN-FF.

Common Functions NetSyn can use the number of common functions (CF) between and as a fitness value for . In other words, the fitness value of is


Because the output of the neural network will be an integer from 0 to , the neural network can be designed as a multiclass classifier with a softmax layer as the final layer.

Longest Common Subsequence As an alternative to CF, we can use longest common subsequence (LCS) between and . The fitness score of is


Similar to CF, training data can be constructed from which is then fed into a neural network-based multiclass classifier.

Function Probability The work Balog et al. (2017) proposed a probability map for the functions in the DSL. Let us assume that the probability map is defined as the probability of each DSL operation to be in given the input-output samples. Namely, such that , where is the operation in the DSL. Then, a multiclass, multilabel neural network classifier with sigmoid activation functions used in the output of the last layer can be used to predict the probability map. Training data can be constructed for the neural network using . We can use the probability map to calculate the fitness score of as


NetSyn can also use the probability map to guide the mutation process. For example, instead of mutating a function with that is selected randomly, NetSyn can select using Roulette Wheel algorithm using the probability map.

Local Neighborhood Search

Figure 2: Example of neighborhood for a -length gene using (a) BFS- and (b) DFS-based approach.

Neighborhood search (NS), checks some candidate genes in the neighborhood of the top scoring genes from the genetic algorithm. The intuition behind NS is that if the target program is in that neighborhood, NetSyn may be able to find it without relying on the genetic algorithm, which would likely result in a faster time to synthesize the correct program.

Neighborhood Search Invocation Let us assume that NetSyn has completed generations. Then, let denote the average fitness score of genes for the last generations (i.e., from to ) and will denote the average fitness score before the last generations (i.e., from to ). Here, is the sliding window. NetSyn invokes NS if . The rationale is that under these conditions, the search procedure has not produced improved genes for the last generations (i.e., saturating). Therefore, it should check if the neighborhood contains any program equivalent to .

Neighborhood Definition Algorithm 1 shows how to define and search a neighborhood. The algorithm is inspired by the breadth first search (BFS) method. For each top scoring gene , NetSyn considers one function at a time starting from the first operation of the gene to the last one. Each selected operation is replaced with all other operations from , and inserts the resultant genes into the neighborhood set . If a program equivalent to is found in , NetSyn stops there and returns the solution. Otherwise, it continues the search and returns to the genetic algorithm. The complexity of the search is , which is significantly smaller than the exponential search space used by a traditional BFS algorithm.

Input: A set of top scoring genes

Output: , if found, or Not found otherwise

for Each  do

2        for  do
3               for  do
6       if there is such that  then
7               return
return Not found
Algorithm 1 Defines and searches neighborhood based on BFS principle

Similar to BFS, NetSyn can define and search the neighborhood using an approach similar to depth first search (DFS). It is similar to Algorithm 1 except will keep track of depth here. After the loop in line 4 finishes, NetSyn needs to pick the best scoring gene from to replace before going to the next level of depth. The algorithmic complexity will remain the same. Figure 2 shows examples of neighborhood using BFS- and DFS-based approach.

4 Experimental Results

We implemented NetSyn in C++ with a TensorFlow backend Abadi et al. (2015). We also developed an interpreter for NetSyn’s DSL to evaluate the generated programs. We used 50,000 randomly generated unique example programs of length 4 to train the neural networks. We used 100 input-output examples for each program to generate the training data. For every approach, the same programs are used for training. Code and data can be found at http://www.anonymous.com/anonymous.

NetSyn’s training time ranges between 3 and 8 hours using an NVidia Tesla P100 25. For our NN-FF, we chose an NN topology using three hidden layers with a softmax activation function in the output layer. Its input layer performs normalization to properly predict and . The hidden layers have 48, 24, and 12 neurons, respectively. These decisions were based on experimental evaluation of the NN-FF’s accuracy. The model uses a softmax activation function in the output layer, because CF and LCS fitness scores are modeled as a classification problem. The normalization layer was chosen due to its ability to reduce the impact of absolute values of inputs. The function probability map was chosen because of its proposal in DeepCoder; as such, we used the same NN configuration to predict . We used fixed length inputs (padded, if necessary) for the neural networks.

To test NetSyn, we randomly generated a total of 100 programs for each program length from 5 to 10. For each program length, 50 of the generated programs produce a singleton integer as the output; the rest produce a list of integers. We therefore refer to the first 50 programs as singleton programs and the rest as list programs. We collected input-output examples for each testing program to populate . We did not consider testing programs of length 1 to 3 because the search space is too small to warrant a sophisticated synthesis technique. When synthesizing a program using NetSyn, we execute it times and average the results to eliminate noise.

Figure 3: NetSyn’s synthesis ability with respect to different fitness functions and other state-of-the-art schemes.

Our experiments aim to (i) demonstrate NetSyn’s synthesis ability and compare its performance against two state-of-the-art approaches, DeepCoder and PCCoder, and to (ii) characterize the effectiveness of different design choices used.

4.1 Demonstration of Synthesis Ability

We ran three variants of NetSyn - , , and , each predicting , , and fitness functions, respectively. Additionally, we ran the publicly available implementations of DeepCoder and PCCoder from their respective GitHub repositories. For DeepCoder, we used the best performing implementation based on “Sort and Add” enumerative search algorithm Balog et al. (2017). For comparison, we also tested three other fitness functions: 1) constant (), presumably the worst case in which the GA finds programs randomly, 2) edit-distance between outputs (), and 3) the oracle (). Unless otherwise mentioned, we use NetSyn to denote and based NetSyn. For all the approaches, we set the maximum search space size to 3,000,000 candidate programs. If an approach does not find the solution prior to reaching that threshold, we conclude the experiment and consider it as “solution not found,”. The summarized results are shown in Figure 3. The detailed results are in the appendix in Table 3 and  4.

Figure 3(a) - (c) show comparative results using synthesis time as the metric. For each approach, we sort the time taken to synthesize the programs. A position N on the X-axis corresponds to the program synthesized in the Nth longest percentile time of all the programs. Lines terminate at the point at which the approach fails to synthesize the corresponding program. In general, DeepCoder, PCCoder and NetSyncan synthesize up to 30% of programs within a few seconds for all program lengths we tested. As expected, synthesis time increases as an approach attempts to synthesize more difficult programs. DeepCoder and PCCoder usually find solutions faster than NetSyn. Moreover, the synthesis time tends to increase for larger length programs. However, when the search space is constrained to some maximum, NetSyn tends to synthesize more programs. Among the fitness functions, and have comparable synthesis percentages and time whereas usually performs worse. NetSyn synthesizes programs at percentages ranging from 50% (in case of for 10 length programs) to as high as 98% (in case of for 5 length programs). Moreover, and based approaches synthesize less percentage of programs than and . On the other hand, (which is impossible to implement) always synthesizes all programs with a second. In summary, for any program length, NetSyn synthesizes more programs than either DeepCoder or PCCoder, although it takes more time to do so.

Figure 3(d) - (f) show comparative results using our proposed metric: search space used. For each test program, we count the number of candidate programs searched before the experiment has concluded by either finding a correct program or exceeding the threshold. The number of candidate programs searched is expressed as a percentage of the maximum search space threshold, i.e., 3,000,000. For all approaches, up to 30% of the programs can be synthesized by searching less than 2% of the maximum search space. Search space use increases when an approach tries to synthesize more programs. In general, DeepCoder and PCCoder search more candidate programs than or based NetSyn. For example, for synthesizing programs of length 5, DeepCoder and PCCoder use 37% and 33% search space to synthesize 40% and 50% programs, respectively. In comparison, NetSyn can synthesize upwards of 90% programs by using only 30% search space. In other words, NetSyn is more efficient in generating and searching likely target programs. Even for length 10 programs, NetSyn can generate 70% of the programs using only 24% of the maximum search space. In contrast, DeepCoder and PCCoder cannot synthesize more than 50% and 60% of the programs even if they use the maximum search space. Constant and edit distance based approach always uses more search space than or . In summary, NetSyn’s synthesis technique is more efficient than both DeepCoder and PCCoder in how it generates and searches candidate programs to find a solution.

4.2 Characterization of NetSyn

Next, we characterize the effect of different fitness functions, neighborhood search algorithms, and DSL function types on the synthesis process. To explain the details of different choices, we show the results in this section based on programs of length 4. However, our general observations hold for longer length programs.

To synthesize a particular program, we ran NetSyn  times. Thus, for a total of testing programs of a particular length, we ran a total of experiments. Figure 4(a) shows the percentage of those experiments in which the target program was synthesized. The results are partitioned by singleton and list types. NetSyn synthesized at the highest percentage when the fitness function is used (85%), whether NS is used or not. Moreover, BFS-based NS tends to produce more equivalent programs. In comparison, caused NetSyn to synthesize programs at a lower percentage (46%), performing roughly an order of magnitude worse for singleton programs. This is caused by the difficulty in predicting probabilities of all functions when the output is a single integer because they contain less information compared to a list type output. For list programs, the synthesis percentage of each approach is comparable. For longer length programs, the synthesis percentage decreases, but the synthesis ratio between singleton and list programs is similar.

Figure 4(b) shows the effect of using , , and options for inputs of the neural network. fitness function only works with model and hence, is not shown here. In general, model is the most effective. This is counter-intuitive because the other models contain more information. However, our analysis found that although the other models had lower synthesis percentages, they expedited the synthesis time. In other words, the other models tend to specialize for particular types of programs.

Fitness Function List Singleton
CF 50 49
LCS 50 49
FP 50 13
Table 2: Unique quantity of programs synthesized based on type of output for each fitness function learned.

Figure 4: (a) & (b) show the effect of NS as well as different input models. (c) - (d) show the synthesis percentage details in 3D scatter plot.

Figure 5: Synthesis percentage across different functions

Table 2 shows how many unique programs that the different approaches were able to synthesize. Both and fitness functions enabled NetSyn to synthesize 99 of 100 unique programs. The one program that NetSyn was not able to synthesize (#37) contains DELETE and ACCESS functions. DELETE deletes all occurrences of a number in the list, whereas ACCESS returns the -th element of a list. Both functions are difficult to predict because the input and output of these two functions behave differently depending on both arguments. Program #11 also contains these functions, which the -based approach was able to synthesize 15 out of 50 times (the second lowest synthesis percentage after #37). The FP-based fitness function was able to synthesize 63 of 100 unique programs. All three approaches correctly synthesized every list program. This implies that singleton programs are harder to synthesize.

Figure 4(c) - (e) show the synthesis percentage for different programs and fitness functions. Program 1 to 50 are singleton programs and have lower synthesis percentage in all three fitness function choices. Particularly, the -based approach has a low synthesis percentage for singleton programs. Functions 1 to 12 produce singleton integer and tend to cause lower synthesis percentage for any program that contains them. To shed more light in this issue, Figure 5 shows synthesis percentage across different functions. The synthesis percentage for a function is at least 40% for the -based approach, whereas for the -based approach, four functions cannot be synthesized at all. Functions corresponding to each number are found in the supplementary materials.

Like DeepCoder and PCCoder Zohar and Wolf (2018b), NetSyn assumes a priori knowledge of the target program length and maintains all genes at that target length. However, we experimented with generating the initial genes with gene lengths following a normal distribution and also allowing crossover and mutation to change gene length. We found that this increased the time to solution and reduced the synthesis percentage. This effect was particularly pronounced if the target program length was two or more functions larger or smaller than the mean initial gene length. In future work we will explore how to predict the program length and remove the need for this a priori knowledge.

5 Conclusion

In this paper, we presented a genetic algorithm-based framework for program synthesis called NetSyn. To the best of our knowledge, it is the first work that uses a neural network to automatically generate an evolutionary algorithm’s fitness function in the context of program synthesis. We proposed two neural network fitness functions and contrasted them against a fitness function based on Balrog et al. \yrcitedeepcoder. NetSyn is also novel in that it uses neighborhood search to expedite the convergence process of an evolutionary algorithm. We compared our approach against two state-of-the art program synthesis systems, DeepCoder and PCCoder and showed that NetSyn synthesizes equivalent programs at a higher rate, especially for singleton programs.

Appendix A Appendix A: NetSyn’s DSL

In this appendix, we provide more details about the list DSL that NetSyn uses to generate programs. Our list DSL has only two implicit data types, integer and list of integer. A program in this DSL is a sequence of statements, each of which is a call to one of the 41 functions defined in the DSL. There are no explicit variables, nor conditionals, nor explicit control flow operations in the DSL, although many of the functions in the DSL are high-level and contain implicit conditionals and control flow within them. Each of the 41 functions in the DSL takes one or two arguments, each being of integer or list of integer type, and returns exactly one output, also of integer or list of integer type. Given these rules, there are 10 possible function signatures. However, only 5 of these signatures occur for the functions we chose to be part of the DSL. The following sections are broken down by the function signature, wherein all the functions in the DSL having that signature are described.

Instead of named variables, each time a function call requires an argument of a particular type, our DSL’s runtime searches backwards and finds the most recently executed function that returns an output of the required type and then uses that output as the current function’s input. Thus, for the first statement in the program, there will be no previous function’s output from which to draw the arguments for the first function. When there is no previous output of the correct type, then our DSL’s runtime looks at the arguments to the program itself to provide those values. Moreover, it is possible for the program’s inputs to not provide a value of the requested type. In such cases, the runtime provides a default value for missing inputs, 0 in the case of integer and an empty list in the case of list of integer. For example, let us say that a program is given a list of integer as input and that the first three functions called in the program each consume and produce a list of integer. Now, let us assume that the fourth function called takes an integer and a list of integer as input. The list of integer input will use the list of integer output from the previous function call. The DSL runtime will search backwards and find that none of the previous function calls produced integer output and that no integer input is present in the program’s inputs either. Thus, the runtime would provide the value 0 as the integer input to this fourth function call. The final output of a program is the output of the last function called.

Thus, our language is defined in such a way that so long as the program consists only of calls to one of the 41 functions provided by the DSL, that these programs are valid by construction. Each of the 41 functions is guaranteed to finish in a finite time and there are no looping constructs in the DSL, and thus, programs in our DSL are guaranteed to finish. This property allows our system to not have to monitor the programs that they execute to detect potentially infinite loops. Moreover, so long as the implementations of those 41 functions are secure and have no potential for memory corruption then programs in our DSL are similarly guaranteed to be secure and not crash and thus we do not require any sand-boxing techniques. When our system performs crossover between two candidate programs, any arbitrary cut points in both of the parent programs will result in a child program that is also valid by construction. Thus, our system need not test that programs created via crossover or mutation are valid.

In the following sections, [] is used to indicate the type list of integer whereas int is used to indicate the integer type. The type after the arrow is used to indicate the output type of the function.

a.1 Functions with the Signature [] int

There are 9 functions in our DSL that take a list of integer as input and return an integer as output.

HEAD (Function 6)

This function returns the first item in the input list. If the list is empty, a 0 is returned.

LAST (Function 7)

This function returns the last item in the input list. If the list is empty, a 0 is returned.

MINIMUM (Function 8)

This function returns the smallest integer in the input list. If the list is empty, a 0 is returned.

MAXIMUM (Function 9)

This function returns the largest integer in the input list. If the list is empty, a 0 is returned.

SUM (Function 11)

This functions returns the sum of all the integers in the input list. If the list is empty, a 0 is returned.

COUNT (Function 2-5)

This function returns the number of items in the list that satisfy the criteria specified by the additional lambda. Each possible lambda is counted as a different function. Thus, there are 4 COUNT functions having lambdas: ¿0, ¡0, odd, even.

a.2 Functions with the Signature [] []

There are 21 functions in our DSL that take a list of integer as input and produce a list of integer as output.

REVERSE (Function 29)

This function returns a list containing all the elements of the input list but in reverse order.

SORT (Function 35)

This function returns a list containing all the elements of the input list in sorted order.

MAP (Function 19-28)

This function applies a lambda to each element of the input list and creates the output list from the outputs of those lambdas. Let be the nth element of the input list to MAP and let be the nth element of the output list from Map. MAP produces an output list such that =lambda() for all n. There are 10 MAP functions corresponding to the following lambdas: +1,-1,*2,*3,*4,/2,/3,/4,*(-1),^2.

FILTER (Function 14-17)

This function returns a list containing only those elements in the input list satisfying the criteria specified by the additional lambda. Ordering is maintained in the output list relative to the input list for those elements satisfying the criteria. There are 4 FILTER functions having the lambdas: ¿0, ¡0, odd, even.

SCANL1 (Function 30-34)

Let be the nth element of the input list to SCANL1 and let be the nth element of the output list from SCANL1. This function produces an output list as follows:

There are 5 SCANL1 functions corresponding to the following lambdas: +, -, *, min, max.

a.3 Functions with the Signature int,[] []

There are 4 functions in our DSL that take an integer and a list of integer as input and produce a list of integer as output.

TAKE (Function 36)

This function returns a list consisting of the first N items of the input list where N is the smaller of the integer argument to this function and the size of the input list.

DROP (Function 13)

This function returns a list in which the first N items of the input list are omitted, where N is the integer argument to this function.

DELETE (Function 12)

This function returns a list in which all the elements of the input list having value X are omitted where X is the integer argument to this function.

INSERT (Function 18)

This function returns a list where the value X is appended to the end of the input list, where X is the integer argument to this function.

a.4 Functions with the Signature [],[] []

There is only one function in our DSL that takes two lists of integers and returns another list of integers.

ZIPWITH (Function 37-41)

This function returns a list whose length is equal to the length of the smaller input list. Let be the nth element of the output list from ZIPWITH. Moreover, let and be the nth elements of the first and second input lists respectively. This function creates the output list such that =lambda(, ). There are 5 ZIPWITH functions corresponding to the following lambdas: +, -, *, min, max.

a.5 Functions with the Signature int,[] int

There are two functions in our DSL that take an integer and list of integer and return an integer.

ACCESS (Function 1)

This function returns the Nth element of the input list, where N is the integer argument to this function. If N is less than 0 or greater than the length of the input list then 0 is returned.

SEARCH (Function 10)

This function return the position in the input list where the value X is first found, where X is the integer argument to this function. If no such value is present in the list, then -1 is returned.

Appendix B Appendix B: System Details

b.1 Hyper-parameters for the Models and Genetic Algorithm

  • Evolutionary Algorithm:

    • Gene pool size: 100

    • Number of reserve gene in each generation: 20

    • Maximum number of generation: 30,000

    • Gene length: 4

    • Crossover rate: 40%

    • Mutation rate: 30%

  • Neural Network Training:

    • Loss: Categorical Cross-Entropy

    • Optimizer: Adam

    • 3 hidden layers with neurons 48, 24, 12

    • Activation function: Sigmoid in hidden layers and Softmax in output layer.

b.2 Generation of the Training Dataset

For our two approaches ( and ), we created 3 types of data sets for 3 different models (, , ). We used 50,000 programs as base program, and to compare, we chose 150 different other programs. These two sets of programs are compared with each other to get the number of common function or longest common sub-sequence between them. In each comparison, we created 100 input-output examples that lead to total 750 million data points. For model we generated our dataset from the base program but for and model we need another output that we created with the comparable program by passing the inputs. Each input or output were padded to fixed 12 dimension and were joined together. For the model we took absolute difference between input and corresponding two different outputs. Also add the information of dimension difference of two output. Thus for the three models input dimension were 24 (), 36 (), 25 ().

With our training programs and given input-output examples we created our dataset. We split out the dataset into training and testing set in a ratio of 3:1. We also randomized the dataset before splitting. Data were normalized before feeding into the neural network.

b.3 Training of Neural Network

We used 3 hidden layers in our model. Our models predicted common functions/longest common subsequence between the target programs and generated programs from EA by using input-output examples. We predicted that value as a classification output.

For the DeepCoder model, we used 3 hidden layers with 256 neurons each. We passed the input through the embedding layer connected to the input neurons. We took the average for the input-output examples and predicted function probability.

b.4 Dead Code Elimination

Dead code elimination (DCE) is a classic compiler technique to remove code from a program that has no effect on the program’s output Debray et al. (2000). Dead code is possible in our list DSL if the output of a statement is never used. We implemented DCE in NetSyn by tracking the input/output dependencies between statements and eliminating those statements whose outputs are never used. NetSyn uses DCE during candidate program generation and during crossover/mutation to ensure that the effective length of the program is not less than the target program length due to the presence of dead code.

Appendix C Additional Results

Program Method Synthesis Time Required to Synthesize (in seconds)
Length Percentage 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

48% 1s 60s 388s 422s - - - - - -
77% 1s 7s 112s 272s 345s 393s 483s - - -
40% 1s 1s 2s 126s - - - - - -
51% 1s 1s 6s 66s 357s - - - - -
64% 1s 1s 4s 174s 525s 803s - - - -
98% 1s 1s 4s 43s 112s 349s 600s 690s 838s -
96% 1s 1s 3s 46s 131s 392s 671s 768s 874s -
98% 1s 1s 4s 43s 115s 344s 612s 704s 847s -
96% 1s 1s 3s 47s 138s 400s 660s 753s 866s -
100% 1s 1s 1s 1s 1s 1s 1s 1s 1s 1s

43% 1s 1s 314s 397s - - - - - -
73% 1s 1s 82s 229s 300s 368s 462s - - -
45% 1s 1s 1s 14s - - - - - -
75% 1s 1s 1s 4s 190s 233s 619s - - -
63% 1s 1s 3s 306s 464s 919s - - - -
87% 1s 1s 4s 130s 574s 716s 870s 956s - -
88% 1s 1s 3s 183s 533s 732s 872s 918s - -
87% 1s 1s 4s 127s 563s 695s 875s 937s - -
88% 1s 1s 3s 185s 528s 754s 855s 899s - -
100% 1s 1s 1s 1s 1s 1s 1s 1s 1s 1s

44% 1s 1s 1s 685s - - - - - -
68% 1s 1s 1s 249s 354s 433s - - - -
45% 1s 1s 1s 13s - - - - - -
52% 1s 1s 2s 11s 635s - - - - -
58% 1s 1s 1s 393s 566s - - - - -
81% 1s 1s 1s 176s 676s 1062s 1134s 1180s - -
78% 1s 1s 1s 127s 609s 889s 956s - - -
81% 1s 1s 1s 178s 670s 1094s 1112s 1156s - -
79% 1s 1s 1s 126s 624s 876s 976s - - -
100% 1s 1s 1s 1s 1s 1s 1s 1s 1s 1s

43% 1s 1s 1s 1371s - - - - - -
65% 1s 1s 1s 297s 401s 534s - - - -
56% 1s 1s 1s 1s 29s - - - - -
57% 1s 1s 1s 1s 15s - - - - -
49% 1s 1s 1s 587s - - - - - -
68% 1s 1s 1s 748s 1545s 1702s - - - -
69% 1s 1s 1s 404s 988s 1044s - - - -
68% 1s 1s 1s 763s 1578s 1668s - - - -
69% 1s 1s 1s 392s 969s 1013s - - - -
100% 1s 1s 1s 1s 1s 1s 1s 1s 1s 1s

41% 1s 1s 1s 1387s - - - - - -
67% 1s 1s 1s 288s 429s 574s - - - -
53% 1s 1s 1s 1s 56s - - - - -
55% 1s 1s 1s 1s 107s - - - - -
52% 1s 1s 1s 1s 614s - - - - -
64% 1s 1s 1s 1584s 2544s 2846s - - - -
67% 1s 1s 1s 837s 1055s 1195s - - - -
64% 1s 1s 1s 1603s 2473s 2790s - - - -
67% 1s 1s 1s 846s 1029s 1207s - - - -
100% 1s 1s 1s 1s 1s 1s 1s 1s 1s 1s

41% 1s 1s 1s 1403s - - - - - -
68% 1s 1s 1s 208s 459s 591s - - - -
42% 1s 1s 1s 67s - - - - - -
48% 1s 1s 1s 1011s - - - - - -
49% 1s 1s 1s 517s - - - - - -
55% 1s 1s 1s 625s 1640s - - - - -
66% 1s 1s 1s 164s 957s 1121s - - - -
55% 1s 1s 1s 638s 1649s - - - - -
66% 1s 1s 1s 168s 978s 1099s - - - -
100% 1s 1s 1s 1s 1s 1s 1s 1s 1s 1s

Table 3: Comparison with DeepCoder and PCCoder in synthesizing different length programs. All experiments are done with the maximum search space set to 3,000,000 candidate programs.

Figure 3 shows detailed numerical results using synthesis time as the metric. Columns 10% to 100% show the duration of time (in seconds) it takes to synthesize the corresponding percentage of programs.

Program Method Search Space Used to Synthesize
Length 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

1% 9% 59% 64% - - - - - -
1% 1% 17% 41% 52% 60% 73% - - -
1% 1% 1% 37% - - - - - -
1% 1% 1% 7% 33% - - - - -
1% 1% 1% 2% 5% 23% - - - -
1% 1% 1% 3% 7% 10% 15% 21% 30% -
1% 1% 1% 2% 5% 9% 14% 21% 29% -
1% 1% 1% 2% 7% 9% 15% 21% 30% -
1% 1% 1% 2% 5% 9% 13% 20% 28% -
1% 1% 1% 1% 1% 1% 1% 1% 1% 1%

1% 1% 46% 58% - - - - - -
1% 1% 12% 33% 43% 53% 67% - - -
1% 1% 1% 4% - - - - - -
1% 1% 1% 1% 33% 55% 73% - - -
1% 1% 1% 1% 5% 28% - - - -
1% 1% 1% 2% 5% 12% 18% 25% - -
1% 1% 1% 2% 5% 10% 17% 24% - -
1% 1% 1% 1% 4% 11% 18% 24% - -
1% 1% 1% 2% 4% 10% 16% 23% - -
1% 1% 1% 1% 1% 1% 1% 1% 1% 1%

1% 1% 1% 82% - - - - - -
1% 1% 1% 34% 48% 59% - - - -
1% 1% 1% 3% - - - - - -
1% 1% 1% 1% 38% - - - - -
1% 1% 1% 1% 14% - - - - -
1% 1% 1% 2% 3% 6% 18% 30% - -
1% 1% 1% 2% 5% 11% 20% - - -
1% 1% 1% 2% 2% 6% 17% 29% - -
1% 1% 1% 1% 5% 10% 20% - - -
1% 1% 1% 1% 1% 1% 1% 1% 1% 1%

1% 1% 1% 93% - - - - - -
1% 1% 1% 37% 50% 67% - - - -
1% 1% 1% 1% 6% - - - - -
1% 1% 1% 1% 1% - - - - -
1% 1% 1% 1% - - - - - -
1% 1% 1% 3% 7% 17% - - - -
1% 1% 1% 2% 6% 13% - - - -
1% 1% 1% 3% 7% 16% - - - -
1% 1% 1% 2% 5% 12% - - - -
1% 1% 1% 1% 1% 1% 1% 1% 1% 1%

1% 1% 1% 1% 1% 9% - - - -
1% 1% 1% 1% 1% 7% - - - -
1% 1% 1% 1% 1% 9% - - - -
1% 1% 1% 1% 1% 7% - - - -
1% 1% 1% 1% 4% - - - - -
1% 1% 1% 1% 3% 11% 17% - - -
1% 1% 1% 1% 1% 9% - - - -
1% 1% 1% 3% 7% 16% - - - -
1% 1% 1% 2% 5% 12% - - - -
1% 1% 1% 1% 1% 1% 1% 1% 1% 1%

1% 1% 1% 90% - - - - - -
1% 1% 1% 20% 43% 56% - - - -
1% 1% 1% 9% - - - - - -
1% 1% 1% 61% - - - - - -
1% 1% 1% 4% - - - - - -
1% 1% 1% 6% 16% - - - - -
1% 1% 1% 4% 12% 24% - - - -
1% 1% 1% 6% 16% - - - - -
1% 1% 1% 4% 11% 24% - - - -
1% 1% 1% 1% 1% 1% 1% 1% 1% 1%

Table 4: Comparison with DeepCoder and PCCoder in terms of search space use. All experiments are done with the maximum search space set to 3,000,000 candidate programs.

In the experimental results section, we presented synthesis times for NetSyn that included both GA time and NN-FF inferencing time. Since inferencing time is significant, we speculate that further optimizations of our approach may be possible using futuristic deep learning accelerators Shawahna et al. (2019) that may eliminate some portion of the overall neural network inferencing time. Thus, we now show two versions of NetSyn, the original non-optimized time, which includes the total wall clock time of NetSyn with our hardware testbed environment, plus an optimized version after subtracting away inferencing time. The results are presented in Table 5.

Program Length MP System Synthesis Time Required to Synthesize (in seconds)
Percentage 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
5 40% 1s 1s 2s 126s - - - - - -
51% 1s 1s 6s 66s 357s - - - - -
64% 1s 1s 4s 174s 525s 803s - - - -
64% 1s 1s 1s 164s 515s 793s - - - -
98% 1s 1s 4s 43s 112s 349s 600s 690s 838s -
98% 1s 1s 1s 11s 33s 141s 321s 421s 537s -
96% 1s 1s 3s 46s 131s 392s 671s 768s 874s -
96% 1s 1s 1s 13s 39s 181s 380s 475s 569s -
6 45% 1s 1s 1s 14s - - - - - -
75% 1s 1s 1s 4s 190s 233s 619s - - -
63% 1s 1s 3s 306s 464s 919s - - - -
63% 1s 1s 1s 296s 457s 909s - - - -
87% 1s 1s 4s 130s 574s 716s 870s 956s - -
87% 1s 1s 1s 56s 307s 429s 579s 658s - -
88% 1s 1s 3s 183s 533s 732s 872s 918s - -
88% 1s 1s 1s 66s 281s 435s 569s 612s - -
7 45% 1s 1s 1s 13s - - - - - -
52% 1s 1s 2s 11s 635s - - - - -
58% 1s 1s 1s 393s 566s - - - - -
58% 1s 1s 1s 383s 556s - - - - -
81% 1s 1s 1s 176s 676s 1062s 1134s 1180s - -
81% 1s 1s 1s 92s 445s 775s 834s 886s - -
78% 1s 1s 1s 127s 609s 889s 956s - - -
78% 1s 1s 1s 49s 349s 593s 655s - - -
8 56% 1s 1s 1s 1s 29s - - - - -
57% 1s 1s 1s 1s 15s - - - - -
49% 1s 1s 1s 587s - - - - - -
49% 1s 1s 1s 577s - - - - - -
68% 1s 1s 1s 748s 1545s 1702s - - - -
68% 1s 1s 1s 571s 1258s 1401s - - - -
69% 1s 1s 1s 404s 988s 1044s - - - -