Learning Fitness Functions for Genetic Algorithms
Abstract
The problem of automatic software generation is known as Machine Programming. In this work, we propose a framework based on genetic algorithms to solve this problem. Although genetic algorithms have been used successfully for many problems, one criticism is that handcrafting its fitness function, the test that aims to effectively guide its evolution, can be notably challenging. Our framework presents a novel approach to learn the fitness function using neural networks to predict values of ideal fitness function. We also augment the evolutionary process with a minimally intrusive search heuristic. This heuristic improves the framework’s ability to discover correct programs from ones that are approximately correct and does so with negligible computational overhead. We compare our approach with two stateoftheart program synthesis methods and demonstrate that it finds more correct programs with fewer candidate program generations.
1 Introduction
In recent years, there has been notable progress in the space of automatic software generation, also known as Machine Programming (MP) Gottschlich et al. (2018); Ratner et al. (2019). MP can be achieved in many ways. One way is by using formal program synthesis, a technique that uses formal methods and rules to generate programs Manna and Waldinger (1975). Formal program synthesis usually guarantees some program properties by evaluating a generated program’s semantics against a corresponding specification Gulwani et al. (2012); Alur et al. (2015). Although useful, such formal synthesis techniques can often be limited by exponentially increasing computational overhead that grows with the program’s instruction size Heule et al. (2016); Bodík and Jobstmann (2013); SolarLezama et al. (2006); Loncaric et al. (2018); Cheung et al. (2012).
An alternative to formal methods for MP is to use machine learning (ML). Machine learning differs from traditional formal program synthesis in that it generally does not provide correctness guarantees. Instead, MLdriven MP approaches are usually only probabilistically correct, i.e., their results are derived from sample data relying on statistical significance Murphy (2012). Such ML approaches tend to explore software program generation using an objective function. Objective functions are used to guide an ML system’s exploration of a problem space to find a solution.
More recently, there has been a surge of research exploring MLbased MP using neural networks (NNs). For example, In Balog et al. (2017), the authors train a neural network with inputoutput examples to predict the probabilities of the functions that are most likely to be used in a program. Raychev et al. \yrciteraychev14 take a different approach and use an ngram model to predict the functions that are most likely to complete a partially constructed program. Bunel et al. \yrcitebunel18 explore a unique approach that combines reinforcement learning (RL) with a supervised model to find semantically correct programs. These are only a few of the works in MP space using neural networks Reed and de Freitas (2016); Cai et al. (2017).
As demonstrated by Becker and Gottschlich \yrciteaip, genetic algorithms also show promise for MP. Real et al. \yrcitereal18 subsequently demonstrated that genetic algorithms can generate accurate image classifiers. Their approach produced a stateoftheart classifier for CIFAR10 Krizhevsky (2009) and ImageNet Deng et al. (2009) datasets. Moreover, genetic algorithms have been exploited to successfully automate the neural architecture optimization process Salimans et al. (2017); Such et al. (2017); Liu et al. (2017); Labs (). Even with this notable progress, genetic algorithms can be challenging to use due to the complexity of handcrafting fitness functions that guide the search.
While genetic algorithms have had demonstrable success in their practical application Korns (2011); Such et al. (2017); Real et al. (2018), an open challenge is in creating simple, yet effective fitness functions. Contrary to this goal, fitness function complexity tends to increase proportionally with the problem being solved. In this paper, we explore an approach to automatically generate these fitness functions by representing their structure with a neural network, making the following technical contributions:

Fitness Function: Our fundamental contribution is in the automation of fitness functions for genetic algorithms. To the best of our knowledge, our work is the first of its kind to use a neural network as a genetic algorithm’s fitness function for software generation.

Convergence: A secondary contribution is in our utilization of local neighborhood search to improve the convergence of approximately correct candidate solutions. We demonstrate its efficacy empirically.

Generality: We demonstrate that our approach can support different neural network fitness functions, uniformly. While our experiments are strictly in the space of MP, we are not aware of any restrictions from utilizing our fitness function automation technique in other domains.

Metric: We contribute with a new metric suitable for MP domain. While prior work Balog et al. (2017); Zohar and Wolf (2018a) aims at optimizing program generation time, we argue that program generation time does not fully capture the efficiency of the generation algorithm. Instead, it captures implementation efficiency of the algorithm. Therefore, we propose to use “search space” size, i.e., how many candidate programs have been searched, as an alternative metric and demonstrate its value for comparing MP approaches.
2 Background
Let be a set of inputoutput pairs, such that the output is obtained by executing the program on the input . Inherently, the set of inputoutput examples describes the behavior of the program . One would like to synthesize a program that recovers the same functionality of . However, is usually unknown, and we are left with the set , which was obtained by running . Based on this assumption, we define equivalency between two programs as follows:
Definition 2.1 (Program Equivalency).
Programs and are equivalent under the set of inputoutput examples if and only if , for . We denote the equivalency by .
Definition 2.1 suggests that to obtain a program equivalent to , we need to synthesize a program that is consistent with the set . Therefore, our goal is find a program that is equivalent to the target program (which was used to generate ), i.e., . This task is known as the Inductive Program Synthesis (IPS) problem.
As suggested by Balog et al. (2017), a machine learning based solution to the IPS problem requires the definition of some components. First, we need a programming language that defines the domain of valid programs. Consequently, this language enforces a definition of the domain for the input and output samples. Second, we need a method to search over the programs domain. The search method sweeps over the program domain to find that satisfies the equivalency property. Optionally, we may want to define a ranking function to rank all the solutions found and choose the best ones. Last, as we plan to base our solution on machine learning techniques, we will need data, e.g., programs with inputoutput examples, to train models.
3 NetSyn
Here, we describe our solution to the IPS problem in more detail, including the choices and novelties for each of the proposed components. We name our solution NetSyn as it is based on neural networks for program synthesis.
3.1 Domain Specific Language
As NetSyn’s programming language, we choose a domain specific language (DSL) constructed specifically for it. This choice allows us to constrain the program space by restricting the operations used by our solution.
NetSyn’s DSL follows the DeepCoder’s DSL Balog et al. (2017), which was inspired by SQL and LINQ Dinesh et al. (2007). The only data types in the language are (i) integers and (ii) lists of integers. The DSL contains 41 functions, each taking one or two arguments and returning one output. Many of these functions include operations for list manipulation. Likewise, some operations also require lambda functions. There is no explicit control flow (conditionals or looping) in the DSL. However, several of the operations are highlevel functions and are implemented using such control flow structures. A full description of the DSL can be found in the supplementary material. With these data types and operations, we define a program as a sequence of functions. Table 1 presents an example of a program of 4in instructions with an input and respective output.
Arguments to functions are not specified via named variables. Instead, each function uses the output of the previously executed function, which produces the type of output that is used as the input to the next function. The first function of each program uses the provided input . If has a type mismatch, default values are used (i.e., 0 for integers and an empty list for a list of integers). The final output of a programs is the output of its last function.
[int]  Input: 
Filter (0)  [2, 10, 3, 4, 5, 2] 
Map (*2)  
Sort  Output: 
Reverse  [20, 10, 6, 4] 

As a whole, NetSyn’s DSL is novel and amenable to genetic algorithms. The language is defined such that all possible programs are valid by construction. This makes the whole program space valid and is important to facilitate the search of programs by any learning method. In particular, this is very useful in evolutionary processes in genetic algorithms. When genetic crossover occurs between two programs or mutation occurs within a single program, the resulting program will always be valid. This eliminates the need for pruning to identify valid programs.
3.2 Search Process
NetSyn synthesizes a program by searching the program space with a genetic algorithmbased method Thomas (2009). It does this by creating a population of random genes (i.e., candidate programs) of a given length and uses a learned neural networkbased fitness function to estimate the fitness of each gene. Higher graded genes are preferentially selected for crossover and mutation to produce the next generation of genes. In general, NetSyn uses this process to evolve the genes from one generation to the next until it discovers a correct candidate program as verified by the inputoutput examples. From time to time, NetSyn takes the top scoring genes from the population, determines their neighborhoods, and looks for the target program using a local proximity search. If a correctly generated program is not found within the neighborhoods, the evolutionary process resumes. Figure 1 summarizes the NetSyn’s search process.
We use a value encoding approach for each gene. A gene is represented as a sequence of values from , the set of functions. Formally, a gene , where . Practically, each contains an identifier (or index) corresponding to one of the DSL functions. The encoding scheme satisfies a onetoone match between programs and genes.
The search process begins with a set of randomly generated programs. If an equivalent program to the target program is found, the search process stops. Otherwise, the genes are ranked using a learned fitness function. A small percentage (e.g., 20%) of the top graded genes in are passed in an unmodified fashion to the next generation for the next evolutionary phase. This guarantees some of the top graded genes are identically preserved, aiding in forward progress guarantees. The remaining genes for the new generation are created through crossover or mutation with some probability. For crossover, two genes from are selected using the Roulette Wheel algorithm with the crossover point selected randomly Goldberg (1989). For mutation, one gene is Roulette Wheel selected and the mutation point in that gene is selected randomly. The selected value is mutated to some other random value such that and . Sometimes, crossovers and mutations can lead to a new gene with dead code (see Section B.4). If dead code is present, we repeat crossover and mutation until a gene without dead code is produced.
Learning the Fitness Function
Evolving the population of genes in a genetic algorithm requires a fitness function to rank the fitness (quality) of genes based on the problem being solved. Ideally, a fitness function should measure how close a gene is to the solution. Namely, it should measure how close a candidate program is to an equivalent of under . Finding a good fitness function is of great importance to reduce the size of the search domain and to make the genetic algorithm more likely to find in less time.
Often times, a fitness function is handcrafted to approximate some ideal function that is impossible (due to incomplete knowledge about the solution) or too computationally intensive to implement in practice. For example, if we knew beforehand, we could have designed an ideal fitness function that compares a candidate program with and calculates some metric of closeness (e.g., edit distance, the number of common functions etc.) as the fitness score. Since we do not know , we cannot implement the ideal fitness function. Instead, in this work, we propose to approximate the ideal fitness function by learning it from training data (generated from a number of known programs). For this purpose, we use a neural network model. We train it with the goal of predicting the values of an ideal fitness function. We call such an ideal fitness function (that would always give the correct answer with respect to the actual solution) the oracle fitness function as it is impossible to achieve in practice merely by examining inputoutput examples. In this case, our models will not be able to approach the 100% accuracy of the oracle but rather will still have sufficiently high enough accuracy to allow the genetic algorithm to make forward progress. Also, we note that the trained model needs to generalize to predict for any unavailable solution and not a single specific target case.
We follow ideas from works that have explored the automation of fitness functions using neural networks for approximating a known mathematical model. For example, Matos Dias et al. \yrciteDias:2014:cejor automated them for IMRT beam angle optimization, while Khuntia et al. \yrciteBonomali:2005:motl used them for rectangular microstrip antenna design automation. In contrast, our work is fundamentally different in that we use a large corpus of program metadata to train our models to predict how close a given, incorrect solution could be from an unknown correct solution (that will generate the correct output).
Given the inputoutput samples of the target program and an ideal fitness function , we would like a model that predicts the fitness value for a gene . In practice, our model needs to predict the values of from inputoutput samples in and/or samples with the output of : , with .
In NetSyn, we use a neural network to model the fitness function, referred to as NNFF. This task requires us to generate a training dataset of programs with respective inputoutput samples. To train the NNFF, we randomly generate a set of example programs, , along with a set of random inputs per program . We then execute each program in with its corresponding input set to calculate the output set . Additionally, for each in , we randomly generate another program . We apply the previously generated input to to produce the output . It then compares the programs and to calculate the fitness value and use it as an example to train the neural network.
In NetSyn, the inputs for the neural network can be chosen from one of these options: (i) only, which we call the format, (ii) and , which we call the model and (iii) the difference between the inputs and outputs in and , which we call the model. During training, we set to be that of programs in , and to be of the respective program . In this way, the network is trained on a variety of program correctness levels. During inference time, each generated candidate program behaves like and the target program replaces . Therefore, the model takes inputs from during search. The case for and models is similar. Details of how we trained the neural network can be found in the supplementary material. For inferencing with many inputoutput examples for a gene , all of them are fed into the neural network, and the NNFF predictions are averaged to obtain the final prediction for ’s fitness.
To illustrate, suppose the program in Table 1 is in . Let us assume that is another program {[int], Filter (0), Map (*2), Reverse, Drop (2)}. If we use the input in Table 1 (i.e., [2, 10, 3, 4, 5, 2]) with , the output is [6, 20]. In case of model, the input for NNFF is {[2, 10, 3, 4, 5, 2], [6, 20]}. For and model, the inputs are {[2, 10, 3, 4, 5, 2], [20, 10, 6, 4], [2, 10, 3, 4, 5, 2], [6, 20]} and {[22, 0, 3, 8, 5, 2], [8, 10, 3, 4, 5, 2]} respectively.
There are different ways to quantify how close two programs are to one another. Each of these different methods then has an associated metric and ideal fitness value. We investigated three such metrics – common functions, longest common subsequence, and function probability – which we use as the expected predicted output for the NNFF.
Common Functions NetSyn can use the number of common functions (CF) between and as a fitness value for . In other words, the fitness value of is
(1) 
Because the output of the neural network will be an integer from 0 to , the neural network can be designed as a multiclass classifier with a softmax layer as the final layer.
Longest Common Subsequence As an alternative to CF, we can use longest common subsequence (LCS) between and . The fitness score of is
(2) 
Similar to CF, training data can be constructed from which is then fed into a neural networkbased multiclass classifier.
Function Probability The work Balog et al. (2017) proposed a probability map for the functions in the DSL. Let us assume that the probability map is defined as the probability of each DSL operation to be in given the inputoutput samples. Namely, such that , where is the operation in the DSL. Then, a multiclass, multilabel neural network classifier with sigmoid activation functions used in the output of the last layer can be used to predict the probability map. Training data can be constructed for the neural network using . We can use the probability map to calculate the fitness score of as
(3) 
NetSyn can also use the probability map to guide the mutation process. For example, instead of mutating a function with that is selected randomly, NetSyn can select using Roulette Wheel algorithm using the probability map.
Local Neighborhood Search
Neighborhood search (NS), checks some candidate genes in the neighborhood of the top scoring genes from the genetic algorithm. The intuition behind NS is that if the target program is in that neighborhood, NetSyn may be able to find it without relying on the genetic algorithm, which would likely result in a faster time to synthesize the correct program.
Neighborhood Search Invocation Let us assume that NetSyn has completed generations. Then, let denote the average fitness score of genes for the last generations (i.e., from to ) and will denote the average fitness score before the last generations (i.e., from to ). Here, is the sliding window. NetSyn invokes NS if . The rationale is that under these conditions, the search procedure has not produced improved genes for the last generations (i.e., saturating). Therefore, it should check if the neighborhood contains any program equivalent to .
Neighborhood Definition Algorithm 1 shows how to define and search a neighborhood. The algorithm is inspired by the breadth first search (BFS) method. For each top scoring gene , NetSyn considers one function at a time starting from the first operation of the gene to the last one. Each selected operation is replaced with all other operations from , and inserts the resultant genes into the neighborhood set . If a program equivalent to is found in , NetSyn stops there and returns the solution. Otherwise, it continues the search and returns to the genetic algorithm. The complexity of the search is , which is significantly smaller than the exponential search space used by a traditional BFS algorithm.
Similar to BFS, NetSyn can define and search the neighborhood using an approach similar to depth first search (DFS). It is similar to Algorithm 1 except will keep track of depth here. After the loop in line 4 finishes, NetSyn needs to pick the best scoring gene from to replace before going to the next level of depth. The algorithmic complexity will remain the same. Figure 2 shows examples of neighborhood using BFS and DFSbased approach.
4 Experimental Results
We implemented NetSyn in C++ with a TensorFlow backend Abadi et al. (2015). We also developed an interpreter for NetSyn’s DSL to evaluate the generated programs. We used 50,000 randomly generated unique example programs of length 4 to train the neural networks. We used 100 inputoutput examples for each program to generate the training data. For every approach, the same programs are used for training. Code and data can be found at http://www.anonymous.com/anonymous.
NetSyn’s training time ranges between 3 and 8 hours using an NVidia Tesla P100 25. For our NNFF, we chose an NN topology using three hidden layers with a softmax activation function in the output layer. Its input layer performs normalization to properly predict and . The hidden layers have 48, 24, and 12 neurons, respectively. These decisions were based on experimental evaluation of the NNFF’s accuracy. The model uses a softmax activation function in the output layer, because CF and LCS fitness scores are modeled as a classification problem. The normalization layer was chosen due to its ability to reduce the impact of absolute values of inputs. The function probability map was chosen because of its proposal in DeepCoder; as such, we used the same NN configuration to predict . We used fixed length inputs (padded, if necessary) for the neural networks.
To test NetSyn, we randomly generated a total of 100 programs for each program length from 5 to 10. For each program length, 50 of the generated programs produce a singleton integer as the output; the rest produce a list of integers. We therefore refer to the first 50 programs as singleton programs and the rest as list programs. We collected inputoutput examples for each testing program to populate . We did not consider testing programs of length 1 to 3 because the search space is too small to warrant a sophisticated synthesis technique. When synthesizing a program using NetSyn, we execute it times and average the results to eliminate noise.
Our experiments aim to (i) demonstrate NetSyn’s synthesis ability and compare its performance against two stateoftheart approaches, DeepCoder and PCCoder, and to (ii) characterize the effectiveness of different design choices used.
4.1 Demonstration of Synthesis Ability
We ran three variants of NetSyn  , , and , each predicting , , and fitness functions, respectively. Additionally, we ran the publicly available implementations of DeepCoder and PCCoder from their respective GitHub repositories. For DeepCoder, we used the best performing implementation based on “Sort and Add” enumerative search algorithm Balog et al. (2017). For comparison, we also tested three other fitness functions: 1) constant (), presumably the worst case in which the GA finds programs randomly, 2) editdistance between outputs (), and 3) the oracle (). Unless otherwise mentioned, we use NetSyn to denote and based NetSyn. For all the approaches, we set the maximum search space size to 3,000,000 candidate programs. If an approach does not find the solution prior to reaching that threshold, we conclude the experiment and consider it as “solution not found,”. The summarized results are shown in Figure 3. The detailed results are in the appendix in Table 3 and 4.
Figure 3(a)  (c) show comparative results using synthesis time as the metric. For each approach, we sort the time taken to synthesize the programs. A position N on the Xaxis corresponds to the program synthesized in the Nth longest percentile time of all the programs. Lines terminate at the point at which the approach fails to synthesize the corresponding program. In general, DeepCoder, PCCoder and NetSyncan synthesize up to 30% of programs within a few seconds for all program lengths we tested. As expected, synthesis time increases as an approach attempts to synthesize more difficult programs. DeepCoder and PCCoder usually find solutions faster than NetSyn. Moreover, the synthesis time tends to increase for larger length programs. However, when the search space is constrained to some maximum, NetSyn tends to synthesize more programs. Among the fitness functions, and have comparable synthesis percentages and time whereas usually performs worse. NetSyn synthesizes programs at percentages ranging from 50% (in case of for 10 length programs) to as high as 98% (in case of for 5 length programs). Moreover, and based approaches synthesize less percentage of programs than and . On the other hand, (which is impossible to implement) always synthesizes all programs with a second. In summary, for any program length, NetSyn synthesizes more programs than either DeepCoder or PCCoder, although it takes more time to do so.
Figure 3(d)  (f) show comparative results using our proposed metric: search space used. For each test program, we count the number of candidate programs searched before the experiment has concluded by either finding a correct program or exceeding the threshold. The number of candidate programs searched is expressed as a percentage of the maximum search space threshold, i.e., 3,000,000. For all approaches, up to 30% of the programs can be synthesized by searching less than 2% of the maximum search space. Search space use increases when an approach tries to synthesize more programs. In general, DeepCoder and PCCoder search more candidate programs than or based NetSyn. For example, for synthesizing programs of length 5, DeepCoder and PCCoder use 37% and 33% search space to synthesize 40% and 50% programs, respectively. In comparison, NetSyn can synthesize upwards of 90% programs by using only 30% search space. In other words, NetSyn is more efficient in generating and searching likely target programs. Even for length 10 programs, NetSyn can generate 70% of the programs using only 24% of the maximum search space. In contrast, DeepCoder and PCCoder cannot synthesize more than 50% and 60% of the programs even if they use the maximum search space. Constant and edit distance based approach always uses more search space than or . In summary, NetSynâs synthesis technique is more efficient than both DeepCoder and PCCoder in how it generates and searches candidate programs to find a solution.
4.2 Characterization of NetSyn
Next, we characterize the effect of different fitness functions, neighborhood search algorithms, and DSL function types on the synthesis process. To explain the details of different choices, we show the results in this section based on programs of length 4. However, our general observations hold for longer length programs.
To synthesize a particular program, we ran NetSyn times. Thus, for a total of testing programs of a particular length, we ran a total of experiments. Figure 4(a) shows the percentage of those experiments in which the target program was synthesized. The results are partitioned by singleton and list types. NetSyn synthesized at the highest percentage when the fitness function is used (85%), whether NS is used or not. Moreover, BFSbased NS tends to produce more equivalent programs. In comparison, caused NetSyn to synthesize programs at a lower percentage (46%), performing roughly an order of magnitude worse for singleton programs. This is caused by the difficulty in predicting probabilities of all functions when the output is a single integer because they contain less information compared to a list type output. For list programs, the synthesis percentage of each approach is comparable. For longer length programs, the synthesis percentage decreases, but the synthesis ratio between singleton and list programs is similar.
Figure 4(b) shows the effect of using , , and options for inputs of the neural network. fitness function only works with model and hence, is not shown here. In general, model is the most effective. This is counterintuitive because the other models contain more information. However, our analysis found that although the other models had lower synthesis percentages, they expedited the synthesis time. In other words, the other models tend to specialize for particular types of programs.
Fitness Function  List  Singleton 

CF  50  49 
LCS  50  49 
FP  50  13 
Table 2 shows how many unique programs that the different approaches were able to synthesize. Both and fitness functions enabled NetSyn to synthesize 99 of 100 unique programs. The one program that NetSyn was not able to synthesize (#37) contains DELETE and ACCESS functions. DELETE deletes all occurrences of a number in the list, whereas ACCESS returns the th element of a list. Both functions are difficult to predict because the input and output of these two functions behave differently depending on both arguments. Program #11 also contains these functions, which the based approach was able to synthesize 15 out of 50 times (the second lowest synthesis percentage after #37). The FPbased fitness function was able to synthesize 63 of 100 unique programs. All three approaches correctly synthesized every list program. This implies that singleton programs are harder to synthesize.
Figure 4(c)  (e) show the synthesis percentage for different programs and fitness functions. Program 1 to 50 are singleton programs and have lower synthesis percentage in all three fitness function choices. Particularly, the based approach has a low synthesis percentage for singleton programs. Functions 1 to 12 produce singleton integer and tend to cause lower synthesis percentage for any program that contains them. To shed more light in this issue, Figure 5 shows synthesis percentage across different functions. The synthesis percentage for a function is at least 40% for the based approach, whereas for the based approach, four functions cannot be synthesized at all. Functions corresponding to each number are found in the supplementary materials.
Like DeepCoder and PCCoder Zohar and Wolf (2018b), NetSyn assumes a priori knowledge of the target program length and maintains all genes at that target length. However, we experimented with generating the initial genes with gene lengths following a normal distribution and also allowing crossover and mutation to change gene length. We found that this increased the time to solution and reduced the synthesis percentage. This effect was particularly pronounced if the target program length was two or more functions larger or smaller than the mean initial gene length. In future work we will explore how to predict the program length and remove the need for this a priori knowledge.
5 Conclusion
In this paper, we presented a genetic algorithmbased framework for program synthesis called NetSyn. To the best of our knowledge, it is the first work that uses a neural network to automatically generate an evolutionary algorithm’s fitness function in the context of program synthesis. We proposed two neural network fitness functions and contrasted them against a fitness function based on Balrog et al. \yrcitedeepcoder. NetSyn is also novel in that it uses neighborhood search to expedite the convergence process of an evolutionary algorithm. We compared our approach against two stateofthe art program synthesis systems, DeepCoder and PCCoder and showed that NetSyn synthesizes equivalent programs at a higher rate, especially for singleton programs.
Appendix A Appendix A: NetSyn’s DSL
In this appendix, we provide more details about the list DSL that NetSyn uses to generate programs. Our list DSL has only two implicit data types, integer and list of integer. A program in this DSL is a sequence of statements, each of which is a call to one of the 41 functions defined in the DSL. There are no explicit variables, nor conditionals, nor explicit control flow operations in the DSL, although many of the functions in the DSL are highlevel and contain implicit conditionals and control flow within them. Each of the 41 functions in the DSL takes one or two arguments, each being of integer or list of integer type, and returns exactly one output, also of integer or list of integer type. Given these rules, there are 10 possible function signatures. However, only 5 of these signatures occur for the functions we chose to be part of the DSL. The following sections are broken down by the function signature, wherein all the functions in the DSL having that signature are described.
Instead of named variables, each time a function call requires an argument of a particular type, our DSL’s runtime searches backwards and finds the most recently executed function that returns an output of the required type and then uses that output as the current function’s input. Thus, for the first statement in the program, there will be no previous function’s output from which to draw the arguments for the first function. When there is no previous output of the correct type, then our DSL’s runtime looks at the arguments to the program itself to provide those values. Moreover, it is possible for the program’s inputs to not provide a value of the requested type. In such cases, the runtime provides a default value for missing inputs, 0 in the case of integer and an empty list in the case of list of integer. For example, let us say that a program is given a list of integer as input and that the first three functions called in the program each consume and produce a list of integer. Now, let us assume that the fourth function called takes an integer and a list of integer as input. The list of integer input will use the list of integer output from the previous function call. The DSL runtime will search backwards and find that none of the previous function calls produced integer output and that no integer input is present in the program’s inputs either. Thus, the runtime would provide the value 0 as the integer input to this fourth function call. The final output of a program is the output of the last function called.
Thus, our language is defined in such a way that so long as the program consists only of calls to one of the 41 functions provided by the DSL, that these programs are valid by construction. Each of the 41 functions is guaranteed to finish in a finite time and there are no looping constructs in the DSL, and thus, programs in our DSL are guaranteed to finish. This property allows our system to not have to monitor the programs that they execute to detect potentially infinite loops. Moreover, so long as the implementations of those 41 functions are secure and have no potential for memory corruption then programs in our DSL are similarly guaranteed to be secure and not crash and thus we do not require any sandboxing techniques. When our system performs crossover between two candidate programs, any arbitrary cut points in both of the parent programs will result in a child program that is also valid by construction. Thus, our system need not test that programs created via crossover or mutation are valid.
In the following sections, [] is used to indicate the type list of integer whereas int is used to indicate the integer type. The type after the arrow is used to indicate the output type of the function.
a.1 Functions with the Signature [] int
There are 9 functions in our DSL that take a list of integer as input and return an integer as output.
HEAD (Function 6)
This function returns the first item in the input list. If the list is empty, a 0 is returned.
LAST (Function 7)
This function returns the last item in the input list. If the list is empty, a 0 is returned.
MINIMUM (Function 8)
This function returns the smallest integer in the input list. If the list is empty, a 0 is returned.
MAXIMUM (Function 9)
This function returns the largest integer in the input list. If the list is empty, a 0 is returned.
SUM (Function 11)
This functions returns the sum of all the integers in the input list. If the list is empty, a 0 is returned.
COUNT (Function 25)
This function returns the number of items in the list that satisfy the criteria specified by the additional lambda. Each possible lambda is counted as a different function. Thus, there are 4 COUNT functions having lambdas: ¿0, ¡0, odd, even.
a.2 Functions with the Signature [] []
There are 21 functions in our DSL that take a list of integer as input and produce a list of integer as output.
REVERSE (Function 29)
This function returns a list containing all the elements of the input list but in reverse order.
SORT (Function 35)
This function returns a list containing all the elements of the input list in sorted order.
MAP (Function 1928)
This function applies a lambda to each element of the input list and creates the output list from the outputs of those lambdas. Let be the nth element of the input list to MAP and let be the nth element of the output list from Map. MAP produces an output list such that =lambda() for all n. There are 10 MAP functions corresponding to the following lambdas: +1,1,*2,*3,*4,/2,/3,/4,*(1),^2.
FILTER (Function 1417)
This function returns a list containing only those elements in the input list satisfying the criteria specified by the additional lambda. Ordering is maintained in the output list relative to the input list for those elements satisfying the criteria. There are 4 FILTER functions having the lambdas: ¿0, ¡0, odd, even.
SCANL1 (Function 3034)
Let be the nth element of the input list to SCANL1 and let be the nth element of the output list from SCANL1. This function produces an output list as follows:
There are 5 SCANL1 functions corresponding to the following lambdas: +, , *, min, max.
a.3 Functions with the Signature int,[] []
There are 4 functions in our DSL that take an integer and a list of integer as input and produce a list of integer as output.
TAKE (Function 36)
This function returns a list consisting of the first N items of the input list where N is the smaller of the integer argument to this function and the size of the input list.
DROP (Function 13)
This function returns a list in which the first N items of the input list are omitted, where N is the integer argument to this function.
DELETE (Function 12)
This function returns a list in which all the elements of the input list having value X are omitted where X is the integer argument to this function.
INSERT (Function 18)
This function returns a list where the value X is appended to the end of the input list, where X is the integer argument to this function.
a.4 Functions with the Signature [],[] []
There is only one function in our DSL that takes two lists of integers and returns another list of integers.
ZIPWITH (Function 3741)
This function returns a list whose length is equal to the length of the smaller input list. Let be the nth element of the output list from ZIPWITH. Moreover, let and be the nth elements of the first and second input lists respectively. This function creates the output list such that =lambda(, ). There are 5 ZIPWITH functions corresponding to the following lambdas: +, , *, min, max.
a.5 Functions with the Signature int,[] int
There are two functions in our DSL that take an integer and list of integer and return an integer.
ACCESS (Function 1)
This function returns the Nth element of the input list, where N is the integer argument to this function. If N is less than 0 or greater than the length of the input list then 0 is returned.
SEARCH (Function 10)
This function return the position in the input list where the value X is first found, where X is the integer argument to this function. If no such value is present in the list, then 1 is returned.
Appendix B Appendix B: System Details
b.1 Hyperparameters for the Models and Genetic Algorithm

Evolutionary Algorithm:

Gene pool size: 100

Number of reserve gene in each generation: 20

Maximum number of generation: 30,000

Gene length: 4

Crossover rate: 40%

Mutation rate: 30%


Neural Network Training:

Loss: Categorical CrossEntropy

Optimizer: Adam

3 hidden layers with neurons 48, 24, 12

Activation function: Sigmoid in hidden layers and Softmax in output layer.

b.2 Generation of the Training Dataset
For our two approaches ( and ), we created 3 types of data sets for 3 different models (, , ). We used 50,000 programs as base program, and to compare, we chose 150 different other programs. These two sets of programs are compared with each other to get the number of common function or longest common subsequence between them. In each comparison, we created 100 inputoutput examples that lead to total 750 million data points. For model we generated our dataset from the base program but for and model we need another output that we created with the comparable program by passing the inputs. Each input or output were padded to fixed 12 dimension and were joined together. For the model we took absolute difference between input and corresponding two different outputs. Also add the information of dimension difference of two output. Thus for the three models input dimension were 24 (), 36 (), 25 ().
With our training programs and given inputoutput examples we created our dataset. We split out the dataset into training and testing set in a ratio of 3:1. We also randomized the dataset before splitting. Data were normalized before feeding into the neural network.
b.3 Training of Neural Network
We used 3 hidden layers in our model. Our models predicted common functions/longest common subsequence between the target programs and generated programs from EA by using inputoutput examples. We predicted that value as a classification output.
For the DeepCoder model, we used 3 hidden layers with 256 neurons each. We passed the input through the embedding layer connected to the input neurons. We took the average for the inputoutput examples and predicted function probability.
b.4 Dead Code Elimination
Dead code elimination (DCE) is a classic compiler technique to remove code from a program that has no effect on the program’s output Debray et al. (2000). Dead code is possible in our list DSL if the output of a statement is never used. We implemented DCE in NetSyn by tracking the input/output dependencies between statements and eliminating those statements whose outputs are never used. NetSyn uses DCE during candidate program generation and during crossover/mutation to ensure that the effective length of the program is not less than the target program length due to the presence of dead code.
Appendix C Additional Results
Program  Method  Synthesis  Time Required to Synthesize (in seconds)  

Length  Percentage  10%  20%  30%  40%  50%  60%  70%  80%  90%  100%  
5 
48%  1s  60s  388s  422s              
77%  1s  7s  112s  272s  345s  393s  483s        
40%  1s  1s  2s  126s              
51%  1s  1s  6s  66s  357s            
64%  1s  1s  4s  174s  525s  803s          
98%  1s  1s  4s  43s  112s  349s  600s  690s  838s    
96%  1s  1s  3s  46s  131s  392s  671s  768s  874s    
98%  1s  1s  4s  43s  115s  344s  612s  704s  847s    
96%  1s  1s  3s  47s  138s  400s  660s  753s  866s    
100%  1s  1s  1s  1s  1s  1s  1s  1s  1s  1s  
6 
43%  1s  1s  314s  397s              
73%  1s  1s  82s  229s  300s  368s  462s        
45%  1s  1s  1s  14s              
75%  1s  1s  1s  4s  190s  233s  619s        
63%  1s  1s  3s  306s  464s  919s          
87%  1s  1s  4s  130s  574s  716s  870s  956s      
88%  1s  1s  3s  183s  533s  732s  872s  918s      
87%  1s  1s  4s  127s  563s  695s  875s  937s      
88%  1s  1s  3s  185s  528s  754s  855s  899s      
100%  1s  1s  1s  1s  1s  1s  1s  1s  1s  1s  
7 
44%  1s  1s  1s  685s              
68%  1s  1s  1s  249s  354s  433s          
45%  1s  1s  1s  13s              
52%  1s  1s  2s  11s  635s            
58%  1s  1s  1s  393s  566s            
81%  1s  1s  1s  176s  676s  1062s  1134s  1180s      
78%  1s  1s  1s  127s  609s  889s  956s        
81%  1s  1s  1s  178s  670s  1094s  1112s  1156s      
79%  1s  1s  1s  126s  624s  876s  976s        
100%  1s  1s  1s  1s  1s  1s  1s  1s  1s  1s  
8 
43%  1s  1s  1s  1371s              
65%  1s  1s  1s  297s  401s  534s          
56%  1s  1s  1s  1s  29s            
57%  1s  1s  1s  1s  15s            
49%  1s  1s  1s  587s              
68%  1s  1s  1s  748s  1545s  1702s          
69%  1s  1s  1s  404s  988s  1044s          
68%  1s  1s  1s  763s  1578s  1668s          
69%  1s  1s  1s  392s  969s  1013s          
100%  1s  1s  1s  1s  1s  1s  1s  1s  1s  1s  
9 
41%  1s  1s  1s  1387s              
67%  1s  1s  1s  288s  429s  574s          
53%  1s  1s  1s  1s  56s            
55%  1s  1s  1s  1s  107s            
52%  1s  1s  1s  1s  614s            
64%  1s  1s  1s  1584s  2544s  2846s          
67%  1s  1s  1s  837s  1055s  1195s          
64%  1s  1s  1s  1603s  2473s  2790s          
67%  1s  1s  1s  846s  1029s  1207s          
100%  1s  1s  1s  1s  1s  1s  1s  1s  1s  1s  
10 
41%  1s  1s  1s  1403s              
68%  1s  1s  1s  208s  459s  591s          
42%  1s  1s  1s  67s              
48%  1s  1s  1s  1011s              
49%  1s  1s  1s  517s              
55%  1s  1s  1s  625s  1640s            
66%  1s  1s  1s  164s  957s  1121s          
55%  1s  1s  1s  638s  1649s            
66%  1s  1s  1s  168s  978s  1099s          
100%  1s  1s  1s  1s  1s  1s  1s  1s  1s  1s  

Figure 3 shows detailed numerical results using synthesis time as the metric. Columns 10% to 100% show the duration of time (in seconds) it takes to synthesize the corresponding percentage of programs.
Program  Method  Search Space Used to Synthesize  

Length  10%  20%  30%  40%  50%  60%  70%  80%  90%  100%  
5 
1%  9%  59%  64%              
1%  1%  17%  41%  52%  60%  73%        
1%  1%  1%  37%              
1%  1%  1%  7%  33%            
1%  1%  1%  2%  5%  23%          
1%  1%  1%  3%  7%  10%  15%  21%  30%    
1%  1%  1%  2%  5%  9%  14%  21%  29%    
1%  1%  1%  2%  7%  9%  15%  21%  30%    
1%  1%  1%  2%  5%  9%  13%  20%  28%    
1%  1%  1%  1%  1%  1%  1%  1%  1%  1%  
6 
1%  1%  46%  58%              
1%  1%  12%  33%  43%  53%  67%        
1%  1%  1%  4%              
1%  1%  1%  1%  33%  55%  73%        
1%  1%  1%  1%  5%  28%          
1%  1%  1%  2%  5%  12%  18%  25%      
1%  1%  1%  2%  5%  10%  17%  24%      
1%  1%  1%  1%  4%  11%  18%  24%      
1%  1%  1%  2%  4%  10%  16%  23%      
1%  1%  1%  1%  1%  1%  1%  1%  1%  1%  
7 
1%  1%  1%  82%              
1%  1%  1%  34%  48%  59%          
1%  1%  1%  3%              
1%  1%  1%  1%  38%            
1%  1%  1%  1%  14%            
1%  1%  1%  2%  3%  6%  18%  30%      
1%  1%  1%  2%  5%  11%  20%        
1%  1%  1%  2%  2%  6%  17%  29%      
1%  1%  1%  1%  5%  10%  20%        
1%  1%  1%  1%  1%  1%  1%  1%  1%  1%  
8 
1%  1%  1%  93%              
1%  1%  1%  37%  50%  67%          
1%  1%  1%  1%  6%            
1%  1%  1%  1%  1%            
1%  1%  1%  1%              
1%  1%  1%  3%  7%  17%          
1%  1%  1%  2%  6%  13%          
1%  1%  1%  3%  7%  16%          
1%  1%  1%  2%  5%  12%          
1%  1%  1%  1%  1%  1%  1%  1%  1%  1%  
9 
1%  1%  1%  1%  1%  9%          
1%  1%  1%  1%  1%  7%          
1%  1%  1%  1%  1%  9%          
1%  1%  1%  1%  1%  7%          
1%  1%  1%  1%  4%            
1%  1%  1%  1%  3%  11%  17%        
1%  1%  1%  1%  1%  9%          
1%  1%  1%  3%  7%  16%          
1%  1%  1%  2%  5%  12%          
1%  1%  1%  1%  1%  1%  1%  1%  1%  1%  
10 
1%  1%  1%  90%              
1%  1%  1%  20%  43%  56%          
1%  1%  1%  9%              
1%  1%  1%  61%              
1%  1%  1%  4%              
1%  1%  1%  6%  16%            
1%  1%  1%  4%  12%  24%          
1%  1%  1%  6%  16%            
1%  1%  1%  4%  11%  24%          
1%  1%  1%  1%  1%  1%  1%  1%  1%  1%  

In the experimental results section, we presented synthesis times for NetSyn that included both GA time and NNFF inferencing time. Since inferencing time is significant, we speculate that further optimizations of our approach may be possible using futuristic deep learning accelerators Shawahna et al. (2019) that may eliminate some portion of the overall neural network inferencing time. Thus, we now show two versions of NetSyn, the original nonoptimized time, which includes the total wall clock time of NetSyn with our hardware testbed environment, plus an optimized version after subtracting away inferencing time. The results are presented in Table 5.
Program Length  MP System  Synthesis  Time Required to Synthesize (in seconds)  

Percentage  10%  20%  30%  40%  50%  60%  70%  80%  90%  100%  
5  40%  1s  1s  2s  126s              
51%  1s  1s  6s  66s  357s            
64%  1s  1s  4s  174s  525s  803s          
64%  1s  1s  1s  164s  515s  793s          
98%  1s  1s  4s  43s  112s  349s  600s  690s  838s    
98%  1s  1s  1s  11s  33s  141s  321s  421s  537s    
96%  1s  1s  3s  46s  131s  392s  671s  768s  874s    
96%  1s  1s  1s  13s  39s  181s  380s  475s  569s    
6  45%  1s  1s  1s  14s              
75%  1s  1s  1s  4s  190s  233s  619s        
63%  1s  1s  3s  306s  464s  919s          
63%  1s  1s  1s  296s  457s  909s          
87%  1s  1s  4s  130s  574s  716s  870s  956s      
87%  1s  1s  1s  56s  307s  429s  579s  658s      
88%  1s  1s  3s  183s  533s  732s  872s  918s      
88%  1s  1s  1s  66s  281s  435s  569s  612s      
7  45%  1s  1s  1s  13s              
52%  1s  1s  2s  11s  635s            
58%  1s  1s  1s  393s  566s            
58%  1s  1s  1s  383s  556s            
81%  1s  1s  1s  176s  676s  1062s  1134s  1180s      
81%  1s  1s  1s  92s  445s  775s  834s  886s      
78%  1s  1s  1s  127s  609s  889s  956s        
78%  1s  1s  1s  49s  349s  593s  655s        
8  56%  1s  1s  1s  1s  29s            
57%  1s  1s  1s  1s  15s            
49%  1s  1s  1s  587s              
49%  1s  1s  1s  577s              
68%  1s  1s  1s  748s  1545s  1702s          
68%  1s  1s  1s  571s  1258s  1401s          
69%  1s  1s  1s  404s  988s  1044s          