TF-Coder: Program Synthesis for Tensor Manipulations

TF-Coder: Program Synthesis for Tensor Manipulations

Kensen Shi Google BrainUnited States David Bieber Google BrainUnited States  and  Rishabh Singh Google BrainUnited States

The success and popularity of deep learning is on the rise, partially due to powerful deep learning frameworks such as TensorFlow and PyTorch that make it easier to develop deep learning models. However, these libraries also come with steep learning curves, since programming in these frameworks is quite different from traditional imperative programming with explicit loops and conditionals. In this work, we present a tool called TF-Coder for programming by example in TensorFlow. TF-Coder uses a bottom-up weighted enumerative search, with value-based pruning of equivalent expressions and flexible type- and value-based filtering to ensure that expressions adhere to various requirements imposed by the TensorFlow library. We also train models that predict TensorFlow operations from features of the input and output tensors and natural language descriptions of tasks, and use the models to prioritize relevant operations during the search. TF-Coder solves 63 of 70 real-world tasks within 5 minutes, often finding solutions that are simpler than those written by TensorFlow experts.

conference: ; ; journalyear: isbn: doi: copyright: none

1 \mdfsetupskipbelow=4pt,skipabove=4pt,leftmargin=6pt,rightmargin=0pt,align=left,usetwoside=false \mdfdefinestylelistingstyle backgroundcolor=black!2, linewidth=2pt,linecolor=black!20, outerlinewidth=5pt,outerlinecolor=black, rightline=false,topline=false,bottomline=false, innerleftmargin=6pt,innerrightmargin=2pt,innertopmargin=0pt,innerbottommargin=0pt, \surroundwithmdframed[style=listingstyle]lstlisting \lst@AddToHookOnEmptyLine

1. Introduction

Deep learning techniques have resulted in recent breakthroughs in many domains including computer vision, audio processing, natural language processing, and robotics (LeCun et al., 2015). These breakthroughs arise through a combination of advancements including new algorithmic ideas, the availability of large labeled datasets, and specialized hardware for efficient training. Playing an equally important role are deep learning frameworks such as TensorFlow (Abadi et al., 2016), PyTorch (Paszke et al., 2017), MXNet (Chen et al., 2015), and CNTK (Seide and Agarwal, 2016) that enable machine learning researchers and engineers to develop and iterate on such models more effectively.

While these deep learning frameworks have greatly eased the development and training of complex neural network models, they also have a steep learning curve, since the programming paradigm of computing over tensors using a fixed set of library functions is quite different from the traditional imperative programming paradigm. For instance, vectorization techniques are used to turn explicit loops into more efficient tensor operations, and special operations like tf.where are used in place of traditional if/else conditionals. Most deep learning models require various tensor manipulations for data processing or cleaning, custom loss functions, and accuracy metrics, that all must be implemented within the constraints of the chosen deep learning framework. Furthermore, these frameworks offer a huge amount of functionality, which makes them powerful but potentially difficult to navigate. For instance, there are nearly 2000 distinct symbols in TensorFlow (including aliases), and about 500 of them are tensor-manipulating operations, so finding the right ones to use for a given task can be a challenge itself.

Given the increasing popularity of deep learning, combined with the relative difficulty of writing neural models, many beginners and even experienced software engineers seek assistance from others by asking questions on forums like StackOverflow. Tensor manipulations are a common difficulty, and such questions typically include a natural language description of what the asker is trying to accomplish, along with an input/output example illustrating the desired computation or transformation. This is usually enough information for a generous expert to answer the question by providing code that implements the desired functionality, but not all questions are lucky enough to receive a correct answer or even an answer at all.

Inspired by this need, we present TF-Coder, a programming by example system to automatically synthesize tensor manipulation programs from input/output examples and natural language descriptions. Our approach builds upon the bottom-up enumerative algorithm used in the previous work Transit (Udupa et al., 2013). We introduce per-operation weights to the prior algorithm, allowing TF-Coder to enumerate over TensorFlow expressions in order of increasing complexity. TF-Coder also incorporates pruning of expressions that behave equivalently for the given inputs (as in the prior work), and a new, flexible, type- and value-based filtering system that handles arbitrary constraints imposed by the TensorFlow library, such as “the two tensor arguments must have broadcastable shapes.” Finally, we introduce two machine learning models that choose operations to prioritize during the search, conditioned on features of the input and output tensors and a natural language description of the task. These models help tailor the search process to fit the particular synthesis task at hand.

The domain of tensor manipulations has not been considered in the program synthesis literature to our knowledge. It is particularly challenging as it encompasses a huge variety of tasks, including reshapes, filters, aggregations, maps, indexing, slicing, grouping, sorting, mathematical operations, and combinations of them. When mathematical operations (e.g., tensor products) are involved, the output tensor typically has no overlapping entries with the input tensors, ruling out synthesis approaches that are informed by partial matches between the inputs and outputs, as is common in manipulation of tables (Bavishi et al., 2019), data structures (Feser et al., 2015b), and strings (Gulwani, 2011). A key takeaway from our work is that the techniques we do use are particularly effective for this domain, enabling an enumerative search to scale to solve practical problems within seconds.

We evaluate TF-Coder on 70 real-world tensor transformation tasks from StackOverflow and from an industrial setting. TF-Coder can successfully synthesize solutions to 63 tasks in 12 seconds on average, while Transit only solves 44 tasks. Moreover, the trained models lead to significantly faster synthesis times (32.4% faster on average), compared to not using the models. We also observed that TF-Coder often produces solutions that are simpler and more elegant than those written by TensorFlow experts (including the authors of this paper).

This paper makes the following key contributions: 1) We introduce TF-Coder , the first programming by example system for synthesizing tensor manipulations in TensorFlow from input/output examples. 2) We present a new weighted enumerative search algorithm that uses a new two-stage filtering approach to enforce arbitrary preconditions required by the operations. 3) We develop two machine learning models that predict useful TensorFlow operations given the example tensors and a natural language description of the task, to guide the weighted enumerative search. 4) We evaluate TF-Coder on 70 real-world tasks taken from StackOverflow and an industrial setting outperforming prior synthesis techniques.

(a) Labeling distinct or duplicate values in a tensor.
(b) Indexing columns of a 3D tensor using indices from a 1D tensor.
Figure 1. Two example tensor transformation tasks in StackOverflow posts.

2. Motivating Examples

We now present some tensor manipulation questions posted to StackOverflow, an online programming help forum.

2.1. Example 1

Consider the StackOverflow question shown in Figure 0(a). The user has a 1-dimensional tensor of length containing distinct values, and they want to create another tensor of the same shape containing values between and , such that both tensors have duplicate values at the same locations. The user provides a clarifying example: the tensor [45, 58, 72, 33, 45, 58, 58, 33] should be converted to [0, 1, 2, 3, 0, 1, 1, 3]. From this example, TF-Coder automatically synthesizes a solution program in 0.8 seconds:

output = tf.unique_with_counts(in1)[1]

Even though the solution is relatively simple, it would be quite difficult for the question asker to find that solution without assistance, considering that there are about 500 tensor-manipulating operations in TensorFlow. Even searching for the function by name would be difficult, as the name “unique_with_counts” bears little resemblance to the user’s description of the task. In such scenarios, TF-Coder can help users find relevant TensorFlow operations automatically, reducing the time spent digging through documentation.

When we first came across this question on StackOverflow, it was four days old with zero answers. We posted TF-Coder’s solution as an answer, which was accepted by the poster.

2.2. Example 2

The StackOverflow question in Figure 0(b) involves a more difficult problem. Given two input tensors in1 and in2, the question asker wants an output tensor where the i element is equal to the in2[i] column of in1[i]. To specify their intent more clearly, the asker also provides an input/output example as shown in the figure.

On this complex problem involving multiple input tensors and TensorFlow operations, TF-Coder finds a solution in 28 seconds:

output = tf.squeeze(tf.gather(
    in1, tf.expand_dims(in2, 1), axis=-1, batch_dims=1))

TF-Coder’s solution is actually simpler than the accepted StackOverflow answer. Thus, TF-Coder can help users find elegant solutions for difficult tensor transformations.

2.3. Observations

These StackOverflow questions follow a larger pattern: many tensor transformations are ambiguous if described using natural language alone, so it is natural to provide both a textual description of the desired transformation and concrete input/output example tensors to clarify the problem. Another interesting property is that most of the time, only one input/output example is necessary, since tensors can be expanded with more entries to resolve ambiguities.

There are over 50,000 questions on StackOverflow containing the text “TensorFlow.” While the majority of these ask about installation issues or deep learning in general, there are still many questions asking how to perform tensor manipulations or how to fix errors raised by the user’s code. Indeed, writing TensorFlow code can be challenging at times (even more so for beginners) due to the amount of information that the programmer must keep in mind. The shapes of tensors must be checked for compatibility under broadcasting rules, the conceptual meanings of the dimensions are crucial to ensure mathematical correctness, and data types of tensors must be tracked carefully (e.g., a tf.int32 tensor cannot be added to a tf.int64 tensor). Furthermore, these properties change as tensors are manipulated, leaving many opportunities for subtle bugs.

Inspired by these questions, we developed TF-Coder to automatically synthesize tensor manipulations in TensorFlow from input/output examples and natural language descriptions. Such a tool could help accelerate users’ TensorFlow development in several ways. In Section 2.1, we observed that TF-Coder can automatically find relevant TensorFlow operations, thus reducing the need to search through TensorFlow’s extensive documentation. Since TF-Coder’s solutions are guaranteed to be consistent with the provided input/output example, it can reduce the number of debugging cycles and lead to increased confidence in the code (much like a unit test). Finally, by finding simple and elegant solutions that the user may have overlooked, TF-Coder can even improve code quality and model efficiency. We strive to find solutions quickly, within seconds or at most a few minutes, so that the tool may be used interactively.

3. Synthesis with Enumerative Search

Motivated by the examples and discussion in Section 2, we now formalize the problem as illustrated in Figure 2.

3.1. Problem Formalization

Figure 2. Given an input/output example of a tensor manipulation, an optional natural language description, and optional scalar constants, TF-Coder synthesizes a composition of TensorFlow operations consistent with the example.

We assume a given task specification , where is an input/output example, i.e., a list of input tensors and the corresponding output tensor , is an optional natural language description of the task, and is an optional set of constants that may be useful for the task.

Our goal is to synthesize a program where . We note that TF-Coder can often synthesize programs directly from the input/output example without needing additional and information, but and allow users to express their intent and obtain better synthesizer performance. The domain of programs considered by TF-Coder consists of single-line TensorFlow expressions, which may contain any of the following base values:

  • Python int, float, Boolean, and string literals

  • TensorFlow data types, e.g., tf.float32, tf.int64 etc.

  • Variables in1, in2, etc., to reference the input tensors

Furthermore, expressions may use the following operations, applied to the base values or composed with each other:

  • Supported TensorFlow function calls, e.g., tf.add(x, y) and tf.math.segment_max(data, segment_ids)

  • Creating a tuple from supported Python literals, e.g., (0, 1), or from other such tuples

  • Various forms of indexing and slicing of sequences and tensors, e.g., tensor[-1], tensor[1:], and tensor[:, 0:5]

Note that the TensorFlow operations specify their arguments because the search algorithm requires a fixed arity for each operation. Hence, some TensorFlow functions have multiple supported variations, e.g., 2-argument tf.gather(params, indices) and 4-argument tf.gather(params, indices, axis, batch_dims). In total, TF-Coder currently supports 123 TensorFlow operations for 99 distinct functions, plus 11 more operations for different forms of indexing, slicing, and tuple creation. These are listed in Appendix A.

In the following sections, we describe the weighted bottom-up enumerative search that powers TF-Coder. Starting with a set of initial values including input tensors and constants (which may be provided by the user or chosen heuristically), the search enumerates ways of applying operations to previously-explored values, to expand the set of known values. Values internally store enough information to recursively reconstruct the code expression that would produce the value. Thus, if the search encounters a value that matches the output tensor, the matching value’s code expression is a valid solution to the synthesis problem.

3.2. Weighted Value Search

Figure 3. Overview of the enumerative search algorithm.  TF-Coder stores already-explored values organized by weight, initially just the input tensors and constants. It enumerates expressions in order of increasing weight. For a target expression weight (e.g., 76), it enumerates over operations and weights for the operation’s arguments, e.g., the operation tf.argmax(input, axis) has weight 36 and two arguments, so the remaining weight () is partitioned into two parts (e.g., ) representing the arguments’ weights. Options for the arguments are drawn from previously-explored values, and a Cartesian product with customizable filtering produces lists of arguments. Finally, invoking the operation produces new values.

TF-Coder’s search enumerates expressions in order of increasing weight, which represents the expression’s complexity. Operations and initial values (input tensors and constants) have associated weights, and an expression’s weight is defined to be the sum of the weights of the operations and initial values used in that expression. For example, the initial values in1 and 0 both have weight 8, and the operation tf.expand_dims(input, axis) has weight 18, so the expression tf.expand_dims(in1, axis=0) has weight .

These weights give TF-Coder a fine-grained notion of the “complexity” of different TensorFlow operations, e.g., tf.reverse(tensor, axis) is more complex and less useful than tf.expand_dims(input, axis), so the former is assigned a greater weight than the latter. We manually assigned weights for each of TF-Coder’s supported operations, taking into consideration how common or useful the operation is, how complex its semantics are, and how many arguments it takes. These weights allow TF-Coder to prioritize simple and useful operations in its search. All weights must be positive integers to enable efficient enumeration.

Figure 3 is a diagram summarizing TF-Coder’s weighted enumerative search, and the algorithm is shown in Algorithm 1. Note that the algorithm mentions using learned models to prioritize operations, discussed in Section 4. Argument filters and combination filters are discussed in Section 3.3.

The algorithm starts by collecting initial values. These include user-provided input tensors, user-provided constants (optional), and heuristically-chosen constants. The constants 0, 1, -1, True, False, tf.int32, tf.int64, tf.float32, and tf.bool are always chosen. We also include natural numbers up to the maximum rank of an input tensor (exclusive) to serve as axis values, all dimension lengths of input and output tensors, and the output tensor’s shape as a tuple. These initial values are assigned hardcoded weights depending on their origin (e.g., a user-provided constant will have smaller weight than a constant extracted from a dimension length).

The search then generates expressions in order of increasing weight. For a given target weight, we enumerate over all supported operations and all allowable weights for the operation’s arguments. For example, if we are currently generating expressions of weight 76 using a 2-argument operation with weight 36, then there is remaining weight to partition among the two arguments. If argument 1 is chosen to have weight 32 and argument 2 is chosen to have weight 8, we would use all previously-explored values of weight 32 as choices to fill argument 1, and similarly all existing values of weight 8 are choices for argument 2. The Cartesian product of these argument choices gives many argument lists, each list containing one concrete value for each argument. The chosen operation is applied to each of these argument lists to produce new values, which by construction all have the desired weight. Each newly generated value that is not equal to a previously-seen value is added back to the set of known explored values. In this way, we prune away expressions with equivalent behavior when run on the input tensors, significantly reducing the size of the search space.

Every value produced by applying an operation to arguments stores references to the operation and the arguments, so that any value can recursively reconstruct its code representation. As soon as TF-Coder encounters a value that is equal to the desired output tensor, it outputs the value’s code representation as a solution.

1:Input/output example , natural language description , user-provided constants
2:A program such that
3:Supported operations Ops (each has argument filters and combination filter ), a model conditioned on input/output examples, and model conditioned on natural language
5: Use learned models to prioritize operations
6: Model predictions
8:for all  do
9:    ReweightOp()
11: Gather initial values with weights
12: Set of explored values
14:for all  do
15:    AssignWeightByOrigin()
17: Bottom-up enumerative search
18:for  do Weight of expressions
19:   for all  do
21:      for all  Argument weights
22:      for all s.t.  do
23:         for  do Collect argument choices
25:         for all  do
26:            if  then
27:               continue             
28:             Execute()
29:            if  then
33:            if  then
34:               return CodeExpression()                               
Algorithm 1  TF-Coder’s Synthesis Algorithm

3.3. Operation Filtering

Figure 4. Two-stage operation filtering.  Here we demonstrate TF-Coder’s flexible two-stage operation filtering for the tf.argmax(input, axis) operation. The first argument, input, must be a numeric tensor (e.g., not a boolean tensor), so tf.cast(in1, tf.bool) is removed by the “arg 1 filter.” The second argument, axis, must be an integer that is a potential axis value, so -5, tf.int32, and tf.bool are removed by the “arg 2 filter.” Finally, the axis must be in range for the particular input, so [tf.squeeze(in1), 2] is removed by the “combination filter” if tf.squeeze(in1) actually has rank 2. After these filtering steps, the tf.argmax operation is applied to the surviving combinations of arguments.

When the search enumerates argument lists for a particular operation, a full Cartesian product of argument choices may be very large, even though very few argument lists actually meet preconditions required by the operation. To avoid enormous Cartesian products, and to reduce the number of errors thrown by operations (which are relatively expensive to catch), we introduce a flexible two-stage operation filtering approach, illustrated in Figure 4.

The first stage of operation filtering occurs independently for each argument of the operation. An “argument filter” ( in Algorithm 1) is simply a function that takes a value and returns a boolean denoting whether the value is an acceptable choice for a particular argument of an operation. For example, the tf.argmax(input, axis) operation requires that the input argument be a numeric tensor (e.g., a tensor with a float or int data type), and the axis argument must be an integer representing an axis. Hence, an argument filter for input would reject tensors with tf.bool data types, and an argument filter for axis would only accept integers with small absolute value. By using argument filters, the size of the Cartesian product of argument choices is greatly reduced.

The second stage of operation filtering checks constraints that involve multiple arguments. A “combination filter” ( in Algorithm 1) for an operation with arguments is a function that takes a list of values and returns a boolean denoting whether the list contains acceptable arguments for one call to the operation. For example, the tf.argmax(input, axis) operation requires that the axis actually be in range for the input tensor. Hence, the operation’s combination filter would remove an argument list if it has an out-of-bounds axis for the corresponding input tensor. The purpose of combination filters is to avoid executing expensive TensorFlow operations that can be eliminated by quick checks. Furthermore, catching exceptions raised by TensorFlow operations is relatively slow compared to running the combination filter.

The two-stage filtering approach allows for arbitrary value-based checking of operation preconditions. TF-Coder is also engineered such that it is easy to add and reuse filters with minimal code duplication—many operations have an axis argument that requires the same argument filter, and similar operations like tf.reduce_sum(input_tensor, axis) can use the same argument and combination filters.

Finally, we note that argument filters (but not combination filters) will be run repetitively on the same values for two reasons. First, argument filters like the axis argument filter are reused among several operations. Second, the same argument will be assigned values of the same weight at different points in the enumerative search. Our solution is to cache the result of applying an argument filter on all explored values of a given weight, i.e., we cache in Algorithm 3, where the cache is keyed by the filter function and the weight of the values being filtered. (For simplicity, this caching behavior is not present in Algorithm 1.)

TF-Coder’s operation filtering significantly improves the quality of candidate programs considered. In particular, for the difficult task described in Figure 5(e), overall the argument filters eliminated 73% of choices for individual arguments, and then the combination filters further eliminated 60% from the Cartesian product of remaining argument choices. Together, the two-stage filtering strategy eliminated 98.6% of all potential candidate programs.

Domain-Specific Details

TF-Coder uses a few additional techniques related to the TensorFlow domain, described in Appendix B. Such techniques are excluded from Algorithm 1 for simplicity.

Handling Multiple I/O Examples

In the tensor manipulation domain, we observe that most tasks only require a single input/output example. For instance, when performing a reduction across rows of an matrix to produce a length- vector, there are essentially independent examples of a row being reduced to a scalar. One can easily construct a single example with large enough to unambiguously specify the task. This idea generalizes to nearly all tensor manipulation tasks – adding more numbers to the example makes it more clear. Even so, TF-Coder’s enumerative search algorithm can be extended to handle multiple examples, described in Appendix C.

4. Learning to Guide the Search

In Section 3.2, we noted that operation weights allow TF-Coder to prioritize simple and useful operations. Another benefit is that weights can be modified to fit the specific synthesis problem at hand, instead of having static weights that are independent of the problem. This enables strategies that tweak the ordering of the search space to better fit the problem.

TF-Coder uses two machine learning models that predict which operations will be used: a neural model conditioned on features of the input and output tensors, and a naïve Bayes bag-of-words model conditioned on the natural language description of the problem. The models’ predictions are used to prioritize operations by multiplying their weights by a constant . Both models independently choose which weights to modify, so if an operation is prioritized by both, its weight will be multiplied by . Modified weights are rounded to the nearest integer (or rounded up to 1 since weights must be positive). Then, the search described in Section 3 is run as normal.

4.1. Tensor Features Model

We now describe a neural model that learns a Bernoulli distribution over each operation, conditioned on features of input and output tensors. Human experts can often recognize useful operations for tensor transformation tasks by looking at patterns in the user-provided examples. For instance, if one tensor contains small nonnegative integers, they may represent indices into another tensor, especially if the output tensor also contains entries that are found in the input tensors. With the tensor features model, our goal is to learn a similar pattern-recognition capability.


One challenge for training such a model is the lack of a large supervised dataset containing real TensorFlow programs together with corresponding input/output examples, so we train our model on a synthetically generated dataset. However, unlike previous approaches (Devlin et al., 2017; Balog et al., 2017; Shin et al., 2019) that uniformly sample from a space of programs and inputs, we observe that this approach in the TensorFlow domain will result in a huge number of errors due to the many constraints imposed by TensorFlow operations. Furthermore, without symbolic formulas for these constraints, we cannot use solver-based approaches to find satisfactory programs and inputs (King, 1976; Cadar et al., 2008).

We present the novel idea of generating the synthetic training dataset using our enumerative search algorithm, running the weighted value search on randomly-generated inputs for 10 minutes to gather a large number of explored values. For each such value, we consider all ways of collapsing subtrees of its code expression into new inputs, to add more variety in the input tensors. For example, given the code expression tf.greater(tf.add(in1, tf.squeeze(in2)), in3)), we would additionally consider the expressions tf.greater(new_input, in3) and tf.greater(tf.add(in1, new_input), in3)), where new_input is a new input tensor with a value equal to the value of the code subtree that it replaced. We randomly choose one such way of collapsing subtrees (including the original expression unchanged) for each explored value, resulting in an I/O example with a corresponding TensorFlow program.

We then filter the dataset to only contain programs that use at least two operations, since programs using one single operation are already easily synthesized by the value search in a fraction of a second. Additionally, we also exclude examples where an input or output tensor has more than 50 elements, to more closely resemble example tensors that would be manually provided by TF-Coder’s users. Our training dataset comes from 20,000 runs of value search on random inputs, where we draw one training example each from at most 2,000 explored values from each run, for a total of 39,930,863 training examples. The evaluation dataset uses 1,000 runs of value search and at most 100 examples from each run, for a total of 99,852 evaluation examples.

Example Features

We compute a set of features for the input/output tensors to feed into the model, which include:

  • If the value is a primitive, sequence, tensor, or SparseTensor

  • The value’s data type, rank, and dimension lengths

  • Statistics (e.g., max, min, mean) of its elements

  • The number and fraction of elements of various properties, e.g., exactly zero, in the range , unique elements, etc.

  • Various boolean properties of the value, e.g., entirely positive, all elements unique, sorted, etc.

In addition to featurizing the individual input and output tensors, we also compute features representing the comparison of each input value to the output value:

  • Comparing the number of elements, ranks, and each dimension length

  • The number and fraction of input elements that also appear in the output, and vice versa

  • If all input elements appear in output, and vice versa

  • If each dimension length of the input also appears as some dimension length of the output, and vice versa

For all features that result in an unbounded integer or float (e.g., the maximum element or number of unique elements), we bucket the feature to turn it into a categorical feature.

To featurize an input/output example, we first pad the list of inputs with dummy input values until there are exactly 3 inputs, so that the same number of features are extracted for each example.111This scheme supports a maximum of 3 inputs, but this could be relaxed. We have not yet encountered a reasonably-complex task requiring 4 inputs. We then extract features for each input and the output individually, and extract features from a comparison of each input to the output. We also add a single feature representing the number of inputs.


Our neural model first embeds categorical features (e.g., boolean properties, bucketed numbers, data types, etc.) using an embedding size equal to the cardinality of the feature. The embeddings are concatenated along with unembedded features (e.g., fraction features), resulting in a vector of length 2049. This is passed through 1 or 2 dense layers, a final dense output layer produces a logit for each operation, and elementwise sigmoid is applied to get a probability for each operation.

We experiment with different loss functions. One is a standard sigmoid cross entropy loss averaged over the operations. However, as each example only uses a few operations, the dataset is overwhelmingly negative, which could lead the model to be overly conservative with its predictions. Thus, we also implement a differentiable metric (van Rijsbergen, 1979) as a loss function to achieve different balances in precision and recall. prioritizes precision and recall equally, while cares twice as much about recall than precision (in general, we found that correctly prioritizing an operation outweighs prioritizing an operation that is actually not used).

The distribution of operations in the synthetic dataset is different from the distribution of operations that are actually used in problem solutions for two reasons. First, the dataset is created from running weighted value search, which inherently prioritizes simple operations over more complex ones. Second, there are fewer valid programs containing operations with many constraints compared to operations with few constraints. We experimented with balancing the dataset by giving a weight to each positive example (where an operation is actually used), and leaving negative examples (operation unused) unchanged. The weight for operation , when it is actually used in the training example, is either

where is the number of examples in the training set where is actually used. The weighting scheme has the property that no operation is downweighted, but it leads to the model believing that there are many more positive examples than there actually are. In contrast, with the weighting scheme, the model believes that the proportion of positive examples is unchanged. Finally, we clip weights to a maximum of 10,000 to avoid training instability from extremely large weights.

Considering sigmoid cross entropy, , and loss functions, along with weights, weights, or no weighting at all, we have 9 different variations. For each variation, we ran a hyperparameter sweep and selected the run with the lowest evaluation loss after 3 epochs. We observed no overfitting. We varied the number of hidden feedforward layers (1 or 2), the size of the hidden layers (512, 1024, or 2048), and the learning rate (7 choices between 1e-5 and 1e-3). We used the Adam optimizer (Kingma and Ba, 2014) with global norm gradient clipping (Pascanu et al., 2012). Results are discussed in Section 5.2.

For all variations of the the tensor features model, we prioritize all operations where the predicted probability is greater than 0.5.

4.2. Natural Language Model

In this section we describe our approach to reweighting operations based on the natural language text accompanying the input/output examples. These descriptions can provide information about what operations are likely to be used in the solution. As with the tensor features model, we formulate the task as a supervised multilabel classification problem. For an input natural language description, the task is to predict a binary label for each operation, indicating whether the operation is likely to be used in the solution.


Since we do not have a large dataset of TF-Coder queries paired with target Tensorflow operations, we construct a proxy dataset from the TensorFlow documentation and from TensorFlow code on GitHub. The proxy dataset does not represent the same distribution as TF-Coder queries, and we will note the implications of this when we describe our models.

We construct the first part of the proxy dataset from the TensorFlow documentation. For each operation supported by TF-Coder, we construct a single instance for our dataset using the operation’s docstring. The docstring serves as the task description, and we consider the operation to be the sole target operation for the instance. This yields 134 descriptions paired with target operations.

To complete the dataset, we additionally construct examples from TensorFlow code from GitHub. We collect 65,617 functions that use at least one TF-Coder-supported TensorFlow operation from GitHub projects with a permissive license. Following the method of Allamanis (2018), we remove duplicate and near-duplicate instances from this dataset, leaving 13,960 functions. For each function, we extract a natural language context from the function, as well as the set of supported TensorFlow operations used by the function. The natural language context consists of the function’s docstring and all comments, string literals, and variable names appearing in the function. We use this natural language context as a proxy for the task description, and we use the TensorFlow operations found in the function as the target TensorFlow operations. In total, our full constructed dataset has 14,094 instances.


We train two models, a TF-IDF model, and a naïve Bayes model. Each model accepts natural language text and operations as input, and decides which operations to prioritize in the search. We restrict our models to prioritizing at most operations with the best scores. These models are implemented using scikit-learn (Pedregosa et al., 2011).

In selecting these models, we take into consideration the differences between the proxy dataset and the expected distribution of TF-Coder queries. For example, the natural language context in the proxy dataset is often different in structure from the real task descriptions. Nevertheless, we hypothesize that we can still learn from the vocabulary used in the proxy dataset to perform well on the benchmark tasks. So, we focus our efforts on two bag-of-words models. In investigations with more complex models, we found that higher capacity models can better fit the proxy data but do not generalize well to the target domain of TF-Coder task descriptions.

We first consider the TF-IDF model, which we train using only the TensorFlow documentation, not the instances gathered from GitHub. We construct a vocabulary consisting of those terms appearing at least once in the docstrings of the supported TensorFlow operations, with English stop words removed. For each operation , we construct a vector from the operation’s docstring consisting of the tf-idf score of each term in the vocabulary (Jones, 1972). The tf-idf score for a term in a docstring is computed as the number of occurrences of the term in the docstring, divided by the smoothed log total number of occurrences of the term across all docstrings. The smoothing is equivalent to there being a single extra docstring containing every term exactly once.

We construct an analogous vector from the input text . For natural language and operation , the TF-IDF model produces a score given by the cosine similarity between and . The model prioritizes the operations with the highest scores, considering only those operations with score exceeding a threshold , and up to operations prioritized.

The second model is a naïve Bayes model, which we train on the full constructed dataset. This model uses the same vocabulary and document frequencies as the TF-IDF model and the same definition of . Though the dataset is now larger, we do not expand the vocabulary to include novel terms. We find that restricting the capacity of the model in this way limits its tendency to overfit to the domain of the constructed dataset.

For each operation , let be a binary random variable indicating whether is used in the target program. The naïve Bayes model estimates the probability of being used given natural language as

We calculate this using the estimate , where is the Lidstone smoothing parameter ( in our experiments). is the sum of the tf-idf scores of all terms appearing with , is the sum of the tf-idf scores of all instances of term appearing with , and is the number of terms in the vocabulary.

The distribution of operations in the proxy dataset differs from the distribution of operations that appear in TF-Coder queries. On GitHub, TensorFlow usage skews toward implementing models and training pipelines, whereas TF-Coder queries are tensor manipulations. So, rather than estimating from the proxy dataset, we instead use the uniform prior and estimate for all operations, which we found to perform better. The naïve Bayes model prioritizes operations with , up to operations, where and are hyperparameters.

We experiment with different variations of these models: TF-IDF using , naïve Bayes using , and the maximum number of operations prioritized for both models. Results for the best settings are shown in Section 5.2.

5. Experiments

We now present an evaluation of TF-Coder on a set of real-world benchmarks. We use ablation experiments to analyze the overall efficiency gains of TF-Coder’s synthesis algorithm compared to baseline approaches. Finally, we perform a study of the synthesis results of TF-Coder in comparison to the answers provided by human experts on StackOverflow.

Benchmark Tasks

We collected a benchmark set of 70 tensor manipulation tasks, including 50 drawn from StackOverflow questions and 20 real tasks encountered by TensorFlow users in an industrial setting. While collecting the benchmark tasks, we noticed that some were not actually amenable to solutions in TensorFlow, so we excluded tasks that we could not solve by hand after much effort. Of the 50 StackOverflow tasks, 34 contained an input/output example in the question. We expanded these examples (adding more entries to the tensors) where necessary to make the patterns clear, or used the examples as-is if they were already comprehensive. For questions posed without input/output examples, we created examples manually. We also manually wrote single-sentence descriptions for the tasks, borrowing as much wording from the question’s title and body as possible while remaining concise, grammatical, and accurate. Examples of this process are discussed in Appendix D.

5.1. Comparison to Prior Work

Figure 5. Ablation study investigating the effects of weighted search and operation filtering. The plot shows the number of benchmarks that can be solved within a particular amount of time. Without these improvements, the search algorithm reduces to that of prior work (Udupa et al., 2013). These runs do not use learned models to prioritize operations.

TF-Coder extends the search in Transit (Udupa et al., 2013) in several ways:

  1. TF-Coder incorporates weights for operations and base values, while Transit does not use weights.

  2. TF-Coder uses a flexible operation filtering system that generalizes Transit’s type checking, which is insufficient for many TensorFlow operations.

  3. TF-Coder uses two models to modify operation weights.

In this section, we evaluate the effectiveness of the first two improvements (the models are evaluated in Section 5.2). We run 4 variations of TF-Coder where we independently turn on or off weighting and operation filtering,222We turn off operation filtering as much as possible, but 36 of 134 operations require filtering to avoid uncatchable segfaults or excessive memory usage. without using models.

The results of these 4 variations on our benchmarks are plotted in Figure 5. Both techniques in isolation lead to significant improvement over the Transit algorithm, and their combination produces another large improvement. Overall, TF-Coder without any models can solve 62 of the 70 benchmark tasks within 5 minutes, while Transit only solves 44 tasks.

5.2. Effect of the Learned Models

Tasks Num faster Num slower Time for Total Avg.
Model solved (avg. speedup) (avg. speedup) 62 tasks (s) speedup speedup
TF-Coder without any models 62 1147.6
(A) CE, 62 30 (43.3%) 17 887.3 22.7% 16.8%
(B), 63 38 (42.9%) 13 756.2 34.1% 23.9%
(C), 63 43 (45.3%) 9 907.7 20.9% 24.4%
(X) Naïve Bayes, , 63 26 (39.0%) 9 1085.2 5.4% 12.5%
(Y) Naïve Bayes, , 62 24 (41.1%) 4 1013.0 11.7% 14.2%
(Z) TF-IDF, , 62 21 (42.5%) 7 1138.6 0.8% 14.8%
(B) with (X) (chosen combination) 63 44 (50.1%) 9 682.4 40.5% 32.4%
(B) with (Y) 63 43 (50.8%) 11 675.1 41.2% 31.8%
(B) with (Z) 63 40 (52.9%) 11 723.5 37.0% 34.7%
(C) with (Y) 63 47 (53.1%) 6 809.9 29.4% 32.8%
Table 1. The best variations of the tensor features model, natural language model, and combinations, comparing against TF-Coder without using any models. All methods in this table solved at least the 62 tasks that were solved by the baseline.

We now evaluate different models to prioritize operations during the enumerative search. We find the best tensor features model (Section 4.1) and the best natural language model (Section 4.2) in isolation, and then find the best combination of the two models.

Table 1 lists the performance of the best model variations on our benchmark tasks. For the tensor features models, we experimented with 3 different loss functions and 3 different weighting schemes as described in Section 4.1. For the natural language models, model-specific hyperparameters are listed as described in Section 4.2.

Table 1 compares the performance of TF-Coder when using each model against TF-Coder without any models at all. All model variations and combinations listed in Table 1 solved (at least) all of the 62 benchmark tasks that were solved by the no-model variant. For each task, if using a model results in a solve time that is less than 5% or less than 0.1 seconds different compared to the no-model run, then we consider the solve times to be “roughly equal” and possibly attributed to noise in the timings. The table lists the number of tasks where the timing difference is larger in either direction: “faster” means the model does better than not using a model, and “slower” means the model does worse. We also report the average speedups among the faster and slower tasks. “Time for 62 tasks” is the sum of the solve times for the tasks solved by the no-model variant, and “total speedup” compares that total time against that of the no-model variant. “Average speedup” computes the average of the per-task speedups. Note that “total speedup” is heavily biased toward performance on the few difficult long-running tasks, while “average speedup” is representative of all tasks (even easy tasks that are solved incredibly quickly, where an end-user might not even notice any time difference).

Tensor Features Model

For the tensor features model, we found that the weighting scheme was consistently the best weighting scheme across all three loss functions. The loss function resulted in the highest total speedup of 34.1%, while the loss function had the highest average speedup of 24.4%. Both of these loss functions solved one extra task compared to the no-model run.

Natural Language Model

The best Naïve Bayes models obtain higher total speedup than the best TF-IDF model, although the TF-IDF model has slightly better average speedup. Overall, the natural language models were less effective than the tensor features models, but the natural language models lead to slowdowns for fewer tasks.

Model Combinations

We tried all 9 combinations of the 3 best tensor features models and the 3 best natural language models (as listed in Table 1), with results for the four best combinations listed at the bottom of the table. The different combinations excel in different ways. Considering the many metrics in Table 1, as well as performance on the 63rd “extra” solved task, we consider the best combination of models to use with weighting as the tensor features model, and Naïve Bayes with and as the natural language model. This combination led to speedups for 44 of 62 tasks (71%), on average cutting the synthesis time in half among such tasks, which helps TF-Coder feel more interactive.

It is also promising that the model combinations perform significantly better than the individual models themselves. This suggests that our framework enables complementary models to jointly influence the enumerative search with compounding benefits.

5.3. Comparison to StackOverflow

Since TF-Coder was inspired by questions on forums like StackOverflow, it is natural to compare TF-Coder’s performance with that of the StackOverflow community. We found that, among the 50 StackOverflow questions, 47 had answers but only 32 had correct answers. Incorrect answers included cases where the expert misinterpreted the question, or the solution did not fully generalize, used operations that no longer exist in the current version of TensorFlow (2.0), or otherwise had bugs that prevent the suggested code from executing successfully. Among correct answers, the median answer-posting time was 31 minutes. In comparison, TF-Coder is able to solve 44 of the StackOverflow tasks within 5 minutes, with a median solve time of 1.4 seconds. Furthermore, TF-Coder’s solutions are guaranteed to run successfully on the given example. We also manually inspected TF-Coder’s solutions and found that they all correctly implement the desired behavior, except for one solution which was mostly correct but had a subtle bug that prevents it from generalizing perfectly. We discuss this in Appendix E.

5.4. A Sample of Synthesized Programs

# Convert tensor into pairs for SparseTensor indexing.
in1 = [0, 0, 0, 1, 3, 3]
output = [[0, 0], [0, 1], [0, 2], [1, 0], [3, 0], [3, 1]]
# Solution found in 2.5 seconds
    tf.math.bincount(in1))), tf.int32)
(a) A real task from an industrial setting that is incredibly tricky to solve, using an unintuitive composition of uncommon operations.
# Reorder segments.
in1 = [10, 20, 30, 40, 50, 13, 17, 19, 21, 22, 23]
in2 = [1, 1, 1, 1, 1, 0, 0, 0, 2, 2, 2],
output = [13, 17, 19, 10, 20, 30, 40, 50, 21, 22, 23]
# Solution found in 2.0 seconds
tf.gather(in1, tf.argsort(in2, axis=0, stable=True))
(b) Another task from an industrial setting. The use of tf.argsort is critical—the problem would be very difficult without it. TF-Coder can help users learn about operations that they are unfamiliar with.
# Compute convex combination of two tensors.
in1 = [[[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]],
       [[10., 20.], [30., 40.], [50., 60.]]]
in2 = [[[9.0, 8.0], [7.0, 6.0], [5.0, 4.0]],
       [[90., 80.], [70., 60.], [50., 40.]]]
in3 = [0.1, 0.4, 0.8]
output = [[[8.2, 7.4], [5.4, 5.2], [5.0, 5.6]],
          [[82., 74.], [54., 52.], [50., 56.]]]
# Solution found in 95.6 seconds
tf.add(in2, tf.multiply(tf.expand_dims(in3, 1),
                        tf.subtract(in1, in2)))
(c) A StackOverflow task with three inputs. Our best handwritten solution used 6 operations, while TF-Coder’s solution only uses 4.
# Convert segment lengths to segment ids.
in1 = [2, 3, 4]
output = [0, 0, 1, 1, 1, 2, 2, 2, 2]
# Solution found in 3.2 seconds
tf.cast(tf.where(tf.sequence_mask(in1))[:, 0], tf.int32)
(d) A StackOverflow task that is surprisingly difficult in TensorFlow. StackOverflow’s answer uses 9 operations; TF-Coder only needs 4.
# Get the indices of several elements.
in1 = [101, 103, 105, 109, 107]
in2 = [105, 107, 103]
output = [2, 4, 1]
# Solution found in 43.8 seconds
tf.cast(tf.argmax(tf.cast(tf.equal(in1, tf.expand_dims(
    in2, 1)), tf.int32), axis=1), tf.int32)
(e) This StackOverflow task requires a particularly long solution.
Figure 6. Results on selected tasks (descriptions in comments). None of these tasks used user-provided constants.

Figure 6 shows examples of interesting problems that TF-Coder is able to solve. We observe that on these problems and many others, TF-Coder finds solutions that are simpler or more elegant than human-written solutions. One major strength of TF-Coder is that it can identify solutions using uncommon operations that a human programmer might not know about, or unconventional combinations of operations that the programmer might not have considered. Such behavior would not be expected from other synthesis approaches that attempt to imitate existing code corpora.

6. Related Work

In this section, we discuss related works from several domains.

Programming By Example (PBE)

The problem of synthesizing programs from input/output examples has been studied for a long time starting with the works of synthesizing LISP programs (Shaw et al., 1975; Hardy, 1974). More recently, PBE techniques have been developed for domains including string transformations (Gulwani, 2011; Gulwani et al., 2012; Singh, 2016), data extraction from semi-structured formats (Le and Gulwani, 2014), data structure manipulations (Feser et al., 2015a; Singh and Solar-Lezama, 2011), distributed cache coherence protocols (Udupa et al., 2013), data imputation programs (Wang et al., 2017, 2018), map-reduce programs (Smith and Albarghouthi, 2016), and Java functions (Shi et al., 2018).

Unlike these approaches, which synthesize programs from only input/output examples, TF-Coder uses both input/output examples and natural language descriptions to guide a weighted enumerative search. Ye et al. (2019) present a technique to generate regular expressions from natural language and examples, where the natural language is first used to generate a program sketch and the sketch is then completed using an enumerative approach using examples. On the other hand, TF-Coder uses both examples and natural language simultaneously to guide a weighted bottom-up search over compositions of supported operations. Synquid (Polikarpova and Solar-Lezama, 2015) also uses type-based reasoning and filtering for synthesis, whereas TF-Coder uses dynamic value-based checks for argument and combination filters for different TensorFlow operations.

Machine Learning for Program Synthesis

With the recent advances in machine learning, there has been much interest in using such techniques for program synthesis. RobustFill (Devlin et al., 2017; Parisotto et al., 2016) uses an encoder-decoder model to generate string transformation programs from examples. The encoder embeds the example strings using recurrent LSTM networks, which is then used to condition the output program sequence decoder. DeepCoder (Balog et al., 2017) trains a model to learn a distribution over possible list functions given the input/output list examples. It then uses the distribution to guide an enumerative search. Euphony (Lee et al., 2018) performs a weighted enumerative search using the A* search algorithm, where the weights come from a probabilistic higher-order grammar (PHOG). Similar to these approaches, TF-Coder also learns a distribution over possible programs conditioned on the corresponding specification. However, it uses both input/output example and natural language as specification, and uses the trained models to modify operation weights to perform a task-specific weighted search.

AutoPandas (Bavishi et al., 2019) uses graph neural networks to synthesize Pandas programs that manipulate DataFrames, which are similar to TensorFlow tensors. A key innovation in AutoPandas is a graph representation of the input and output DataFrames with edges connecting equal cells. Although tensors and DataFrames are similar, AutoPandas’ graph approach is not as applicable to the TensorFlow domain, since many common mathematical operations would break the cell-equivalence edges. In other words, DataFrames retain much of their data while being manipulated through pivots, selections, and joins, making it easy for cell-equivalence edges to track the movement of data, while this is only true for a fraction of manipulations in TensorFlow.

There are also some approaches that use machine learning for ranking programs. FlashFill uses version-space algebra to identify all programs in a DSL that are consistent with a set of input/output examples, and then uses a ranking function learned through supervised learning (Singh and Gulwani, 2015) to rank the programs, so that the user does not need to provide too many examples before obtaining the desired program. Unlike this ranking approach that first finds all consistent programs, TF-Coder uses learning to guide the search in first place.

Menon et al. (2013) describe an approach for synthesizing string manipulation programs that learns a probabilistic context free grammar (PCFG) of rules given a set of examples. It uses a set of hand-designed clues to learn a distribution over likely rules and then enumerates over a subset of rules in order of decreasing probabilities to search for a consistent program. Since it learns from a small number of training examples (280), the clues need to be very domain-specific. In comparison, TF-Coder’s TensorFlow domain is quite different from the string-processing domain. TF-Coder trains a model to learn a distribution over operations from millions of synthetically generated programs, and the model is used to guide an efficient weighted enumerative search with value- and type-based filtering and pruning strategies.

Program Synthesis

There has been a renewed interest in program synthesis research in the last decade because of the advances in both constraint solving and algorithmic synthesis techniques (Alur et al., 2013; Gulwani et al., 2017). The synthesis approaches can be broadly classified based on the underlying search mechanism: (i) enumerative (Udupa et al., 2013), (ii) constraint-based (Solar-Lezama et al., 2006; Solar-Lezama, 2013), and (iii) stochastic (Schkufza et al., 2013; Shi et al., 2018). Applying constraint-based synthesis techniques to the TensorFlow domain would require a huge effort of modeling semantics of TensorFlow operations, and for many operations these would not be scalable due to complex non-linear computations. TF-Coder builds on top of the bottom-up enumerative search from Transit (Udupa et al., 2013), adding expression weights and flexible value-based filtering for a more efficient search. Moreover, it dynamically adjusts weights using learned models based on the input/output examples and natural language description.

7. Conclusion

In this paper, we presented TF-Coder, a synthesis tool for automatically generating tensor manipulation programs in TensorFlow from examples and natural language. TF-Coder employs a bottom-up weighted enumerative search with type- and value-based filtering to conform to the constraints imposed by TensorFlow operations. It uses two machine learning models to predict useful operations from features of the input/output tensors and a natural language description of the task, and these predictions are used to modify the weights to customize the search process for the given task. We evaluated TF-Coder successfully on several real-world tensor transformation tasks faced by TensorFlow users on StackOverflow and in an industrial setting, and various ablation experiments show usefulness of the two models and filtering techniques. We believe that TF-Coder can help both machine learning beginners and experienced practitioners in writing tricky tensor transformation programs that are common in deep learning pipelines.

The authors thank Charles Sutton and the other members of the program synthesis team at Google Brain for helpful discussions.


  • M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu and X. Zheng (2016) TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2-4, 2016, pp. 265–283. Cited by: §1.
  • M. Allamanis (2018) The Adverse Effects of Code Duplication in Machine Learning Models of Code. arXiv e-prints, pp. arXiv:1812.06469. External Links: 1812.06469 Cited by: §4.2.
  • R. Alur, R. Bodík, G. Juniwal, M. M. K. Martin, M. Raghothaman, S. A. Seshia, R. Singh, A. Solar-Lezama, E. Torlak and A. Udupa (2013) Syntax-guided synthesis. See DBLP:conf/fmcad/2013, pp. 1–8. External Links: Link Cited by: §6.
  • M. Balog, A. L. Gaunt, M. Brockschmidt, S. Nowozin and D. Tarlow (2017) DeepCoder: learning to write programs. See DBLP:conf/iclr/2017, External Links: Link Cited by: §4.1, §6.
  • R. Bavishi, C. Lemieux, R. Fox, K. Sen and I. Stoica (2019) AutoPandas: neural-backed generators for program synthesis. PACMPL 3 (OOPSLA), pp. 168:1–168:27. External Links: Link, Document Cited by: §1, §6.
  • C. Cadar, D. Dunbar and D. Engler (2008) KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI’08, Berkeley, CA, USA, pp. 209–224. External Links: Link Cited by: §4.1.
  • T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang and Z. Zhang (2015) MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR abs/1512.01274. External Links: Link, 1512.01274 Cited by: §1.
  • J. Devlin, J. Uesato, S. Bhupatiraju, R. Singh, A. Mohamed and P. Kohli (2017) RobustFill: neural program learning under noisy I/O. See DBLP:conf/icml/2017, pp. 990–998. External Links: Link Cited by: §4.1, §6.
  • J. K. Feser, S. Chaudhuri and I. Dillig (2015a) Synthesizing data structure transformations from input-output examples. See DBLP:conf/pldi/2015, pp. 229–239. External Links: Link, Document Cited by: §6.
  • J. K. Feser, S. Chaudhuri and I. Dillig (2015b) Synthesizing data structure transformations from input-output examples. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’15, New York, NY, USA, pp. 229–239. External Links: ISBN 9781450334686, Link, Document Cited by: §1.
  • S. Gulwani, W. R. Harris and R. Singh (2012) Spreadsheet data manipulation using examples. Commun. ACM 55 (8), pp. 97–105. External Links: Link, Document Cited by: §6.
  • S. Gulwani, O. Polozov and R. Singh (2017) Program synthesis. Foundations and Trends in Programming Languages 4 (1-2), pp. 1–119. External Links: Link, Document Cited by: §6.
  • S. Gulwani (2011) Automating string processing in spreadsheets using input-output examples. See DBLP:conf/popl/2011, pp. 317–330. External Links: Link, Document Cited by: §1, §6.
  • S. Hardy (1974) Automatic induction of lisp functions. In Proceedings of the 1st Summer Conference on Artificial Intelligence and Simulation of Behaviour, AISB’74, Amsterdam, The Netherlands, The Netherlands, pp. 50–62. External Links: Link Cited by: §6.
  • K. S. Jones (1972) A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, pp. 11–21. Cited by: §4.2.
  • J. C. King (1976) Symbolic execution and program testing. Commun. ACM 19 (7), pp. 385–394. External Links: ISSN 0001-0782, Link, Document Cited by: §4.1.
  • D. P. Kingma and J. Ba (2014) Adam: A Method for Stochastic Optimization. arXiv e-prints, pp. arXiv:1412.6980. External Links: 1412.6980 Cited by: §4.1.
  • V. Le and S. Gulwani (2014) FlashExtract: a framework for data extraction by examples. See DBLP:conf/pldi/2014, pp. 542–553. External Links: Link, Document Cited by: §6.
  • Y. LeCun, Y. Bengio and G. E. Hinton (2015) Deep learning. Nature 521 (7553), pp. 436–444. External Links: Link, Document Cited by: §1.
  • W. Lee, K. Heo, R. Alur and M. Naik (2018) Accelerating search-based program synthesis using learned probabilistic models. SIGPLAN Not. 53 (4), pp. 436–449. External Links: ISSN 0362-1340, Link, Document Cited by: §6.
  • A. Menon, O. Tamuz, S. Gulwani, B. Lampson and A. Kalai (2013) A machine learning framework for programming by example. In Proceedings of the 30th International Conference on Machine Learning, S. Dasgupta and D. McAllester (Eds.), Proceedings of Machine Learning Research, Vol. 28, Atlanta, Georgia, USA, pp. 187–195. Cited by: §6.
  • E. Parisotto, A. Mohamed, R. Singh, L. Li, D. Zhou and P. Kohli (2016) Neuro-symbolic program synthesis. CoRR abs/1611.01855. External Links: Link, 1611.01855 Cited by: §6.
  • R. Pascanu, T. Mikolov and Y. Bengio (2012) Understanding the exploding gradient problem. CoRR abs/1211.5063. External Links: Link, 1211.5063 Cited by: §4.1.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga and A. Lerer (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §1.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §4.2.
  • N. Polikarpova and A. Solar-Lezama (2015) Program synthesis from polymorphic refinement types. CoRR abs/1510.08419. External Links: Link, 1510.08419 Cited by: §6.
  • E. Schkufza, R. Sharma and A. Aiken (2013) Stochastic superoptimization. See DBLP:conf/asplos/2013, pp. 305–316. External Links: Link, Document Cited by: §6.
  • F. Seide and A. Agarwal (2016) CNTK: microsoft’s open-source deep-learning toolkit. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pp. 2135–2135. External Links: ISBN 978-1-4503-4232-2, Link, Document Cited by: §1.
  • D. E. Shaw, W. R. Swartout and C. C. Green (1975) Inferring LISP programs from examples. See DBLP:conf/ijcai/1975, pp. 260–267. External Links: Link Cited by: §6.
  • K. Shi, J. Steinhardt and P. Liang (2018) FrAngel: component-based synthesis with control structures. CoRR abs/1811.05175. External Links: Link, 1811.05175 Cited by: §6, §6.
  • R. Shin, N. Kant, K. Gupta, C. Bender, B. Trabucco, R. Singh and D. Song (2019) Synthetic datasets for neural program synthesis. See DBLP:conf/iclr/2019, External Links: Link Cited by: §4.1.
  • R. Singh and S. Gulwani (2015) Predicting a correct program in programming by example. See DBLP:conf/cav/2015-1, pp. 398–414. External Links: Link, Document Cited by: §6.
  • R. Singh and A. Solar-Lezama (2011) Synthesizing data structure manipulations from storyboards. See DBLP:conf/sigsoft/2011, pp. 289–299. External Links: Link, Document Cited by: §6.
  • R. Singh (2016) BlinkFill: semi-supervised programming by example for syntactic string transformations. PVLDB 9 (10), pp. 816–827. External Links: Link, Document Cited by: §6.
  • C. Smith and A. Albarghouthi (2016) MapReduce program synthesis. See DBLP:conf/pldi/2016, pp. 326–340. External Links: Link, Document Cited by: §6.
  • A. Solar-Lezama, L. Tancau, R. Bodík, S. A. Seshia and V. A. Saraswat (2006) Combinatorial sketching for finite programs. See DBLP:conf/asplos/2006, pp. 404–415. External Links: Link, Document Cited by: §6.
  • A. Solar-Lezama (2013) Program sketching. STTT 15 (5-6), pp. 475–495. External Links: Link, Document Cited by: §6.
  • A. Udupa, A. Raghavan, J. V. Deshmukh, S. Mador-Haim, M. M. K. Martin and R. Alur (2013) TRANSIT: specifying protocols with concolic snippets. See DBLP:conf/pldi/2013, pp. 287–296. External Links: Link, Document Cited by: §1, Figure 5, §5.1, §6, §6.
  • C.J. van Rijsbergen (1979) Information retrieval. Cited by: §4.1.
  • X. Wang, I. Dillig and R. Singh (2017) Synthesis of data completion scripts using finite tree automata. PACMPL 1 (OOPSLA), pp. 62:1–62:26. External Links: Link, Document Cited by: §6.
  • X. Wang, I. Dillig and R. Singh (2018) Program synthesis using abstraction refinement. PACMPL 2 (POPL), pp. 63:1–63:30. External Links: Link, Document Cited by: §6.
  • X. Ye, Q. Chen, X. Wang, I. Dillig and G. Durrett (2019) Sketch-driven regular expression generation from natural language and examples. CoRR abs/1908.05848. External Links: Link, 1908.05848 Cited by: §6.

Appendix A Supported Operations in TF-Coder

Below is the list of 134 operations currently supported by TF-Coder. We did not cherrypick the operations to support; in fact, out of the 134 supported operations, only 59 are used in TF-Coder’s solutions to our benchmark tasks.

General TensorFlow functions:
tf.add(x, y)
tf.argmax(input, axis)
tf.argmin(input, axis)
tf.argsort(values, axis, stable=True)
tf.argsort(values, axis, direction=’DESCENDING’, stable=True)
tf.boolean_mask(tensor, mask)
tf.broadcast_to(input, shape)
tf.cast(x, dtype)
tf.clip_by_value(t, clip_value_min, clip_value_max)
tf.concat(values, axis)
tf.constant(value, dtype)
tf.divide(x, y)
tf.equal(x, y)
tf.expand_dims(input, axis)
tf.eye(num_rows, num_columns)
tf.eye(num_rows, dtype)
tf.fill(dims, value)
tf.gather(params, indices)
tf.gather(params, indices, axis, batch_dims)
tf.gather_nd(params, indices)
tf.gather_nd(params, indices, batch_dims)
tf.greater(x, y)
tf.greater_equal(x, y)
tf.math.count_nonzero(input, axis)
tf.math.cumsum(x, axis)
tf.math.cumsum(x, axis, exclusive=True)
tf.math.divide_no_nan(x, y)
tf.math.segment_max(data, segment_ids)
tf.math.segment_mean(data, segment_ids)
tf.math.segment_min(data, segment_ids)
tf.math.segment_prod(data, segment_ids)
tf.math.segment_sum(data, segment_ids)
tf.math.squared_difference(x, y)
tf.math.top_k(input, k)
tf.math.unsorted_segment_max(data, segment_ids, num_segments)
tf.math.unsorted_segment_mean(data, segment_ids, num_segments)
tf.math.unsorted_segment_min(data, segment_ids, num_segments)
tf.math.unsorted_segment_prod(data, segment_ids, num_segments)
tf.math.unsorted_segment_sum(data, segment_ids, num_segments)
tf.matmul(a, b)
tf.maximum(x, y)
tf.minimum(x, y)
tf.multiply(x, y)
tf.not_equal(x, y)
tf.one_hot(indices, depth)
tf.pad(tensor, paddings, mode=’CONSTANT’)
tf.pad(tensor, paddings, mode=’CONSTANT’, constant_values)
tf.pad(tensor, paddings, mode=’REFLECT’)
tf.pad(tensor, paddings, mode=’SYMMETRIC’)
tf.range(start, limit, delta)
tf.reduce_any(input_tensor, axis)
tf.reduce_max(input_tensor, axis)
tf.reduce_mean(input_tensor, axis)
tf.reduce_min(input_tensor, axis)
tf.reduce_prod(input_tensor, axis)
tf.reduce_sum(input_tensor, axis)
tf.reshape(tensor, shape)
tf.reverse(tensor, axis)
tf.roll(input, shift, axis)
tf.searchsorted(sorted_sequence, values, side=’left’)
tf.searchsorted(sorted_sequence, values, side=’right’)
tf.sequence_mask(lengths, maxlen)
tf.sort(values, axis)
tf.sort(values, axis, direction=’DESCENDING’)
tf.squeeze(input, axis)
tf.stack(values, axis)
tf.subtract(x, y)
tf.tensordot(a, b, axes)
tf.tile(input, multiples)
tf.transpose(a, perm)
tf.unstack(value, axis)
tf.where(condition, x, y)
SparseTensor functions:
tf.SparseTensor(indices, values, dense_shape)
tf.sparse.add(a, b)
tf.sparse.concat(axis, sp_inputs)
tf.sparse.expand_dims(sp_input, axis)
tf.sparse.maximum(sp_a, sp_b)
tf.sparse.minimum(sp_a, sp_b)
tf.sparse.reduce_max(sp_input, axis, output_is_sparse)
tf.sparse.reduce_sum(sp_input, axis, output_is_sparse)
tf.sparse.reshape(sp_input, shape)
tf.sparse.retain(sp_input, to_retain)
tf.sparse.slice(sp_input, start, size)
tf.sparse.split(sp_input, num_split, axis)
tf.sparse.to_dense(sp_input, default_value)
tf.sparse.to_indicator(sp_input, vocab_size)
tf.sparse.transpose(sp_input, perm)
Python-syntax operations:
IndexingAxis1Operation:             arg1[:, arg2]
IndexingOperation:                  arg1[arg2]
PairCreationOperation:              (arg1, arg2)
SingletonTupleCreationOperation:    (arg1,)
SlicingAxis0BothOperation:          arg1[arg2:arg3]
SlicingAxis0LeftOperation:          arg1[arg2:]
SlicingAxis0RightOperation:         arg1[:arg2]
SlicingAxis1BothOperation:          arg1[:, arg2:arg3]
SlicingAxis1LeftOperation:          arg1[:, arg2:]
SlicingAxis1RightOperation:         arg1[:, :arg2]
TripleCreationOperation:            (arg1, arg2, arg3)

Appendix B Domain-Specific Details

Here we describe a few techniques in TF-Coder taking advantage of the TensorFlow domain, that may or may not be useful in other similar domains. These techniques are excluded from Algorithm 1 for simplicity.

We impose limits on the sizes of values (e.g., number of elements in a tensor) encountered during search. This is done to avoid excessive memory usage through the creation of huge tensors. These limits are enforced during operation filtering, e.g., do not call tf.ones(shape) on the argument tf.range(1, 20), as that would cause an out-of-memory error. The limits are also checked after new values are created as a blanket safeguard against memory issues, and values that are too large are immediately discarded. In our experiments, we allow tensors to have a maximum of 1000 elements, 4 dimensions, and 100 elements along a single dimension. These limits are chosen to admit the largest tensors that we expect average users to require.

Many tasks require a tf.cast operation as the final step. Instead of waiting for the tf.cast operation to be applied through the search, TF-Coder opportunistically casts newly generated values to the target output’s data type if the new value matches the output’s shape but not its data type. If the casted value does not match the output, it is discarded. This step takes negligible time since it is applied to few values, but it drastically reduces the synthesis time for tasks that require a tf.cast as the final operation. Note that the tf.cast operation is still treated normally within the value search, which is necessary to produce and store casted values to be used as arguments to other operations later in the search.

A SparseTensor is a special kind of tensor object in that represents large sparse tensors in a memory-efficient way. TensorFlow’s tf.sparse submodule is dedicated to manipulating SparseTensors, e.g., the tf.add function does not support adding SparseTensors, and the tf.sparse.add function must be used instead. Because sparse operations may be confusing to users who are not familiar with SparseTensors, we prevent all tf.sparse.* operations from being used unless a SparseTensor is given as an input or output tensor, or the description includes the term “sparse”. This also reduces the search space for tasks that do not use SparseTensors.

Appendix C Handling Multiple I/O Examples

To handle multiple input/output examples, we simply need to extend the notion of a “value” in our value search.

In the single-example case, a “value” represents one code expression and contains the result of running that code expression using the example’s inputs. In the multi-example case, a “super-value” still represents one code expression, but it contains the results of running that code expression on inputs from each example.

For equivalence-based pruning (line 20 of Algorithm 1), two super-values are considered equal if all pairs of contained results are equal. For operation filtering (lines 15 and 17), a super-value is permitted by a filter if all of its contained results pass the filter. A solution is found (line 24) when the super-value’s contained results all match the examples’ outputs.

Appendix D Benchmark Creation

Here we walk through representative instances of our benchmark-creation process.

d.1. User Provides Good Example

This benchmark comes from the StackOverflow question in Figure 0(a). The user provides an input/output example: the input tensor [45, 58, 72, 33, 45, 58, 58, 33] should be transformed into the output tensor [0, 1, 2, 3, 0, 1, 1, 3]. The example has several desirable qualities:

  • There are no obvious patterns in the choice of numbers in the input tensor. In contrast, if the input tensor were instead [10, 20, 30, 40, 10, 20, 20, 40], one could incorrectly construct the output as (in1 / 10) - 1. In general, we observed that using “random-looking” numbers in the input tensor will significantly improve the quality of the example by eliminating coincidental patterns that are not actually relevant to the problem.

  • There are no obvious patterns in the arrangement of numbers in the input tensor, e.g., the duplicate elements are not all consecutive. This makes it clear that the intended solution must be general enough to handle non-consecutive duplicate elements.

  • The example tensors have sufficient length. Given only the example, the intended task would be much more ambiguous if the input tensor had, say, 4 elements instead of 8.

  • The example covers a variety of cases: there are elements appearing exactly 1, 2, and 3 times.

Hence, we consider this input/output example to be of high quality, and use it as-is in our benchmark without modification.

For the natural language description of this task, we use the sentence “Assign values between 0 and N - 1 for a vector with N different elements,” which is a slight simplification of the question’s title, “Assign values between 0 and N - 1 for a vector of length L with N different elements in Tensorflow.”

d.2. User Provides Ambiguous Example

This benchmark comes from another StackOverflow question, where the user wants to gather elements of in2 along axis 1, using indices from in1. The user provides the following example:

in1 = [[1], [1]]
in2 = [[0.2, 0.8], [0.4, 0.6]]
output = [[0.8], [0.6]]

Unfortunately, considering the points from the previous example benchmark, this input/output example is not as good. The example only includes two “parts” (where each part is an element of in2 being indexed), and the same index is used in both parts. Furthermore, the example includes a coincidental pattern – the extracted elements of in2 are the maximum of each row. Thus, we modify the example and increase the sizes of the tensors to make the intended pattern more clear, while breaking other patterns:

in1 = [[1], [2], [0]]
in2 = [[0.2, 0.8, 0.0], [0.3, 0.1, 0.6], [0.1, 0.6, 0.3]]
output = [[0.8], [0.6], [0.1]]

We found that examples given in StackOverflow questions were often too small because they were intended to be interpreted by humans who also understand the question text. In contrast, examples created by actual TF-Coder users are much more extensive.

We also wrote a single-sentence description of the task that one would plausibly provide to the tool, “how to gather element with index along axis 1,” where “how to gather element with index” is drawn verbatim from the question title, and “along axis 1” comes from the question body.

d.3. User Provides No Example

In this StackOverflow question, the user clearly describes the desired behavior, but does not provide an input/output example: “Assume we have two TensorFlow tensors: input and weights. input is a tensor of images, say. So its shape is . weights is a simple list of scalar weights: . The aim is to scalar-multiply each image by its corresponding weight. How would one do that?”

For such questions without user-provided input/output examples, we create our own examples. We make sure that the examples are extensive enough to unambiguously specify the task and simple enough that a TF-Coder user could plausibly have written the example. For this task, we use the following:

# Shape = [n, H, W, C] = [3, 1, 2, 3].
in1 = [[[[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]]],
       [[[0.8, 1.0, 0.0], [0.6, 0.4, 0.2]]],
       [[[0.9, 0.8, 0.7], [0.1, 0.2, 0.3]]]]
in2 = [2.0, 0.5, 1.0]
output = [[[[0.2, 0.4, 0.6], [0.8, 1.0, 1.2]]],
          [[[0.4, 0.5, 0.0], [0.3, 0.2, 0.1]]],
          [[[0.9, 0.8, 0.7], [0.1, 0.2, 0.3]]]]

For this task we use the natural language description “scalar multiply images in a batch,” which is a short rephrasing of the question title, “Given a batch of images, how to scalar multiply each image by a different scalar in tensorflow.”

Appendix E TF-Coder’s Buggy Solution

In this task, the user wants to sum elements of in1, but partitioned into groups specified by in2 first. The user provides the following example, which we use as-is in our benchmark task:

in1 = [5, 7, -12, 10, 20]
in2 = [1, 2, 3, 1, 2]
output = [15, 27, -12, 15, 27]

In this example, the elements 5 and 10 of in1 are both in group 1 (specified by in2), so their sum, 15, is present in the corresponding positions in the output. Considering the format of in2 as provided by the user, we assume that it will only contain integers from 1 to inclusive, if there are distinct groups.

TF-Coder’s solution to this problem is:

    tf.math.unsorted_segment_sum(in1, in2, tf.reduce_sum(in1)),

This is very close to being a correct solution, but it does have a bug. The operation tf.math.unsorted_segment_sum(data, segment_ids, num_segments) is very useful here, taking care of grouping and summing, but it requires that num_segments be sufficiently large (but being too large will hinder efficiency). For this particular I/O example, setting num_segments=tf.reduce_sum(in1) happens to be large enough so the solution works in this case, but this is not true in general (e.g., if in1 were entirely negative). A bug-free solution would use tf.reduce_max(in2) + 1 instead:

    tf.math.unsorted_segment_sum(in1, in2,
                                 tf.reduce_max(in2) + 1),

Although TF-Coder’s solution was not perfect, it was nearly so, such that a human user reviewing the solution (while looking at TensorFlow documentation if needed) could identify the bug and write a fix.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description