TFCoder: Program Synthesis for Tensor Manipulations
Abstract.
The success and popularity of deep learning is on the rise, partially due to powerful deep learning frameworks such as TensorFlow and PyTorch that make it easier to develop deep learning models. However, these libraries also come with steep learning curves, since programming in these frameworks is quite different from traditional imperative programming with explicit loops and conditionals. In this work, we present a tool called TFCoder for programming by example in TensorFlow. TFCoder uses a bottomup weighted enumerative search, with valuebased pruning of equivalent expressions and flexible type and valuebased filtering to ensure that expressions adhere to various requirements imposed by the TensorFlow library. We also train models that predict TensorFlow operations from features of the input and output tensors and natural language descriptions of tasks, and use the models to prioritize relevant operations during the search. TFCoder solves 63 of 70 realworld tasks within 5 minutes, often finding solutions that are simpler than those written by TensorFlow experts.
1 \mdfsetupskipbelow=4pt,skipabove=4pt,leftmargin=6pt,rightmargin=0pt,align=left,usetwoside=false \mdfdefinestylelistingstyle backgroundcolor=black!2, linewidth=2pt,linecolor=black!20, outerlinewidth=5pt,outerlinecolor=black, rightline=false,topline=false,bottomline=false, innerleftmargin=6pt,innerrightmargin=2pt,innertopmargin=0pt,innerbottommargin=0pt, \surroundwithmdframed[style=listingstyle]lstlisting \lst@AddToHookOnEmptyLine
1. Introduction
Deep learning techniques have resulted in recent breakthroughs in many domains including computer vision, audio processing, natural language processing, and robotics (LeCun et al., 2015). These breakthroughs arise through a combination of advancements including new algorithmic ideas, the availability of large labeled datasets, and specialized hardware for efficient training. Playing an equally important role are deep learning frameworks such as TensorFlow (Abadi et al., 2016), PyTorch (Paszke et al., 2017), MXNet (Chen et al., 2015), and CNTK (Seide and Agarwal, 2016) that enable machine learning researchers and engineers to develop and iterate on such models more effectively.
While these deep learning frameworks have greatly eased the development and training of complex neural network models, they also have a steep learning curve, since the programming paradigm of computing over tensors using a fixed set of library functions is quite different from the traditional imperative programming paradigm. For instance, vectorization techniques are used to turn explicit loops into more efficient tensor operations, and special operations like tf.where are used in place of traditional if/else conditionals. Most deep learning models require various tensor manipulations for data processing or cleaning, custom loss functions, and accuracy metrics, that all must be implemented within the constraints of the chosen deep learning framework. Furthermore, these frameworks offer a huge amount of functionality, which makes them powerful but potentially difficult to navigate. For instance, there are nearly 2000 distinct symbols in TensorFlow (including aliases), and about 500 of them are tensormanipulating operations, so finding the right ones to use for a given task can be a challenge itself.
Given the increasing popularity of deep learning, combined with the relative difficulty of writing neural models, many beginners and even experienced software engineers seek assistance from others by asking questions on forums like StackOverflow. Tensor manipulations are a common difficulty, and such questions typically include a natural language description of what the asker is trying to accomplish, along with an input/output example illustrating the desired computation or transformation. This is usually enough information for a generous expert to answer the question by providing code that implements the desired functionality, but not all questions are lucky enough to receive a correct answer or even an answer at all.
Inspired by this need, we present TFCoder, a programming by example system to automatically synthesize tensor manipulation programs from input/output examples and natural language descriptions. Our approach builds upon the bottomup enumerative algorithm used in the previous work Transit (Udupa et al., 2013). We introduce peroperation weights to the prior algorithm, allowing TFCoder to enumerate over TensorFlow expressions in order of increasing complexity. TFCoder also incorporates pruning of expressions that behave equivalently for the given inputs (as in the prior work), and a new, flexible, type and valuebased filtering system that handles arbitrary constraints imposed by the TensorFlow library, such as “the two tensor arguments must have broadcastable shapes.” Finally, we introduce two machine learning models that choose operations to prioritize during the search, conditioned on features of the input and output tensors and a natural language description of the task. These models help tailor the search process to fit the particular synthesis task at hand.
The domain of tensor manipulations has not been considered in the program synthesis literature to our knowledge. It is particularly challenging as it encompasses a huge variety of tasks, including reshapes, filters, aggregations, maps, indexing, slicing, grouping, sorting, mathematical operations, and combinations of them. When mathematical operations (e.g., tensor products) are involved, the output tensor typically has no overlapping entries with the input tensors, ruling out synthesis approaches that are informed by partial matches between the inputs and outputs, as is common in manipulation of tables (Bavishi et al., 2019), data structures (Feser et al., 2015b), and strings (Gulwani, 2011). A key takeaway from our work is that the techniques we do use are particularly effective for this domain, enabling an enumerative search to scale to solve practical problems within seconds.
We evaluate TFCoder on 70 realworld tensor transformation tasks from StackOverflow and from an industrial setting. TFCoder can successfully synthesize solutions to 63 tasks in 12 seconds on average, while Transit only solves 44 tasks. Moreover, the trained models lead to significantly faster synthesis times (32.4% faster on average), compared to not using the models. We also observed that TFCoder often produces solutions that are simpler and more elegant than those written by TensorFlow experts (including the authors of this paper).
This paper makes the following key contributions: 1) We introduce TFCoder , the first programming by example system for synthesizing tensor manipulations in TensorFlow from input/output examples. 2) We present a new weighted enumerative search algorithm that uses a new twostage filtering approach to enforce arbitrary preconditions required by the operations. 3) We develop two machine learning models that predict useful TensorFlow operations given the example tensors and a natural language description of the task, to guide the weighted enumerative search. 4) We evaluate TFCoder on 70 realworld tasks taken from StackOverflow and an industrial setting outperforming prior synthesis techniques.
2. Motivating Examples
We now present some tensor manipulation questions posted to StackOverflow, an online programming help forum.
2.1. Example 1
Consider the StackOverflow question shown in Figure 0(a). The user has a 1dimensional tensor of length containing distinct values, and they want to create another tensor of the same shape containing values between and , such that both tensors have duplicate values at the same locations. The user provides a clarifying example: the tensor [45, 58, 72, 33, 45, 58, 58, 33] should be converted to [0, 1, 2, 3, 0, 1, 1, 3]. From this example, TFCoder automatically synthesizes a solution program in 0.8 seconds:
Even though the solution is relatively simple, it would be quite difficult for the question asker to find that solution without assistance, considering that there are about 500 tensormanipulating operations in TensorFlow. Even searching for the function by name would be difficult, as the name “unique_with_counts” bears little resemblance to the user’s description of the task. In such scenarios, TFCoder can help users find relevant TensorFlow operations automatically, reducing the time spent digging through documentation.
When we first came across this question on StackOverflow, it was four days old with zero answers. We posted TFCoder’s solution as an answer, which was accepted by the poster.
2.2. Example 2
The StackOverflow question in Figure 0(b) involves a more difficult problem. Given two input tensors in1 and in2, the question asker wants an output tensor where the i element is equal to the in2[i] column of in1[i]. To specify their intent more clearly, the asker also provides an input/output example as shown in the figure.
On this complex problem involving multiple input tensors and TensorFlow operations, TFCoder finds a solution in 28 seconds:
TFCoder’s solution is actually simpler than the accepted StackOverflow answer. Thus, TFCoder can help users find elegant solutions for difficult tensor transformations.
2.3. Observations
These StackOverflow questions follow a larger pattern: many tensor transformations are ambiguous if described using natural language alone, so it is natural to provide both a textual description of the desired transformation and concrete input/output example tensors to clarify the problem. Another interesting property is that most of the time, only one input/output example is necessary, since tensors can be expanded with more entries to resolve ambiguities.
There are over 50,000 questions on StackOverflow containing the text “TensorFlow.” While the majority of these ask about installation issues or deep learning in general, there are still many questions asking how to perform tensor manipulations or how to fix errors raised by the user’s code. Indeed, writing TensorFlow code can be challenging at times (even more so for beginners) due to the amount of information that the programmer must keep in mind. The shapes of tensors must be checked for compatibility under broadcasting rules, the conceptual meanings of the dimensions are crucial to ensure mathematical correctness, and data types of tensors must be tracked carefully (e.g., a tf.int32 tensor cannot be added to a tf.int64 tensor). Furthermore, these properties change as tensors are manipulated, leaving many opportunities for subtle bugs.
Inspired by these questions, we developed TFCoder to automatically synthesize tensor manipulations in TensorFlow from input/output examples and natural language descriptions. Such a tool could help accelerate users’ TensorFlow development in several ways. In Section 2.1, we observed that TFCoder can automatically find relevant TensorFlow operations, thus reducing the need to search through TensorFlow’s extensive documentation. Since TFCoder’s solutions are guaranteed to be consistent with the provided input/output example, it can reduce the number of debugging cycles and lead to increased confidence in the code (much like a unit test). Finally, by finding simple and elegant solutions that the user may have overlooked, TFCoder can even improve code quality and model efficiency. We strive to find solutions quickly, within seconds or at most a few minutes, so that the tool may be used interactively.
3. Synthesis with Enumerative Search
Motivated by the examples and discussion in Section 2, we now formalize the problem as illustrated in Figure 2.
3.1. Problem Formalization
We assume a given task specification , where is an input/output example, i.e., a list of input tensors and the corresponding output tensor , is an optional natural language description of the task, and is an optional set of constants that may be useful for the task.
Our goal is to synthesize a program where . We note that TFCoder can often synthesize programs directly from the input/output example without needing additional and information, but and allow users to express their intent and obtain better synthesizer performance. The domain of programs considered by TFCoder consists of singleline TensorFlow expressions, which may contain any of the following base values:

Python int, float, Boolean, and string literals

TensorFlow data types, e.g., tf.float32, tf.int64 etc.

Variables in1, in2, etc., to reference the input tensors
Furthermore, expressions may use the following operations, applied to the base values or composed with each other:

Supported TensorFlow function calls, e.g., tf.add(x, y) and tf.math.segment_max(data, segment_ids)

Creating a tuple from supported Python literals, e.g., (0, 1), or from other such tuples

Various forms of indexing and slicing of sequences and tensors, e.g., tensor[1], tensor[1:], and tensor[:, 0:5]
Note that the TensorFlow operations specify their arguments because the search algorithm requires a fixed arity for each operation. Hence, some TensorFlow functions have multiple supported variations, e.g., 2argument tf.gather(params, indices) and 4argument tf.gather(params, indices, axis, batch_dims). In total, TFCoder currently supports 123 TensorFlow operations for 99 distinct functions, plus 11 more operations for different forms of indexing, slicing, and tuple creation. These are listed in Appendix A.
In the following sections, we describe the weighted bottomup enumerative search that powers TFCoder. Starting with a set of initial values including input tensors and constants (which may be provided by the user or chosen heuristically), the search enumerates ways of applying operations to previouslyexplored values, to expand the set of known values. Values internally store enough information to recursively reconstruct the code expression that would produce the value. Thus, if the search encounters a value that matches the output tensor, the matching value’s code expression is a valid solution to the synthesis problem.
3.2. Weighted Value Search
TFCoder’s search enumerates expressions in order of increasing weight, which represents the expression’s complexity. Operations and initial values (input tensors and constants) have associated weights, and an expression’s weight is defined to be the sum of the weights of the operations and initial values used in that expression. For example, the initial values in1 and 0 both have weight 8, and the operation tf.expand_dims(input, axis) has weight 18, so the expression tf.expand_dims(in1, axis=0) has weight .
These weights give TFCoder a finegrained notion of the “complexity” of different TensorFlow operations, e.g., tf.reverse(tensor, axis) is more complex and less useful than tf.expand_dims(input, axis), so the former is assigned a greater weight than the latter. We manually assigned weights for each of TFCoder’s supported operations, taking into consideration how common or useful the operation is, how complex its semantics are, and how many arguments it takes. These weights allow TFCoder to prioritize simple and useful operations in its search. All weights must be positive integers to enable efficient enumeration.
Figure 3 is a diagram summarizing TFCoder’s weighted enumerative search, and the algorithm is shown in Algorithm 1. Note that the algorithm mentions using learned models to prioritize operations, discussed in Section 4. Argument filters and combination filters are discussed in Section 3.3.
The algorithm starts by collecting initial values. These include userprovided input tensors, userprovided constants (optional), and heuristicallychosen constants. The constants 0, 1, 1, True, False, tf.int32, tf.int64, tf.float32, and tf.bool are always chosen. We also include natural numbers up to the maximum rank of an input tensor (exclusive) to serve as axis values, all dimension lengths of input and output tensors, and the output tensor’s shape as a tuple. These initial values are assigned hardcoded weights depending on their origin (e.g., a userprovided constant will have smaller weight than a constant extracted from a dimension length).
The search then generates expressions in order of increasing weight. For a given target weight, we enumerate over all supported operations and all allowable weights for the operation’s arguments. For example, if we are currently generating expressions of weight 76 using a 2argument operation with weight 36, then there is remaining weight to partition among the two arguments. If argument 1 is chosen to have weight 32 and argument 2 is chosen to have weight 8, we would use all previouslyexplored values of weight 32 as choices to fill argument 1, and similarly all existing values of weight 8 are choices for argument 2. The Cartesian product of these argument choices gives many argument lists, each list containing one concrete value for each argument. The chosen operation is applied to each of these argument lists to produce new values, which by construction all have the desired weight. Each newly generated value that is not equal to a previouslyseen value is added back to the set of known explored values. In this way, we prune away expressions with equivalent behavior when run on the input tensors, significantly reducing the size of the search space.
Every value produced by applying an operation to arguments stores references to the operation and the arguments, so that any value can recursively reconstruct its code representation. As soon as TFCoder encounters a value that is equal to the desired output tensor, it outputs the value’s code representation as a solution.
3.3. Operation Filtering
When the search enumerates argument lists for a particular operation, a full Cartesian product of argument choices may be very large, even though very few argument lists actually meet preconditions required by the operation. To avoid enormous Cartesian products, and to reduce the number of errors thrown by operations (which are relatively expensive to catch), we introduce a flexible twostage operation filtering approach, illustrated in Figure 4.
The first stage of operation filtering occurs independently for each argument of the operation. An “argument filter” ( in Algorithm 1) is simply a function that takes a value and returns a boolean denoting whether the value is an acceptable choice for a particular argument of an operation. For example, the tf.argmax(input, axis) operation requires that the input argument be a numeric tensor (e.g., a tensor with a float or int data type), and the axis argument must be an integer representing an axis. Hence, an argument filter for input would reject tensors with tf.bool data types, and an argument filter for axis would only accept integers with small absolute value. By using argument filters, the size of the Cartesian product of argument choices is greatly reduced.
The second stage of operation filtering checks constraints that involve multiple arguments. A “combination filter” ( in Algorithm 1) for an operation with arguments is a function that takes a list of values and returns a boolean denoting whether the list contains acceptable arguments for one call to the operation. For example, the tf.argmax(input, axis) operation requires that the axis actually be in range for the input tensor. Hence, the operation’s combination filter would remove an argument list if it has an outofbounds axis for the corresponding input tensor. The purpose of combination filters is to avoid executing expensive TensorFlow operations that can be eliminated by quick checks. Furthermore, catching exceptions raised by TensorFlow operations is relatively slow compared to running the combination filter.
The twostage filtering approach allows for arbitrary valuebased checking of operation preconditions. TFCoder is also engineered such that it is easy to add and reuse filters with minimal code duplication—many operations have an axis argument that requires the same argument filter, and similar operations like tf.reduce_sum(input_tensor, axis) can use the same argument and combination filters.
Finally, we note that argument filters (but not combination filters) will be run repetitively on the same values for two reasons. First, argument filters like the axis argument filter are reused among several operations. Second, the same argument will be assigned values of the same weight at different points in the enumerative search. Our solution is to cache the result of applying an argument filter on all explored values of a given weight, i.e., we cache in Algorithm 3, where the cache is keyed by the filter function and the weight of the values being filtered. (For simplicity, this caching behavior is not present in Algorithm 1.)
TFCoder’s operation filtering significantly improves the quality of candidate programs considered. In particular, for the difficult task described in Figure 5(e), overall the argument filters eliminated 73% of choices for individual arguments, and then the combination filters further eliminated 60% from the Cartesian product of remaining argument choices. Together, the twostage filtering strategy eliminated 98.6% of all potential candidate programs.
DomainSpecific Details
Handling Multiple I/O Examples
In the tensor manipulation domain, we observe that most tasks only require a single input/output example. For instance, when performing a reduction across rows of an matrix to produce a length vector, there are essentially independent examples of a row being reduced to a scalar. One can easily construct a single example with large enough to unambiguously specify the task. This idea generalizes to nearly all tensor manipulation tasks – adding more numbers to the example makes it more clear. Even so, TFCoder’s enumerative search algorithm can be extended to handle multiple examples, described in Appendix C.
4. Learning to Guide the Search
In Section 3.2, we noted that operation weights allow TFCoder to prioritize simple and useful operations. Another benefit is that weights can be modified to fit the specific synthesis problem at hand, instead of having static weights that are independent of the problem. This enables strategies that tweak the ordering of the search space to better fit the problem.
TFCoder uses two machine learning models that predict which operations will be used: a neural model conditioned on features of the input and output tensors, and a naïve Bayes bagofwords model conditioned on the natural language description of the problem. The models’ predictions are used to prioritize operations by multiplying their weights by a constant . Both models independently choose which weights to modify, so if an operation is prioritized by both, its weight will be multiplied by . Modified weights are rounded to the nearest integer (or rounded up to 1 since weights must be positive). Then, the search described in Section 3 is run as normal.
4.1. Tensor Features Model
We now describe a neural model that learns a Bernoulli distribution over each operation, conditioned on features of input and output tensors. Human experts can often recognize useful operations for tensor transformation tasks by looking at patterns in the userprovided examples. For instance, if one tensor contains small nonnegative integers, they may represent indices into another tensor, especially if the output tensor also contains entries that are found in the input tensors. With the tensor features model, our goal is to learn a similar patternrecognition capability.
Dataset
One challenge for training such a model is the lack of a large supervised dataset containing real TensorFlow programs together with corresponding input/output examples, so we train our model on a synthetically generated dataset. However, unlike previous approaches (Devlin et al., 2017; Balog et al., 2017; Shin et al., 2019) that uniformly sample from a space of programs and inputs, we observe that this approach in the TensorFlow domain will result in a huge number of errors due to the many constraints imposed by TensorFlow operations. Furthermore, without symbolic formulas for these constraints, we cannot use solverbased approaches to find satisfactory programs and inputs (King, 1976; Cadar et al., 2008).
We present the novel idea of generating the synthetic training dataset using our enumerative search algorithm, running the weighted value search on randomlygenerated inputs for 10 minutes to gather a large number of explored values. For each such value, we consider all ways of collapsing subtrees of its code expression into new inputs, to add more variety in the input tensors. For example, given the code expression tf.greater(tf.add(in1, tf.squeeze(in2)), in3)), we would additionally consider the expressions tf.greater(new_input, in3) and tf.greater(tf.add(in1, new_input), in3)), where new_input is a new input tensor with a value equal to the value of the code subtree that it replaced. We randomly choose one such way of collapsing subtrees (including the original expression unchanged) for each explored value, resulting in an I/O example with a corresponding TensorFlow program.
We then filter the dataset to only contain programs that use at least two operations, since programs using one single operation are already easily synthesized by the value search in a fraction of a second. Additionally, we also exclude examples where an input or output tensor has more than 50 elements, to more closely resemble example tensors that would be manually provided by TFCoder’s users. Our training dataset comes from 20,000 runs of value search on random inputs, where we draw one training example each from at most 2,000 explored values from each run, for a total of 39,930,863 training examples. The evaluation dataset uses 1,000 runs of value search and at most 100 examples from each run, for a total of 99,852 evaluation examples.
Example Features
We compute a set of features for the input/output tensors to feed into the model, which include:

If the value is a primitive, sequence, tensor, or SparseTensor

The value’s data type, rank, and dimension lengths

Statistics (e.g., max, min, mean) of its elements

The number and fraction of elements of various properties, e.g., exactly zero, in the range , unique elements, etc.

Various boolean properties of the value, e.g., entirely positive, all elements unique, sorted, etc.
In addition to featurizing the individual input and output tensors, we also compute features representing the comparison of each input value to the output value:

Comparing the number of elements, ranks, and each dimension length

The number and fraction of input elements that also appear in the output, and vice versa

If all input elements appear in output, and vice versa

If each dimension length of the input also appears as some dimension length of the output, and vice versa
For all features that result in an unbounded integer or float (e.g., the maximum element or number of unique elements), we bucket the feature to turn it into a categorical feature.
To featurize an input/output example, we first pad the list of inputs with dummy input values until there are exactly 3 inputs, so that the same number of features are extracted for each example.^{1}^{1}1This scheme supports a maximum of 3 inputs, but this could be relaxed. We have not yet encountered a reasonablycomplex task requiring 4 inputs. We then extract features for each input and the output individually, and extract features from a comparison of each input to the output. We also add a single feature representing the number of inputs.
Models
Our neural model first embeds categorical features (e.g., boolean properties, bucketed numbers, data types, etc.) using an embedding size equal to the cardinality of the feature. The embeddings are concatenated along with unembedded features (e.g., fraction features), resulting in a vector of length 2049. This is passed through 1 or 2 dense layers, a final dense output layer produces a logit for each operation, and elementwise sigmoid is applied to get a probability for each operation.
We experiment with different loss functions. One is a standard sigmoid cross entropy loss averaged over the operations. However, as each example only uses a few operations, the dataset is overwhelmingly negative, which could lead the model to be overly conservative with its predictions. Thus, we also implement a differentiable metric (van Rijsbergen, 1979) as a loss function to achieve different balances in precision and recall. prioritizes precision and recall equally, while cares twice as much about recall than precision (in general, we found that correctly prioritizing an operation outweighs prioritizing an operation that is actually not used).
The distribution of operations in the synthetic dataset is different from the distribution of operations that are actually used in problem solutions for two reasons. First, the dataset is created from running weighted value search, which inherently prioritizes simple operations over more complex ones. Second, there are fewer valid programs containing operations with many constraints compared to operations with few constraints. We experimented with balancing the dataset by giving a weight to each positive example (where an operation is actually used), and leaving negative examples (operation unused) unchanged. The weight for operation , when it is actually used in the training example, is either
where is the number of examples in the training set where is actually used. The weighting scheme has the property that no operation is downweighted, but it leads to the model believing that there are many more positive examples than there actually are. In contrast, with the weighting scheme, the model believes that the proportion of positive examples is unchanged. Finally, we clip weights to a maximum of 10,000 to avoid training instability from extremely large weights.
Considering sigmoid cross entropy, , and loss functions, along with weights, weights, or no weighting at all, we have 9 different variations. For each variation, we ran a hyperparameter sweep and selected the run with the lowest evaluation loss after 3 epochs. We observed no overfitting. We varied the number of hidden feedforward layers (1 or 2), the size of the hidden layers (512, 1024, or 2048), and the learning rate (7 choices between 1e5 and 1e3). We used the Adam optimizer (Kingma and Ba, 2014) with global norm gradient clipping (Pascanu et al., 2012). Results are discussed in Section 5.2.
For all variations of the the tensor features model, we prioritize all operations where the predicted probability is greater than 0.5.
4.2. Natural Language Model
In this section we describe our approach to reweighting operations based on the natural language text accompanying the input/output examples. These descriptions can provide information about what operations are likely to be used in the solution. As with the tensor features model, we formulate the task as a supervised multilabel classification problem. For an input natural language description, the task is to predict a binary label for each operation, indicating whether the operation is likely to be used in the solution.
Dataset
Since we do not have a large dataset of TFCoder queries paired with target Tensorflow operations, we construct a proxy dataset from the TensorFlow documentation and from TensorFlow code on GitHub. The proxy dataset does not represent the same distribution as TFCoder queries, and we will note the implications of this when we describe our models.
We construct the first part of the proxy dataset from the TensorFlow documentation. For each operation supported by TFCoder, we construct a single instance for our dataset using the operation’s docstring. The docstring serves as the task description, and we consider the operation to be the sole target operation for the instance. This yields 134 descriptions paired with target operations.
To complete the dataset, we additionally construct examples from TensorFlow code from GitHub. We collect 65,617 functions that use at least one TFCodersupported TensorFlow operation from GitHub projects with a permissive license. Following the method of Allamanis (2018), we remove duplicate and nearduplicate instances from this dataset, leaving 13,960 functions. For each function, we extract a natural language context from the function, as well as the set of supported TensorFlow operations used by the function. The natural language context consists of the function’s docstring and all comments, string literals, and variable names appearing in the function. We use this natural language context as a proxy for the task description, and we use the TensorFlow operations found in the function as the target TensorFlow operations. In total, our full constructed dataset has 14,094 instances.
Models
We train two models, a TFIDF model, and a naïve Bayes model. Each model accepts natural language text and operations as input, and decides which operations to prioritize in the search. We restrict our models to prioritizing at most operations with the best scores. These models are implemented using scikitlearn (Pedregosa et al., 2011).
In selecting these models, we take into consideration the differences between the proxy dataset and the expected distribution of TFCoder queries. For example, the natural language context in the proxy dataset is often different in structure from the real task descriptions. Nevertheless, we hypothesize that we can still learn from the vocabulary used in the proxy dataset to perform well on the benchmark tasks. So, we focus our efforts on two bagofwords models. In investigations with more complex models, we found that higher capacity models can better fit the proxy data but do not generalize well to the target domain of TFCoder task descriptions.
We first consider the TFIDF model, which we train using only the TensorFlow documentation, not the instances gathered from GitHub. We construct a vocabulary consisting of those terms appearing at least once in the docstrings of the supported TensorFlow operations, with English stop words removed. For each operation , we construct a vector from the operation’s docstring consisting of the tfidf score of each term in the vocabulary (Jones, 1972). The tfidf score for a term in a docstring is computed as the number of occurrences of the term in the docstring, divided by the smoothed log total number of occurrences of the term across all docstrings. The smoothing is equivalent to there being a single extra docstring containing every term exactly once.
We construct an analogous vector from the input text . For natural language and operation , the TFIDF model produces a score given by the cosine similarity between and . The model prioritizes the operations with the highest scores, considering only those operations with score exceeding a threshold , and up to operations prioritized.
The second model is a naïve Bayes model, which we train on the full constructed dataset. This model uses the same vocabulary and document frequencies as the TFIDF model and the same definition of . Though the dataset is now larger, we do not expand the vocabulary to include novel terms. We find that restricting the capacity of the model in this way limits its tendency to overfit to the domain of the constructed dataset.
For each operation , let be a binary random variable indicating whether is used in the target program. The naïve Bayes model estimates the probability of being used given natural language as
We calculate this using the estimate , where is the Lidstone smoothing parameter ( in our experiments). is the sum of the tfidf scores of all terms appearing with , is the sum of the tfidf scores of all instances of term appearing with , and is the number of terms in the vocabulary.
The distribution of operations in the proxy dataset differs from the distribution of operations that appear in TFCoder queries. On GitHub, TensorFlow usage skews toward implementing models and training pipelines, whereas TFCoder queries are tensor manipulations. So, rather than estimating from the proxy dataset, we instead use the uniform prior and estimate for all operations, which we found to perform better. The naïve Bayes model prioritizes operations with , up to operations, where and are hyperparameters.
We experiment with different variations of these models: TFIDF using , naïve Bayes using , and the maximum number of operations prioritized for both models. Results for the best settings are shown in Section 5.2.
5. Experiments
We now present an evaluation of TFCoder on a set of realworld benchmarks. We use ablation experiments to analyze the overall efficiency gains of TFCoder’s synthesis algorithm compared to baseline approaches. Finally, we perform a study of the synthesis results of TFCoder in comparison to the answers provided by human experts on StackOverflow.
Benchmark Tasks
We collected a benchmark set of 70 tensor manipulation tasks, including 50 drawn from StackOverflow questions and 20 real tasks encountered by TensorFlow users in an industrial setting. While collecting the benchmark tasks, we noticed that some were not actually amenable to solutions in TensorFlow, so we excluded tasks that we could not solve by hand after much effort. Of the 50 StackOverflow tasks, 34 contained an input/output example in the question. We expanded these examples (adding more entries to the tensors) where necessary to make the patterns clear, or used the examples asis if they were already comprehensive. For questions posed without input/output examples, we created examples manually. We also manually wrote singlesentence descriptions for the tasks, borrowing as much wording from the question’s title and body as possible while remaining concise, grammatical, and accurate. Examples of this process are discussed in Appendix D.
5.1. Comparison to Prior Work
TFCoder extends the search in Transit (Udupa et al., 2013) in several ways:

TFCoder incorporates weights for operations and base values, while Transit does not use weights.

TFCoder uses a flexible operation filtering system that generalizes Transit’s type checking, which is insufficient for many TensorFlow operations.

TFCoder uses two models to modify operation weights.
In this section, we evaluate the effectiveness of the first two improvements (the models are evaluated in Section 5.2). We run 4 variations of TFCoder where we independently turn on or off weighting and operation filtering,^{2}^{2}2We turn off operation filtering as much as possible, but 36 of 134 operations require filtering to avoid uncatchable segfaults or excessive memory usage. without using models.
The results of these 4 variations on our benchmarks are plotted in Figure 5. Both techniques in isolation lead to significant improvement over the Transit algorithm, and their combination produces another large improvement. Overall, TFCoder without any models can solve 62 of the 70 benchmark tasks within 5 minutes, while Transit only solves 44 tasks.
5.2. Effect of the Learned Models
Tasks  Num faster  Num slower  Time for  Total  Avg.  
Model  solved  (avg. speedup)  (avg. speedup)  62 tasks (s)  speedup  speedup 
TFCoder without any models  62  —  —  1147.6  —  — 
(A) CE,  62  30 (43.3%)  17  887.3  22.7%  16.8% 
(B) ,  63  38 (42.9%)  13  756.2  34.1%  23.9% 
(C) ,  63  43 (45.3%)  9  907.7  20.9%  24.4% 
(X) Naïve Bayes, ,  63  26 (39.0%)  9  1085.2  5.4%  12.5% 
(Y) Naïve Bayes, ,  62  24 (41.1%)  4  1013.0  11.7%  14.2% 
(Z) TFIDF, ,  62  21 (42.5%)  7  1138.6  0.8%  14.8% 
(B) with (X) (chosen combination)  63  44 (50.1%)  9  682.4  40.5%  32.4% 
(B) with (Y)  63  43 (50.8%)  11  675.1  41.2%  31.8% 
(B) with (Z)  63  40 (52.9%)  11  723.5  37.0%  34.7% 
(C) with (Y)  63  47 (53.1%)  6  809.9  29.4%  32.8% 
We now evaluate different models to prioritize operations during the enumerative search. We find the best tensor features model (Section 4.1) and the best natural language model (Section 4.2) in isolation, and then find the best combination of the two models.
Table 1 lists the performance of the best model variations on our benchmark tasks. For the tensor features models, we experimented with 3 different loss functions and 3 different weighting schemes as described in Section 4.1. For the natural language models, modelspecific hyperparameters are listed as described in Section 4.2.
Table 1 compares the performance of TFCoder when using each model against TFCoder without any models at all. All model variations and combinations listed in Table 1 solved (at least) all of the 62 benchmark tasks that were solved by the nomodel variant. For each task, if using a model results in a solve time that is less than 5% or less than 0.1 seconds different compared to the nomodel run, then we consider the solve times to be “roughly equal” and possibly attributed to noise in the timings. The table lists the number of tasks where the timing difference is larger in either direction: “faster” means the model does better than not using a model, and “slower” means the model does worse. We also report the average speedups among the faster and slower tasks. “Time for 62 tasks” is the sum of the solve times for the tasks solved by the nomodel variant, and “total speedup” compares that total time against that of the nomodel variant. “Average speedup” computes the average of the pertask speedups. Note that “total speedup” is heavily biased toward performance on the few difficult longrunning tasks, while “average speedup” is representative of all tasks (even easy tasks that are solved incredibly quickly, where an enduser might not even notice any time difference).
Tensor Features Model
For the tensor features model, we found that the weighting scheme was consistently the best weighting scheme across all three loss functions. The loss function resulted in the highest total speedup of 34.1%, while the loss function had the highest average speedup of 24.4%. Both of these loss functions solved one extra task compared to the nomodel run.
Natural Language Model
The best Naïve Bayes models obtain higher total speedup than the best TFIDF model, although the TFIDF model has slightly better average speedup. Overall, the natural language models were less effective than the tensor features models, but the natural language models lead to slowdowns for fewer tasks.
Model Combinations
We tried all 9 combinations of the 3 best tensor features models and the 3 best natural language models (as listed in Table 1), with results for the four best combinations listed at the bottom of the table. The different combinations excel in different ways. Considering the many metrics in Table 1, as well as performance on the 63rd “extra” solved task, we consider the best combination of models to use with weighting as the tensor features model, and Naïve Bayes with and as the natural language model. This combination led to speedups for 44 of 62 tasks (71%), on average cutting the synthesis time in half among such tasks, which helps TFCoder feel more interactive.
It is also promising that the model combinations perform significantly better than the individual models themselves. This suggests that our framework enables complementary models to jointly influence the enumerative search with compounding benefits.
5.3. Comparison to StackOverflow
Since TFCoder was inspired by questions on forums like StackOverflow, it is natural to compare TFCoder’s performance with that of the StackOverflow community. We found that, among the 50 StackOverflow questions, 47 had answers but only 32 had correct answers. Incorrect answers included cases where the expert misinterpreted the question, or the solution did not fully generalize, used operations that no longer exist in the current version of TensorFlow (2.0), or otherwise had bugs that prevent the suggested code from executing successfully. Among correct answers, the median answerposting time was 31 minutes. In comparison, TFCoder is able to solve 44 of the StackOverflow tasks within 5 minutes, with a median solve time of 1.4 seconds. Furthermore, TFCoder’s solutions are guaranteed to run successfully on the given example. We also manually inspected TFCoder’s solutions and found that they all correctly implement the desired behavior, except for one solution which was mostly correct but had a subtle bug that prevents it from generalizing perfectly. We discuss this in Appendix E.
5.4. A Sample of Synthesized Programs
Figure 6 shows examples of interesting problems that TFCoder is able to solve. We observe that on these problems and many others, TFCoder finds solutions that are simpler or more elegant than humanwritten solutions. One major strength of TFCoder is that it can identify solutions using uncommon operations that a human programmer might not know about, or unconventional combinations of operations that the programmer might not have considered. Such behavior would not be expected from other synthesis approaches that attempt to imitate existing code corpora.
6. Related Work
In this section, we discuss related works from several domains.
Programming By Example (PBE)
The problem of synthesizing programs from input/output examples has been studied for a long time starting with the works of synthesizing LISP programs (Shaw et al., 1975; Hardy, 1974). More recently, PBE techniques have been developed for domains including string transformations (Gulwani, 2011; Gulwani et al., 2012; Singh, 2016), data extraction from semistructured formats (Le and Gulwani, 2014), data structure manipulations (Feser et al., 2015a; Singh and SolarLezama, 2011), distributed cache coherence protocols (Udupa et al., 2013), data imputation programs (Wang et al., 2017, 2018), mapreduce programs (Smith and Albarghouthi, 2016), and Java functions (Shi et al., 2018).
Unlike these approaches, which synthesize programs from only input/output examples, TFCoder uses both input/output examples and natural language descriptions to guide a weighted enumerative search. Ye et al. (2019) present a technique to generate regular expressions from natural language and examples, where the natural language is first used to generate a program sketch and the sketch is then completed using an enumerative approach using examples. On the other hand, TFCoder uses both examples and natural language simultaneously to guide a weighted bottomup search over compositions of supported operations. Synquid (Polikarpova and SolarLezama, 2015) also uses typebased reasoning and filtering for synthesis, whereas TFCoder uses dynamic valuebased checks for argument and combination filters for different TensorFlow operations.
Machine Learning for Program Synthesis
With the recent advances in machine learning, there has been much interest in using such techniques for program synthesis. RobustFill (Devlin et al., 2017; Parisotto et al., 2016) uses an encoderdecoder model to generate string transformation programs from examples. The encoder embeds the example strings using recurrent LSTM networks, which is then used to condition the output program sequence decoder. DeepCoder (Balog et al., 2017) trains a model to learn a distribution over possible list functions given the input/output list examples. It then uses the distribution to guide an enumerative search. Euphony (Lee et al., 2018) performs a weighted enumerative search using the A* search algorithm, where the weights come from a probabilistic higherorder grammar (PHOG). Similar to these approaches, TFCoder also learns a distribution over possible programs conditioned on the corresponding specification. However, it uses both input/output example and natural language as specification, and uses the trained models to modify operation weights to perform a taskspecific weighted search.
AutoPandas (Bavishi et al., 2019) uses graph neural networks to synthesize Pandas programs that manipulate DataFrames, which are similar to TensorFlow tensors. A key innovation in AutoPandas is a graph representation of the input and output DataFrames with edges connecting equal cells. Although tensors and DataFrames are similar, AutoPandas’ graph approach is not as applicable to the TensorFlow domain, since many common mathematical operations would break the cellequivalence edges. In other words, DataFrames retain much of their data while being manipulated through pivots, selections, and joins, making it easy for cellequivalence edges to track the movement of data, while this is only true for a fraction of manipulations in TensorFlow.
There are also some approaches that use machine learning for ranking programs. FlashFill uses versionspace algebra to identify all programs in a DSL that are consistent with a set of input/output examples, and then uses a ranking function learned through supervised learning (Singh and Gulwani, 2015) to rank the programs, so that the user does not need to provide too many examples before obtaining the desired program. Unlike this ranking approach that first finds all consistent programs, TFCoder uses learning to guide the search in first place.
Menon et al. (2013) describe an approach for synthesizing string manipulation programs that learns a probabilistic context free grammar (PCFG) of rules given a set of examples. It uses a set of handdesigned clues to learn a distribution over likely rules and then enumerates over a subset of rules in order of decreasing probabilities to search for a consistent program. Since it learns from a small number of training examples (280), the clues need to be very domainspecific. In comparison, TFCoder’s TensorFlow domain is quite different from the stringprocessing domain. TFCoder trains a model to learn a distribution over operations from millions of synthetically generated programs, and the model is used to guide an efficient weighted enumerative search with value and typebased filtering and pruning strategies.
Program Synthesis
There has been a renewed interest in program synthesis research in the last decade because of the advances in both constraint solving and algorithmic synthesis techniques (Alur et al., 2013; Gulwani et al., 2017). The synthesis approaches can be broadly classified based on the underlying search mechanism: (i) enumerative (Udupa et al., 2013), (ii) constraintbased (SolarLezama et al., 2006; SolarLezama, 2013), and (iii) stochastic (Schkufza et al., 2013; Shi et al., 2018). Applying constraintbased synthesis techniques to the TensorFlow domain would require a huge effort of modeling semantics of TensorFlow operations, and for many operations these would not be scalable due to complex nonlinear computations. TFCoder builds on top of the bottomup enumerative search from Transit (Udupa et al., 2013), adding expression weights and flexible valuebased filtering for a more efficient search. Moreover, it dynamically adjusts weights using learned models based on the input/output examples and natural language description.
7. Conclusion
In this paper, we presented TFCoder, a synthesis tool for automatically generating tensor manipulation programs in TensorFlow from examples and natural language. TFCoder employs a bottomup weighted enumerative search with type and valuebased filtering to conform to the constraints imposed by TensorFlow operations. It uses two machine learning models to predict useful operations from features of the input/output tensors and a natural language description of the task, and these predictions are used to modify the weights to customize the search process for the given task. We evaluated TFCoder successfully on several realworld tensor transformation tasks faced by TensorFlow users on StackOverflow and in an industrial setting, and various ablation experiments show usefulness of the two models and filtering techniques. We believe that TFCoder can help both machine learning beginners and experienced practitioners in writing tricky tensor transformation programs that are common in deep learning pipelines.
Acknowledgements.
The authors thank Charles Sutton and the other members of the program synthesis team at Google Brain for helpful discussions.References
 TensorFlow: A system for largescale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 24, 2016, pp. 265–283. Cited by: §1.
 The Adverse Effects of Code Duplication in Machine Learning Models of Code. arXiv eprints, pp. arXiv:1812.06469. External Links: 1812.06469 Cited by: §4.2.
 Syntaxguided synthesis. See DBLP:conf/fmcad/2013, pp. 1–8. External Links: Link Cited by: §6.
 DeepCoder: learning to write programs. See DBLP:conf/iclr/2017, External Links: Link Cited by: §4.1, §6.
 AutoPandas: neuralbacked generators for program synthesis. PACMPL 3 (OOPSLA), pp. 168:1–168:27. External Links: Link, Document Cited by: §1, §6.
 KLEE: unassisted and automatic generation of highcoverage tests for complex systems programs. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI’08, Berkeley, CA, USA, pp. 209–224. External Links: Link Cited by: §4.1.
 MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR abs/1512.01274. External Links: Link, 1512.01274 Cited by: §1.
 RobustFill: neural program learning under noisy I/O. See DBLP:conf/icml/2017, pp. 990–998. External Links: Link Cited by: §4.1, §6.
 Synthesizing data structure transformations from inputoutput examples. See DBLP:conf/pldi/2015, pp. 229–239. External Links: Link, Document Cited by: §6.
 Synthesizing data structure transformations from inputoutput examples. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI â15, New York, NY, USA, pp. 229â239. External Links: ISBN 9781450334686, Link, Document Cited by: §1.
 Spreadsheet data manipulation using examples. Commun. ACM 55 (8), pp. 97–105. External Links: Link, Document Cited by: §6.
 Program synthesis. Foundations and Trends in Programming Languages 4 (12), pp. 1–119. External Links: Link, Document Cited by: §6.
 Automating string processing in spreadsheets using inputoutput examples. See DBLP:conf/popl/2011, pp. 317–330. External Links: Link, Document Cited by: §1, §6.
 Automatic induction of lisp functions. In Proceedings of the 1st Summer Conference on Artificial Intelligence and Simulation of Behaviour, AISB’74, Amsterdam, The Netherlands, The Netherlands, pp. 50–62. External Links: Link Cited by: §6.
 A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, pp. 11–21. Cited by: §4.2.
 Symbolic execution and program testing. Commun. ACM 19 (7), pp. 385–394. External Links: ISSN 00010782, Link, Document Cited by: §4.1.
 Adam: A Method for Stochastic Optimization. arXiv eprints, pp. arXiv:1412.6980. External Links: 1412.6980 Cited by: §4.1.
 FlashExtract: a framework for data extraction by examples. See DBLP:conf/pldi/2014, pp. 542–553. External Links: Link, Document Cited by: §6.
 Deep learning. Nature 521 (7553), pp. 436–444. External Links: Link, Document Cited by: §1.
 Accelerating searchbased program synthesis using learned probabilistic models. SIGPLAN Not. 53 (4), pp. 436â449. External Links: ISSN 03621340, Link, Document Cited by: §6.
 A machine learning framework for programming by example. In Proceedings of the 30th International Conference on Machine Learning, S. Dasgupta and D. McAllester (Eds.), Proceedings of Machine Learning Research, Vol. 28, Atlanta, Georgia, USA, pp. 187–195. Cited by: §6.
 Neurosymbolic program synthesis. CoRR abs/1611.01855. External Links: Link, 1611.01855 Cited by: §6.
 Understanding the exploding gradient problem. CoRR abs/1211.5063. External Links: Link, 1211.5063 Cited by: §4.1.
 Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §1.
 Scikitlearn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §4.2.
 Program synthesis from polymorphic refinement types. CoRR abs/1510.08419. External Links: Link, 1510.08419 Cited by: §6.
 Stochastic superoptimization. See DBLP:conf/asplos/2013, pp. 305–316. External Links: Link, Document Cited by: §6.
 CNTK: microsoft’s opensource deeplearning toolkit. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pp. 2135–2135. External Links: ISBN 9781450342322, Link, Document Cited by: §1.
 Inferring LISP programs from examples. See DBLP:conf/ijcai/1975, pp. 260–267. External Links: Link Cited by: §6.
 FrAngel: componentbased synthesis with control structures. CoRR abs/1811.05175. External Links: Link, 1811.05175 Cited by: §6, §6.
 Synthetic datasets for neural program synthesis. See DBLP:conf/iclr/2019, External Links: Link Cited by: §4.1.
 Predicting a correct program in programming by example. See DBLP:conf/cav/20151, pp. 398–414. External Links: Link, Document Cited by: §6.
 Synthesizing data structure manipulations from storyboards. See DBLP:conf/sigsoft/2011, pp. 289–299. External Links: Link, Document Cited by: §6.
 BlinkFill: semisupervised programming by example for syntactic string transformations. PVLDB 9 (10), pp. 816–827. External Links: Link, Document Cited by: §6.
 MapReduce program synthesis. See DBLP:conf/pldi/2016, pp. 326–340. External Links: Link, Document Cited by: §6.
 Combinatorial sketching for finite programs. See DBLP:conf/asplos/2006, pp. 404–415. External Links: Link, Document Cited by: §6.
 Program sketching. STTT 15 (56), pp. 475–495. External Links: Link, Document Cited by: §6.
 TRANSIT: specifying protocols with concolic snippets. See DBLP:conf/pldi/2013, pp. 287–296. External Links: Link, Document Cited by: §1, Figure 5, §5.1, §6, §6.
 Information retrieval. Cited by: §4.1.
 Synthesis of data completion scripts using finite tree automata. PACMPL 1 (OOPSLA), pp. 62:1–62:26. External Links: Link, Document Cited by: §6.
 Program synthesis using abstraction refinement. PACMPL 2 (POPL), pp. 63:1–63:30. External Links: Link, Document Cited by: §6.
 Sketchdriven regular expression generation from natural language and examples. CoRR abs/1908.05848. External Links: Link, 1908.05848 Cited by: §6.
Appendix A Supported Operations in TFCoder
Below is the list of 134 operations currently supported by TFCoder. We did not cherrypick the operations to support; in fact, out of the 134 supported operations, only 59 are used in TFCoder’s solutions to our benchmark tasks.
Appendix B DomainSpecific Details
Here we describe a few techniques in TFCoder taking advantage of the TensorFlow domain, that may or may not be useful in other similar domains. These techniques are excluded from Algorithm 1 for simplicity.
We impose limits on the sizes of values (e.g., number of elements in a tensor) encountered during search. This is done to avoid excessive memory usage through the creation of huge tensors. These limits are enforced during operation filtering, e.g., do not call tf.ones(shape) on the argument tf.range(1, 20), as that would cause an outofmemory error. The limits are also checked after new values are created as a blanket safeguard against memory issues, and values that are too large are immediately discarded. In our experiments, we allow tensors to have a maximum of 1000 elements, 4 dimensions, and 100 elements along a single dimension. These limits are chosen to admit the largest tensors that we expect average users to require.
Many tasks require a tf.cast operation as the final step. Instead of waiting for the tf.cast operation to be applied through the search, TFCoder opportunistically casts newly generated values to the target output’s data type if the new value matches the output’s shape but not its data type. If the casted value does not match the output, it is discarded. This step takes negligible time since it is applied to few values, but it drastically reduces the synthesis time for tasks that require a tf.cast as the final operation. Note that the tf.cast operation is still treated normally within the value search, which is necessary to produce and store casted values to be used as arguments to other operations later in the search.
A SparseTensor is a special kind of tensor object in that represents large sparse tensors in a memoryefficient way. TensorFlow’s tf.sparse submodule is dedicated to manipulating SparseTensors, e.g., the tf.add function does not support adding SparseTensors, and the tf.sparse.add function must be used instead. Because sparse operations may be confusing to users who are not familiar with SparseTensors, we prevent all tf.sparse.* operations from being used unless a SparseTensor is given as an input or output tensor, or the description includes the term “sparse”. This also reduces the search space for tasks that do not use SparseTensors.
Appendix C Handling Multiple I/O Examples
To handle multiple input/output examples, we simply need to extend the notion of a “value” in our value search.
In the singleexample case, a “value” represents one code expression and contains the result of running that code expression using the example’s inputs. In the multiexample case, a “supervalue” still represents one code expression, but it contains the results of running that code expression on inputs from each example.
For equivalencebased pruning (line 20 of Algorithm 1), two supervalues are considered equal if all pairs of contained results are equal. For operation filtering (lines 15 and 17), a supervalue is permitted by a filter if all of its contained results pass the filter. A solution is found (line 24) when the supervalue’s contained results all match the examples’ outputs.
Appendix D Benchmark Creation
Here we walk through representative instances of our benchmarkcreation process.
d.1. User Provides Good Example
This benchmark comes from the StackOverflow question in Figure 0(a). The user provides an input/output example: the input tensor [45, 58, 72, 33, 45, 58, 58, 33] should be transformed into the output tensor [0, 1, 2, 3, 0, 1, 1, 3]. The example has several desirable qualities:

There are no obvious patterns in the choice of numbers in the input tensor. In contrast, if the input tensor were instead [10, 20, 30, 40, 10, 20, 20, 40], one could incorrectly construct the output as (in1 / 10)  1. In general, we observed that using “randomlooking” numbers in the input tensor will significantly improve the quality of the example by eliminating coincidental patterns that are not actually relevant to the problem.

There are no obvious patterns in the arrangement of numbers in the input tensor, e.g., the duplicate elements are not all consecutive. This makes it clear that the intended solution must be general enough to handle nonconsecutive duplicate elements.

The example tensors have sufficient length. Given only the example, the intended task would be much more ambiguous if the input tensor had, say, 4 elements instead of 8.

The example covers a variety of cases: there are elements appearing exactly 1, 2, and 3 times.
Hence, we consider this input/output example to be of high quality, and use it asis in our benchmark without modification.
For the natural language description of this task, we use the sentence “Assign values between 0 and N  1 for a vector with N different elements,” which is a slight simplification of the question’s title, “Assign values between 0 and N  1 for a vector of length L with N different elements in Tensorflow.”
d.2. User Provides Ambiguous Example
This benchmark comes from another StackOverflow question, where the user wants to gather elements of in2 along axis 1, using indices from in1. The user provides the following example:
Unfortunately, considering the points from the previous example benchmark, this input/output example is not as good. The example only includes two “parts” (where each part is an element of in2 being indexed), and the same index is used in both parts. Furthermore, the example includes a coincidental pattern – the extracted elements of in2 are the maximum of each row. Thus, we modify the example and increase the sizes of the tensors to make the intended pattern more clear, while breaking other patterns:
We found that examples given in StackOverflow questions were often too small because they were intended to be interpreted by humans who also understand the question text. In contrast, examples created by actual TFCoder users are much more extensive.
We also wrote a singlesentence description of the task that one would plausibly provide to the tool, “how to gather element with index along axis 1,” where “how to gather element with index” is drawn verbatim from the question title, and “along axis 1” comes from the question body.
d.3. User Provides No Example
In this StackOverflow question, the user clearly describes the desired behavior, but does not provide an input/output example: “Assume we have two TensorFlow tensors: input and weights. input is a tensor of images, say. So its shape is . weights is a simple list of scalar weights: . The aim is to scalarmultiply each image by its corresponding weight. How would one do that?”
For such questions without userprovided input/output examples, we create our own examples. We make sure that the examples are extensive enough to unambiguously specify the task and simple enough that a TFCoder user could plausibly have written the example. For this task, we use the following:
For this task we use the natural language description “scalar multiply images in a batch,” which is a short rephrasing of the question title, “Given a batch of images, how to scalar multiply each image by a different scalar in tensorflow.”
Appendix E TFCoder’s Buggy Solution
In this task, the user wants to sum elements of in1, but partitioned into groups specified by in2 first. The user provides the following example, which we use asis in our benchmark task:
In this example, the elements 5 and 10 of in1 are both in group 1 (specified by in2), so their sum, 15, is present in the corresponding positions in the output. Considering the format of in2 as provided by the user, we assume that it will only contain integers from 1 to inclusive, if there are distinct groups.
TFCoder’s solution to this problem is:
This is very close to being a correct solution, but it does have a bug. The operation tf.math.unsorted_segment_sum(data, segment_ids, num_segments) is very useful here, taking care of grouping and summing, but it requires that num_segments be sufficiently large (but being too large will hinder efficiency). For this particular I/O example, setting num_segments=tf.reduce_sum(in1) happens to be large enough so the solution works in this case, but this is not true in general (e.g., if in1 were entirely negative). A bugfree solution would use tf.reduce_max(in2) + 1 instead:
Although TFCoder’s solution was not perfect, it was nearly so, such that a human user reviewing the solution (while looking at TensorFlow documentation if needed) could identify the bug and write a fix.