A Grammar

# Syntax-Directed Variational Autoencoder for Structured Data

## Abstract

Deep generative models have been enjoying success in modeling continuous data. However it remains challenging to capture the representations for discrete structures with formal grammars and semantics, \eg, computer programs and molecular structures. How to generate both syntactically and semantically correct data still remains largely an open problem. Inspired by the theory of compiler where the syntax and semantics check is done via syntax-directed translation (SDT), we propose a novel syntax-directed variational autoencoder (SD-VAE) by introducing stochastic lazy attributes. This approach converts the offline SDT check into on-the-fly generated guidance for constraining the decoder. Comparing to the state-of-the-art methods, our approach enforces constraints on the output space so that the output will be not only syntactically valid, but also semantically reasonable. We evaluate the proposed model with applications in programming language and molecules, including reconstruction and program/molecule optimization. The results demonstrate the effectiveness in incorporating syntactic and semantic constraints in discrete generative models, which is significantly better than current state-of-the-art approaches.

\iclrfinalcopy1

## 1 Introduction

Recent advances in deep representation learning have resulted in powerful probabilistic generative models which have demonstrated their ability on modeling continuous data, \eg, time series signals \citepOorDieZenSimetal16, DaiDaiZhaetal17 and images \citepRadMetChi15,KarAilLaietal17. Despite the success in these domains, it is still challenging to correctly generate discrete structured data, such as graphs, molecules and computer programs. Since many of the structures have syntax and semantic formalisms, the generative models without explicit constraints often produces invalid ones.

Conceptually an approach in generative model for structured data can be divided in two parts, one being the formalization of the structure generation and the other one being a (usually deep) generative model producing parameters for stochastic process in that formalization. Often the hope is that with the help of training samples and capacity of deep models, the loss function will prefer the valid patterns and encourage the mass of the distribution of the generative model towards the desired region automatically.

Arguably the simplest structured data are sequences, whose generation with deep model has been well studied under the seq2seq \citepSutVinLe14 framework that models the generation of sequence as a series of token choices parameterized by recurrent neural networks (RNNs). Its widespread success has encourage several pioneer works that consider the conversion of more complex structure data into sequences and apply sequence models to the represented sequences. \citetGomDuvHer16 (CVAE) is a representative work of such paradigm for the chemical molecule generation, using the SMILES line notation \citepWeininger88 for representing molecules. However, because of the lack of formalization of syntax and semantics serving as the restriction of the particular structured data, underfitted general-purpose string generative models will often lead to invalid outputs. Therefore, to obtain a reasonable model via such training procedure, we need to prepare large amount of valid combinations of the structures, which is time consuming or even not practical in domains like drug discovery.

To tackle such a challenge, one approach is to incorporate the structure restrictions explicitly into the generative model. For the considerations of computational cost and model generality, context-free grammars (CFG) have been taken into account in the decoder parametrization. For instance, in molecule generation tasks, \citetKusPaiHer17 proposes a grammar variational autoencoder (GVAE) in which the CFG of SMILES notation is incorporated into the decoder. The model generates the parse trees directly in a top-down direction, by repeatedly expanding any nonterminal with its production rules. Although the CFG provides a mechanism for generating syntactic valid objects, it is still incapable to regularize the model for generating semantic valid objects \citepKusPaiHer17. For example, in molecule generation, the semantic of the SMILES languages requires that the rings generated must be closed; in program generation, the referenced variable should be defined in advance and each variable can only be defined exactly once in each local context (illustrated in Fig 3). All the examples require cross-serial like dependencies which are not enforceable by CFG, implying that more constraints beyond CFG are needed to achieve semantic valid production in VAE.

In the theory of compiler, attribute grammars, or syntax-directed definition has been proposed for attaching semantics to a parse tree generated by context-free grammar. Thus one straightforward but not practical application of attribute grammars is, after generating a syntactic valid molecule candidate, to conduct offline semantic checking. This process needs to be repeated until a semantically valid one is discovered, which is at best computationally inefficient and at worst infeasible, due to extremely low rate of passing checking. As a remedy, we propose the syntax-direct variational autoencoder (SD-VAE), in which a semantic restriction component is advanced to the stage of syntax tree generator. This allows the generator with both syntactic and semantic validation. The proposed syntax-direct generative mechanism in the decoder further constraints the output space to ensure the semantic correctness in the tree generation process. The relationships between our proposed model and previous models can be characterized in Figure 3.

Our method brings theory of formal language into stochastic generative model. The contribution of our paper can be summarized as follows:

[leftmargin=*]
• Syntax and semantics enforcement: We propose a new formalization of semantics that systematically converts the offline semantic check into online guidance for stochastic generation using the proposed stochastic lazy attribute. This allows us effectively address both syntax and semantic constraints.

• Efficient learning and inference: Our approach has computational cost where is the length of structured data. This is the same as existing methods like CVAE and GVAE which do not enforce semantics in generation. During inference, the SD-VAE runs with semantic guiding on-the-fly, while the existing alternatives generate many candidates for semantic checking.

• Strong empirical performance: We demonstrate the effectiveness of the SD-VAE through applications in two domains, namely (1) the subset of Python programs and (2) molecules. Our approach consistently and significantly improves the results in evaluations including generation, reconstruction and optimization.

## 2 Background

Before introducing our model and the learning algorithm, we first provide some background knowledge which is important for understanding the proposed method.

### 2.1 Variational Autoencoder

The variational autoencoder \citepKinWel13,RezMohWie14 provides a framework for learning the probabilistic generative model as well as its posterior, respectively known as decoder and encoder. We denote the observation as , which is the structured data in our case, and the latent variable as . The decoder is modeling the probabilistic generative processes of given the continuous representation through the likelihood and the prior over the latent variables , where denotes the parameters. The encoder approximates the posterior with a model parametrized by . The decoder and encoder are learned simultaneously by maximizing the evidence lower bound (ELBO) of the marginal likelihood, \ie,

 \Lcal\rbrX;θ,ψ:=∑x∈X\EEq(z|x)\sbrlogpθ(x|z)p(z)−logqψ(z|x)≤∑x∈Xlog∫pθ(x|z)p(z)dz, (1)

where denotes the training datasets containing the observations.

### 2.2 Context Free Grammar and Attribute Grammar

Context free grammar   A context free grammar (CFG) is defined as , where symbols are divided into , the set of non-terminal symbols, , the set of terminal symbols and , the start symbol. Here is the set of production rules. Each production rule is denoted as for is a nonterminal symbol, and is a sequence of terminal and/or nonterminal symbols.

Attribute grammar   To enrich the CFG with “semantic meaning”, \citetKnuth68 formalizes attribute grammar that introduces attributes and rules to CFG. An attribute is an attachment to the corresponding nonterminal symbol in CFG, written in the format for . There can be two types of attributes assigned to non-terminals in : the inherited attributes and the synthesized attributes. An inherited attribute depends on the attributes from its parent and siblings, while a synthesized attribute is computed based on the attributes of its children. Formally, for a production , we denote and be the sets of inherited and synthesized attributes of for , respectively.

#### A motivational example

We here exemplify how the above defined attribute grammar enriches CFG with non-context-free semantics. We use the following toy grammar, a subset of SMILES that generates either a chain or a cycle with three carbons:

Production Semantic Rule \grammarindent10ex \grammarparsep0.5ex {grammar} ¡s¿ ¡atom¿ ‘C’ ¡atom¿ ¡s¿”.matched” ¡atom¿”.set” ¡atom¿”.set”,
¡s¿”.ok” ¡atom¿”.set” ¡s¿.”matched” ¡atom¿”.set”

¡atom¿ ‘C’ — ‘C’ ¡bond¿ ¡digit¿ ¡atom¿”.set” — concat¡bond¿”.val”, ¡digit¿”.val”

¡bond¿ ‘-’ — ‘=’ — ‘#’ ¡bond¿”.val” ‘-’ — ‘=’ — ‘#’

¡digit¿ ‘1’ — ‘2’ — … — ‘9’ ¡digit¿”.val” ‘1’ — ‘2’ … — ‘9’ where we show the production rules in CFG with on the left, and the calculation of attributes in attribute grammar with on the left. Here we leverage the attribute grammar to check (with attribute matched) whether the ringbonds come in pairs: a ringbond generated at should match the bond type and bond index that generated at , also the semantic constraint expressed by requires that there is no difference between the set attribute of and . Such constraint in SMILES is known as cross-serial dependencies (CSD) \citepBreKapPetZae82 which is non-context-free \citepShieber85. See Appendix A.3 for more explanations. Figure 4 illustrates the process of performing syntax and semantics check in compilers. Here all the attributes are synthetic, \ie, calculated in a bottom-up direction.

So generally, in the semantic correctness checking procedure, one need to perform bottom-up procedures for calculating the attributes after the parse tree is generated. However, in the top-down structure generating process, the parse tree is not ready for semantic checking, since the synthesized attributes of each node require information from its children nodes, which are not generated yet. Due to such dilemma, it is nontrivial to use the attribute grammar to guide the top-down generation of the tree-structured data. One straightforward way is using acceptance-rejection sampling scheme, \ie, using the decoder of CVAE or GVAE as a proposal and the semantic checking as the threshold. It is obvious that since the decoder does not include semantic guidance, the proposal distribution may raise semantically invalid candidate frequently, therefore, wasting the computational cost in vain.

## 3 Syntax-Directed Variational Autoencoder

As described in Section 2.2.1, directly using attribute grammar in an offline fashion (\ie, after the generation process finishes) is not efficient to address both syntax and semantics constraints. In this section we describe how to bring forward the attribute grammar online and incorporate it into VAE, such that our VAE addresses both syntactic and semantic constraints. We name our proposed method Syntax-Directed Variational Autoencoder (SD-VAE).

### 3.1 Stochastic Syntax-Directed Decoder

By scrutinizing the tree generation, the major difficulty in incorporating the attributes grammar into the processes is the appearance of the synthesized attributes. For instance, when expanding the start symbol , none of its children is generated yet. Thus their attributes are also absent at this time, making the unable to be computed. To enable the on-the-fly computation of the synthesized attributes for semantic validation during tree generation, besides the two types of attributes, we introduce the stochastic lazy attributes to enlarge the existing attribute grammar. Such stochasticity transforms the corresponding synthesized attribute into inherited constraints in generative procedure; and lazy linking mechanism sets the actual value of the attribute, once all the other dependent attributes are ready. We demonstrate how the decoder with stochastic lazy attributes will generate semantic valid output through the same pedagogical example as in Section 2.2.1. Figure 5 visually demonstrates this process.

The tree generation procedure is indeed sampling from the decoder , which can be decomposed into several steps that elaborated below:

i) stochastic predetermination: in Figure 5(a), we start from the node with the synthesized attributes determining the index and bond type of the ringbond that will be matched at node . Since we know nothing about the children nodes right now, the only thing we can do is to ‘guess’ a value. That is to say, we associate a stochastic attribute as a predetermination for the sake of the absence of synthesized attribute , where is the Bernoulli distribution. Here is the maximum cardinality possible 2 for the corresponding attribute . In above example, the indicates no ringbond and indicates one ringbond at both and , respectively.

ii) constraints as inherited attributes: we pass the as inherited constraints to the children of node , \ie, and to ensure the semantic validation in the tree generation. For example, Figure 5(b) sa=1' is passed down to .

iii) sampling under constraints: without loss of generality, we assume is generated before . We then sample the rules from for expanding , and so on and so forth to generate the subtree recursively. Since we carefully designed sampling distribution that is conditioning on the stochastic property, the inherited constraints will be eventually satisfied. In the example, due to the , when expanding , the sampling distribution only has positive mass on rule C' .

iv) lazy linking: once we complete the generation of the subtree rooted at , the synthesized attribute is now available. According to the semantic rule for , we can instantiate . This linking is shown in Figure 5(d)(e). When expanding , the will be passed down as inherited attribute to regulate the generation of , as is demonstrated in Figure 5(f)(g).

In summary, the general syntax tree can be constructed step by step, within the languages covered by grammar . In the beginning, , where which contains only the start symbol . At step , we will choose an nonterminal node in the frontier3 of partially generated tree to expand. The generative process in each step can be described as:

1. Pick node where its attributes needed are either satisfied, or are stochastic attributes that should be sampled first according to Bernoulli distribution ;

2. Sample rule according to distribution , where , and , \ie, expand the nonterminal with production rules defined in CFG.

3. , \ie, grow the tree by attaching to . Now the node has children represented by symbols in .

The above process continues until all the nodes in the frontier of are all terminals after steps. Then, we obtain the algorithm 1 for sampling both syntactic and semantic valid structures.

In fact, in the model training phase, we need to compute the likelihood given and . The probability computation procedure is similar to the sampling procedure in the sense that both of them requires tree generation. The only difference is that in the likelihood computation procedure, the tree structure, \ie, the computing path, is fixed since is given; While in the sampling procedure, it is sampled following the learned model. Specifically, the generative likelihood can be written as:

 pθ(x|z)=T∏t=0pθ(rt|ctx(t),node(t),\Tcal(t))\Bcalθ(sat|node(t),\Tcal(t)) (2)

where and . Here RNN can be commonly used LSTM, \etc.

### 3.2 Structure-Based Encoder

As we introduced in section 2, the encoder, approximates the posterior of the latent variable through the model with some parametrized function with parameters . Since the structure in the observation plays an important role, the encoder parametrization should take care of such information. The recently developed deep learning models \citepDuvMacIpaBometal15,DaiDaiSon16,LeiJinRegJaa17 provide powerful candidates as encoder. However, to demonstrate the benefits of the proposed syntax-directed decoder in incorporating the attribute grammar for semantic restrictions, we will exploit the same encoder in \citetKusPaiHer17 for a fair comparison later.

We provide a brief introduction to the particular encoder model used in [Kusner et al.(2017)Kusner, Paige, and Hernández-Lobato] for a self-contained purpose. Given a program or a SMILES sequence, we obtain the corresponding parse tree using CFG and decompose it into a sequence of productions through a pre-order traversal on the tree. Then, we convert these productions into one-hot indicator vectors, in which each dimension corresponds to one production in the grammar. We will use a deep convolutional neural networks which maps this sequence of one-hot vectors to a continuous vector as the encoder.

### 3.3 Model Learning

Our learning goal is to maximize the evidence lower bound in Eq 1. Given the encoder, we can then map the structure input into latent space . The variational posterior is parameterized with Gaussian distribution, where the mean and variance are the output of corresponding neural networks. The prior of latent variable . Since both the prior and posterior are Gaussian, we use the closed form of KL-divergence that was proposed in \citetKinWel13. In the decoding stage, our goal is to maximize . Using the Equation (2), we can compute the corresponding conditional likelihood. During training, the syntax and semantics constraints required in Algorithm 1 can be precomputed. In practice, we observe no significant time penalty measured in wall clock time compared to previous works.

## 4 Related work

Generative models with discrete structured data have raised increasing interests among researchers in different domains. The classical sequence to sequence model \citepSutVinLe14 and its variations have also been applied to molecules \citepGomDuvHer16. Since the model is quite flexible, it is hard to generate valid structures with limited data, though  \citetJanWesPaietal18 shows that an extra validator model could be helpful to some degree. Techniques including data augmentation \citepBjerrum17, active learning \citepJanWesJos17 and reinforcement learning \citepGuiSanFar17 also been proposed to tackle this issue. However, according to the empirical evaluations from  \citetBenhenda17, the validity is still not satisfactory. Even when the validity is enforced, the models tend to overfit to simple structures while neglect the diversity.

Since the structured data often comes with formal grammars, it is very helpful to generate its parse tree derived from CFG, instead of generating sequence of tokens directly. The Grammar VAE\citepKusPaiHer17 introduced the CFG constrained decoder for simple math expression and SMILES string generation. The rules are used to mask out invalid syntax such that the generated sequence is always from the language defined by its CFG. \citetParMohSinLietal16 uses a RecursiveReverse-Recursive Neural Network (R3NN) to capture global context information while expanding with CFG production rules. Although these works follow the syntax via CFG, the context sensitive information can only be captured using variants of sequence/tree RNNs \citepAlvJaa16, DonLap16, ZhaLuLap15, which may not be time and sample efficient.

In our work, we capture the semantics with proposed stochastic lazy attributes when generating structured outputs. By addressing the most common semantics to harness the deep networks, it can greatly reshape the output domain of decoder \citepHuMaLiuHovetal16. As a result, we can also get a better generative model for discrete structures.

## 5 Experiments

Code is available at https://github.com/Hanjun-Dai/sdvae.

We show the effectiveness of our proposed SD-VAE with applications in two domains, namely programs and molecules. We compare our method with CVAE \citepGomDuvHer16 and GVAE \citepKusPaiHer17. CVAE only takes character sequence information, while GVAE utilizes the context-free grammar. To make a fair comparison, we closely follow the experimental protocols that were set up in \citetKusPaiHer17. The training details are included in Appendix B.

Our method gets significantly better results than previous works. It yields better reconstruction accuracy and prior validity by large margins, while also having comparative diversity of generated structures. More importantly, the SD-VAE finds better solution in program and molecule regression and optimization tasks. This demonstrates that the continuous latent space obtained by SD-VAE is also smoother and more discriminative.

### 5.1 Settings

Here we first describe our datasets in detail. The programs are represented as a list of statements. Each statement is an atomic arithmetic operation on variables (labeled as v0, v1, , v9) and/or immediate numbers (). Some examples are listed below:

v3=sin(v0);v8=exp(2);v9=v3-v8;v5=v0*v9;return:v5

v2=exp(v0);v7=v2*v0;v9=cos(v7);v8=cos(v9);return:v8

Here v0 is always the input, and the variable specified by return (respectively v5 and v8 in the examples) is the output, therefore it actually represent univariate functions . Note that a correct program should, besides the context-free grammar specified in Appendix A.1, also respect the semantic constraints. For example, a variable should be defined before being referenced. We randomly generate programs, where each consisting of to valid statements. Here the maximum number of decoding steps . We hold 2000 programs out for testing and the rest for training and validation.

For molecule experiments, we use the same dataset as in \citetKusPaiHer17. It contains SMILES strings, which are extracted from the ZINC database \citepGomDuvHer16. We use the same split as \citetKusPaiHer17, where SMILES strings are held out for testing. Regarding the syntax constraints, we use the grammar specified in Appendix A.2, which is also the same as \citetKusPaiHer17. Here the maximum number of decoding steps .

For our SD-VAE, we address some of the most common semantics:

Program semantics   We address the following: a) variables should be defined before use, b) program must return a variable, c) number of statements should be less than 10.

Molecule semantics   The SMILES semantics we addressed includes: a) ringbonds should satisfy cross-serial dependencies, b) explicit valence of atoms should not go beyond permitted. For more details about the semantics of SMILES language, please refer to Appendix A.3.

### 5.2 Reconstruction Accuracy and Prior Validity

We use the held-out dataset to measure the reconstruction accuracy of VAEs. For prior validity, we first sample the latent representations from prior distribution, and then evaluate how often the model can decode into a valid structure. Since both encoding and decoding are stochastic in VAEs, we follow the Monte Carlo method used in \citetKusPaiHer17 to do estimation:

areconstruction: for each of the structured data in the held-out dataset, we encode it 10 times and decoded (for each encoded latent space representation) 25 times, and report the portion of decoded structures that are the same as the input ones; bvalidity of prior: we sample 1000 latent representations . For each of them we decode 100 times, and calculate the portion of 100,000 decoded results that corresponds to valid Program or SMILES sequences.

Program    We show in the left part of Table 1 that our model has near perfect reconstruction rate, and most importantly, a perfect valid decoding program from prior. This huge improvement is due to our model that utilizes the full semantics that previous work ignores, thus in theory guarantees perfect valid prior and in practice enables high reconstruction success rate. For a fair comparison, we run and tune the baselines in of training data and report the best result. In the same place we also report the reconstruction successful rate grouped by number of statements. It is shown that our model keeps high rate even with the size of program growing.

SMILES    Since the settings are exactly the same, we include CVAE and GVAE results directly from \citetKusPaiHer17. We show in the right part of Table 1 that our model produces a much higher rate of successful reconstruction and ratio of valid prior. Figure 10 in Appendix C.2 also demonstrates some decoded molecules from our method. Note that the results we reported have not included the semantics specific to aromaticity into account. If we use an alternative kekulized form of SMILES to train the model, then the valid portion of prior can go up to .

### 5.3 Bayesian Optimization

One important application of VAEs is to enable the optimization (\eg, find new structures with better properties) of discrete structures in continuous latent space, and then use decoder to obtain the actual structures. Following the protocol used in \citetKusPaiHer17, we use Bayesian Optimization (BO) to search the programs and molecules with desired properties in latent space. Details about BO settings and parameters can be found in Appendix C.1.

Finding program   In this application the models are asked to find the program which is most similar to the ground truth program. Here the distance is measured by , where the MSE (Mean Square Error) calculates the discrepancy of program outputs, given the 1000 different inputs v0 sampled evenly in . In Figure 6 we show that our method finds the best program to the ground truth one compared to CVAE and GVAE.

Molecules   Here we optimize the drug properties of molecules. In this problem, we ask the model to optimize for octanol-water partition coefficients (a.k.a log P), an important measurement of drug-likeness of a given molecule. As [Gómez-Bombarelli et al.(2016)Gómez-Bombarelli, Duvenaud, Hernández-Lobato, Aguilera-Iparraguirre, Hirzel, Adams, and Aspuru-Guzik] suggests, for drug-likeness assessment log P is penalized by other properties including synthetic accessibility score \citepErtSch09. In Figure 7 we show the the top-3 best molecules found by each method, where our method found molecules with better scores than previous works. Also one can see the molecule structures found by SD-VAE are richer than baselines, where the latter ones mostly consist of chain structure.

### 5.4 Predictive performance of latent representation

The VAEs also provide a way to do unsupervised feature representation learning \citetGomDuvHer16. In this section, we seek to to know how well our latent space predicts the properties of programs and molecules. After the training of VAEs, we dump the latent vectors of each structured data, and train the sparse Gaussian Process with the target value (namely the error for programs and the drug-likeness for molecules) for regression. We test the performance in the held-out test dataset. In Table 2, we report the result in Log Likelihood (LL) and Regression Mean Square Error (RMSE), which show that our SD-VAE always produces latent space that are more discriminative than both CVAE and GVAE baselines. This also shows that, with a properly designed decoder, the quality of encoder will also be improved via end-to-end training.

### 5.5 Diversity of generated molecules

Inspired by \citetBenhenda17, here we measure the diversity of generated molecules as an assessment of the methods. The intuition is that a good generative model should be able to generate diverse data and avoid mode collapse in the learned space. We conduct this experiment in the SMILES dataset. We first sample 100 points from the prior distribution. For each point, we associate it with a molecule, which is the most frequent occurring valid SMILES decoded (we use 50 decoding attempts since the decoding is stochastic). We then, with one of the several molecular similarity metrics, compute the pairwise similarity and report the mean and standard deviation in Table 3. We see both methods do not have the mode collapse problem, while producing similar diversity scores. It indicates that although our method has more restricted decoding space than baselines, the diversity is not sacrificed. This is because we never rule-out the valid molecules. And a more compact decoding space leads to much higher probability in obtaining valid molecules.

### 5.6 Visualizing the Latent Space

We seek to visualize the latent space as an assessment of how well our generative model is able to produces a coherent and smooth space of program and molecules.

Program   Following \citetBowVilVinetal16, we visualize the latent space of program by interpolation between two programs. More specifically, given two programs which are encoded to and respectively in the latent space, we pick 9 evenly interpolated points between them. For each point, we pick the corresponding most decoded structure. In Table 4 we compare our results with previous works. Our SD-VAE can pass though points in the latent space that can be decoded into valid programs without error and with visually more smooth interpolation than previous works. Meanwhile, CVAE makes both syntactic and semantic errors, and GVAE produces only semantic errors (reference of undefined variables), but still in a considerable amount.

SMILES   For molecules, we visualize the latent space in 2 dimensions. We first embed a random molecule from the dataset into latent space. Then we randomly generate 2 orthogonal unit vectors . To get the latent representation of neighborhood, we interpolate the 2-D grid and project back to latent space with pseudo inverse of . Finally we show decoded molecules. In Figure 8, we present two of such grid visualizations. Subjectively compared with figures in [Kusner et al.(2017)Kusner, Paige, and Hernández-Lobato], our visualization is characterized by having smooth differences between neighboring molecules, and more complicated decoded structures.

## 6 Conclusion

In this paper we propose a new method to tackle the challenge of addressing both syntax and semantic constraints in generative model for structured data. The newly proposed stochastic lazy attribute presents a the systematical conversion from offline syntax and semantic check to online guidance for stochastic generation, and empirically shows consistent and significant improvement over previous models, while requiring similar computational cost as previous model. In the future work, we would like to explore the refinement of formalization on a more theoretical ground, and investigate the application of such formalization on a more diverse set of data modality.

#### Acknowledgments

This project was supported in part by NSF IIS-1218749, NIH BIGDATA 1R01GM108341, NSF CAREER IIS-1350983, NSF IIS-1639792 EAGER, NSF CNS-1704701, ONR N00014-15-1-2340, NSF IIS-1546113, DBI-1355990, Intel ISTC, NVIDIA and Amazon AWS.

Appendix

## Appendix A Grammar

### a.1 Grammar for Program Syntax

The syntax grammar for program is a generative contest free grammar starting with . \grammarindent24ex \grammarparsep0.5ex {grammar} ¡program¿ ¡stat list¿

¡stat list¿ ¡stat¿ ‘;’ ¡stat list¿ — ¡stat¿

¡stat¿ ¡assign¿ — ¡return¿

¡assign¿ ¡lhs¿ ‘=’ ¡rhs¿

¡return¿ ‘return:’ ¡lhs¿

¡lhs¿ ¡var¿

¡var¿ ‘v’ ¡var id¿

¡digit¿ ‘1’ — ‘2’ — ‘3’ — ‘4’ — ‘5’ — ‘6’ — ‘7’ — ‘8’ — ‘9’

¡rhs¿ ¡expr¿

¡expr¿ ¡unary expr¿ — ¡binary expr¿

¡unary expr¿ ¡unary op¿ ¡operand¿ — ¡unary func¿ ‘(’ ¡operand¿ ‘)’

¡binary expr¿ ¡operand¿ ¡binary op¿ ¡operand¿

¡unary op¿ ‘+’ — ‘-’

¡unary func¿ ‘sin’ — ‘cos’ — ‘exp’

¡binary op¿ ‘+’ — ‘-’ — ‘*’ — ‘/’

¡operand¿ ¡var¿ — ¡immediate number¿

¡immediate number¿ ¡digit¿ ‘.’ ¡digit¿

¡digit¿ ‘0’ — ‘1’ — ‘2’ — ‘3’ — ‘4’ — ‘5’ — ‘6’ — ‘7’ — ‘8’ — ‘9’

### a.2 Grammar for Molecule Syntax

Our syntax grammar for molecule is based on OpenSMILES standard, a context free grammar starting with .

\grammarindent

30ex \grammarparsep0.5ex {grammar}

¡s¿ ¡atom¿

¡smiles¿ ¡chain¿

¡atom¿ ¡bracket atom¿ — ¡aliphatic organic¿ — ¡aromatic organic¿

¡aliphatic organic¿ ‘B’ — ‘C’ — ‘N’ — ‘O’ — ‘S’ — ‘P’ — ‘F’ — ‘I’ — ‘Cl’ — ‘Br’

¡aromatic organic¿ ’c’ — ’n’ — ’o’ — ’s’

¡bracket atom¿ ‘[’ ¡bracket atom (isotope)¿ ‘]’

¡bracket atom (isotope)¿ ¡isotope¿ ¡symbol¿ ¡bracket atom (chiral)¿ \alt¡symbol¿ ¡bracket atom (chiral)¿ \alt¡isotope¿ ¡symbol¿ — ¡symbol¿

¡bracket atom (chiral)¿ ¡chiral¿ ¡bracket atom (h count)¿ \alt¡bracket atom (h count)¿ \alt¡chiral¿

¡bracket atom (h count)¿ ¡h count¿ ¡bracket atom (charge)¿ \alt¡bracket atom (charge)¿ \alt¡h count¿

¡bracket atom (charge)¿ ¡charge¿

¡symbol¿ ¡aliphatic organic¿ — ¡aromatic organic¿

¡isotope¿ ¡digit¿ — ¡digit¿ ¡digit¿ — ¡digit¿ ¡digit¿ ¡digit¿

¡digit¿ ’1’ — ’2’ — ’3’ — ’4’ — ’5’ — ’6’ — ’7’ — ’8’

¡chiral¿ ‘@’ — ‘@@’

¡h count¿ ‘H’ — ‘H’ ¡digit¿

¡charge¿ ’-’ — ’-’ ¡digit¿ — ’+’ — ’+’ ¡digit¿

¡bond¿ ‘-’ — ‘=’ — ‘#’ — ‘/’ — ‘

¡ringbond¿ ¡digit¿

¡branched atom¿ ¡atom¿ — ¡atom¿ ¡branches¿ — ¡atom¿ ¡ringbonds¿ \alt¡atom¿ ¡ringbonds¿ ¡branches¿

¡ringbonds¿ ¡ringbonds¿ ¡ringbond¿ — ¡ringbond¿

¡branches¿ ¡branches¿ ¡branch¿ — ¡branch¿

¡branch¿ ’(’ ¡chain¿ ’)’ — ’(’ ¡bond¿ ¡chain¿ ’)’

¡chain¿ ¡branched atom¿ — ¡chain¿ ¡branched atom¿ \alt¡chain¿ ¡bond¿ ¡branched atom¿

### a.3 Examples of SMILES semantics

Here we provide more explanations of the semantics constraints that contained in SMILES language for molecules.

Specifically, the semantics we addressed here are:

1. Ringbond matching: The ringbonds should come in pairs. Each pair of ringbonds has an index and a bond-type associated. What the SMILES semantics requires is exactly the same as the well-known cross-serial dependencies (CSD) in formal language. CSD also appears in some natural languages, such as Dutch and Swiss-German. Another example of CSD is a sequence of multiple different types of parentheses where each separately balanced disregarding the others. See Figure 9 for an illustration.

2. Explicit valence control: Intuitively, the semantics requires that each atom cannot have too many bonds associated with it. For example, a normal carbon atom has maximum valence of 4, which means associating a Carbon atom with two triple-bonds will violate the semantics.

### a.4 Dependency graph introduced by attribute grammar

Suppose there is a production and an attribute we denote the dependency set . The union of all dependency sets induces a dependency graph, where nodes are the attributes and directed edges represents the dependency relationships between those attributes computation. Here is an (partial or full) instantiation of the generated syntax tree of grammar . Let and , that is, is constructed from by merging nodes with the same symbol but different attributes, we call is noncircular if the corresponding is noncircular.

In our paper, we assume the noncircular property of the dependency graph. Such property will be exploited for top-down generation in our decoder.

## Appendix B Training Details

Since our proposed SD-VAE differentiate itself from previous works (CVAE, GVAE) on the formalization of syntax and semantics, we therefore use the same deep neural network model architecture for a fair comparison. In encoder, we use 3-layer one-dimension convolution neural networks (CNNs) followed by a full connected layer, whose output would be fed into two separate affine layers for producing and respectively as in reparameterization trick; and in decoder we use 3-layer RNNs followed by a affine layer activated by softmax that gives probability for each production rule. In detail, we use dimensions the latent space and the dimension of layers as the same number as in [Kusner et al.(2017)Kusner, Paige, and Hernández-Lobato]. As for implementation, we use [Kusner et al.(2017)Kusner, Paige, and Hernández-Lobato]’s open sourced code for baselines, and implement our model in PyTorch framework 4.

In a validation set we tune the following hyper parameters and report the test result from setting with best valid loss. For a fair comparison, all tunings are also conducted in the baselines.

We use as the loss function for training. A natural setting is , but [Kusner et al.(2017)Kusner, Paige, and Hernández-Lobato] suggested in their open-sourced implementation5 that using would leads to better results. We explore both settings.

## Appendix C More experiment details

### c.1 Bayesian optimization

The Bayesian optimization is used for searching latent vectors with desired target property. For example, in symbolic program regression, we are interested in finding programs that can fit the given input-output pairs; in drug discovery, we are aiming at finding molecules with maximum drug likeness. To get a fair comparison with baseline algorithms, we follow the settings used in  \citetKusPaiHer17.

Specifically, we first train the variational autoencoder in an unsupervised way. After obtaining the generative model, we encode all the structures into latent space. Then these vectors and corresponding property values (\ie, estimated errors for program, or drug likeness for molecule) are used to train a sparse Gaussian process with 500 inducing points. This is used later for predicting properties in latent space. Next, 5 iterations of batch Bayesian optimization with the expected improvement (EI) heuristic is used for proposing new latent vectors. In each iteration, 50 latent vectors are proposed. After the proposal, the newly found programs/molecules are then added to the batch for next round of iteration.

During the proposal of latent vectors in each iteration, we perform 100 rounds of decoding and pick the most frequent decoded structures. This helps regulates the decoding due to randomness, as well as increasing the chance for baselines algorithms to propose valid ones.

### c.2 Recontruction

We visualize some reconstruction results of SMILES in Figure 10. It can be observed that, in most cases the decoder successfully recover the exact origin input. Due to the stochasticity of decoder, it may have some small variations.

### Footnotes

1. footnotetext: *Both authors contributed equally to the paper.
2. Note that setting threshold for assumes a mildly context sensitive grammar (\eg, limited CSD).
3. Here frontier is the set of all nonterminal leaves in current tree.
4. {http://pytorch.org/}
5. {https://github.com/mkusner/grammarVAE/issues/2}

### References

1. David Alvarez-Melis and Tommi S Jaakkola. Tree-structured decoding with doubly-recurrent neural networks. 2016.
2. Mostapha Benhenda. Chemgan challenge for drug discovery: can ai reproduce natural chemical diversity? arXiv preprint arXiv:1708.08227, 2017.
3. Esben Jannik Bjerrum. Smiles enumeration as data augmentation for neural network modeling of molecules. arXiv preprint arXiv:1703.07076, 2017.
4. Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. CoNLL 2016, pp.  10, 2016.
5. Joan Bresnan, Ronald M Kaplan, Stanley Peters, and Annie Zaenen. Cross-serial dependencies in dutch. 1982.
6. Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for structured data. In ICML, 2016.
7. Hanjun Dai, Bo Dai, Yan-Ming Zhang, Shuang Li, and Le Song. Recurrent hidden semi-markov model. 2017.
8. Brooks Paige Matt Kusner Jose Miguel Hernandez Lobato Dave Janz, Jos van der Westhuizen. Learning a generative model for validity in complex discrete structures. International Conference on Learning Representations, 2018. accepted as poster.
9. Li Dong and Mirella Lapata. Language to logical form with neural attention. arXiv preprint arXiv:1601.01280, 2016.
10. David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems, pp. 2215–2223, 2015.
11. Peter Ertl and Ansgar Schuffenhauer. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. Journal of cheminformatics, 1(1):8, 2009.
12. Rafael Gómez-Bombarelli, David Duvenaud, José Miguel Hernández-Lobato, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of molecules. arXiv preprint arXiv:1610.02415, 2016.
13. Gabriel Lima Guimaraes, Benjamin Sanchez-Lengeling, Pedro Luis Cunha Farias, and Alán Aspuru-Guzik. Objective-reinforced generative adversarial networks (organ) for sequence generation models. arXiv preprint arXiv:1705.10843, 2017.
14. Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard Hovy, and Eric Xing. Harnessing deep neural networks with logic rules. arXiv preprint arXiv:1603.06318, 2016.
15. David Janz, Jos van der Westhuizen, and José Miguel Hernández-Lobato. Actively learning what makes a discrete sequence valid. arXiv preprint arXiv:1708.04465, 2017.
16. Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
17. Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
18. Donald E Knuth. Semantics of context-free languages. Theory of Computing Systems, 2(2):127–145, 1968.
19. Matt J Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. arXiv preprint arXiv:1703.01925, 2017.
20. Tao Lei, Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Deriving neural architectures from sequence and graph kernels. arXiv preprint arXiv:1705.09037, 2017.
21. Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
22. Emilio Parisotto, Abdel-rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and Pushmeet Kohli. Neuro-symbolic program synthesis. arXiv preprint arXiv:1611.01855, 2016.
23. Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
24. Danilo J Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1278–1286, 2014.
25. Stuart M Shieber. Evidence against the context-freeness of natural language. 1985.
26. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014.
27. David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36, 1988.
28. Xingxing Zhang, Liang Lu, and Mirella Lapata. Top-down tree long short-term memory networks. arXiv preprint arXiv:1511.00060, 2015.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters