Localised Generative Flows
Abstract
We argue that flowbased density models based on continuous bijections are limited in their ability to learn target distributions with complicated topologies, and propose localised generative flows (LGFs) to address this problem. LGFs are composed of stacked continuous mixtures of bijections, which enables each bijection to learn a local region of the target rather than its entirety. Our method is a generalisation of existing flowbased methods, which can be used without modification as the basis for an LGF model. Unlike normalising flows, LGFs do not permit exact computation of log likelihoods, but we propose a simple variational scheme that performs well in practice. We show empirically that LGFs yield improved performance across a variety of density estimation tasks.
1 Introduction
Flowbased generative models, often referred to as normalising flows, have become popular methods for density estimation because of their flexibility, expressiveness, and tractable likelihoods. Given the problem of learning an unknown target density on a data space , normalising flows model as the marginal of obtained by the generative process
(1) 
where is a prior density on a space , and is a bijection.^{1}^{1}1We assume throughout that , and that all densities are with respect to the Lebesgue measure. Assuming sufficient regularity, it follows that has density (see e.g. billingsley2008probability). The parameters of can be learned via maximum likelihood given i.i.d. samples from .
To be effective, a normalising flow model must specify an expressive family of bijections with tractable Jacobians. Affine coupling layers (dinh2014nice; dinh2016density), autoregressive transformations (germain2015made; papamakarios2017masked), ODEbased transformations (grathwohl2018ffjord), and invertible ResNet blocks (behrmann2019invertible) are all examples of such bijections that can be composed to produce complicated flows. These models have demonstrated significant promise in their ability to model complex datasets (papamakarios2017masked) and to synthesise novel data points (kingma2018glow).
However, in all these cases, is continuous in . We believe this is a significant limitation of these models since it imposes a global constraint on , which must learn to match the topology of , which is usually quite simple, to the topology of , which we expect to be very complicated. We argue that this constraint makes maximum likelihood estimation extremely difficult in general, leading to training instabilities and erroneous regions of high likelihood in the learned density landscape.
To address this problem we introduce localised generative flows (LGFs), which generalise equation 1 by replacing the single bijection with stacked continuous mixtures of bijections for an index set . Intuitively, LGFs allow each to focus on modelling only a local component of the target that may have a much simpler topology than the full density. LGFs do not stipulate the form of , and indeed any standard choice of can be used as the basis of its definition. We pay a price for these benefits in that we can no longer compute the likelihood of our model exactly and must instead resort to a variational approximation, with our training objective replaced by the evidence lower bound (ELBO). However, in practice we find this is not a significant limitation, as the bijective structure of LGFs permits learning a highquality variational distribution suitable for largescale training. We show empirically that LGFs outperform their counterpart normalising flows across a variety of density estimation tasks.
2 Limitations of Normalising Flows
Consider a normalising flow model defined by a family of bijections parameterised by . Suppose we are in the typical case that each is continuous in . Intuitively, this seems to pose a problem when and are supported^{2}^{2}2As usually defined, the support of a density technically refers to the set on which is strictly positive. However, our statements are also approximately true when is interpreted as the region of which is not smaller than some threshold. This is relevant in practice since, even if both are highly concentrated on some small region, it is common to assume that and have full support, in which case the supports of and would be trivially homeomorphic. on domains with different topologies, since continuous functions necessarily preserve topology. In practice, suggestive pathologies along these lines are readily apparent in simple 2D experiments as shown in Figure 1. In this case, the density model (a masked autoregressive flow (MAF) (papamakarios2017masked)) is unable to continuously transform the support of the prior (a standard 2D Gaussian) into the support of , which is the union of two disjoint rectangles and hence clearly has a different topology.
The standard maximum likelihood objective is asymptotically equivalent to
(2) 
where and denotes the KullbackLeibler divergence. We conjecture that the pathologies in Figure 1 are the result of sensitivities of this KL that are greatly exacerbated when modelling as the pushforward of by a single continuous . In particular, observe that unless the support contains . However, at the same time, Pinsker’s inequality (massart2007concentration) means that the KL is bounded below like
and is hence strictly positive unless . Thus the objective of equation 2 encourages the support of to approximate that of from above as closely as possible, but any overshoot – however small – carries an immediate infinite penalty.
We believe this behaviour significantly limits the ability of normalising flows to learn any target distribution whose support has a complicated topology. For example, note that in practice our parameterisation usually satisfies for all and . It then follows that no will yield when and are not homeomorphic, even if our parameterisation contains all such continuous bijections. When this holds, we conjecture that the optimum of equation 2 will typically be pathological, since, in order to drive the KL to zero, must necessarily approximate some mapping that is not a continuous bijection.
For gradientbased methods of solving equation 2, we also conjecture that this problem is compounded by the discontinuous behaviour of the KL described above. It seems plausible that perturbations of might entail small perturbations of parts of . When is much larger than , this may not be a problem. However, if is close to optimal, so that , then these perturbations may mean no longer holds, hence driving the KL to infinity. We conjecture this makes the landscape around the optimal difficult to navigate for gradientbased methods, leading to increasing fluctuations and instabilities of the KL in equation 2 that degrade the quality of the final produced. Figure 2 illustrates this behaviour: observe that, even after 300 training epochs, the support of the standard MAF is unstable.
A simple way to fix these problems would be to use a more complicated that is better matched to the structure of . For instance, taking to be a mixture model has previously been found to improve the performance of normalising flows in some cases (papamakarios2017masked). However, this approach requires prior knowledge of the topology of that might be difficult to obtain: e.g. solving the problem with a Gaussian mixture requires us to know the number of connected components of beforehand, and even then would require that these components are each homeomorphic to a hypersphere. Ideally, we would like our model to be flexible enough to learn the structure of the target on its own, with minimal explicit design choices required on our part.
An alternative approach would be to try a more expressive family in the hope that this better conditions the optimisation problem in equation 2. Several works have considered families of that are (in principle) universal approximators of any continuous probability densities (huang2018neural; jaini2019sum). While we have not performed a thorough empirical evaluation of these methods, we suspect that these models can at best mitigate the problems described above, since the assumption of continuity of in practice holds for universal approximators also. Moreover, the method we propose below can be used in conjunction with any standard flow, so that we expect even better performance when an expressive is combined with our approach.
Finally, we note that the shortcomings of the KL divergence for generative modelling are described at length by arjovsky2017wasserstein. There the authors suggest instead using the Wasserstein distance to measure the discrepancy between and , since under typical assumptions this will yield a continuous loss function suitable for gradientbased training. However, the Wasserstein distance is difficult to estimate in high dimensions, and its performance can be sensitive to the choice of ground metric used (peyre2019computational). Our proposal here is to keep the KL objective in equation 2 and instead modify the model so that we are not required to map onto using a single continuous bijection. We describe our method in full now.
3 Localised generative flows
3.1 Model
The essence of our idea is to replace the single used in equation 2 with an indexed family such that each is a bijection from to . Intuitively, our aim is for each only to push the prior onto a local region of , thereby relaxing the constraints posed by standard normalising flows as described above. To do so, we now define as the marginal density of obtained via the following generative process:
(3) 
Here is an additional term that we must specify. In all our experiments we take this to be a mean field Gaussian, so that for some functions , where is the dimension of the index set , and is the identity matrix. Other possibilities, such as the Concrete distribution (maddison2016concrete), might also be useful.
Informally,^{3}^{3}3This argument can be made precise using disintegrations (chang1997conditioning), but since the proof is mainly a matter of measuretheoretic formalities we omit it. this yields the joint model where is the Dirac delta. We can marginalise out the dependence on by making the change of variable , which means .^{4}^{4}4Note that refers to the Jacobian with respect to only. This yields a proper density for via
We then obtain our density model by integrating over :
(4) 
In other words, is a mixture (in general, infinite) of individual normalising flows , each weighted by .
We can also stack this architecture by taking itself to be a density of the form of equation 4. Doing so with layers of stacking corresponds to the marginal of obtained via:
where now each is a bijection for all . The stochastic computation graph corresponding to this model is shown in (a). In this case, the same argument yields
(5) 
where . This approach is in keeping with the standard practice of constructing normalising flows as the composition of simpler bijections, which can indeed be recovered here by taking each to be Dirac. We have found stacking to improve significantly the overall expressiveness of our models, and make extensive use of it in our experiments below.
3.2 Benefits
Heuristically, we believe our model allows each to learn a local region of , thereby greatly relaxing the global constraints on existing flowbased models described above. To ensure a finite KL, we no longer require the density to have support covering the entirety of for any given ; all that matters is that every region of is covered for some . Our model can thus achieve good performance with each bijection faithfully mapping onto a potentially very small component of . This seems intuitively more achievable than the previous case, wherein a single bijection is required to capture the entire target.
This argument is potentially clearest if is discrete. For example, even if can take only two possible values, it immediately becomes simpler to represent the target shown in Figure 1 using a Gaussian prior: we simply require to map onto one component, and to map onto the other. In practice, we could easily implement such a using two separate normalising flows that are trained jointly. The discrete case is also appealing since in principle it allows exact evaluation of the integral in equation 4, which becomes a summation. Unfortunately this approach also has significant drawbacks that we discuss at greater length in Appendix B.
We therefore instead focus here on a continuous . In this case, for example, we can recover from Figure 1 by partitioning into disjoint regions and , and having map onto the left component of for , and the right component for . Observe that in this scenario we do not require any given to map onto both components of the target, which is in keeping with our goal of localising the model of that is learned by our method.
In practice will invariably be continuous in both its arguments, in which case it will not be possible to partition disjointly in this way. Instead we will necessarily obtain some additional intermediate region on which maps part of outside of , so that will be strictly positive there. This might appear to return us to the situation shown for the standard MAF in Figure 1. However, if only a relatively small region of maps to any such , then the overall value of the integral in equation 4 will be small there. If is sufficiently flexible, it might additionally learn to downweight such pairs, further reducing the value of . Empirically we find these conditions are indeed sufficient to learn an accurate density estimator without the pathologies of standard flows. Observe in Figure 1 that our model is able to separate cleanly the two distinct components of the target density. Moreover, in Figure 2, we see that the learned density avoids the instabilities we observed previously.
We believe a similar story holds when our model is applied to more complicated targets. At a heuristic level, we expect our model will learn to (softly) partition into separate components, with each mapping onto only one of these components. We conjecture that learning many such localised bijections is easier than learning a single global bijection. We demonstrate this empirically for several density estimation problems of interest later in the paper.
3.3 Inference
Even in the single layer case (), if is continuous, then the integral in equation 4 is intractable. In order to train our model, we therefore resort to a variational approximation: we introduce an approximate posterior , and consider the evidence lower bound (ELBO) of :
It is straightforward to show that , and that this bound is tight when is the exact posterior . This allows learning an approximation to by maximising jointly in and , where we assume a dataset of i.i.d. samples .
It can be shown (see Appendix A) that the exact posterior factors as where , and for . We thus endow with the same form:
The stochastic computation graph for this inference model is shown in (b). In conjunction with equation 5, this allows writing the ELBO recursively as
for , with the base case . Here we recover .
Now let denote all the parameters of both and . Suppose each can be suitably reparametrised (kingma2013auto; rezende2014stochastic) so that when for some function and density , where does not depend on . In all our experiments we give the same mean field form as in equation 3, in which case this holds immediately as described e.g. by kingma2013auto. We can then obtain unbiased estimates of straightforwardly using Algorithm 1, which in turn allows minimising our objective via stochastic gradient descent. Note that although this algorithm is specified in terms of a single value of , it is trivial to obtain an unbiased estimate of for a minibatch of points by averaging over the batch index at each layer of recursion.
3.3.1 Performance
A major reason for the popularity of normalising flows is the tractability of their exact log likelihoods. In contrast, the variational scheme described here can produce at best an approximation to this value, which we might expect reduces performance of the final density estimator learned. Moreover, particularly when the number of layers is large, it might seem that the variance of gradient estimators obtained from Algorithm 1 would be impractically high.
However, in practice we have not found either of these problems to be a significant limitation, as our experiments results in Section 5 show. Empirically we find improved performance over standard flows even when using the ELBO as our training objective. We also find that importance sampling is sufficient for obtaining good, lowvariance (if slightly negatively biased) estimates of (rezende2014stochastic, Appendix E) at test time, although we do note that the stochasticity here can lead to small artefacts like the occasional white spots visible above in our 2D experiments.
We similarly do not find that the variance of Algorithm 1 grows intractably when we use a large number of layers , and in practice we are able to train models having the same depth as popular normalising flows. We conjecture that this occurs because, as the number of stacked bijections in our generative process grows, the complexity required of each individual bijection to map to naturally decreases. We therefore have reason to think that, as becomes large, learning each will become easier, so that the variance at each layer will decrease and the overall variance will remain fairly stable.
3.4 Choice of indexed bijection family
We now consider the choice of , for which there are many possibilities. In our experiments, which all take , we focus on the simple case of
(6) 
where are unrestricted mappings, is a bijection, and denotes elementwise multiplication. In this case where is the component of . This has the advantage of working outofthebox with all preexisting normalising flow methods for which a tractable Jacobian of is available.
Equation 6 also has an appealing similarity with the common practice of applying affine transformations between flow steps for normalisation purposes, which has been found empirically to improve stability, convergence time, and overall performance (dinh2016density; papamakarios2017masked; kingma2018glow). In prior work, and have been simple parameters that are learned either directly as part of the model, or updated according to running batch statistics. Our approach may be understood as a generalisation of this family of techniques.
Other choices of are certainly possible here. For instance, we have had preliminary experimental success on small problems by simply taking to be the identity, in which case the model is greatly simplified by not requiring any Jacobians at all. Alternatively, it is also frequently possible to modify the architecture of standard choices of to obtain an appropriate . For instance, affine coupling layers, a key component of models such as RealNVP (dinh2016density), make use of neural networks that take as input a subset of the dimensions of . By concatenating to this input, we straightforwardly obtain a family of bijections for each value of . This requires more work to implement than our suggested method, but has the advantage of no longer requiring a choice of and . We have again had preliminary empirical success with this approach. We leave a more thorough exploration of these alternative possibilites for future work.
4 Related Work
4.1 Discrete Mixture Methods
Several density models exist that involve discrete mixtures of normalising flows. A closelyrelated approach to ours is RAD (dinh2019rad), which is a special case of our model when , is discrete, and is partitioned disjointly. In the context of Monte Carlo estimation, duan2019transport proposes a similar model to RAD that does not use partitioning. ziegler2019latent introduce a normalising flow model for sequences with an additional latent variable indicating sequence length. More generally, our method may be considered an addition to the class of deep mixture models (tang2012deep; van2014factoring), with our use of continuous mixing variables designed to reduce the computational difficulties that arise when stacking discrete mixtures hierarchically – we refer the reader to Appendix B for more details on this.
4.2 Methods Combining Variational Inference and Normalising Flows
There is also a large class of methods which use normalising flows to improve the inference procedure in variational inference (berg2018sylvester; kingma2016improved; rezende2015variational), although flows are not typically present in the generative process here. This approach can be described as using normalising flows to improve variational inference, which contrasts our goal of using variational inference to improve normalising flows.
However, there are indeed some methods augmenting normalising flows with variational inference, but in all cases below the variational structure is not stacked to obtain extra expressiveness. ho2019flow++ use a variational scheme to improve upon the standard dequantisation method for deep generative modelling of images (theis2015note); this approach is orthogonal to ours and could indeed be used alongside LGFs. gritsenko2019on also generalise normalising flows using variational methods, but they incorporate the extra latent noise into the model in a much more restrictive way. das2019dimensionality only learn a lowdimensional prior over the noise space variationally.
4.3 Stacked Variational Methods
Finally, we can contrast our approach with purely variational methods which are not flowbased, but still involve some type of stacking architecture. The main difference between these methods and LGFs is that the bijections we use provide us with a generative model with far more structure, which allows us to build appropriate inference models more easily. Contrast this with, for example, rezende2014stochastic, in which the layers are independently inferred, or sonderby2016ladder, which requires a complicated parametersharing scheme to reliably perform inference. A singlelayer instance of our model also shares some similarity with the fullyunsupervised case in maaloe2016auxiliary, but the generative process there conditions the auxiliary variable on the data.
5 Experiments
We evaluated the performance of LGFs on several problems of varying difficulty, including synthetic 2D data, several UCI datasets, and two image datasets. We describe the results of the UCI and image datasets here; Appendix C.2 contains the details of our 2D experiments.
In each case, we compared a baseline flow to its extension as an LGF model with roughly the same number of parameters. We obtained each LGF by inserting a mixing variable between every component layer of the baseline flow. For example, we inserted a after each autoregressive layer in MAF (papamakarios2017masked), and after each affine coupling layer in RealNVP (dinh2016density). In all cases we obtained , , , and as described in Appendix C.1. We inserted batch normalisation (ioffe2015batch) between flow steps for the baselines as suggested by dinh2016density, but omit it for the LGFs, since our choice of bijection family is a generalisation of batch normalisation as described in Subsection 3.4.
We trained our models to maximise either the loglikelihood (for the baselines) or the ELBO (for the LGFs) using the ADAM optimiser (kingma2014adam) with default hyperparameters and no weight decay. For the UCI and image experiments we stopped training after 30 epochs of no validation improvement for the UCI experiments, and after 50 epochs for the image experiments. Both validation and test performance were evaluated using the exact loglikelihood for the baseline, and the standard importance sampling estimator of the average loglikelihood (rezende2014stochastic, Appendix E) for LGFs. For validation we used 5 importance samples for the UCI datasets and 10 for the image datasets, while for testing we used 1000 importance samples in all cases.
Our code is available at https://github.com/jrmcornish/lgf.
5.1 UCI Datasets
We tested the performance of LGFs on the POWER, GAS, HEPMASS, and MINIBOONE datasets from the UCI repository (Bache+Lichman:2013). We preprocessed these datasets identically to papamakarios2017masked, and use the same train/validation/test splits. For all models we used a batch size of 1000 and a learning rate of . These constitute a factor of 10 increase over the values used by papamakarios2017masked and were chosen to decrease training time. Our baseline results differ slightly from papamakarios2017masked, which may be as a result of this change.
We focused on MAF as our baseline normalising flow because of its improved performance over alternatives such as RealNVP for generalpurpose density estimation (papamakarios2017masked). A given MAF model is defined primarily by how many autoregressive layers it uses, as well as the sizes of the autoregressive networks at each layer. For an LGFMAF, we must additionally define the neural networks used in the generative and inference process, for which we use multilayer perceptrons (MLPs). We considered a variety of choices of hyperparameters for MAF and LGFMAF. Each instance of LGFMAF had a corresponding MAF configuration, but in order to compensate for the parameters introduced by our additional neural networks, we also considered deeper and wider MAF models. For each LGFMAF, the dimensionality of was roughly a quarter of that of the data. Full hyperparameter details are given in Appendix C.3. For three random seeds, we trained models using every hyperparameter configuration, and then chose the bestperforming model across all parameter configurations using validation performance. Table 1 shows the resulting testset loglikelihoods averaged across the different seeds. It is clear that LGFMAFs yield improved results in this case.
POWER  GAS  HEPMASS  MINIBOONE  

MAF  
LGFMAF 
5.2 Image datasets
We also considered LGFs applied to the FashionMNIST (xiao2017fashion) and CIFAR10 (krizhevsky2009learning) datasets. In both cases we applied the dequantisation scheme of theis2015note beforehand, and trained all models with a learning rate of and a batch size of 100.
We took our baseline to be a RealNVP with the same architecture used by dinh2016density for their CIFAR10 experiments. In particular, we used 10 affine coupling layers with the corresponding alternating channelwise and checkerboard masks. Each coupling layer used a ResNet (he2016deep; he2016identity) consisting of residual blocks of channels (denoted for brevity). We also replicated their multiscale architecture, squeezing the channel dimension after the first 3 coupling layers, and splitting off half the dimensions after the first 6. This model had 5.94M parameters for FashionMNIST and 6.01M parameters for CIFAR10. For completeness, we also consider a RealNVP model with coupler networks of size to match those used below in LGFRealNVP. This model had M and M parameters for FashionMNIST and CIFAR10 respectively.
For the LGFRealNVP, we sought to maintain roughly the same depth over which we propagate gradients as in the baseline. To this end, our coupling networks were ResNets of size , and each and used ResNets of size . Our network was a ResNet of size . We give the same shape as a single channel of , and upsampled to the dimension of by adding channels at the output of the ResNet. Our model had 5.99M parameters for FashionMNIST and 6.07M parameters for CIFAR10.
FashionMNIST  CIFAR10  

RealNVP (4)  
RealNVP (8)  
LGFRealNVP (4) 
Table 2 shows that LGFs consistently outperformed the baseline models. We moreover found that LGFs tend to train faster: the average epoch with best validation performance on CIFAR10 was 458 for LGFRealNVP, and 723 for RealNVP (8).^{5}^{5}5No RealNVP (4) run had converged after 1000 epochs, at which point we stopped training. However, by this point the rate of improvement had slowed significantly. Samples synthesised from all models can be found in Appendix C.4.
We also found that using the ELBO instead of the loglikelihood does not penalise our method, even in these high dimensional settings. The gap between the estimated testset loglikelihood and the average testset ELBO was not very large for the LGF models, with a relative error of for FashionMNIST and for CIFAR10. Moreover, the importancesamplingbased loglikelihood estimator itself had very low variance when using 1000 importance samples. For each trained LGF model, we estimated the relative standard deviation using three separate samples of this estimator. We obtained an average over all trained models of for FashionMNIST, and for CIFAR10.
6 Conclusion
In this paper, we have proposed localised generative flows for density estimation, which generalise existing normalising flow methods and address the limitations of these models in expressing complicated target distributions. Our method obtains successful empirical results on a variety of tasks. Many extensions appear possible, and we believe localised generative flows show promise as a means for improving the performance of density estimators in practice.
Acknowledgments
Rob Cornish is supported by the EPSRC Centre for Doctoral Training in Autonomous Intelligent Machines & Systems (EP/L015897/1) and NVIDIA. Anthony Caterini is a Commonwealth Scholar, funded by the U.K. Government. Arnaud Doucet is partially supported by the U.S. Army Research Laboratory, the U.S. Army Research Office, and by the U.K. Ministry of Defence (MoD) and the U.K. EPSRC under grant numbers EP/R013616/1 and EP/R034710/1.
References
Appendix A Correctness of Posterior Factorisation
We want to prove that the exact posterior factorises according to
(7) 
where , , and for . It is convenient to introduce the index notation . First note that we can always write this posterior autoregressively as
(8) 
Now, consider the graph of the full generative model shown in (a). It is clear that all paths from to any node in the set are blocked by , and hence that dseparates from (bishop2006pattern). Consequently,
Informally,^{6}^{6}6As with our arguments in Subsection 3.1, this could again be made precise using disintegrations (chang1997conditioning). we then have the following for any :
by the definition of , where we write to remove any ambiguities when composing functions. We can substitute this result into equation 8 to obtain equation 7.
Appendix B Discrete
While a discrete is appealing for its ability to produce exact likelihoods, it also suffers several disadvantages that we describe now. First, observe that the choice of the number of discrete values taken by has immediate implications for the number of disconnected components of that can separate, which therefore seems to require making fairly concrete assumptions (perhaps implicitly) about the topology of the target of interest. To mitigate this, we might try taking the number of values to be very large, but then in turn the time required to evaluate equation 4 (now a sum rather than an integral) necessarily increases. This is particularly true when using a stacked architecture, since to evaluate with layers each having possible values takes complexity. dinh2019rad propose a model that partitions so that only one component in each summation is nonzero for any given , which reduces this cost to . However, this partitioning means that their is not continuous as a function of , which is reported to make the optimisation problem in equation 2 difficult.
Unlike for continuous , the ELBO objective is also of limited effectiveness in the discrete case. Since defines a discrete distribution at each layer, we would need a discrete variational approximation to ensure a welldefined ELBO. However, the parameters of a discrete distribution are not ameanble to the reparametrisation trick, and hence we would expect our gradient estimates of the ELBO to have high variance. As mentioned above, a compromise here might be to use the CONCRETE distribution (maddison2016concrete) to approximate a discrete while still apply variational methods. We leave exploring this for future work.
Appendix C Further Experimental Details
c.1 LGF architecture
In addition to the bijections from the baseline flow, an LGF model of the form we consider requires specifying , , and . In all our experiments

, where and were two separate outputs of the same neural network

, where and were two separate outputs of the same neural network

and were two separate outputs of the same neural network.
c.2 2D experiments
To gain intuition about our model, we ran several experiments on the simple 2D datasets shown in Figure 1 and Figure 4. Specifically, we compared the performance of a baseline MAF against an LGFMAF. We used MLPs for all networks involved in our model.
For the dataset shown in Figure 1, the baseline MAF had 10 autoregressive layers consisting of 4 hidden layers with 50 hidden units. The LGFMAF had 5 autoregressive layers consisting of 2 hidden layers with 10 hidden units. Each was a single scalar. The network consisted of 2 hidden layers of 10 hidden units, and both the and networks consisted of 4 hidden layers of 50 hidden units. In total the baseline MAF had 80080 parameters, while LGFMAF had 80810 parameters. We trained both models for 300 epochs.
We used more parameters for the datasets shown in Figure 4, since these targets have more complicated topologies. In particular, the baseline MAF had 20 autoregressive layers, each with the same structure as before. The LGFMAF had 5 autoregressive layers, now with 4 hidden layers of 50 hidden units. The other networks were the same as before, and each was still a scalar. In total the baseline MAF had 160160 parameters, while our model had 119910 parameters. We trained all models now for 500 epochs.
The results of these experiments are shown in Figure 1 and Figure 4. Observe that LGFMAF consistently produces a more faithful representation of the target distribution than the baseline. A failure mode of our approach is exhibited in the spiral dataset, where our model still lacks the power to fully capture the topology of the target. However, we did not find it difficult to improve on this: by increasing the size of the mean/standard deviation networks to 8 hidden layers of 50 hidden units (and keeping all other parameters fixed), we were able to obtain the result shown in Figure 5. This model had a total of 221910 parameters. For the sake of a fair comparison, we also tried increasing the complexity of the MAF model, by the size of its autoregressive networks to 8 hidden layers of 50 hidden units (obtaining 364160 parameters total). This model diverged after approximately 160 epochs. The result after 150 epochs is shown in Figure 5.
c.3 UCI Experiments
In Tables 3, 4, and 5, we list the choices of parameters for MAF and LGFMAF. In all cases, we allowed the base MAF to have more layers and deeper coupler networks to compensate for the additional parameters added by the additional components of our model. Note that neural networks are listed as size , where denotes the number of hidden layers and denotes the size of the hidden layers. All combinations of parameters were considered; in each case, there were configurations for MAF and configurations for LGFMAF.
Layers  Coupler size  dim  , size  size  

MAF  , ,  , ,  N/A  N/A  N/A 
LGFMAF  , 
Layers  Coupler size  dim  , size  size  

MAF  , ,  , ,  N/A  N/A  N/A 
LGFMAF  , 
Layers  Coupler size  dim  , size  size  

MAF  , ,  , ,  N/A  N/A  N/A 
LGFMAF  , 