HINT: Hierarchical Invertible Neural Transport for Density Estimation and Bayesian Inference
A large proportion of recent invertible neural architectures is based on a coupling block design. It operates by dividing incoming variables into two sub-spaces, one of which parameterizes an easily invertible (usually affine) transformation that is applied to the other. While the Jacobian of such a transformation is triangular, it is very sparse and thus may lack expressiveness. This work presents a simple remedy by noting that (affine) coupling can be repeated recursively within the resulting sub-spaces, leading to an efficiently invertible block with dense triangular Jacobian. By formulating our recursive coupling scheme via a hierarchical architecture, HINT allows sampling from a joint distribution and the corresponding posterior using a single invertible network. We demonstrate the power of our method for density estimation and Bayesian inference on a novel data set of 2D shapes in Fourier parameterization, which enables consistent visualization of samples for different dimensionalities.
Invertible neural networks based on the normalizing flow principle have recently gained increasing attention for generative modeling tasks. Arguably the most popular group of such networks are those built on the coupling block design. Their success is due to a number of useful properties setting them apart from other generative approaches:
(a) they offer tractable likelihood computation for any sample or data point, (b) training via the maximum likelihood criterion is generally very stable, (c) their interpretable and easily accessible latent space opens up possibilities for e.g. style transfer and (d) the same trained model can be used for efficient data generation as well as efficient density estimation.
While autoregressive models can also be trained as normalizing flows and share the former two properties, they sacrifice efficient invertibility for expressive power and thus lose the latter two properties. In contrast, the expressive power of a single invertible block is a core limitation of invertible networks, leading to impractically deep models with dozen or hundreds of blocks, as is exemplified by the GLOW architecture of Kingma and Dhariwal (2018). While invertibility actually allows to backpropagate through very deep networks with minimal memory footprint Gomez et al. (2017), the search for more expressive invertible building blocks remains an important question.
The superior performance of autoregressive approaches such as Van den Oord et al. (2016a) stems from the fact that they model a larger range of interactions between variables, as reflected in the dense triangular Jacobian matrix of the mapping they represent. From the theory of transport maps we know that certain guarantees of universality exist for triangular maps. These do not hold for the standard coupling block design, which has a comparatively sparse Jacobian as shown in figure 1 (left).
In this work we propose an extension to the coupling block design that recursively fills out the previously unused portions of the Jacobian until we obtain a dense triangular map (figure 1, right), or any intermediate design if we choose to stop early. The advantages of the original coupling block architecture are retained in the process.
We furthermore show that the recursive structure of this mapping can be used to split the variables of interest into two subsets and , such that the transformation of is modeled conditionally on . This property can be exploited for a new sampling scheme, where a single normalizing flow model efficiently generates samples from both the joint distribution and the conditional of the factorized variables.
Finally, we introduce a new family of data sets based on 2d Fourier curve parameterizations. In the normalizing flow literature, there is an abundance of two-dimensional toy densities that provide an easy visual check for correctness of the model’s output. Beyond two dimensions however, it is challenging to visualize the distribution or even individual samples produced by a model. Image data sets solve this problem, but are quickly of high enough dimensionality that it becomes infeasible to evaluate statistical baselines like Approximate Bayesian Computation (ABC) to compare against.
A step towards visualizable data sets of intermediate size has been made in Kruse et al. (2019), but their four-dimensional problems are still two simple to demonstrate the advantages of the recursive coupling approach described above. To fill the gap, we describe a way to generate data sets of arbitrary dimensionality such that each data point parameterizes a closed curve in 2d space which is easy to visualize. With more data dimensions, i.e. more degrees of freedom, distributions of more detailed curves can be represented.
To summarize, the contributions of this paper to the field of normalizing flow research are as follows:
A simple, efficiently invertible flow module with dense lower triangular Jacobian
A hierarchical coupling architecture that models joint and conditional distributions simultaneously
A novel family of data sets allowing 2d visualization for a flexible number of dimensions
2 Related Work
Normalizing flows were popularized in the context of deep learning chiefly by the work of Rezende and Mohamed (2015) and Dinh et al. (2015). At present, a large variety of different architectures exist to realize normalizing flows in practice. A comprehensive overview of many of these, as well as general background on invertible neural networks and normalizing flows, can be found in Kobyzev et al. (2019) and in the simultaneous work of Papamakarios et al. (2019).
Many existing networks fall into one of two groups, namely coupling block architectures and autoregressive models. First additive and then affine coupling blocks were introduced by Dinh et al. (2015, 2017). Kingma and Dhariwal (2018), besides demonstrating the power of flow networks as generators, went on to generalize the permutation of variables between blocks by learning the corresponding matrices. There have been a number of works that focus on improving on the limiting nature of the affine transformation at the core of most coupling block networks. E.g. Durkan et al. (2019) replace affine couplings with more expressive monotonous splines, albeit at the cost of evaluation speed.
On the other hand, there is a rich body of research into autoregressive (flow) networks for various purposes, ranging from Van den Oord et al. (2016b, a) over Kingma et al. (2016) and Papamakarios et al. (2017) to Huang et al. (2018). More recently, Jaini et al. (2019) applied second-order polynomials to improve expressive power compared to typical autoregressive models, and proved that their model is a universal density approximator. While such models are known for excellent density estimation results compared to coupling architectures Ma et al. (2019); Liao et al. (2019), generating samples is often not a priority and can be prohibitively slow.
Outside of these two subfields, there are other approaches towards a favorable trade-off between expressive power and efficient invertibility.
Residual Flows (Behrmann et al., 2019; Chen et al., 2019) impose Lipschitz constraints on a standard residual block, which guarantees invertibility with a full Jacobian and enables approximate maximum likelihood training but requires an iterative procedure for sampling. Similarly, Song et al. (2019) use lower triangular weight matrices which can be inverted via fixed-point iteration. The normalizing flow principle is formulated continuously as a differential equation by Grathwohl et al. (2019), which allows free-form Jacobians but requires integrating an ODE for each network pass. Karami et al. (2019) introduce another method with dense Jacobian based on invertible convolutions performed in the Fourier domain.
In terms of modeling conditional densities with invertible neural networks, Ardizzone et al. (2019a) proposed an approach that divides the network output into conditioning variables and latent vector, training the flow part with an MMD Gretton et al. (2012) objective instead of maximum likelihood. Later Ardizzone et al. (2019b) introduced a simple conditional coupling block from which a conditional normalizing flow can be constructed.
For an input vector , the function defined by a standard invertible coupling block is
with and being the first and second half of the input vector and an easily invertible transform of conditioned on . Its inverse is then simply given as
In the case of an affine coupling block Dinh et al. (2017), takes the form with and unconstrained feed-forward networks. The logarithm of this block’s Jacobian determinant can be computed very efficiently as
To ensure that all entries of are transformed, and interact with each other in different combinations, they construct a pipeline that alternates between coupling blocks and random orthogonal matrices , where can trivially be inverted as and has Jacobian log-determinant .
3.1 Normalizing flows and transport maps
To create a normalizing flow, this pipeline is trained via maximum likelihood loss
to transport the data distribution to a standard normal latent distribution .
We can draw samples from via by drawing latent samples from and passing them through the inverse model: . Using the change-of-variables formula, we can also evaluate the density of a given data point as .
This procedure is founded on the theory of transport maps Villani (2008), which are employed in exactly the same way to push a reference density (e.g. Gaussian) to a target density (e.g. data distribution, Marzouk et al. (2016)).
The objective in equation 4 is in fact the KL divergence between the data distribution and the pull-back of the latent density , with the unknown fixed entropy of the data distribution:
Note also that each pair in is a composition of an orthogonal transformation and a triangular map, where the latter is better known in the field of transport maps as a Knothe-Rosenblatt rearrangement Marzouk et al. (2016). This factorization can be seen as a non-linear generalization of the classic QR decomposition Stoer and Bulirsch (2013). Whereas the triangular part encodes the possibility to represent non-linear transformations, the orthogonal part reshuffles variables to foster dependence of each part of the input to the final output, thereby drastically increasing the representational power of the map .
3.2 Bayesian inference with conditional flows
When a data set does not just represent a distribution of data points , but consists of tuples , the labels/features/observations don’t have a clear place in the normalizing flow framework described above. It is of course possible to just concatenate and into a single vector and let the transport model their joint distribution. But in many cases the paired data stems from a known (probabilistic) forward process and what we are interested in is the inverse process, in particular evaluating and sampling from the conditional density . This task is called Bayesian inference.
Ardizzone et al. (2019b) and Winkler et al. (2019) independently introduced conditional coupling blocks that allow an entire normalizing flow to be conditioned on external variables. By conditioning the transport between and on the corresponding values of as , its inverse can be used to turn the latent distribution into an approximation of the posterior . A similar idea can be found in Marzouk et al. (2016).
The maximum likelihood objective from equation 4 is readily adjusted to the conditional setting as
where the Jacobian is w.r.t. , i.e. . This is equivalent to minimizing the KL divergence between and the pull-back of a standard normal distribution through the inverse of :
We will now extend the basic coupling block architecture described above in two ways.
4.1 The recursive coupling block
As visualized in figure 1 (left), the Jacobian of a simple coupling block is very sparse, i.e. many possible interactions between variables are not modelled. However the efficient determinant computation in equation 3 works for any lower triangular , and indeed Theorem 1 of Hyvärinen and Pajunen (1999) states that a single triangular transformation can, in theory, already represent arbitrary distributions.
To make use of this potential and fill the empty areas below the diagonal in , we propose a recursive coupling scheme
where each sub-space is again split and transformed until we end up with single dimensions (or stop early). Note that each sub-coupling has its own – e.g. affine – coupling function with independent parameters. The inverse transform is
This procedure leads to the dense lower triangular Jacobian visualized in figure 1 (right), the log-determinant of which is simply the sum of the log-determinants of all sub-couplings . If is a power of two and hence the dimensions of all splits work out, becomes a full Knothe-Rosenblatt map.
Having defined the recursive coupling block, we find that the way we integrate it with the permutation matrix has a profound effect on the type of transport we get when compositing many blocks.
If we omit the permutations entirely, the composition of lower triangular maps gives us another dense lower triangular map (figure 2, left) and thus, in fact, an autoregressive model.
Placing a permutation matrix at the beginning of the lower branch in each recursive split (figure 2, middle) interestingly produces a conditional structure: The sub-transport performed by the lower branch at each level is informed by the upper branch, but has no influence on the latter itself. Because variables never get permuted between the outermost branches, we end up with a lower “lane” that performs a transport conditioned on intermediate representations from the upper lane. The Jacobian of in this case exhibits a block triangular structure.
Finally, if we simply apply a permutation over all variables outside of each recursive block (figure 2, right), we end up with a transport with full Jacobian and, presumably, the greatest expressive power among the options considered here.
4.2 Hierarchical Invertible Neural Transport
While the recursive coupling block defined above is motivated by the search for a more expressive architecture, we have seen that its structure also lends itself well to a hierarchical treatment where splits of the variables at different levels carry different meaning.
Specifically, in a setting with paired data , we can provide both variables as input to the flow and use the top level split to transform them separately at the next recursion level. To this end we take inspiration from figure 2 (middle) to design a hierarchical affine coupling block. Instead of relying on the recursive permutation structure implied there, however, we only apply a single permutation and to each respective lane at the beginning of the block. This preserves the hierarchical setup while letting variables within each lane interact more freely and saving on expensive tensor operations.
Figure 3 shows a hierarchical affine coupling block of this type, with one level of recursion for simplicity. A normalizing flow model constructed in this way performs hierarchical invertible neural transport, or HINT for short.
The output of a HINT model is a latent code with two components, , but the training objective stays the same as in equation 4:
As with a standard normalizing flow, the joint density of input variables is the pull-back of the latent density via :
But because the -lane in HINT can be evaluated independently of the -lane, we can determine the partial latent code for a given and hold it fixed, while drawing from the -part of the latent distribution. Doing so yields samples from the conditional density
This means HINT gives access to both the joint and the conditional density of the two variables and .
All experiments were carried out on an artificial data set of 2d shapes that allows visualizing samples regardless of the chosen data dimensionality. All flow models were trained on a single NVIDIA GeForce RTX 2080 Ti.
5.1 Fourier shapes data set
A curve parameterized by complex 2d Fourier coefficients can be traced as
with parameter running from to . This parameterization will always yield a closed, possibly self-intersecting curve McGarva and Mullineux (1993).
Vice-versa, given a sequence of 2d points we can calculate the Fourier coefficients of a curve approximating this sequence as
When we increase the limit , higher order terms are added to the parameterization in equation 11 and the shape can be approximated in greater detail. An example of this effect for a natural shape is shown in figure 4 (right). Note that the actual dimensionality of the parameterization in our data set is , as each complex 2d coefficient needs four real numbers to represent it.
In this paper we perform experiments on two specific Fourier shapes data sets. The first one uses , i.e. , and we choose the prior to be an arbitrary Gaussian mixture model with five components. Figure 5 (left) shows a large sample from this shape prior and four single representatives. Note that these shapes carry little geometric meaning, often self-intersect and are best understood as a way to visualize samples in two dimensions.
The second one uses , i.e. , to represent a distribution of simple shapes that are generated geometrically by crossing two bars of randomized length and width at a right angle. This results in a variation of X, L and T shapes, some of which are shown in figure 5 (right).
In both instances, our training set has samples and the test set has samples.
5.2 Forward process
Our forward operator in the Bayesian inference task returns three simple features of the 2d shape parameterized by . We determine the largest diameter as and the total width of the shape orthogonal to this direction as . The angle of the former is one feature, the aspect ratio another. The third feature is circularity, defined as .
For ambiguity we add some noise to these three values to obtain our vector .
5.3 Density estimation
On the Gaussian mixture data set, we trained a single-block and a two-block network for density estimation, once with standard coupling blocks and once with our recursive design. Each internal sub-network and consist of three fully connected layers. Sub-networks further down in the recursion use progressively fewer parameters, and all four normalizing flows are scaled to have a total of trainable parameters.
For these and all following experiments, we trained each network for epochs with Adam and a learning rate decaying exponentially from to . With a small amount of weight regularization and clamping gradients to , we observed very stable training for all networks.
Qualitative samples from these models are shown in figure 6 (left), with the left-most panel being an empirical estimate of the prior . Quantitative results can be found in table 1. We compare across several training runs per model in terms of maximum mean discrepancy Gretton et al. (2012) and bits per dimension.
The former measures the dissimilarity of two distributions using only samples from both. Following Ardizzone et al. (2019a), we use MMD with an inverse multi-quadratic kernel and average the results from batches from the data prior and from each trained model.
Bits per dimension are a representation of the log-likelihood of the test data under a model, calculated as
For both metrics, we see that a single standard coupling block performs very badly, which is unsurprising since it leaves half the variables untouched. A single recursive coupling block on the other hand is reasonably successful and achieves bits/dimension almost on par with a two standard coupling blocks. Out of the tested arrangements, a recursive two-block network performs the best.
We also trained networks with standard and recursive coupling blocks on the larger geometrical shape data set, where we afforded each network a total of parameters spread over four coupling blocks. Qualitative samples from these models are shown in figure 6 (right), with those from the normal coupling network in the top row and those from the recursive network in the bottom row.
While it is difficult to judge how well the entire distribution of shapes matches the prior (see figure 5, (right)), it is very clear that the network based on recursive coupling blocks produces individual samples that are more faithful to the data set.
5.4 Bayesian inference
For the Bayesian inference task on the Gaussian mixture data set, we trained a range of conditional flow models and HINT models with the following properties:
with 1 block and trainable parameters
with 2 blocks and trainable parameters
with 4 blocks and trainable parameters
with 8 blocks and trainable parameters
We train multiple instances of each model as described above and again evaluate in terms of MMD and bits per dimension.
In this case however the MMD comparison is not with samples from the prior , but with samples from an estimate of the posterior for values of obtained via the prior and forward process. We create these estimated ground truth posterior samples via quantitative approximate Bayesian computation as explained in Ardizzone et al. (2019a).
The bits per dimension measure also requires extra attention in this setting, since we have to compute the log-likelihood of only the -lane of our HINT models. Fortunately, this is easily done by excluding the contributions from the -lane when adding up the log-determinants of the Jacobians in each sub-coupling.
The quantitative results of these experiments are summarized in table 2, where we can see that the standard coupling model with eight blocks achieves the best score. As in the unconditional case, however, the recursive design vastly outperforms the standard one when only a single block is compared.
Qualitative results are presented in figure 7. Each individual sample has a red bounding box that displays its true aspect ratio and angle of greatest diameter, and a green box displaying the conditioning target . The jagged rings around the samples are chosen to have the exact circularity measured from the sample (red) or required by (green), respectively. Visually, we find both approaches to be on equal footing.
In figure 8, we also show qualitative results from models with four and eight blocks and parameters each trained on the geometrical shape data set. Adherence to the condition is visualized just like in figure 7.
As in the unconditional case, it is clear that the Fourier coefficients produced by the HINT models parameterize shapes that are much closer to those in the data set. Both conditional coupling block models have trouble recovering the rectangular nature of the shapes. While all four models stay true to the circularity they are conditioned on, they all struggle to place the greatest diameter of the shapes at the required angle. The HINT models appear slightly more faithful to the aspect ratio even when the rotation is off.
5.5 Proof of Concept: Automatic Posterior Transformation using HINT
In the following, we use HINT to implement the Automatic Posterior Transformation (APT) algorithm proposed by Greenberg et al. (2019), and compare the implementation to a standard coupling block INN. In short, given some observations, APT estimates the posterior of unobserved variables. The posterior is then used as a prior for the next step, and subsequently updated by newly arriving observations. This can be repeated many times. To prevent errors from accumulating over many timesteps, the density models have to be very accurate, making this an interesting application for HINT.
As a task, we use the general predator-prey model, also termed competitive Lotka-Volterra equations, which typically describes the interaction of species in a biological system over time:
The undisturbed growth rate of species is given by (can be or , growing or shrinking naturally), and is further affected by the other species in a positive way (, predator), or in a negative way (, prey). The solutions to this system of equations can not be expressed analytically, which makes their prediction a challenging task. Additionally, we make the process stochastic, by adding random noise at each time step. Therefore, at each time step, noisy measurements of are observed, with the task to predict the remaining population , given the current and past observations. The exact parameters of the data model are found in appendix section C.
We compare this to an implementation using a standard coupling block INN with the training procedure proposed by Ardizzone et al. (2019a). The results for a single example time-series are shown in figure 9. We find that the standard INN does not make meaningful predictions in the short- or mid-term, only correctly predicting the final dying off of population . HINT on the other hand is able to correctly model the entire sequence. Importantly, the modeled posterior distribution at each step correctly contracts over time, as more measurements are accumulated. Experimental details are found in appendix section C.
In this work, we presented HINT, a new architecture with the primary use as a normalizing flow. It improves on the traditional coupling block design in terms of expressive power by making the Jacobian densly triangular. Hereby, HINT keeps the advantages of interpretable latent space and equally efficient sampling and density estimation, which are typically not present in other models with densely triangular Jacobians, specifically autoregressive flows. To evaluate the model, we present a new data set based on Fourier decompositions of 2D shapes, which can be easily visualized while still posing a reasonable level of challenge to compare different methods. In terms of future improvements, we expect that our formulation can be made even more computationally efficient through the use of masking operations enabling more advanced parallelization.
J. Kruse was supported by Informatics for Life funded by the Klaus Tschira Foundation. G. Detommaso was supported by the EPSRC Centre for Doctoral Training in Statistical Applied Mathematics at Bath (EP/L015684/1).
A Details on Gaussian mixture data set
The prior for our smaller data set instance is a Gaussian mixture model with five components in dimensions. The parameters and for this distribution can be generated with the following Python snippet:
B Details on geometrical shape data set
The shapes for the larger data set instance are generated by taking the union of two oblong rectangles crossing each other at a right angle. For both rectangles, the longer side length is drawn uniformly from and the other from . We shift both rectangles along their longer side by a uniformly random amount from .
Then we form the union and insert equally spaced points along the resulting polygon’s sides such that no line segment is longer than . This is necessary to obtain point sequences which are approximated more faithfully by Fourier curves.
Finally, we center the shape at the origin, rotate it by a random angle and shift it in either direction by a distance drawn from .
Python code for creating both data set variants can be found under https://github.com/VLL-HD/HINT.
C Details on extra experiment in 5. 5.
Parameters and for the 4d competitive Lotka-Volterra model in equation 13 were picked randomly as
|and the true initial population values set to|
Figure 10 shows the resulting development of the four populations over ten unit time steps, including noise on the observed values for , and .
Both the INN and the HINT network we trained consist of ten (non-recursive) coupling blocks with a total of trainable weights. We used Adam to train both for epochs per time step with training samples and a batch size of . The learning rate started at and exponentially decayed to (HINT) and (INN), respectively. Inference on begins with an initial guess at drawn from .
- Analyzing inverse problems with invertible neural networks. In Intl. Conf. on Learning Representations, Cited by: §2, §5.3, §5.4, §5.5.
- Guided image generation with conditional invertible neural networks. arXiv:1907.02392. External Links: Cited by: §2, §3.2.
- Invertible residual networks. In International Conference on Machine Learning, External Links: Cited by: §2.
- Residual flows for invertible generative modeling. In Advances in Neural Information Processing Systems 32, pp. 9913–9923. Cited by: §2.
- NICE: non-linear independent components estimation. In International Conference on Learning Representations, External Links: Cited by: §2, §2.
- Density estimation using real NVP. In International Conference on Learning Representations, External Links: Cited by: §2, §3.
- Neural spline flows. In Advances in Neural Information Processing Systems, pp. 7509–7520. Cited by: §2.
- The reversible residual network: backpropagation without storing activations. In Advances in neural information processing systems, pp. 2214–2224. Cited by: §1.
- Scalable reversible generative models with free-form continuous dynamics. In International Conference on Learning Representations, External Links: Cited by: §2.
- Automatic posterior transformation for likelihood-free inference. arXiv:1905.07488. External Links: Cited by: §5.5.
- A kernel two-sample test. Journal of Machine Learning Research 13 (Mar), pp. 723–773. Cited by: §2, §5.3, Table 1.
- Neural autoregressive flows. In International Conference on Machine Learning, pp. 2078–2087. Cited by: §2.
- Nonlinear independent component analysis: existence and uniqueness results. Neural Networks 12 (3), pp. 429–439. Cited by: §4.1.
- Sum-of-squares polynomial flow. In International Conference on Machine Learning, pp. 3009–3018. Cited by: §2.
- Invertible convolutional flow. In Advances in Neural Information Processing Systems, pp. 5636–5646. Cited by: §2.
- Glow: generative flow with invertible 1x1 convolutions. arXiv:1807.03039. Cited by: §1, §2.
- Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pp. 4743–4751. Cited by: §2.
- Normalizing flows: an introduction and review of current methods. arXiv:1908.09257. Cited by: §2.
- Benchmarking invertible architectures on inverse problems. Workshop on Invertible Neural Networks and Normalizing Flows, International Conference on Machine Learning. Cited by: §1.
- Generative model with dynamic linear flow. IEEE Access 7, pp. 150175–150183. External Links: Cited by: §2.
- MaCow: masked convolutional generative flow. Advances in Neural Information Processing Systems 32, pp. 5891–5900. External Links: Cited by: §2.
- Sampling via measure transport: an introduction. Handbook of Uncertainty Quantification, pp. 1–41. Cited by: §3.1, §3.1, §3.2.
- Harmonic representation of closed curves. Applied Mathematical Modelling 17 (4), pp. 213 – 218. External Links: Cited by: §5.1.
- Normalizing flows for probabilistic modeling and inference. arXiv:1912.02762. External Links: Cited by: §2.
- Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pp. 2338–2347. Cited by: §2.
- Variational inference with normalizing flows. In International Conference on Machine Learning, pp. 1530–1538. Cited by: §2.
- MintNet: building invertible neural networks with masked convolutions. In Advances in Neural Information Processing Systems, pp. 11002–11012. Cited by: §2.
- Introduction to numerical analysis. Vol. 12, Springer Science & Business Media. Cited by: §3.1.
- Conditional image generation with PixelCNN decoders. In Advances in neural information processing systems, pp. 4790–4798. Cited by: §1, §2.
- Pixel recurrent neural networks. In International Conference on Machine Learning, pp. 1747–1756. Cited by: §2.
- Optimal transport: old and new. Vol. 338, Springer Science & Business Media. Cited by: §3.1.
- Learning likelihoods with conditional normalizing flows. arXiv preprint arXiv:1912.00042. Cited by: §3.2.