Implicit Maximum Likelihood Estimation
Abstract
Implicit probabilistic models are models defined naturally in terms of a sampling procedure and often induces a likelihood function that cannot be expressed explicitly. We develop a simple method for estimating parameters in implicit models that does not require knowledge of the form of the likelihood function or any derived quantities, but can be shown to be equivalent to maximizing likelihood under some conditions. Our result holds in the nonasymptotic parametric setting, where both the capacity of the model and the number of data examples are finite. We also demonstrate encouraging experimental results.
Implicit Maximum Likelihood Estimation
Ke Li Jitendra Malik Department of Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley, CA 94720 United States {ke.li,malik}@eecs.berkeley.edu
noticebox[b]This paper was submitted to and rejected from NIPS 2018. We make available the reviews, the rebuttal and the metareview at https://people.eecs.berkeley.edu/~ke.li/papers/imle_reviews.pdf, in the interest of promoting discussion. Comments are most welcome; please contact the authors at ke.li@eecs.berkeley.edu and malik@eecs.berkeley.edu. \end@float
1 Introduction
Generative modelling is a cornerstone of machine learning and has received increasing attention. Recent models like variational autoencoders (VAEs) (Kingma & Welling, 2013; Rezende et al., 2014) and generative adversarial nets (GANs) (Goodfellow et al., 2014; Gutmann et al., 2014), have delivered impressive advances in performance and generated a lot of excitement.
Generative models can be classified into two categories: prescribed models and implicit models (Diggle & Gratton, 1984; Mohamed & Lakshminarayanan, 2016). Prescribed models are defined by an explicit specification of the density, and so their unnormalized complete likelihood can be usually expressed in closed form. Examples include models whose complete likelihoods lie in the exponential family, such as mixture of Gaussians (Everitt, 1985), hidden Markov models (Baum & Petrie, 1966), Boltzmann machines (Hinton & Sejnowski, 1986). Because computing the normalization constant, also known as the partition function, is generally intractable, sampling from these models is challenging.
On the other hand, implicit models are defined most naturally in terms of a (simple) sampling procedure. Most models take the form of a deterministic parameterized transformation of an analytic distribution, like an isotropic Gaussian. This can be naturally viewed as the distribution induced by the following sampling procedure:

Sample

Return
The transformation often takes the form of a highly expressive function approximator, like a neural net. Examples include generative adversarial nets (GANs) (Goodfellow et al., 2014; Gutmann et al., 2014) and generative moment matching nets (GMMNs) (Li et al., 2015; Dziugaite et al., 2015). The marginal likelihood of such models can be characterized as follows:
where denotes the probability density function (PDF) of .
In general, attempting to reduce this to a closedform expression is hopeless. Evaluating it numerically is also challenging, since the domain of integration could consist of an exponential number of disjoint regions and numerical differentiation is illconditioned.
These two categories of generative models are not mutually exclusive. Some models admit both an explicit specification of the density and a simple sampling procedure and so can be considered as both prescribed and implicit. Examples include variational autoencoders (Kingma & Welling, 2013; Rezende et al., 2014), their predecessors (MacKay, 1995; Bishop et al., 1998) and extensions (Burda et al., 2015), and directed/autoregressive models, e.g., (Neal, 1992; Bengio & Bengio, 2000; Larochelle & Murray, 2011; van den Oord et al., 2016).
1.1 Challenges in Parameter Estimation
Maximum likelihood (Fisher, 1912; Edgeworth, 1908) is perhaps the standard method for estimating the parameters of a probabilistic model from observations. The maximum likelihood estimator (MLE) has a number of appealing properties: under mild regularity conditions, it is asymptotically consistent, efficient and normal. A longstanding challenge of training probabilistic models is the computational roadblocks of maximizing the loglikelihood function directly.
For prescribed models, maximizing likelihood directly requires computing the partition function, which is intractable for all but the simplest models. Many powerful techniques have been developed to attack this problem, including variational methods (Jordan et al., 1999), contrastive divergence (Hinton, 2002; Welling & Hinton, 2002), score matching (Hyvärinen, 2005) and pseudolikelihood maximization (Besag, 1975), among others.
For implicit models, the situation is even worse, as there is no term in the loglikelihood function that is in closed form; evaluating any term requires computing an intractable integral. As a result, maximizing likelihood in this setting seems hopelessly difficult. A variety of likelihoodfree solutions have been proposed that in effect minimize a divergence measure between the data distribution and the model distribution. They come in two forms: those that minimize an fdivergence, and those that minimize an integral probability metric (Müller, 1997). In the former category are GANs, which are based on the idea of minimizing the distinguishability between data and samples (Tu, 2007; Gutmann & Hyvärinen, 2010). It has been shown that when given access to an infinitely powerful discriminator, the original GAN objective minimizes the JensenShannon divergence, the variant of the objective minimizes the reverse KLdivergence minus a bounded quantity (Arjovsky & Bottou, 2017), and later extensions (Nowozin et al., 2016) minimize arbitrary fdivergences. In the latter category are GMMNs which use maximum mean discrepancy (MMD) (Gretton et al., 2007) as the witness function.
In the case of GANs, despite the theoretical results, there are a number of challenges that arise in practice, such as mode dropping/collapse (Goodfellow et al., 2014; Arora & Zhang, 2017), vanishing gradients (Arjovsky & Bottou, 2017; Sinn & Rawat, 2017) and training instability (Goodfellow et al., 2014; Arora et al., 2017). A number of explanations have been proposed to explain these phenomena and point out that many theoretical results rely on three assumptions: the discriminator must have infinite modelling capacity (Goodfellow et al., 2014; Arora et al., 2017), the number of samples from the true data distribution must be infinite (Arora et al., 2017; Sinn & Rawat, 2017) and the gradient ascentdescent procedure (Arrow et al., 1958; Schmidhuber, 1992) can converge to a global purestrategy Nash equilibrium (Goodfellow et al., 2014; Arora et al., 2017).
When some of these assumptions do not hold, the theoretical guarantees do not necessarily apply, and the pathologies commonly observed in practice emerge as a natural consequence. For example, Arora et al. (2017) observed that when the number of parameters in the discriminator is polynomial in the number of dimensions, it cannot in general detect mode dropping/collapse, and so the generator at equilibrium could have finite support and drop an exponential number of modes, which is consistent with the empirical findings by Arora & Zhang (2017) and Wu et al. (2016). When only a finite number of data examples are available, Sinn & Rawat (2017) showed that even under the assumptions of an infinitecapacity discriminator and a perfect optimization algorithm, there is no guarantee that the generator will recover the true data distribution at optimality, and that the gradient w.r.t. the generator parameters is zero almost everywhere at the optimal choice of the discriminator. Arora et al. (2017) showed that minimizing the JensenShannon divergence or the Wasserstein distance between the empirical data distribution and the model distribution does not necessarily minimize the same between the true data distribution and the model distribution. Moreover, when the discriminator and generator have finite capacity, Arora et al. (2017) pointed out that a global purestrategy Nash equilibrium may not exist, in which case training may be unstable and fail to converge. Because a global purestrategy Nash equilibrium is not guaranteed to exist in the parametric setting, finding it is NPhard (Gilboa & Zemel, 1989), even in the case of twoplayer zerosum games, e.g. GANs. Therefore, it is necessary to settle for finding a local purestrategy Nash equilibrium. In general, however, gradient ascentdescent is only guaranteed to find a stationary point, which can only be a local Nash equilibrium if the Hessian of the objective (i.e.: the payoff to the discriminator) is positive semidefinite w.r.t. the generator parameters and negative semidefinite w.r.t. the discriminator parameters at the stationary point (Mescheder et al., 2017). Unfortunately, a stationary point is not necessarily desirable and could correspond to a parameter setting where the discriminator is perfect and the gradient vanishes, which is not uncommon in practice because the supports of the data distribution and the model distribution are often nearly disjoint (Arjovsky & Bottou, 2017). Even when the objective is concave in the discriminator parameters and convex in the generator parameters, convergence could be slow if the Hessian has an eigenvalue whose imaginary part is larger in magnitude than its real part (Mescheder et al., 2017). A number of ways have been proposed that alleviate some of these issues, e.g., (Zhao et al., 2016; Salimans et al., 2016; Donahue et al., 2016; Dumoulin et al., 2016; Arjovsky et al., 2017; Hjelm et al., 2017; Li et al., 2017; Zhu et al., 2017), but a way of solving all three issues simultaneously remains elusive.
1.2 Our Contribution
In this paper, we present an alternative method for estimating parameters in implicit models. Like the methods above, our method is likelihoodfree, but can be shown to be equivalent to maximizing likelihood under some conditions. Our result holds when the capacity of the model is finite and the number of data examples is finite. The idea behind the method is simple: it finds the nearest sample to each data example and optimizes the model parameters to pull the sample towards it. The direction in which nearest neighbour search is performed is important: the proposed method ensures each data example has a similar sample, which contrasts with an alternative approach of pushing each sample to the nearest data example, which would ensure that each sample has a similar data example. The latter approach would permit all samples being similar to one data example. Such a scenario would be heavily penalized by the former approach.
The proposed method could sidestep the three issues mentioned above: mode collapse, vanishing gradients and training instability. Modes are not dropped because the loss ensures each data example has a sample nearby at optimality; gradients do not vanish because the gradient of the distance between a data example and its nearest sample does not become zero unless they coincide; training is stable because the estimator is the solution to a simple minimization problem. By leveraging recent advances in fast nearest neighbour search algorithms (Li & Malik, 2016, 2017), this approach is able to scale to large, highdimensional datasets.
1.3 Organization
This paper is organized into the following sections:

Implicit Maximum Likelihood Estimator: We define the proposed estimator and describe the algorithm.

Why Maximum Likelihood: We justify why maximum likelihood is a sensible objective for the purposes of learning generative models.

Analysis: We show the equivalence of the proposed estimator to maximum likelihood, under some conditions.

Experiments: We describe the experimental settings and present results on standard datasets.

Discussion: We discuss and address possible concerns about the proposed method.

Conclusion: We summarize the method and the contributions.

Appendix: We present the proofs of the theoretical results presented in the main paper.
2 Implicit Maximum Likelihood Estimator
2.1 Definition
We are given a set of data examples and some unknown parameterized probability distribution with density . We also have access to an oracle that allows us to draw independent and identically distributed (i.i.d.) samples from .
Let be i.i.d. samples from , where . For each data example , we define a random variable to be the distance between and the nearest sample. More precisely,
where denotes .
The implicit maximum likelihood estimator is defined as:
2.2 Algorithm
We outline the proposed parameter estimation procedure in Algorithm 1. In each outer iteration, we draw i.i.d. samples from the current model . We then randomly select a batch of examples from the dataset and find the nearest sample from each data example. We then run a standard iterative optimization algorithm, like stochastic gradient descent (SGD), to minimize a samplebased version of the Implicit Maximum Likelihood Estimator (IMLE) objective.
Because our algorithm needs to solve a nearest neighbour search problem in each outer iteration, the scalability of our method depends on our ability to find the nearest neighbours quickly. This was traditionally considered to be a hard problem, especially in high dimensions. However, this is no longer the case, due to recent advances in nearest neighbour search algorithms (Li & Malik, 2016, 2017), which avoid the curse of dimensionality in time complexity that often arises in nearest neighbour search.
Note that the use of Euclidean distance is not a major limitation of the proposed approach. A variety of distance metrics are either exactly or approximately equivalent to Euclidean distance in some nonlinear embedding space, in which case the theoretical guarantees are inherited from the Euclidean case. This encompasses popular distance metrics used in the literature, like the Euclidean distance between the activations of a neural net, which is often referred to as a perceptual similarity metric (Salimans et al., 2016; Dosovitskiy & Brox, 2016). The approach can be easily extended to use these metrics, though because this is the initial paper on this method, we focus on the vanilla setting of Euclidean distance in the natural representation of the data, e.g.: pixels, both for simplicity/clarity and for comparability to vanilla versions of other methods that do not use auxiliary sources of labelled data or leverage domainspecific prior knowledge. For distance metrics that cannot be embedded in Euclidean space, the analysis can be easily adapted with minor modifications as long as the volume of a ball under the metric has a simple dependence on its radius.
3 Why Maximum Likelihood
There has been debate (Huszár, 2015) over whether maximizing likelihood of the data is the appropriate objective for the purposes of learning generative models. Recall that maximizing likelihood is equivalent to minimizing , where denotes the empirical data distribution and denotes the model distribution. One proposed alternative is to minimize the reverse KLdivergence, , which is suggested (Huszár, 2015) to be better because it severely penalizes the model for generating an implausible sample, whereas the standard KLdivergence, , severely penalizes the model for assigning low density to a data example. As a result, when the model is underspecified, i.e. has less capacity than what’s necessary to fit all the modes of the data distribution, minimizing leads to a narrow model distribution that concentrates around a few modes, whereas minimizing leads to a broad model distribution that hedges between modes. The success of GANs in generating good samples is often attributed to the former phenomenon (Arjovsky & Bottou, 2017).
This argument, however, relies on the assumption that we have access to an infinite number of samples from the true data distribution. In practice, however, this assumption rarely holds: if we had access to the true data distribution, then there is usually no need to fit a generative model, since we can simply draw samples from the true data distribution. What happens when we only have the empirical data distribution? Recall that is defined and finite only if is absolutely continuous w.r.t. , i.e.: implies for all . In other words, is defined and finite only if the support of is contained in the support of . Now, consider the difference between and : minimizing the former, which is equivalent to maximizing likelihood, ensures that the support of the model distribution contains all data examples, whereas minimizing the latter ensures that the support of the model distribution is contained in the support of the empirical data distribution, which is just the set of data examples. In other words, maximum likelihood disallows mode dropping, whereas minimizing reverse KLdivergence forces the model to assign zero density to unseen data examples and effectively prohibits generalization. Furthermore, maximum likelihood discourages the model from assigning low density to any data example, since doing so would make the likelihood, which is the product of the densities at each of the data examples, small.
From a modelling perspective, because maximum likelihood is guaranteed to preserve all modes, it can make use of all available training data and can therefore be used to train highcapacity models that have a large number of parameters. In contrast, using an objective that permits mode dropping allows the model to pick and choose which data examples it wants to model. As a result, if the goal is to train a highcapacity model that can learn the underlying data distribution, we would not be able to do so using such an objective because we have no control over which modes the model chooses to drop. Put another way, we can think about the model’s performance along two axes: its ability to generate plausible samples (precision) and its ability to generate all modes of the data distribution (recall). A model that successfully learns the underlying distribution should score high along both axes. If mode dropping is allowed, then an improvement in precision may be achieved at the expense of lower recall and could represent a move to a different point on the same precisionrecall curve. As a result, since sample quality is an indicator of precision, improvement in sample quality in this setting may not mean an improvement in density estimation performance. On the other hand, if mode dropping is disallowed, since full recall is always guaranteed, an improvement in precision is achieved without sacrificing recall and so implies an upwards shift in the precisionrecall curve. In this case, an improvement in sample quality does signify an improvement in density estimation performance, which may explain sample quality historically was an important way to evaluate the performance of generative models, most of which maximized likelihood. With the advent of generative models that permit mode dropping, however, sample quality is no longer a reliable indicator of density estimation performance, since good sample quality can be trivially achieved by dropping all but a few modes. In this setting, sample quality can be misleading, since a model with low recall on a lower precisionrecall curve can achieve a better precision than a model with high recall on a higher precisionrecall curve. Since it is hard to distinguish whether an improvement in sample quality is due to a move along the same precisionrecall curve or a real shift in the curve, an objective that disallows mode dropping is critical tool that researchers can use to develop better models, since they can be sure that an apparent improvement in sample quality is due to a shift in the precisionrecall curve.
4 Analysis
Before formally stating the theoretical results, we first illustrate the intuition behind why the proposed estimator is equivalent to maximum likelihood estimator under some conditions. For simplicity, we will consider the special case where we only have a single data example and a single sample . Consider the total density of inside a ball of radius of centred at as a function of , a function that will be denoted as . If the density in the neighbourhood of is high, then would grow rapidly as increases. If, on the other hand, the density in the neighbourhood of is low, then would grow slowly. So, maximizing likelihood is equivalent to making grow as fast as possible. To this end, we can maximize the area under the function , or equivalently, minimize the area under the function . Observe that can be interpreted as the cumulative distribution function (CDF) of the Euclidean distance between and , which is a random variable because is random and will be denoted as . Because is nonnegative, recall that , which is exactly the area under the function . Therefore, we can maximize likelihood of a data example by minimizing , or in other words, minimizing the expected distance between the data example and a random sample. To extend this analysis to the case with multiple data examples, we show in the appendix that if the objective function is a summation, applying a monotonic transformation to each term and then reweighting appropriately preserves the optimizer under some conditions.
We now state the key theoretical result formally. Please refer to the appendix for the proof.
Theorem 1.
Consider a set of observations , a parameterized family of distributions with probability density function (PDF) and a unique maximum likelihood solution . For any , let be i.i.d. random variables and define , and . Let be the cumulative distribution function (CDF) of and .
If satisfies the following:

is differentiable w.r.t. and continuous w.r.t. everywhere.

, there exists such that .

For any , there exists such that and .

such that , , where denotes the ball centred at of radius .

is differentiable everywhere.

For all , if , there exists such that .
Then,
Furthermore, if , then,
Now, we examine the restrictiveness of each condition. The first condition is satisfied by nearly all analytic distributions. The second condition is satisfied by nearly all distributions that have an unrestricted location parameter, since one can simply shift the location parameter by . The third condition is satisfied by most distributions that have location and scale parameters, like a Gaussian distribution, since the scale can be made arbitrarily low and the location can be shifted so that the constraint on is satisfied. The fourth condition is satisfied by nearly all distributions, whose density eventually tends to zero as the distance from the optimal parameter setting tends to infinity. The fifth condition requires to change smoothly as changes. The final condition requires the two dimensional vectors, one of which can be chosen from a set of vectors, to be not exactly orthogonal. As a result, this condition is usually satisfied when is large, i.e. when the model is richly parameterized.
There is one remaining difficulty in applying this theorem, which is that the quantity , which appears as an coefficient on each term in the proposed objective, is typically not known. If we consider a new objective that ignores the coefficients, i.e. , then minimizing this objective is equivalent to minimizing an upper bound on the ideal objective, . The tightness of this bound depends on the difference between the highest and lowest likelihood assigned to individual data points at the optimum, i.e. the maximum likelihood estimate of the parameters. Such a model should not assign high likelihoods to some points and low likelihoods to others as long as it has reasonable capacity, since doing so would make the overall likelihood, which is the product of the likelihoods of individual data points, low. Therefore, the upper bound is usually reasonably tight.
5 Experiments
We trained generative models using the proposed method on three standard benchmark datasets, MNIST, the Toronto Faces Dataset (TFD) and CIFAR10. All models take the form of feedforward neural nets with isotropic Gaussian noise as input.
For MNIST, the architecture consists of two fully connected hidden layers with 1200 units each followed by a fully connected output layer with 784 units. ReLU activations were used for hidden layers and sigmoids were used for the output layer. For TFD, the architecture is wider and consists of two fully connected hidden layers with 8000 units each followed by a fully connected output layer with 2304 units. For both MNIST and TFD, the dimensionality of the noise vector is 100.
For CIFAR10, we used a simple convolutional architecture with 1000dimensional Gaussian noise as input. The architecture consists of five convolutional layers with 512 output channels and a kernel size of 5 that all produce feature maps, followed by a bilinear upsampling layer that doubles the width and height of the feature maps. There is a batch normalization layer followed by leaky ReLU activations with slope after each convolutional layer. This design is then repeated for each subsequent level of resolution, namely , and , so that we have 20 convolutional layers, each with output 512 channels. We then add a final output layer with three output channels on top, followed by sigmoid activations. We note that this architecture has more capacity than typical architectures used in other methods, like (Radford et al., 2015). This is because our method aims to capture all modes of the data distribution and therefore needs more modelling capacity than methods that are permitted to drop modes.
Evaluation for implicit generative models in general remains an open problem. Various intrinsic and extrinsic evaluation metrics have been proposed, all of which have limitations. Extrinsic evaluation metrics measure performance indirectly via a downstream task (Salimans et al., 2016). Unfortunately, dependence on the downstream task could introduce bias and may not capture desirable properties of the generative model that do not affect performance on the task. Intrinsic evaluation metrics measure performance without relying on external models or data. Popular examples include estimated loglikelihood (Bengio et al., 2014; Wu et al., 2016) and visual assessment of sample quality. While recent literature has focused more on the latter and less on the former, it should be noted that they evaluate different properties – sample quality reflects precision, i.e.: how accurate the model samples are compared to the ground truth, whereas estimated loglikelihood focuses on recall, i.e.: how much of the diversity in the data distribution the model captures. Consequently, both are important metrics; one is not a replacement for the other. As pointed out by (Theis et al., 2015), “qualitative as well as quantitative analyses based on model samples can be misleading about a model’s density estimation performance, as well as the probabilistic model’s performance in applications other than image synthesis.” Two models that achieve different levels of precision may simply be at different points on the same precisionrecall curve, and therefore may not be directly comparable. Models that achieve the same level of recall, on the other hand, may be directly compared. So, for methods that maximize likelihood, which are guaranteed to preserve all modes and achieve full recall, both sample quality and estimated loglikelihood capture precision. Because most generative models traditionally maximized likelihood or a lower bound on the likelihood, the only property that differed across models was precision, which may explain why sample quality has historically been seen as an important indicator of performance. However, in heterogenous experimental settings with different models optimized for various objectives, sample quality does not necessarily reflect how well a model learns the underlying data distribution. Therefore, under these settings, both precision and recall need to be measured. While there is not yet a reliable way to measure recall (given the known issues of estimated loglikelihoods in high dimensions), this does not mean that sample quality can be a valid substitute for estimated loglikelihoods, as it cannot detect the lack of diversity of samples. A secondary issue that is more easily solvable is that samples presented in papers are sometimes cherrypicked; as a result, they capture the maximum sample quality, but not necessarily the mean sample quality.
Method  MNIST  TFD 

DBN (Bengio et al., 2013)  
SCAE (Bengio et al., 2013)  
DGSN (Bengio et al., 2014)  
GAN (Goodfellow et al., 2014)  
GMMN (Li et al., 2015)  
IMLE (Proposed) 
To mitigate these problems to some extent, we avoid cherrypicking and visualize randomly chosen samples, which are shown in Figure 1. We also report the estimated loglikelihood in Table 1. As mentioned above, both evaluation criteria have biases/deficiencies, so performing well on either of these metrics does not necessarily indicate good density estimation performance. However, not performing badly on either metric can provide some comfort that the model is simultaneously able to achieve reasonable precision and recall.
As shown in Figure 1, despite its simplicity, the proposed method is able to generate reasonably good samples for MNIST, TFD and CIFAR10. While it is commonly believed that minimizing reverse KLdivergence is necessary to produce good samples and maximizing likelihood necessarily leads to poor samples (Grover et al., 2017), the results suggest that this is not necessarily the case. Even though Euclidean distance was used in the objective, the samples do not appear to be desaturated or overly blurry. Samples also seem fairly diverse. This is supported by the estimated loglikelihood results in Table 1. Because the model achieved a high score on that metric on both MNIST and TFD, this suggests that the model did not suffer from significant mode dropping.
In Figure 4, we show samples and their nearest neighbours in the training set. Each sample is quite different from its nearest neighbour in the training set, suggesting that the model has not overfitted to examples in the training set. If anything, it seems to be somewhat underfitted, and further increasing the capacity of the model may improve sample quality.
Next, we visualize the learned manifold by walking along a geodesic on the manifold between pairs of samples. More concretely, we generate five samples, arrange them in arbitrary order, perform linear interpolation in latent variable space between adjacent pairs of samples, and generate an image from the interpolated latent variable. As shown in Figure 3, the images along the path of interpolation appear visually plausible and do not have noisy artifacts. In addition, the transition from one image to the next appears smooth, including for CIFAR10, which contrasts with findings in the literature that suggest the transition between two natural images tends to be abrupt. This indicates that the support of the model distribution has not collapsed to a set of isolated points and that the proposed method is able to learn the geometry of the data manifold, even though it does not learn a distance metric explicitly.
Finally, we illustrate the evolution of samples as training progresses in Figure 2. As shown, the samples are initially blurry and become sharper over time. Importantly, sample quality consistently improves over time, which demonstrates the stability of training.
While our sample quality may not yet be stateoftheart, it is important to remember that these results are obtained under the setting of full recall. So, this does not necessarily mean that our method models the underlying data distribution less accurately than other methods that achieve better sample quality, as some of them may drop modes and therefore achieve less than full recall. As previously mentioned, this does not suggest a fundamental tradeoff between precision and recall that cannot be overcome – on the contrary, our method provides researchers with a way of designing models that can improve the precisionrecall curve without needing to worry that the observed improvements are due to a movement along the curve. With refinements to the model, it is possible to move the curve upwards and obtain better sample quality at any level of recall as a consequence. This is left for future work; as this is the initial paper on this approach, its value stems from the foundation it lays for a new research direction upon which subsequent work can be built, as opposed to the current results themselves. For this paper, we made a deliberate decision to keep the model simple, since nonessential practically motivated enhancements are less grounded in theory, may obfuscate the key underlying idea and could impart the impression that they are critical to making the approach work in practice. The fact that our method is able to generate more plausible samples on CIFAR10 than other methods at similar stages of development, such as the initial versions of GAN (Goodfellow et al., 2014) and PixelRNN (van den Oord et al., 2016), despite the minimal sophistication of our method and architecture, shows the promise of the approach. Later iterations of other methods incorporate additional supervision in the form of pretrained weights and/or make taskspecific modifications to the architecture and training procedure, which were critical to achieving stateoftheart sample quality. We do believe the question of how the architecture should be refined in the context of our method to take advantage of taskspecific insights is an important one, and is an area ripe for future exploration.
6 Discussion
In this section, we consider and address some possible concerns about our method.
6.1 Does Maximizing Likelihood Necessarily Lead to Poor Sample Quality?
It has been suggested (Huszár, 2015) that maximizing likelihood leads to poor sample quality because when the model is underspecified, it will try to cover all modes of the empirical data distribution and therefore assign high density to regions with few data examples. There is also empirical evidence (Grover et al., 2017) for a negative correlation between sample quality and log likelihood, suggesting an inherent tradeoff between maximizing likelihood and achieving good sample quality. A popular solution is to minimize reverse KLdivergence instead, which trades off recall for precision. This is an imperfect solution, as the ultimate goal is to model all the modes and generate highquality samples.
Note that this apparent tradeoff exists that the model capacity is assumed to be fixed. We argue that a more promising approach would be to increase the capacity of the model, so that it is less underspecified. As the model capacity increases, avoiding mode dropping becomes more important, because otherwise there will not be enough training data to fit the larger number of parameters to. This is precisely a setting appropriate for maximum likelihood. As a result, it is possible that a combination of increasing the model capacity and maximum likelihood training can achieve good precision and recall simultaneously.
6.2 Would Minimizing Distance to the Nearest Samples Cause Overfitting?
When the model has infinite capacity, minimizing distance from data examples to their nearest samples will lead to a model distribution that memorizes data examples. The same is true if we maximize likelihood. Likewise, minimizing any divergence measure will lead to memorization of data examples, since the minimum divergence is zero and by definition, this can only happen if the model distribution is the same as the empirical data distribution, whose support is confined to the set of data examples. This implies that whenever we have a finite number of data examples, any method that learns a model with infinite capacity will memorize the data examples and will hence overfit.
To get around this, most methods learn a parametric model with finite capacity. In the parametric setting, the minimum divergence is not necessarily zero; the same is true for the minimum distance from data examples to their nearest samples. Therefore, the optimum of these objective functions is not necessarily a model distribution that memorizes data examples, and so overfitting will not necessarily occur.
6.3 Does Disjoint Support Break Maximum Likelihood?
Arjovsky et al. (2017) observes that the data distribution and the model distribution are supported on lowdimensional manifolds and so they are unlikely to have a nonnegligible intersection. They point out would be infinite in this case, or equivalently, the likelihood would be zero. While this does not invalidate the theoretical soundness of maximum likelihood, since the maximum of a nonnegative function that is zero almost everywhere is still welldefined, it does cause a lot of practical issues for gradientbased learning, as the gradient is zero almost everywhere. This is believed to be one reason that models like variational autoencoders (Kingma & Welling, 2013; Rezende et al., 2014) use a Gaussian distribution with high variance for the conditional likelihood/observation model rather than a distribution close to the Dirac delta, so that the support of the model distribution is broadened to cover all the data examples (Arjovsky et al., 2017).
This issue does not affect our method, as our loss function is different from the loglikelihood function, even though their optima are the same (under some conditions). As the result, the gradients of our loss function are different from those of loglikelihood. When the supports of the data distribution and the model distribution do not overlap, each data example is likely far away from its nearest sample and so the gradient is large. Moreover, the farther the data examples are from the samples, the larger the gradient gets. Therefore, even when the gradient of loglikelihood can be tractably computed, there may be situations when the proposed method would work better than maximizing likelihood directly.
7 Conclusion
We presented a simple and versatile method for parameter estimation when the form of the likelihood is unknown. The method works by drawing samples from the model, finding the nearest sample to every data example and adjusting the parameters of the model so that it is closer to the data example. We showed that performing this procedure is equivalent to maximizing likelihood under some conditions. The proposed method can capture the full diversity of the data and avoids common issues like mode collapse, vanishing gradients and training instability. The method combined with vanilla model architectures is able to achieve encouraging results on MNIST, TFD and CIFAR10.
Acknowledgements.
This work was supported by ONR MURI N000141410671. Ke Li thanks the Natural Sciences and Engineering Research Council of Canada (NSERC) for fellowship support. The authors also thank Shubham Tulsiani, Andrew Owens, Kevin Swersky, Rich Zemel, Alyosha Efros, Weicheng Kuo, Deepak Pathak, Christian Häne, Allan Jabri, Trevor Darrell, Peter Bartlett, Bin Yu and the anonymous reviewers for feedback.
References
 Arjovsky & Bottou (2017) Arjovsky, Martin and Bottou, Léon. Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862, 2017.
 Arjovsky et al. (2017) Arjovsky, Martin, Chintala, Soumith, and Bottou, Léon. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pp. 214–223, 2017.
 Arora & Zhang (2017) Arora, Sanjeev and Zhang, Yi. Do GANs actually learn the distribution? an empirical study. arXiv preprint arXiv:1706.08224, 2017.
 Arora et al. (2017) Arora, Sanjeev, Ge, Rong, Liang, Yingyu, Ma, Tengyu, and Zhang, Yi. Generalization and equilibrium in generative adversarial nets (GANs). arXiv preprint arXiv:1703.00573, 2017.
 Arrow et al. (1958) Arrow, Kenneth Joseph, Hurwicz, Leonid, Uzawa, Hirofumi, and Chenery, Hollis Burnley. Studies in linear and nonlinear programming. 1958.
 Baum & Petrie (1966) Baum, Leonard E and Petrie, Ted. Statistical inference for probabilistic functions of finite state markov chains. The annals of mathematical statistics, 37(6):1554–1563, 1966.
 Bengio & Bengio (2000) Bengio, Yoshua and Bengio, Samy. Modeling highdimensional discrete data with multilayer neural networks. In Advances in Neural Information Processing Systems, pp. 400–406, 2000.
 Bengio et al. (2013) Bengio, Yoshua, Mesnil, Grégoire, Dauphin, Yann, and Rifai, Salah. Better mixing via deep representations. In ICML (1), pp. 552–560, 2013.
 Bengio et al. (2014) Bengio, Yoshua, Laufer, Eric, Alain, Guillaume, and Yosinski, Jason. Deep generative stochastic networks trainable by backprop. In International Conference on Machine Learning, pp. 226–234, 2014.
 Besag (1975) Besag, Julian. Statistical analysis of nonlattice data. The statistician, pp. 179–195, 1975.
 Bishop et al. (1998) Bishop, Christopher M, Svensén, Markus, and Williams, Christopher KI. GTM: The generative topographic mapping. Neural computation, 10(1):215–234, 1998.
 Burda et al. (2015) Burda, Yuri, Grosse, Roger, and Salakhutdinov, Ruslan. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
 Diggle & Gratton (1984) Diggle, Peter J and Gratton, Richard J. Monte carlo methods of inference for implicit statistical models. Journal of the Royal Statistical Society. Series B (Methodological), pp. 193–227, 1984.
 Donahue et al. (2016) Donahue, Jeff, Krähenbühl, Philipp, and Darrell, Trevor. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
 Dosovitskiy & Brox (2016) Dosovitskiy, Alexey and Brox, Thomas. Generating images with perceptual similarity metrics based on deep networks. In Advances in Neural Information Processing Systems, pp. 658–666, 2016.
 Dumoulin et al. (2016) Dumoulin, Vincent, Belghazi, Ishmael, Poole, Ben, Lamb, Alex, Arjovsky, Martin, Mastropietro, Olivier, and Courville, Aaron. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
 Dziugaite et al. (2015) Dziugaite, Gintare Karolina, Roy, Daniel M, and Ghahramani, Zoubin. Training generative neural networks via maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906, 2015.
 Edgeworth (1908) Edgeworth, Francis Ysidro. On the probable errors of frequencyconstants. Journal of the Royal Statistical Society, 71(2):381–397, 1908.
 Everitt (1985) Everitt, Brian S. Mixture Distributions—I. Wiley Online Library, 1985.
 Fisher (1912) Fisher, Ronald A. On an absolute criterion for fitting frequency curves. Messenger of Mathematics, 41:155–160, 1912.
 Gilboa & Zemel (1989) Gilboa, Itzhak and Zemel, Eitan. Nash and correlated equilibria: Some complexity considerations. Games and Economic Behavior, 1(1):80–93, 1989.
 Goodfellow et al. (2014) Goodfellow, Ian, PougetAbadie, Jean, Mirza, Mehdi, Xu, Bing, WardeFarley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
 Gretton et al. (2007) Gretton, Arthur, Borgwardt, Karsten M, Rasch, Malte, Schölkopf, Bernhard, and Smola, Alex J. A kernel method for the twosampleproblem. In Advances in neural information processing systems, pp. 513–520, 2007.
 Grover et al. (2017) Grover, Aditya, Dhar, Manik, and Ermon, Stefano. FlowGAN: Bridging implicit and prescribed learning in generative models. arXiv preprint arXiv:1705.08868, 2017.
 Gutmann & Hyvärinen (2010) Gutmann, Michael and Hyvärinen, Aapo. Noisecontrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS, volume 1, pp. 6, 2010.
 Gutmann et al. (2014) Gutmann, Michael U, Dutta, Ritabrata, Kaski, Samuel, and Corander, Jukka. Likelihoodfree inference via classification. arXiv preprint arXiv:1407.4981, 2014.
 Hinton (2002) Hinton, Geoffrey E. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.
 Hinton & Sejnowski (1986) Hinton, Geoffrey E and Sejnowski, Terrence J. Learning and releaming in boltzmann machines. Parallel distributed processing: Explorations in the microstructure of cognition, 1(282317):2, 1986.
 Hjelm et al. (2017) Hjelm, R Devon, Jacob, Athul Paul, Che, Tong, Cho, Kyunghyun, and Bengio, Yoshua. Boundaryseeking generative adversarial networks. arXiv preprint arXiv:1702.08431, 2017.
 Huszár (2015) Huszár, Ferenc. How (not) to train your generative model: Scheduled sampling, likelihood, adversary? arXiv preprint arXiv:1511.05101, 2015.
 Hyvärinen (2005) Hyvärinen, Aapo. Estimation of nonnormalized statistical models by score matching. Journal of Machine Learning Research, 6(Apr):695–709, 2005.
 Jordan et al. (1999) Jordan, Michael I, Ghahramani, Zoubin, Jaakkola, Tommi S, and Saul, Lawrence K. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
 Kingma & Welling (2013) Kingma, Diederik P and Welling, Max. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Larochelle & Murray (2011) Larochelle, Hugo and Murray, Iain. The neural autoregressive distribution estimator. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 29–37, 2011.
 Li & Malik (2016) Li, Ke and Malik, Jitendra. Fast knearest neighbour search via Dynamic Continuous Indexing. In International Conference on Machine Learning, pp. 671–679, 2016.
 Li & Malik (2017) Li, Ke and Malik, Jitendra. Fast knearest neighbour search via Prioritized DCI. In International Conference on Machine Learning, pp. 2081–2090, 2017.
 Li et al. (2015) Li, Yujia, Swersky, Kevin, and Zemel, Rich. Generative moment matching networks. In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pp. 1718–1727, 2015.
 Li et al. (2017) Li, Yujia, Schwing, Alexander, Wang, KuanChieh, and Zemel, Richard. Dualing GANs. In Advances in Neural Information Processing Systems, pp. 5611–5621, 2017.
 MacKay (1995) MacKay, David JC. Bayesian neural networks and density networks. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 354(1):73–80, 1995.
 Mescheder et al. (2017) Mescheder, Lars, Nowozin, Sebastian, and Geiger, Andreas. The numerics of GANs. In Advances in Neural Information Processing Systems, pp. 1823–1833, 2017.
 Mohamed & Lakshminarayanan (2016) Mohamed, Shakir and Lakshminarayanan, Balaji. Learning in implicit generative models. arXiv preprint arXiv:1610.03483, 2016.
 Müller (1997) Müller, Alfred. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443, 1997.
 Neal (1992) Neal, Radford M. Connectionist learning of belief networks. Artificial intelligence, 56(1):71–113, 1992.
 Nowozin et al. (2016) Nowozin, Sebastian, Cseke, Botond, and Tomioka, Ryota. fGAN: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pp. 271–279, 2016.
 Radford et al. (2015) Radford, Alec, Metz, Luke, and Chintala, Soumith. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 Rezende et al. (2014) Rezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra, Daan. Stochastic backpropagation and variational inference in deep latent gaussian models. In International Conference on Machine Learning, 2014.
 Salimans et al. (2016) Salimans, Tim, Goodfellow, Ian, Zaremba, Wojciech, Cheung, Vicki, Radford, Alec, and Chen, Xi. Improved techniques for training GANs. In Advances in Neural Information Processing Systems, pp. 2234–2242, 2016.
 Schmidhuber (1992) Schmidhuber, Jürgen. Learning factorial codes by predictability minimization. Neural Computation, 4(6):863–879, 1992.
 Sinn & Rawat (2017) Sinn, Mathieu and Rawat, Ambrish. Nonparametric estimation of jensenshannon divergence in generative adversarial network training. arXiv preprint arXiv:1705.09199, 2017.
 Theis et al. (2015) Theis, Lucas, Oord, Aäron van den, and Bethge, Matthias. A note on the evaluation of generative models. arXiv preprint arXiv:1511.01844, 2015.
 Tu (2007) Tu, Zhuowen. Learning generative models via discriminative approaches. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pp. 1–8. IEEE, 2007.
 van den Oord et al. (2016) van den Oord, Aaron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016.
 Welling & Hinton (2002) Welling, Max and Hinton, Geoffrey. A new learning algorithm for mean field boltzmann machines. Artificial Neural Networks—ICANN 2002, pp. 82–82, 2002.
 Wu et al. (2016) Wu, Yuhuai, Burda, Yuri, Salakhutdinov, Ruslan, and Grosse, Roger. On the quantitative analysis of decoderbased generative models. arXiv preprint arXiv:1611.04273, 2016.
 Zhao et al. (2016) Zhao, Junbo, Mathieu, Michael, and LeCun, Yann. Energybased generative adversarial network. arXiv preprint arXiv:1609.03126, 2016.
 Zhu et al. (2017) Zhu, JunYan, Park, Taesung, Isola, Phillip, and Efros, Alexei A. Unpaired imagetoimage translation using cycleconsistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.
Appendix A Appendix
Before proving the main result, we first prove the following intermediate results:
Lemma 1.
Let and . For , let be differentiable on and be differentiable on and strictly increasing. Assume exists and is unique. Let and . If the following conditions hold:

There is a bounded set such that , and , where denotes the boundary of .

For all , if , there exists such that .
Then exists and is unique. Furthermore, .
Proof.
Let be the bounded set such that , and . Consider the closure of , denoted as . Because and , . Since is bounded, is bounded. Because and is closed and bounded, it is compact.
Consider the function . By the differentiability of ’s and , is differentiable on and hence continuous on . By the compactness of and the continuity of on , Extreme Value Theorem applies, which implies that exists. Let be such that .
By definition of , , implying that since is strictly increasing. Because , and so . At the same time, since , by definition of , . Combining these two facts yields . Since the inequality is strict, this implies that , and so .
In addition, because is the minimizer of on , . So, . Hence, is a minimizer of on , and so exists. Because is differentiable on , must be a critical point of on .
On the other hand, since is differentiable on and for all , exists for all . So,
At ,
Since each is differentiable on , is differentiable on . Combining this with the fact that is the minimizer of on , it follows that . Hence, and so is a critical point of .
Because , if , such that , for any . Therefore, is the only critical point of on . Since is a critical point on , we can conclude that , and so is a minimizer of on . Since any other minimizer must be a critical point and is the only critical point, is the unique minimizer. So, . ∎
Lemma 2.
Let be a distribution on whose density is continuous at a point and be a random variable. Let , , where denotes the gamma function ^{1}^{1}1The constant is the the ratio of the volume of a dimensional ball of radius to a dimensional cube of side length ., and . Let denote the cumulative distribution function (CDF) of and denote the onesided derivative of from the right. Then, .
Proof.
By definition of ,
If we define , the above can be rewritten as:
We want to show that . In other words, we want to show such that , .
Let be arbitrary.
Since is continuous at , by definition, such that , . Let be such that , . We choose .
Let be arbitrary. Since ,
Observe that is the volume of a dimensional ball of radius , so . Thus, , implying that . By similar reasoning, we conclude that .
Hence,
Therefore,
∎
Lemma 3.
Let be a parameterized family of distributions on with parameter and probability density function (PDF) that is continuous at a point . Consider a random variable and define , whose cumulative distribution function (CDF) is denoted by . Assume has the following property: for any , there exists such that and . For any , let be i.i.d. random variables and define . Then the function is strictly decreasing.
Proof.
Let be a random variable and let be the CDF of . Since is nonnegative,
Also, by Lemma 2, . Using these facts, we can rewrite as . By definition of , exists for all . Let be a value of that attains the minimum. Define . By definition, , where denotes the onesided partial derivative from the right w.r.t. . Also, since is the CDF of a distribution of a nonnegative random variable, .
By definition of , such that , .
Let . Let be such that , and be such that , .
Consider . Then,