A Polynomial Time MCMC Method for Sampling from Continuous DPPs
We study the Gibbs sampling algorithm for continuous determinantal point processes. We show that, given a warm start, the Gibbs sampler generates a random sample from a continuous -DPP defined on a -dimensional domain by only taking number of steps. As an application, we design an algorithm to generate random samples from -DPPs defined by a spherical Gaussian kernel on a unit sphere in -dimensions, in time polynomial in .
Let be a positive semi-definite (PSD) matrix. A discrete determinantal point process with kernel is a probability distribution defined by
The notion of DPP was first introduced by [Mac75] to model fermions. Since then, They have been extensively studied, and efficient algorithms have been discovered for tasks like sampling from DPPs [LJS15, DR10, AGR16], marginalization [BR05], and learning [GKFT14, UBMR17] them (in the discrete domain). In machine learning they are mainly used to solve problems where selecting a diverse set of objects is preferred since they offer negative correlation. To get intuition about why they are good models to capture diversity, suppose each row of the gram matrix associated with the kernel is a feature vector representing an item. It means the probability of a set of items is proportional to the square of the volume of the space spanned the vectors representing items. Therefore, larger volume shows those items are more spread which resembles diversity. Text summarization, pose estimation, and diverse image selection are examples of applications of DPP in this area [GKT12, KT12, KT10, KT11]. These distributions also naturally appear in many contexts including non-intersecting random walks [Joh02], random spanning trees [BP93].
Here, we focus on the sampling problem of DPPs. In the discrete setting, the first sampling algorithms was proposed by [HKP06]. They propose a two-step spectral algorithm which generates a random sample by running an eigen-space decomposition of the kernel and running conditional sampling on the space spanned by a randomly chosen set of the eigenvectors. Several algorithms for sampling different variation of DPPs including -DPPs have been built on this idea [KT12, DR10]. The disadvantage of spectral techniques for this problem is that they typically need the eigen-decompostion or Cholesky decomposition of the kernel which makes them inefficient for large instances. Recently, several group of researchers studied employing the Monte Carlo Markov Chain (MCMC) technique for this task [AGR16, LSJ16, RK15]. It is shown in [AGR16] that the natural Metropolis-Hastings algorithm for -DPPs gives an efficient sampling method running in time , where is the number points in the underlying kernel.
In this paper, our main goal is to study the sampling problem of continuous DPPs. On a continuous domain, a DPP is defined similarly by a continuous PSD operator. For , and a continuous PSD kernel , the DPP is a distribution over finite subsets of where for every subset , the probability density function at , satisfies
For an integer , a (continuous) -DPP is the restriction of a (continuous) DPP to subsets of size exactly . Continuous DPPs naturally arise in several areas of Phyiscs, Math and Computer Science; To name a few examples, eigenvalues of random matrices [MG60, Gin65], zero-set of Gaussian analytic functions [PV05] are families of DPPs; also, see [LMR15] for applications in statistics and [BL16] for connections to repulsive systems. Recently, sampling from continuous -DPP has also been used for tuning the hyper-parameters of a network [?]. Unlike the discrete setting, despite several attempts [HAFT13, SZT09, LMR12, BH16, HG16], to this date, we are not aware of any efficient sampling algorithms with provable guarantees. It remains an open problem to design an efficient sampling algorithm for continuous -DPPs.
Our main contribution is to develop an algorithm to draw approximate samples from a -DPP defined by a continuous kernel, and having access to a “conditional-sampling” oracle.
Unlike [AGR16], here, we analyze a different Markov chain, called the Gibbs sampler chain: Let be a -DPP defined by a kernel for . Given a state , the Gibbs sampler moves as follows: Remove a point is chosen uniformly at random, and move to the state with probability proportional to .
We study the problem of sampling from a continuous -DPP, and present the first MCMC based algorithm with provable guarantees for this problem. Our main contribution is to show that the Gibbs sampler for -DPPs (see 2.1) mixes rapidly, and can be simulated efficiently under some extra assumptions on the kernel. More precisely, we analyze the conductance of the Gibbs samplers -DPPs (see theorems 3.1 and 4.1), and using the well-known connection between the conductance and mixing time obtain the following.
Let be the Gibbs sampler for a -DPP . If we run the chain starting from an arbitrary distribution , for any we have
In the above theorem, and refer to the probability density functions for and , respectively. Moreover, denotes the mixing time for the chain started from , and is defined by
where is the distribution after steps, and denotes the total variation distance. Moreover, an -approximate sample for distribution refers to a random sample from a distribution where .
To find a “good” starting distribution for which the bound in Theorem 1.1 is polynomial, we require some additional constraints on the kernel. Namely, we need to have access to conditional sampling oracles, formally defined as follows.
For a kernel , a subset , and an integer , we define -conditional distribution of to be a simple point process defined on by a pdf function satisfying
and zero if . We denote this distribution by CD. We say an algorithm is an CD oracle for integers and , if given any , it returns a sample from the CD.
It is straight-forward to see that taking a step of the Gibbs sampler of the -DPP from the state defined by is equivalent to removing a point , for some , and generating a sample from CD. We prove the following.
Theorem 1.3 (informal).
Let be the Gibbs sampler for the -DPP defined by a kernel . Given CD oracles for all , we sample a starting state for from a probability distribution where
Therefore, to get a polynomial time algorithm for sampling from a -DPPs (a CD distribution), it is essentially enough to have efficient algorithms to sample from conditional -DPPs (CD distributions), which is a much simpler problem. As an application, we consider Gaussian kernels. For a covariance matrix , a Gaussian kernel is defined by . We show a simple rejection sampling can be used as conditional sampling oracles for , and obtain the following.
Theorem 1.4 (See Theorem 5.2 for details.).
Let , and for some be two integers. There is a randomized algorithm that for any and , generates an -approximate sample from the -DPP defined by restricted to which runs in time
In the above, we are assuming a sample from the normal distribution can be generated in constant time.
1.2 Previous Work
In the continuous regime, the efforts have been mostly concentrated on finding the eigen-decomposition of the kernel or a low-rank approximation of it, and extending the aforementioned spectral techniques to the continuous space [HAFT13, SZT09, LMR12, BH16].
However, in theory, these methods does not yield provable guarantees for sampling because generally speaking, to project the DPP
kernel onto a lower dimensional space, they minimize the error with respect to a matrix norm, rather than the DPP distribution.
Moreover, there are two main obstacles to implement this approach: First, there is no efficient algorithm for obtaining an eigen-decomposition of a kernel defined on an infinite space because it may have infinitely many eigenvalues. Secondly, for a general continuous eigen-decomposition, it is impossible to run a conditional sampling algorithm. For a concrete example, [LMR12] uses an orthonormal Fourier transform to find a low-rank approximation of the DPP kernel, and then proceeds with conditional sampling via rejection sampling.
But, they do not provide any rigorous guarantee on the distance between the resulting distribution, and the underlying -DPP distribution. A similar spectral approach is proposed in [HAFT13]. They consider the Nystörm method, and random Fourier feature as two techniques to find low-rank approximation of the kernel. The approximation
scheme that they use enables them to handle a wider range of
kernels. However, the first
issue still remains unresolved:
as the rank of the approximated
kernel increases, the resulting
distribution becomes closer to
the initial -DPP, but to the
best of our knowledge there is
no provable guarantee.
provides empirical evidence that Gibbs sampling is efficient to
generate sample from continuous
-DPP in many cases. However,
they do not provide any rigorous
It is also worth mentioning that
[HG16] claims to
devise an algorithm to generate
exact samples for specific
kernels (including Gaussian),
yet a careful look at their
method would reveal a major flaw
in their argument
Our first contribution is to analyze the Gibbs sampler chain in the discrete setting. We prove for a -DPP defined on points, the spectral gap of the Gibbs sampler chain is a polynomial in and independent of . So, up to logarithmic factors in , the chain mixes in time polynomial in . This result on its own could be of interest in designing distributed algorithms for sampling from discrete -DPPs. This is because given access to processors, one can generate the next step of the Gibbs sampler in time .
Secondly, we lift the above proof to the continuous setting using a natural discretization of the underlying space. To prove the mixing time, we need to make sure that the logarithm of the variance of the starting distribution with respect to the stationary distribution of the chain, i.e., the -DPP, is polynomially small in . We use a simple randomized greedy algorithm for this task: We start from the empty set; assuming we have chosen we sample from , where as usual is the underlying kernel. We show that the distribution governing the state output by this algorithm is our desired starting distribution.
Lastly, we use our main theorem to generate samples from a -DPP defined on a spherical Gaussian kernel on . To run the above algorithm we need to construct the for all where is the corresponding kernel. Given the point , we use the classical rejection sampling algorithm to choose ; namely, we generate a uniformly random point on the unit sphere and we accept it with probability . We use the distribution of the eigenvalues of the spherical Gaussian kernel [MNY06] to bound the expected number of proposals in the rejection sampler.
Let denote the -dimensional euclidean space. Whenever, we consider as measurable space, our measure is the standard Lebesgue measure. The denotes the -dimensional volume of with respect to the standard measure. A function belongs to , if . For two such functions the standard inner product is defined by . A function , is a Hilbert-Schmidt kernel . The associated Hilbert-Schmidt integral operator is then a linear operator which for any is defined by The operator is self-adjoint if for any , . It is Positive Semi-Definite (PSD), if for any , .
A kernel satisfies the Mercer conditions if: is symmetric, which means for any , . Moreover, for any finite sequence , the submatrix is a PSD matrix. It is known that the operators satisfying Mercer conditions are PSD. Moreover, if satisfies Mercer’s condition, for any , there exists a Hilbert space , and a function , where for any , . These functions are also known as feature maps. We also use the classical Mercer’s theorem which states that operators satisfying the Mercer’s condition are compact, and so have a countable system of eigen-spaces and eigenvalues. i.e. there are non-negative eigenvalues , and where
In section 5, we use this result for Gaussian kernels. Throughout, the rest of the paper, whenever we say a continuous kernel, we refer to continuous Hilbert-Schmidt kernel which satisfies the Mercer’s conditions.
If is a probability distribution, we use to refer to the corresponding probability density function (pdf). We use bold small letters to refer to a finite set of points in , and in particular a state of the Gibbs sampler for a -DPP, e.g. . For , we may use to indicate . For any , we use and interchangeably to refer to determinant of the submatrix where the entry is . Whenever, the kernel is clear from the context, we may drop the subscript. For two expression and , we write to denote .
2.1 Continuous Determinantal Point Process
A Determinantal Point Process (DPP) on a finite set, namely is a probability distribution on the subsets of which is defined by a PSD matrix (a.k.a missingkernel) where for every subset ,
where is the principal submatrix of indexed by elements of . For an integer , the restriction of to subsets of size is called a -DPP defined by the kernel . So support of a -DPP defined on is .
Similarly a continuous -DPP can be defined on a continuous domain with a continuous PSD kernel . The above succinct definition suffices to understand the rest of the paper. However, for completeness, we formally define them in the following. For more details about DPPs and generally point processes on continuous domains, we refer interested readers to [HKP06].
For a kernel , and for an integer , the -DPP defined by on domain is a point process with the support of subsets of of size , , defined as follows: For any the probability density function is proportional to . i.e. for any mutually disjoint family of subsets ,
where is the partition function . Now As alluded to before, we study the Gibbs sampling scheme for a -DPP which is formally defined as follows:
Definition 2.1 (Gibbs samplers for -DPPs).
Let be a -DPP defined by a kernel for . The Gibbs sampler for is a Markov chain with state space and stationary measure which moves as follows: Let be the current state. A point is chosen uniformly at random, and the chain moves to the state for chosen by the distribution defined by the pdf function
2.2 Markov Chains with Measurable State Space
In this section we give a high level overview on the theory of Markov chains defined over measurable sets. We refer interested readers to [LS93] for more details. Let be a measurable space. In the most general setting, a Markov chain is defined by the triple , where for every , is a probability measure on . Also, for every fixed , is a measurable function in terms of . In this setting starting from a distribution , after one step the distribution would be given by
From now on, assume and is the standard Borel -algebra. In our setting, we can assume the transition probabilities are given by a kernel transition kernel where for any measurable , we can write
In this notation, we use and interchangeably. would also denote the probability distribution of the states after steps of the chain started at . Similar to the discrete setting, we can define the stationary measure for the chain. A probability distribution on is stationary if and only if for every measurable set , we have
We call -irreducible for a probability measure if for any set with , and any state , there is such that . It is called strongly -irreducible if for any with non-zero measure and , there exists such that for any , . We say is reversible with respect to a measure if for any two sets and we have
In particular, reversibility with respect to a measure, implies it is a stationary measure. Is is immediate from this to verify that for a Gibbs sampler of a -DPPs , the itself is the stationary measure. Moreover, if the kernel of the -DPP is continuous, it is straight-forward to see that it is -strongly irreducible. The following lemma also shows is the unique stationary measure, and as the number of steps increases, the chain approaches to the unique stationary measure.
Lemma 2.2 ([Df97]).
If is a stationary measure of , and is strongly -irreducible. Then for any other distribution which is absolutely continuous with respect to , .
From now on, consider is chain with state space , probability transition function , and a unique stationary measure . Let us describe some results about mixing time in the Markov chains defined on continuous spaces. But before that we need to setup some notation. Consider a Hilbert space equipped with the following inner product.
defines an operator in this space where for any function and ,
In particular being reversible is equivalent to being self-adjoint. For a reversible chain and a function , the Dirichlet form is defined as
We also define the Variance of with respect to as
We may drop the subscript if the underlying stationary distribution is clear in the context. One way for upperbounding the mixing time of a chain is to use is to its spectral gap which is also known as Poincaré Constant.
Definition 2.3 (Poincaré Constant).
. The Poincaré constant of the chain is defined as follows,
where the infimum is only taken over all functions in with non-zero variance.
In this paper, we use the following theorem to upperbound the mixing time of the chain relevant to us.
Theorem 2.4 ([Km12]).
For any reversible, strongly -irreducible Markov chain , if , then the distribution of the chain started from which is absolute continuous with respect to is
For the sake of completeness, we include a proof of the above theorem which is an extension of the proof of the analogous discrete result in [Fil91]. We need the following simple lemma known as Mihail’s identity.
Lemma 2.5 (Mihail’s identity, [Fil91]).
For any reversible irreducible Markov chain , and any function in ,
Proof of Theorem 2.4. First of all, one can easily verify that if a chain is lazy and irreducible, then it is strongly-irreducible. Combining it with 2.2 would guarantee the uniqueness of the stationary measure. Let be the starting distribution and define be the distribution at time . Set , we have
So applying Mihail’s identity on and using (2.2) , we conclude
Now, note that has the same stationary distribution , so its Poincaré constant is at most
Combining this with (2.3), and using induction we can deduce
Note that, since is the kernel for a lazy chain, it has no negative values in its spectrum, implying . So in order to complete the proof it is enough show
This can be seen using an application of Cauchy-Schwarz’s inequality. We have
The last identity uses that . This completes the proof. ∎
In order to take advantage of Theorem 2.4, we need to lowerbound the Poicaré constant of our chain. This can be done by lowerbounding the Ergodic Flow of the chain.
Definition 2.6 (Ergodic Flow).
For a chain , the ergodic flow is defined by
The conductance of a set is defined by, , and the conductance of the chain is
The following theorem which is an extension of the Cheeger’s inequality for the Markov chains on a continuous space, relates the spectral gap to conductance.
Theorem 2.7 ([Ls88]).
For a chain defined on a general state space with spectral gap we have
3 Gibbs Sampling for Discrete -Dpp
In this section we prove the Gibbs sampler for a discrete -DPP is an -expander. Recall that the conductance of a time reversible chain is defined by
where for , . We prove the following.
Let be the Gibbs sampler chain for an arbitrary discrete -DPP, then for a constant we have
In the rest of this section, we fix to be the Gibbs-sampler chain on a -DPP defined on a set of elements.
Before discussing the details of the proof let us first fix a notation and recall fundamental properties of -DPPs. For any element , define be the set of all states in that contain, do not contain , respectively. Also define
It follows from [AGR16] that can be identified with a -DPP, -DPP supported on , respectively. We define to be the restricted Gibbs samplers. So, it is straightforward to see that for any we get . and consequently for defined as for , we get
Unlike , is not obtained from scaling a restriction of . In particular, Let so that (which implies ). Then, setting and with a bit abuse of notation , i.e. , we have
whereas . For any , define be the set of its neighbours in , i.e.
We use the following lemma to relate to .
Let be an arbitrary subset. For a state , consider the following partitioning of : and . Then we have
Note that is the set of all states containing elements in . So by definition of and , we have
where the inequality follows simply because . ∎
High level idea of the proof of Theorem 3.1.
We follow a proof strategy similar to [Mih92], which obtains analogue of our result in an unweighted setting and for the Metropolis-Hastings samplers. We use an inductive argument to prove the theorem. We need to prove for a subset with . Letting and , we have
We carry out the induction step by lowerbounding the RHS of the above term by term. In order to bound we use induction hypothesis on . To bound , we combine the induction hypothesis on with 3.2. It remains to bound the other two terms which correspond to the contribution of the edge across . To do that, we crucially use negative association of . In particular, we use the following lemma (appeared before in [Mih92] in the unweighted case). For any set , let denote the set of neighbors of in .
Lemma 3.3 ([Agr16]).
For any subset ,
The lemma lower bounds the vertex expansion of in and similarly
vertex expansion of in . Later we show how to use it to bound the edge expansion which is
our quantity of interest.
Proof of Theorem 3.1. We induct on . So, assume, the conductance of the Gibbs sampler for any -DPP over elements is at most and the conductance is at most for any -DPP over any elements.
Fix a set where . We need to show . First, consider a simple case where and . By induction hypothesis we have . Moreover, by adding up (3.1) for the edges across the cut , we get . So combining them we have
Now, we use induction on along with 3.2. The induction hypothesis implies
So to prove the theorem in this case, it is enough to show the following and add it up with (3.7).
To see that, it is enough to apply 3.2 and add up (3.3) for all , where subset in the lemma is determined as follows: if then set , otherwise set . Note that, doing that the RHS of the result will be exactly , because any edge of that will only show up in (3.3) by having .
So we focus on the case . Since , we have .so So, without loss of generality, perhaps by considering instead of , we may assume and . Our goal is to prove
For every , let , and be a partitioning of , so for every subset we have
Now, define to be
in other words, is the subset of states so that, if the chain takes one step from by removing and resampling element , then with probability at least it leaves and enters