# Learning Subspaces of Different Dimensions

## Abstract

We introduce a Bayesian model for inferring mixtures of subspaces of different dimensions. The key challenge in such a mixture model is specification of prior distributions over subspaces of different dimensions. We address this challenge by embedding subspaces or Grassmann manifolds into a sphere of relatively low dimension and specifying priors on the sphere. We provide an efficient sampling algorithm for the posterior distribution of the model parameters. We illustrate that a simple extension of our mixture of subspaces model can be applied to topic modeling. We also prove posterior consistency for the mixture of subspaces model. The utility of our approach is demonstrated with applications to real and simulated data.

## 1Introduction

The problem of modeling manifolds has been of great interest in a variety of statistical problems including dimension reduction [7], characterizing the distributions of statistical models as points on a Riemannian manifold [2], and the extensive literature in statistics and machine learning on manifold learning [12]. A generalization of the manifold setting is to model unions and intersections of manifolds (of possibly different dimensions), formally called stratified spaces [8]. Stratified spaces arise when data or parameter spaces are characterized by combinations of manifolds such as the case of mixture models. One of the most important special cases arises when the manifolds involved are all affine subspaces or linear subspaces. Mixtures of linear subspaces have been suggested in applications such as tracking images [27], quantitative analysis of evolution or artificial selection [26], applications in communication and coding theory [3], and is relevant for text modeling [41].In this paper we provide a model for the simplest instance of inferring stratified spaces, estimating mixtures of linear subspaces of different dimensions.

The idea of dimension reduction via projections onto low-dimensional subspaces goes back at least to Adcock [1] and Edgworth [16], with methodological and foundational contributions by R. A. Fisher [18]; see [12] for an excellent review. It is very interesting that in 1922 Fisher suggested that the statistical setting where the number of variables is greater than the number of observations, could be addressed by reducing the dimension of to very few summaries of the data where The summaries in this setting were linear combinations of the variables. This idea of dimension reduction has been extensively used statistics ranging from classical methods such as principal components analysis (PCA) [30] to a variety of recent methods, some algorithmic and some likelihood based, that fall under the category of nonlinear dimension reduction and manifold learning [7]. A challenging setting for both algorithmic and probabilistic models in this setting is where the data are being generated from multiple populations inducing a mixture distribution. It is particularly challenging when the mixtures are of different dimensions.

In many applications a useful model for the observed high-dimensional data assumes the data is concentrated around a lower-dimensional structure in the high-dimensional ambient space. In addition, it is often the case that the data is generated from multiple processes or populations each of which has low-dimensional structure. In general, the degrees of freedom or number of parameters of the processes capturing the different populations need not be equal. In this paper, we address this problem of modeling data arising from a mixture of manifolds of different dimensions for the restricted case where the manifolds are linear subspaces.

The most recent work that offers both estimators and provides guarantees on estimates for inferring mixtures of subspaces has been limited to *equidimensional* subspaces [33]. A Bayesian procedure for mixtures of subspaces of equal dimensions was developed in Page et al. [38]. A penalized loss based procedure was introduced in Lerman and Zhang [33] to learn mixtures of -flats. There are significant difficulties in extending either approach to subspaces of different dimensions. The key difficulty in extending either approach is addressing the singularity introduced in moving between subspaces of different dimensions when one parameterizes a subspace as a point on the Grassmann manifold and uses the natural geodesic on this manifold. This difficulty appears in the Bayesian approach as requiring the posterior samples to come from models of different dimensions which will require methods such as reversible jump MCMC which may cause mixing problems. The difficulty is immediate in the penalized loss model as the loss is based on a distance to subspaces and if the dimensions of the subspaces vary the loss based procedure becomes very difficult. A related line of work appears in Hoff [28] where efficient methodology is developed to sample from a space of orthonormal matrices with fixed intrinsic dimension using the matrix Bingham–von Mises–Fisher distribution. One of the results in this paper is a procedure to sample from orthonormal matrices of varying intrinsic dimensions. From a geometric perspective a method was developed in [28] to simulate from a Stiefel manifold with fixed intrinsic dimension, in this paper we provide a methodology to simulate over Stiefel manifolds of varying intrinsic dimensions.

The key idea we develop in this paper is that subspaces of different dimensions can be embedded into a sphere of relatively low dimension where chordal distances on the sphere can be used to compute distances between subspaces of differing dimensions [11]. This embedding removes the discontinuity that occurs in moving between subspaces of different dimensions when one uses the natural metric for a Grassmann manifold. The other tool we make use of is a Gibbs posterior [31] which allows us to efficiently obtain posterior samples of the model parameters.

The structure of the paper is as follows. In Section 2 we state a likelihood model for a mixture of subspaces each of dimension . In Section 2.3 we define the embedding procedure we use to model subspaces of different dimensions and specify the model with respect to the likelihood and prior. In Section 3 we provide an algorithm for sampling from the posterior distribution. For some of the parameters standard methods will not be sufficient for efficient sampling and we use a Gibbs posterior for efficient sampling. In Section Section 4 we extend the mixture of subspaces model to topic modeling. In Section 5 a frequentist analysis of the Bayesian procedure is given that proves posterior consistency of the procedure. In Section 6 we use simulated data to provide an empirical analysis of the model and then we use real data to show the utility of the model.We close with a discussion.

## 2Model specification

### 2.1Notation

We first specify notation for the geometric objects used throughout this paper. The Grassmann manifold or Grassmannian of -dimensional subspaces in will be denoted . The Stiefel manifold of matrices with orthonormal columns will be denoted and when we write for the orthogonal group. We use boldfaced uppercase letters, e.g., , to denote subspaces and the corresponding letter in normal typeface, e.g., , to denote the matrix whose columns form an orthonormal basis for the respective subspace. Note that and . A subspace has infinitely many different orthonormal bases, related to one another by the equivalence relation where . We identify a subspace with the equivalence class of all its orthonormal bases thereby allowing the identification .

In this article, the dimension of the ambient space will always be fixed but our discussions will often involve multiple copies of Grassmannians with different values of . We will use the term ’Grassmannian of dimension ’ when referring to even though as a manifold, .

### 2.2Likelihood specification

We consider data drawn independent and identically from a mixture of subspaces where each observation is measured in the ambient space . We assume that each population is concentrated near a linear subspace which we represent with an orthonormal basis , , .

We first state the likelihood of a sample conditional on the mixture component. Each mixture component is modeled using a -dimensional normal distribution to capture the subspace and a -dimensional normal distribution to model the residual error or null space:

where is the orthonormal basis for the th component and is modeled by a multivariate normal with mean and covariance and is the basis for the null space which models the residual error as multivariate normal with variance . We are estimating affine subspaces so the parameter serves as a location parameter for the component and by construction . Also note that without loss of generality we can assume that is diagonal since we may diagonalize the covariance matrix and rotate by resulting in a parameterization that depends on and a diagonal matrix.The distributions for the null space and and subspace can be combined and specified by either of the following parameterizations

It will be convenient for us to use the second parameterization for our likelihood model.

Given the above likelihood model for a component we can specify the following mixture model

where is a probability vector and we assume components. We will use a latent or auxiliary variable approach to sample from the above mixture model and specify a -dimensional vector with a single entry of and all other entries of zero, . The conditional probability of given the latent variable is

### 2.3Prior specification and the spherical embedding

The parameters in the likelihood for each component are and the mixture parameters are weights . Again we fix the number of mixtures at . Prior specification for some of these parameters are straightforward. For example the location parameter is normal, the variance terms and are Gamma, and the mixture weights are Dirichlet. A prior distribution for each triple is less obvious.

The inherent difficulty in sampling this triple is that we do not want to fix the dimension of the subspace , we want to consider as random. We can state the following joint prior on the triple

Given we can specify as a multivariate normal of dimension . Given we can also specify a conjugate distribution for as the von Mises–Fisher (MF) distribution

where is the exponential trace operator. The matrix von Mises–Fisher distribution is a spherical distribution over the set of all matrices, also known as the Stiefel manifold which we denote as . A prior on would take values over and for each value the conditional distributions and need to be specified. For a prior distribution of seems reasonable since we can assume the mean is zero and the entries independent for any . Specifying the conditional distribution for is not as clear. As changes the dimension of the matrix needs to change and one cannot simply add columns of zeroes since columns need to be orthonormal. An additional constraint on the prior is that a small change in dimension should only change the prior on slightly. This constraint is to avoid model fitting inconsistencies. This constraint highlights the key difficulty in prior specification over subspaces of different dimensions: how to measure the distance between subspaces of different dimensions. Note that we can not simply integrate out or as nuisance parameters since we have no prior specification.

We will use the geometry of the subspace to specify an appropriate joint prior on . Recall that the set of all -dimensional linear subspaces in is the Grassmann manifold and that we represent a subspace with an orthonormal matrix from an equivalence class . We need to place priors on Grassmanians of different dimension . The key tool we use to specify such a prior is the embedding of into , an appropriately chosen sphere^{1}

The following theorem states that embedding the Grassmanian into a sphere allows us to measure distances between subspaces.

The embedding procedure proceeds in the following steps: (1) given a basis compute the projection matrix , (2) take all the entries of in the upper triangle (or lower triangle) as well as all the elements in the diagonal except for one as a vector in . The sum of all the entries on the vector will be a constant, this is a result of the orthogonality of , which means that all the subspaces of dimension lie on the same sphere. The key observation by Conway et al. [11] was that if the extra coordinate is included, thus embedding into , the subspaces are still embedded into spheres and each of these spheres are cross sections of a higher-dimensional sphere which we denote as . The sphere is centered at where denotes the embedding of the projection matrix and is the half-vectorization operation

The -dimensional subspace is embedded at the origin . The radius of is . In summary,

Grassmann manifolds are embedded into cross-sections of where the projection matrix corresponding to the pre-image has an integer valued trace. The geodesic distance along the surface of the sphere, , corresponds to the projective distance between two subspaces ,

where are the principal angles between the subspaces. We illustrate the embedding for two projection matrices in Figure 1.

The representation of Grassmannians as points on has several useful properties.

- Sphere interpretation
The sphere provides an intuitive way to sample subspaces of different dimensions by sampling from . Under the projective distance, the sphere also has an intuitive structure. For example, distances between subspaces of different dimensions can also be computed as the distance between points on the sphere, these points will be on different cross-sections. Under the projective distance, the orthogonal complement of a subspace is the point on that maximizes the projective distance. Further, the projection matrix is always invariant to the representation .

- Differentiable
The projective distance however is square differentiable everywhere, making it more suitable in general for optimization problems. This is not the case for distances like geodesic distance or the Asimov–Golub–Van Loan distance where maximizing the distance between a set of subspaces will result in distances that lie near non-differentiable points [11]. This numerical instability can lead to sub-optimal solutions.

- Ease of computation
The projective distance is easy to compute via principal angles, which are in turn readily computable with singular value decomposition [23]. Working with the embedding requires only a relatively small number of coordinates — in fact only quadratic in or . Furthermore one can exploit many properties of a sphere in Euclidean space in our computations. For example sampling from a sphere is simple. The number of required coordinates is small compared to alternative embeddings of the Grassmannian, see [25]. In contrast the usual Plücker embedding requires a number of coordinates that is , i.e., exponential in . Moreover the Plücker embedding does not reveal a clear relationship between Grassmannians of different dimensions, as there is using the spherical embedding.

We will place a prior on projection matrices by placing a distribution over the lower half of , points on corresponding to cross-sections where the subspace corresponding to the pre-image has dimension . We only consider the lower half since we assume the model to be low-dimensional. The prior over projection matrices imples a prior over and . A point drawn from may not correspond to a subspace, recall only points with integer trace have subspaces as a pre-image. We address this problem by the following procedure: given a sampled point we return the closest point that is the pre-image of a subspace. The following theorem states the procedure.

In the case where the point is already on a cross-section of the sphere corresponding to , the eigendecomposition will return exactly non-zero eigenvalues. The eigenvectors give a basis for the subspace that is embedded into the point . Similarly when the point is between cross sections corresponding to Grassmannians, the above algorithm minimizes the Euclidean distance between the point and , and therefore minimizes the distance on .

The full model is specified as follows for each , ,

where equation corresponds to sampling from a distribution supported on the lower half of the sphere a projection matrix that corresponds to a subspace and computing the dimension as the trace of the subspace and computing the subspace from the projection and equation corresponds to sampling from a normal distribution subject to the projection constraint .

## 3Posterior sampling

In this section we provide an efficient algorithm for sampling the model parameters from the posterior distribution. Sampling directly from a joint posterior distribution of all the parameters is intractable and we will use Markov chain Monte Carlo methods for sampling. For most of the parameters we can sample from the posterior using a Gibbs sampler. This is not the case for sampling from the posterior distribution over projection matrices with prior on the sphere . The prior should place more mass on cross-sections of the sphere corresponding to lower dimensions . Sampling efficiently from a joint distribution on is difficult. We will address this problem by using a Gibbs posterior [31] to sample the projection matrices. We first state the Gibbs posterior we use to sample and efficiently and the rationale for this form of the posterior. We then close with the sampling algorithm for all the model parameters.

It is not obvious how to place a prior on the sphere that will allow for efficiently sampling. We can however follow the idea of a Gibbs posterior to design an efficient sampler. The idea behind a Gibbs posterior is to replace the standard posterior which takes the form of

with a distribution based on a loss or risk function that depends on both the data as well the parameter of interest in our case the loss function is given by

where is the residual error for the th sample given by the -th subspace with the error defined by our likelihood model. The above loss function corresponds to computing for each sample the residual error to the closest subspace weighted by the dimension of the subspace. The penalty weighting the dimension of the subspace enforces a prior that puts more mass on subspaces of lower dimension. Given the likelihood or loss function we state the following Gibbs posterior

where is a chosen temperature parameter. A Gibbs posterior is simply a loss oriented alternative to the likelihood based posterior distribution. Traditionally it is used to account for model misspecification. Here the Gibbs posterior is used to avoid overfitting by arbitrarily increasing the dimension of the subspace and for computational efficiency in sampling.

### 3.1Sampling and from the Gibbs posterior

In this subsection we outline our procedure for sampling from the model parameters and using a Metropolis–Hastings algorithm which is effectively a random walk on the sphere. We first state a few facts that we will use. First recall that there is a deterministic relation between and , so given a we can compute . Also recall that a point sampled from is not the pre-image of a subspace. Given a point we denote the subspace corresponding to this point as , this is the closest projection matrix to corresponding to a subspace. The procedure to compute from is given in Theorem ?. We obtain correspond to the top eigenvectors of where is the trace of .

We now state two procedures. The first procedure initializes the parameters and . The second procedure computes the -th sample of the parameters.

The first procedure which we denote as proceeds as follows:

Draw , the symmetric group of permutations on elements.

For ,

draw ;

compute ;

compute ;

compute ;

compute as the top eigenvectors of ;

draw ;

compute .

The first step permutes the order we initialize the components. Step (a) samples a point from a multivariate normal with the dimension of the sphere. In Step (b) we normalize the sampled point, recenter it, and embed it onto the sphere . In Step (c) we compute the projection matrix by computing the closest subspace to the embedded point computed in Step (b). Given the projection matrix we compute the dimension in Step (d) and the basis of the subspace in Step (e). Steps (e) and (f) we compute the parameters.

The second procedure which we denote as computes the -th sample as follows:

Draw , the symmetric group of permutations on elements.

For ,

draw ;

compute ;

compute ;

compute ;

compute as the top eigenvectors of ;

draw ;

set

set

compute the acceptance probability

set

draw ;

compute ;

draw ;

set

compute the acceptance probability

set

Many steps of this procedure are the same as the first procedure with the following exceptions. In Steps (a) and (k) we are centering the random walk to the previous values of and respectively. Step (g) updates the set of projection matrices by replacing the -th projection matrix in the set with the proposed new matrix. Step (h) is analogous to Step (g) but for the set of vectors. In Step (j) we update the subspace and in Step (p) we update the vector.

### 3.2Sampling algorithm

We now state the algorithm we use to sample from the posterior. To simplify notation we work with precision matrices instead of the inverse of covariance matrices for each mixture component. Similarly, we work with the precision of the -th component instead of the inverse of the variance, .

The follow procedure provides posterior samples:

Draw .

Draw for and .

For ,

for and , compute

for , set

for , draw ;

update for each where

and ;

update for , and each ,

update for , and ,

where denotes the th element of the vector ;

draw

The update steps for are (d), (e), (f) respectively and are given by the conditional probabilities given all other variables. Steps (a), (b), and (c) assign the latent membership variable to each observation based on the distance to the subspaces. We set the parameter very large which effective assigns membership of each to the subspace with least residual error.

When drawing from the Gibb’s posterior distribution via a Metropolis–Hastings algorithm, the proposal distribution and temperature are adjusted through a burn-in period. In the first stage of burn-in, the proposal variance parameter is fixed, while temperature is selected by a decreasing line search on a log-scale grid, from to until the acceptance ratio reaches the 10%–90% range. With temperature fixed, the proposal variance is adjusted until the acceptance ratio falls in the 25%–45% range during the burn-in period. Thinning was applied in that every third draw of the sampler was kept, this was determined from autocorrelation analysis.

## 4Specification of a Topic Model

### 4.1Generative Model on the Stiefel manifold

The idea behind topic modeling is to specify a generative model for documents where the model parameters provide some intuition about a collection of documents. A common representation for documents is what is called a “bag of words” model where the grammar and structure of a document is ignored and a document is just a vector of counts of words [9]. A natural generative model for collections of documents is an admixture of topics where each topic is a multinomial distribution over words, this model is called a latent Dirichlet allocation (LDA) model [9]. We will propose a slight variation of the LDA model later in this section which is a direct extension/application of a mixture of subspaces.

We first state the standard LDA model. Given documents, topics, and a vocabulary of size the counts of the -th word in the -th document is specified by the following hierarchical model

In a spherical admixture model (SAM) [41] the vector of word counts in each document transformed by centering at zero and normalizing to unit length. The idea behind a SAM is to represent data as direction distributions on a hypersphere. The advantage of a SAM is that one can simultaneously model both frequency as well as presence/absence of words, an LDA model can only model frequency. There is empirical evidence of greater accuracy in using a SAM for sparse data such as text [5]. We extend the SAM model in two important ways, first by ensuring all the topics are orthogonal. The logic behind orthogonality constraints in the topics is to avoid the empirically observed problem of redundant topics. Strategies to eliminate this problem include removing what are called stop words from the corpus, for example words including “such”, “as”, and “and.” However, it is not always the case that stop words for a particular corpus are known a priori. For example the word “topic” should be a stop word in a corpus of papers on topic modeling. Introducing an orthogonality constraint on the topics can enforce the prior knowledge that they should be interpretable as distinct. In [28] a Bayesian model for an orthogonal SAM is specified and a posterior sampling procedure is developed. A key insight this paper was how to efficiently simulate from the set of orthonormal matrices using the matrix Bingham–von Mises–Fisher distribution. For an orthogonal SAM model the topics on words were modeled using the matrix Bingham–von Mises–Fisher distribution which is on the Stiefel manifold, .

Our second extension is to infer the number of topics . While the orthogonality constraints help with interpretation of topics and removes redundant topics there are still topics with low posterior mixture probabilities and low coherence can still occur. This is mainly driven by misspecification of the number of topics. We now state our novel SAM model that enforces orthonormal columns as well as allows for the inference of the number of topics. The novel contributions of our model are inference of the number of topics by using the geometry of the Conway embedding to place a joint prior over the number of topics and topic. We are able to sample a from Stiefel manifolds of variable intrinsic dimension by coupling draws from the von Mises–Fisher distribution with inversion of the Conway embedding. This allows us to avoid using the matrix Bingham–von Mises–Fisher distribution.

We provide some intuition for our SAM with orthogonality constraints as well as useful notation before we specify the model. We will simulate a topic matrix where the columns of the matrix are topics and the number of topics is random, this orthogonal matrix is sampled from a distribution over Stiefel manifolds, , with fixed ambient dimension (the number of words) and variable embedding dimension . For each document probability vector of topic proportions over the topics is generated. The normalized unit vector representing normalized word frequencies for each document is generated from the topic proportions and the topic matrix . The following notation and concepts will be used in the generative model. We denote as the Conway sphere this is the collection of orthogonal subspaces of variable dimension embedded into a sphere. We denote and as the embedding function and its inverse respectively. We denote as the von Mises–Fisher distribution over the Conway sphere and as the von Mises–Fisher distribution over the unit sphere . Given the topics matrix and the topic proportions for a document a spherical average of the topics with respect to the proportions is the admixed parameter that models the combination of topics in document and is computed by

We did not use the Buss-Fillmore spherical average due to computational considerations, we wanted to avoid iterative procedures. Given a vocabulary of size and documents, the normalized unit vector for each document is specified by the following hierarchical model

The main difference in the above model and prior models on spheres [28] is that instead of sampling topics from the embedding on the Conway sphere a fixed topic vectors were each sampled from a von Mises–Fisher distribution over . In [41] the vectors were not constrained to be orthogonal, in [28] an efficient procedure is given to simulate these orthogonal vectors. The Conway sphere in this case is extremely high dimensional and there is computational utility in reducing the vocabulary size.

As in the mixture of subspaces model we require a prior that will place greater weight on models with fewer topics, as increasing the number of topics will result in a better fit with respect to the likelihood. In the same spirit as Section 3 we specify a Gibbs posterior to place a prior on the Conway sphere that can be efficiently be sampled from and favors models with fewer topics. We specify the following loss function for each document vector

and corresponding Gibbs posterior

The maximum penalty above is and would counterbalance a perfect fit to each document with a loss value of zero. Using the Gibbs posterior allows us to skip the step of estimating the corpus average parameter since the remaining parameters and are conditionally independent given the topics. We set the temperature parameter using out-of-sample fits on a random search over .

Inference of the remaining parameters of Model are estimated using the same sampling steps as in a standard SAM, once the topics and number of topics are sampled. The high-dimension of the Conway sphere can result in slower mixing of the topics and it is of interest to explore EM or Hamiltonian Monte Carlo approaches for computational gain.

## 5Posterior consistency

In this section we show that the mixture of subspaces model specified in Section 2 has good frequentist properties. We provide an asymptotic analysis of our model and state some theoretical guarantees. We will prove posterior consistency—the posterior distribution specified by our model contracts to any arbitrary neighborhood of the true data generating density. Adapting proof techniques from the extensive literature on posterior consistency of Bayesian models [21] to our model is non-trivial.

Before we state our results we provide an explanation of the relation between our consistency result, the estimation procedure, and ideally what proof statements would be of interest. The MCMC algorithm stated in Section 3 uses a Gibbs posterior while our analysis in this section is for a standard posterior. Providing an argument for posterior consistency and proper calibration of the Gibbs posterior is of great interest but beyond the scope of this paper. A natural question is the asymptotic analysis of the convergence of mixture components and weights, basically an analysis of the clustering performance. The inference of mixture components is significantly more complicated that convergence in density and involves subtle identifiability issues.

Let be the space of all the densities in and be the true data generating density. We first define some notion of distances and neighborhoods in . A weak neighborhood of with radius is defined as

where is the space of all continuous and bounded functions on . The Hellinger distance is defined as

Denote an -Hellinger neighborhood around with respect to . The Kullback–Leibler (KL) divergence between and is defined to be

with denoting an -KL neighborhood of .

One of the key geometric insights of the Conway embedding given in Section 2 is that the geometric embedding allows for the specification of prior distributions of the parameters on a smooth space. Let be a prior on the sphere which can be taken to be the uniform distribution or the von Mises–Fisher distribution. By projecting the samples from onto the cross-sections of the sphere, induces a prior distribution on the subspaces basis which we denote by .

Note that our model induces a prior on . Assume the true density follows the following the regularity conditions

is bounded away from zero and bounded above by some constant for all ;

;

for some , , where ;

there exists such that .

We will show that the posterior distribution concentrates around any true density as . The following theorem is on weak consistency.

A result due to [43] states that if assumes positive mass to any Kullback–Leibler neighborhood of , then the resulting posterior is weakly consistent. Therefore, one needs to show for all ,

Note that for any . Then one has

Therefore it suffices to show that where are the bases of the respective -dimensional subspaces .

We will show that there exists large enough such that given -dimensional subspaces , the following mixture model assigns positive mass to any KL neighborhood of ,

with and

If have the same dimension and are a choice of orthonormal bases on the respective subspaces, then an infinite-dimensional version of our model can be given by

with parameters . The prior for can be given by a Dirichlet process prior whose base measure has full support in while the priors of the rest parameters can be given the same as those of our model. By Theorem 3.1 of [38] or Theorem 2 of [47], there exists an open subset of the space of all the probability measures on such that for for all , any such that

We will first show that for any there exists large enough and , , , , , , and such that

for any .

Let be some large enough number. We first partition into sets. Let

and

Pick where and let . Approximating the integral by the finite sum over the cubes in , for all , there exists large enough such that