Fast nonlinear embeddings via structured matrices

# Fast nonlinear embeddings via structured matrices

Krzysztof Choromanski Google Research, New York, NY 10011, USA, kchoro@google.com Francois Fagan Department of IEOR, Columbia University, New York, NY 10027, USA, ff2316@columbia.edu
###### Abstract

We present a new paradigm for speeding up randomized computations of several frequently used functions in machine learning. In particular, our paradigm can be applied for improving computations of kernels based on random embeddings. Above that, the presented framework covers multivariate randomized functions. As a byproduct, we propose an algorithmic approach that also leads to a significant reduction of space complexity. Our method is based on careful recycling of Gaussian vectors into structured matrices that share properties of fully random matrices. The quality of the proposed structured approach follows from combinatorial properties of the graphs encoding correlations between rows of these structured matrices. Our framework covers as special cases already known structured approaches such as the Fast Johnson-Lindenstrauss Transform, but is much more general since it can be applied also to highly nonlinear embeddings. We provide strong concentration results showing the quality of the presented paradigm.

dimensionality reduction, structured matrices, nonlinear embeddings

Author\subjclassG.3 Probability and statistics - Probabilistic algorithms

## 1 Introduction

Dimensionality reduction techniques and nonlinear embeddings based on random projections is a well-established field of machine learning. It is built on the surprising observation that the relationship between points in a high-dimensional space might be approximately reconstructed from a relatively small number of their independent random projections. This relationship might be encoded by the standard Euclidean distance, as is the case for the Johnson-Lindenstrauss Transform [26], or a nonlinear function such as kernel similarity measure [32]. These techniques are applied in compression and information retrieval [2, 5, 14, 17], compressed sensing due to the related restricted isometry properties [4, 8, 15], quantization [7, 20] and many more. One particularly compelling application involves random feature selection techniques that were successfully used for large-scale kernel computation [19, 32, 37, 39, 41]. The randomized procedure for computing many of the kernels’ similarity measures/distances considered in that setting (including Euclidean distance, angular similarity kernels, arc-cosine kernels and Gaussian kernels) is based on using first a Gaussian random mapping and then applying pointwise nonlinear mappings (thus the computations mimic these in the neural network setting, but the linear projection is not learned). This is the area of our interest in this paper.

Recently it was observed that for some of these procedures unstructured random matrices can be replaced by their structured counterparts and the quality of the embedding does not change much. A structured matrix uses random gaussian variables and distributes them in a way that approximately preserves several properties of the completely random version. The importance of the structured approach lies in the fact that it usually provides speedups of matrix-vector multiplication — a key computational block of the unstructured variant — by the exploitation of the matrix structure (for instance for Gaussian circulant matrices one may use the Fast Fourier Transform to reduce the computational time from to ). Furthermore, it gives a space complexity reduction since structured matrices can be stored in subquadratic or even linear space. A structured approach to linear embeddings, called the Fast Johnson-Lindenstrauss Transform, is itself a subject of vast volume of research results that use different approaches involving: Hadamard matrices with Fast Fourier Transforms and sparse matrices [2, 3, 4, 6, 26], binary and sparse matrices [1, 13, 23, 24, 29, 31], Lean Walsh Transform [28], circulant matrices [18, 25, 33, 34, 35, 40, 45] and others. Here no nonlinear mappings are used. In this paper we are interested mainly in structured nonlinear embeddings. Not much is known in that area. All known results target very specific kernels, such as the angular similarity kernel [11, 17, 43, 44] and Gaussian kernel [27], and/or use a fixed budget of randomness to construct a structured matrix (all above but [11]) since the construction of the structured matrix is very rigid.

This is all relevant to neural networks which have matrix-vector multiplication and nonlinear transformations at their core. Structured matrices have been used in neural networks to speed up matrix-vector computation, decrease storage and sharply reduce the number of training parameters without much affecting performance [9, 30, 38, 42]. Random weight matrices eliminate training weight matrices altogether [11, 36] and provide a pathway for analyzing neural networks [16]. Explicitly integrating out random weights produces new kernel-based algorithms, such as the arc-cosine kernel [10].

Even though there is some agreement which structured approaches may work in practice for specific applications, general characteristics of structured matrices producing high quality embeddings for general nonlinear mappings as well as the underlying theoretical explanation was not known. In this paper we propose such a general framework that covers as special cases most of the existing structured mechanisms and can be automatically adjusted to different “budgets of randomness” used for structured matrices. The latter property enables us to smoothly transition from the completely unstructured setting, where the quality guarantees are stronger but computational cost is larger, to the structured setting, where we can still prove quality results but the computations are sped up and space complexity is drastically reduced. At the same time we show an intriguing connection between guarantees regarding the quality of the produced structured nonlinear embeddings and combinatorial properties of some graphs associated with the structured models and encoding in a compact form correlations between different rows of the structured matrix.

The randomized function for which we propose a structured computational model takes as an input vectors . Thus the model is general enough to handle relations involving more than vectors. Each vector is preprocessed by multiplying it with a Gaussian matrix , where stands for the row, and pointwise nonlinear mapping following it. We then apply another mapping separately on each dimension (in the context of kernel computations mapping is simply a product of its arguments) and finally agglomerate the results for all dimensions by applying another mapping . Functions defined in such a way, although they may look complicated at first glance, encode all the distance/kernels’ similarity measures that we have mentioned so far. We show that the proposed general structured approach enables us to get strong concentration results regarding the computed structured approximation of .

Presented structured approach was considered in [12], but only for the computations of specific kernels (and the results heavily relied on the properties of these kernels). Our concentration results are also much sharper, since we do not rely on the moments method and thus cover datasets of sizes superpolynomial in . In [12] the authors empirically verify the use of the multi-block “Toeplitz-like” matrices for feature set expansion which is not our focus here since we reduce dimensionality.

This paper is organized as follows:

• in Section 2 we introduce our structured mechanism and propose an algorithm using it for fast computations of nonlinear embeddings,

• in Section 3 we present all theoretical results,

• in the Appendix we prove all theoretical results that were not proved in the main body of the paper.

## 2 Structured mechanism for fast nonlinear embeddings

### 2.1 Problem formulation

We consider in this paper functions of the form:

 Λf(v1,...,vk)=E[Ψ(β(f(y1,1),...,f(y1,k)),...,β(f(ym,1),...,f(ym,k)))], (1)

where: , for , , expectation is taken over independent random choices from -dimensional Gaussian distributions where each entry is independently taken from , denotes the dot product and , , for some integers . We will assume that are linearly independent. is always spherically-invariant.

Below we present several examples of machine learning distances/similarity measures that can be expressed in the form: . Although our theoretical results will cover more general cases, the examples will focus on when , and . Equation (1) simplifies:

 Λf(v1,v2)=E[f(⟨r,v1⟩)⋅f(⟨r,v2⟩)]. (2)

This defines a wide class of spherically invariant kernels characterized by . Our results will cover general functions that do not have to be linear, or even not continuous.

1. Euclidean inner product

This is probably the most basic example. Let . One can easily note that . Furthermore, if we take large enough then the value that will be computed, , is well concentrated around its mean and that follows from standard concentration inequalities. Since is usually much smaller than , then one can think about the mapping as a dimensionality reduction procedure that preserves Euclidean inner products. And indeed, the above transformation is well known in the literature as the aforementioned Johnson-Lindenstrauss transform.

2. Angular distance

Now we want to express the angular distance between two given vectors (that are of not necessarily of the same magnitude) as a function . We take as the heaviside step function, i.e. for and otherwise. From basic properties of the Gaussian distribution [11] one can deduce that . Note that, since in this setting takes values from a discrete set, the mapping is not only a dimensionality reduction, but in fact a hashing procedure encoding angular distance between vectors in terms of the dot product between corresponding hashes taken from .

3. Arc-cosine and Gaussian kernels

The arc-cosine kernel [10] is parametrized by , with for and otherwise. For its computation reduces to the computation of the angular distance. If then is the linear rectifier. Higher-order arc-cosine kernels can be obtained by recursively applying that transformation and thus can be approximated by recursively applying the presented mechanism. Gaussian kernels can be computed by a similar transformation, with replaced by trigonometric functions: and .

Our goal is to compute efficiently. Since the random variable in equation (1), , is usually well concentrated around its mean , its straightforward computation gives a good quality approximation of . This, as already mentioned, unfortunately usually requires time and space.

We are interested in providing a good quality approximation of in subquadratic time and subquadratic (or even linear) space. To achieve this goal, we will replace the sequence of independent Gaussian vectors by Gaussian vectors that are no longer independent, yet provide us speed-ups in computations and reduce storage complexity. Our mechanism will use independent Gaussian variables to construct structured matrices , where stands for the row. Parameter enables us to make a smooth transition from the unstructured setting (large values of ), where we obtain stronger concentration results regarding but need more space and computational time, to the structured setting, where concentration results are weaker (yet still strong enough so that the entire mechanism can be applied in practice) but computation can be substantially sped up and the storage complexity is much smaller.

The core of our structured mechanism is the construction of our structured matrices. In the next subsection we will present it and show why several structured matrices considered so far in that context are very special cases of our general structured approach.

### 2.2 Structured linear projections

Consider a vector of independent Gaussian variables taken from . Let be a sequence of matrices, where: . We construct the rows of our structured matrix A as follows:

 ai=g⋅Pi (3)

for . Thus the entire structured mechanism is defined by parameter , and a sequence . We call it a -model.

In practice we will not store the entire sequence but just a matrix A that is obtained by applying matrices of to the vector g. We denote the column of matrix as . We will assume that sequence is normalized. {definition} (Normalization property) A sequence of matrices is normalized if for any fixed and the expression .

Note that from the normalization property it follows that every is a Gaussian vector with elements from . We will use another useful notation, namely: for and . Note that when then reduces to the cross-correlation between column and column of . If the following is also true: then .

We define now graphs associated with a given -model that we call the coherence graphs.

{definition}

(Coherence graphs) Let . We define by an undirected graph with the set of vertices and and the set of edges . In other words, edges are between these vertices for which the corresponding -element subsets intersect.

We denote by the chromatic number of the graph , i.e. the minimum number of colors that need to be used to color its vertices in such a way that no two adjacent vertices get the same color.

The correlation between different rows of the structured matrix A obtained from the sequence of matrices and the “budget of randomness” can be measured very accurately by three quantities that we will introduce right now. These quantities give a quantitative measure of the “structuredness” of a given matrix A and play important role in establishing theoretical results for general structured models.

{definition}

(Chromatic number of a -model) The chromatic number of a -model is defined as:

 χ[P]=max1≤i,j≤mχ(i,j). (4)

Thus the chromatic number of a -model is the maximum chromatic number of a coherence graph.

{definition}

(Coherence and unicoherence of a -model) The coherence of a -model is defined as:

 μ[P]=max1≤i,j≤m√∑1≤n1

The unicoherence of a -model is given by the following formula:

 ~μ[P]=max1≤i

We will show in the theoretical section that as long as are at most polynomial in and , strong concentration results regarding the quality of the structured embedding can be derived. Below we show many classes of matrices A that can be constructed according to the presented mechanism and for which all three quantities have desired orders of magnitude.

1. Circulant matrices

This is a flagship example of the structured approach [18, 25, 33, 34, 35, 40, 45]. In that setting and the structured Gaussian matrix A is obtained from a single Gaussian vector by its right shifts, i.e. A is of the form:

 Acirc=⎛⎜ ⎜ ⎜⎝g0g1...gn−1gn−1g0...gn−2............gn−m+1gn−m+2...g2n−m⎞⎟ ⎟ ⎟⎠, (7)

where the operations on indices are taken modulo . Matrix can be obtained from the presented pipeline by using budget of randomness and a sequence of matrices , where are respectively:

 ⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝\framebox10.........00\framebox1.........0....................................0............\framebox1⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠,⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝0\framebox1.........000\framebox1......0....................................\framebox1............0⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠,⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝00\framebox1......0000\framebox1...0..................\framebox1............00\framebox1.........0⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠,...

Each is normalized. Furthermore:

 σi1,i2(n1,n2)={0if n1−n2≠i1−i2modn,1otherwise. (8)

The above observation implies that each is a collection of vertex disjoint cycles (since each vertex has degree two), thus in particular is at most (see Figure 1 for an illustration). We conclude that . One can also see that and .

2. Toeplitz matrices

Toeplitz matrices that are also used frequently in the structured setting can be modeled by our mechanism by increasing the budget of randomness from to . A Toeplitz Gaussian matrix is of the form:

 AToeplitz=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝g0g1g2g3......gn−1gng0g1g2......gn−2gn+1gng0g1......gn−3gn+2gn+1gng0......gn−4.....................gn+m−2gn+m−3............gn−m⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠ (9)

In other words, a Toeplitz matrix is constant along each diagonal. In that scenario matrices are of the form:

 ⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝\framebox10............00\framebox1............0.....................0000......\framebox10000......0..........................................0000......00000......00000......0⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠,⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝0\framebox1............000\framebox1.........0.....................0000......\framebox10000......0\framebox1...............0.....................0000......00000......00000......0⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠,⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝00\framebox1.........0000\framebox1......0.....................0...............\framebox10000......00000......00\framebox1............0\framebox10............0.....................0000......0⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠,

and so on. Again, one can easily note that is normalized. For Toeplitz matrices we have:

 σi1,i2(n1,n2)={0if n1−n2≠i1−i2modn,1−c(i1,i2,n1,n2)otherwise, (10)

for some . By increasing the budget of randomness we managed to decrease and that implies better concentration results. Bounds for , , from the circulant setting are valid also here. Figures 1 and 2 demonstrate how increasing the “budget of randomness” decreases the chromatic numbers of the corresponding coherence graphs and thus also parameter . There we compare the circulant structured approach with the Toeplitz structured approach, where the “budget of randomness” is increased.

3. Hankel matrices

These can be obtained in the analogous way as Toeplitz matrices since each Hankel matrix is defined as the one in which each ascending skew-diagonal from left to right is constant. Thus it is a reflected image of the Toeplitz matrix and in particular shares with it all structural properties considered above.

4. Matrices with low displacement rank

Several classes of structured matrices can be described by the low value of the parameter called displacement rank [21, 22, 38]. In particular, classes of matrices described by the formula:

 Aldr=r∑i=1Z1(gi)Z−1(hi), (11)

where: are given -dimensional vectors () and matrices are the circulant-shift and skew-circulant-shift matrix respectively (see [38] for definitions). Matrices have displacement rank and cover such families as: circulant and skew-circulant matrices, Toeplitz matrices, inverses of Toeplitz matrices (for ), products of the form for and all linear combinations of the form , where and are Toeplitz matrices or inverses of Toeplitz matrices [38].

Assume now that are independent Gaussian vectors. Note that then is a special instance of the -model, where the budget of randomness is of the form: , is obtained by vertically stacking matrices for and is obtained from for by vertical circulant shifts applied block-wise.

There exist several purely deterministic and simple random constructions of the sequence for which the considered parameters , , of the related -model are in the desired range of magnitude. For instance, fix some constant and choose at random nonzero dimensions, independently for each . Choose the value of each nonzero dimension to be with probability and otherwise, independently for each dimension.

In that setting each column of each has norm equal to one. One can also note that , . Furthermore, with high probability if is large enough (but still satisfies ). Thus these matrices can be also used in the algorithm we are about to present now and are covered by our theoretical results. It was heuristically observed before that displacement rank is a useful parameter for tuning the level of “structuredness” of these matrices and increasing may potentially lead to better quality embeddings [38]. Our framework explains it. Larger values of trivially imply larger “budgets of randomness”, i.e. stronger concentration results for and thus much smaller values of in practice. That decreases the value of the coherence and unicoherence of the related -model and thus, as we will see in the theoretical section, improves concentration results.

### 2.3 The Algorithm

We are ready to describe a general algorithm for fast nonlinear embeddings via structured matrices. Consider a function

 Λf(v1,...,vk)=E[Ψ(β(f(y1,1),...,f(y1,k)),...,β(f(ym,1),...,f(ym,k)))], (12)

where: and s are independent Gaussian vectors. We want to compute efficiently for a dataset of -dimensional points.

Structured matrix A that allows us to do this is constructed by choosing the budget of randomness for a given parameter and a sequence of matrices such each element of has the same distribution as the corresponding element of for . By choosing different s and budgets of randomness g one can smoothly balance between speed of the transform/storage complexity and its quality.

Step 1: Dataset is first preprocessed by multiplying each datapoint by a matrix , where H is an arbitrary -normalized Hadamard matrix and are independent random diagonal matrices with nonzero entries taken from the set , each independently at random and with probability .

Step 2: Dataset is transformed by multiplying it by a structured matrix A. Then function is applied pointwise to each datapoint of . For any given the approximated value of is calculated as:

 Ψ(β(vf,11,...,vf,k1),...,β(vf,1m,...,vf,km)), (13)

where with applied pointwise ( is the dimension of ).

In practice, for equation (13) very often boils down to computing the standard dot product between and (as is the case for any in the form of equation (2)).

By using structured matrices listed in Section 2.2 one can significantly reduce storage complexity of the entire computational mechanism. Indeed: {remark} Circulant, Toeplitz, Hankel matrices or products/linear combinations of the number of Toeplitz matrices/inverses of Toeplitz matrices can be stored in linear space. Hadamard matrices can be efficiently (i.e. in the subquadratic time) computed on-the-fly and do not have to be stored. More importantly, the presented structured pipeline gives significant computational speed-ups over the standard approach requiring quadratic time. This is a direct implication of the fact that matrix-vector multiplication, which is a main computational bottleneck of the nonlinear embeddings pipelines, can be performed in subquadratic time for many classes of the structured matrices covered by the presented scheme, in particular for all special classes listed by us so far. Indeed:

{remark}

For classes of matrices with bounded displacement rank matrix-vector multiplication can be performed in subquadratic time. These classes cover in particular: circulant and skew-circulant matrices, Toeplitz matrices, Hankel matrices, inverses of Toeplitz matrices (for ), products of the form for and all linear combinations of the form , where and are Toeplitz matrices or inverses of Toeplitz matrices. For Toeplitz (and thus also circulant) matrices as well as for Hankel matrices the computation can be done in time.

Some of the mentioned structured matrices were used before in the non-linear embedding setting for certain functions . However to the best of our knowledge, we are the first to present a general structured framework that covers all these settings as very special subcases. Furthermore, we give rigorous theoretical results proving the quality of the structured approach for general nonlinear functions . The nonlinear transformation is what makes the entire theoretical analysis challenging and forces us to apply different techniques than those for the fast Johnson-Lindenstrauss transform.

## 3 Theoretical results

In this section we prove several concentration results regarding the presented structured mechanism. We start with the following observation.

{lemma}

Assume that is a linear function and in a given -model for every any two columns of are orthogonal. Then that -model mechanism gives an unbiased estimation of i.e. for any given the following is true:

 E[Λstructf(v1,...,vk)]=Λf(v1,...,vk). (14)

We call the condition regarding matrices from the statement above the orthogonality condition. The orthogonality condition is trivially satisfied by Hankel, circulant or Toeplitz structured matrices produced by the -model. It is also satisfied in expectation (which in practice suffices) for some structured models where matrices are constructed according to a random procedure. Linear is used in all applications given by us. We want to note however that even if is not linear, strong concentration results (with an extra error accounting for ’s nonlinearity) can be obtained as we show in the section regarding concentration inequalities.

###### Proof.

Note that it suffices to show that every row of a structured matrix has the same distribution as the corresponding row of the unstructured matrix. If this is the case then for any given the distribution of is the same as a distribution of and the result follows from the linearity of expectations. The fact that a distribution of is the same as of is implied by two observations. First, notice that by the way are constructed, the distribution of each dimension of is the same as a distribution of the corresponding dimension of . The independence of different dimensions of is an immediate consequence of the fact that projections of the “budget of randomness” Gaussian vector g onto orthogonal directions are independent and the assumed orthogonality condition regarding the -model. ∎

From now on we will assume that a given -model satisfies the orthogonality condition. We need to introduce a few useful definitions.{definition} We denote by the supremum of the expression over all pairs of vectors from the domain that differ on at most one dimension and by at most .

In lots of applications (such as angular distance computation or any in the form of equation (2)) we have: . In that setting is -bounded for . In the angular distance setting we have: . For given above we also have: .

{definition}

For a function we denote

 ρΨ,βi=supy1,1,...,ym,k,y′1,1,...,y′m,k|h(y1,1,...,ym,k)−h(y′1,1,...,y′m,k)|, (15)

where and sequences , differ on the coordinate.

For instance, for the angular distance setting we have: .

Note that the value of the main computational block of , namely:

 Bv1,...,vk=β(f(⟨r,v1⟩),...,f(⟨r,vk⟩)), (16)

depends only on the projection of r into the linear space spanned by , not the part orthogonal to it. Thus for fixed , and function is in fact the function of , where the coordinate of is the projection of r onto . We will measure how sensitive is to the perturbations of using the following definition:

{definition}

Let be as in equation (1). Define:

 pλ,ϵ=P[supv1,...,vk,∥ζ∥∞≤ϵ|Bv1,...,vk(rproj+ζ)−Bv1,...,vk(rproj)|>λ], (17)

where the supremum is taken over all -tuples of linearly independent vectors from the domain. We also denote

 ~βϵ=supv1,...,vk,∥ζ∥∞≤ϵ|E[Bv1,...,vk(% rproj+ζ)]−E[Bv1,...,vk(rproj)]|. (18)

Example () If , (as it is the case in most of the considered examples) and data is taken from the bounded domain then one can easily see that for .

Example - angular case. For the angular distance setting one can prove (see: Appendix) that .

Example - general kernels. We say that function is -Lipschitz if . Let and let be the maximum value of the bounded function . If is in the form of equation (2) and is -Lipschitz then one can easily prove that: for . For instance, if and all datapoints of have -norm at most then for .

{definition}

The Legendre Transform of a random variable is defined as: . For a -tuple and given with we denote

 Lζ,v1,...,vk(x)=LX(x), (19)

where: .

If a nonlinear mapping is unbounded we will assume that all datapoints are taken from a bounded set and that , for some constants and . The latter conditions are trivially satisfied in most of the considered structured computations with unbounded . In particular, if is an arc-cosine kernel then one can take: , . Our main result is stated below.

{theorem}

Let be a dataset of -dimensional points and size . Let be of the form:

 Λf(v1,...,vk)=E[Ψ(β(f(y1,1),...,f(y1,