The Unreasonable Effectiveness of Structured Random Orthogonal Embeddings

The Unreasonable Effectiveness of Structured Random Orthogonal Embeddings

Krzysztof Choromanski
equal contribution
Mark Rowland 11footnotemark: 1
University of Cambridge
mr504@cam.ac.uk
University of Cambridge and Alan Turing Institute
aw665@cam.ac.uk
Abstract

We examine a class of embeddings based on structured random matrices with orthogonal rows which can be applied in many machine learning applications including dimensionality reduction and kernel approximation. For both the Johnson-Lindenstrauss transform and the angular kernel, we show that we can select matrices yielding guaranteed improved performance in accuracy and/or speed compared to earlier methods. We introduce matrices with complex entries which give significant further accuracy improvement. We provide geometric and Markov chain-based perspectives to help understand the benefits, and empirical results which suggest that the approach is helpful in a wider range of applications.

1 Introduction

Embedding methods play a central role in many machine learning applications by projecting feature vectors into a new space (often nonlinearly), allowing the original task to be solved more efficiently. The new space might have more or fewer dimensions depending on the goal. Applications include the Johnson-Lindenstrauss Transform for dimensionality reduction (JLT, Johnson and Lindenstrauss, 1984) and kernel methods with random feature maps [Rahimi and Recht, 2007]. The embedding can be costly hence many fast methods have been developed, see §1.1 for background and related work.

We present a general class of random embeddings based on particular structured random matrices with orthogonal rows, which we call random ortho-matrices (ROMs); see §2. We show that ROMs may be used for the applications above, in each case demonstrating improvements over previous methods in statistical accuracy (measured by mean squared error, MSE), in computational efficiency (while providing similar accuracy), or both. We highlight the following contributions:

• In §3: The Orthogonal Johnson-Lindenstrauss Transform (OJLT) for dimensionality reduction. We prove this has strictly smaller MSE than the previous unstructured JLT mechanisms. Further, OJLT is as fast as the fastest previous JLT variants (which are structured).

• In §4: Estimators for the angular kernel [Sidorov et al., 2014] which guarantee better MSE. The angular kernel is important for many applications, including natural language processing [Sidorov et al., 2014], image analysis [Jégou et al., 2011], speaker representations [Schmidt et al., 2014] and tf-idf data sets [Sundaram et al., 2013].

• In §5: Two perspectives on the effectiveness of ROMs to help build intuitive understanding.

In §6 we provide empirical results which support our analysis, and show that ROMs are effective for a still broader set of applications. Full details and proofs of all results are in the Appendix.

1.1 Background and related work

Our ROMs can have two forms (see §2 for details): (i) a is a random Gaussian matrix conditioned on rows being orthogonal; or (ii) an -product matrix is formed by multiplying some number of blocks, each of which is highly structured, typically leading to fast computation of products. Here is a particular structured matrix, and is a random diagonal matrix; see §2 for full details. Our block generalizes an block, where is a Hadamard matrix, which received previous attention. Earlier approaches to embeddings have explored using various structured matrices, including particular versions of one or other of our two forms, though in different contexts.

For dimensionality reduction, Ailon and Chazelle [2006] used a single block as a way to spread out the mass of a vector over all dimensions before applying a sparse Gaussian matrix. Choromanski and Sindhwani [2016] also used just one block as part of a larger structure. Bojarski et al. [2017] discussed using blocks for locality-sensitive hashing methods but gave no concrete results for their application to dimensionality reduction or kernel approximation. All these works, and other earlier approaches [Hinrichs and Vybíral, 2011, Vybíral, 2011, Zhang and Cheng, 2013, Le et al., 2013, Choromanska et al., 2016], provided computational benefits by using structured matrices with less randomness than unstructured iid Gaussian matrices, but none demonstrated accuracy gains.

Yu et al. [2016] were the first to show that -type matrices can yield improved accuracy, but their theoretical result applies only asymptotically for many dimensions, only for the Gaussian kernel and for just one specific orthogonal transformation, which is one instance of the larger class we consider. Their theoretical result does not yield computational benefits. Yu et al. [2016] did explore using a number of blocks empirically, observing good computational and statistical performance for , but without any theoretical accuracy guarantees. It was left as an open question why matrices formed by a small number of blocks can outperform non-discrete transforms.

In contrast, we are able to prove that ROMs yield improved MSE in several settings and for many of them for any number of dimensions. In addition, -product matrices can deliver computational speed benefits. We provide initial analysis to understand why can outperform the state-of-the-art, why odd yields better results than even , and why higher values of deliver decreasing additional benefits (see §3 and §5).

2 The family of Random Ortho-Matrices (ROMs)

Random ortho-matrices (ROMs) are taken from two main classes of distributions defined below that require the rows of sampled matrices to be orthogonal. A central theme of the paper is that this orthogonal structure can yield improved statistical performance. We shall use bold uppercase (e.g. ) to denote matrices and bold lowercase (e.g. ) for vectors.

Gaussian orthogonal matrices. Let be a random matrix taking values in with iid elements, which we refer to as an unstructured Gaussian matrix. The first distribution for generating ROMs we consider produces orthogonal variants of and may be viewed as the regular conditional probability measure associated with conditioned on the event that all rows of the matrix are orthogonal. Realizations of can be generated by, for example, performing a Gram-Schmidt process on the rows of to obtain a set of orthonormal rows, and then randomly independently scaling each row so that marginally it has the distribution of a Gaussian vector. The orthogonality of the rows of this matrix has been observed to yield improved statistical properties for randomized algorithms built from the matrix in a variety of applications.

-product matrices. Our second class of distributions is motivated by the desire to obtain similar statistical benefits of orthogonality to , whilst gaining computational efficiency by employing more structured matrices. We call this second class -product matrices. These take the more structured form , where has orthogonal rows, ; and the are independent diagonal matrices described below. By , we mean the matrix product . This class includes as particular cases several recently introduced random matrices (e.g. Andoni et al., 2015, Yu et al., 2016), where good empirical performance was observed. We go further to establish strong theoretical guarantees, see §3 and §4.

A prominent example of an matrix is the normalized Hadamard matrix , defined recursively by , and then for , Importantly, matrix-vector products with are computable in time via the fast Walsh-Hadamard transform, yielding large computational savings. In addition, matrices enable a significant space advantage: since the fast Walsh-Hadamard transform can be computed without explicitly storing , only space is required to store the diagonal elements of . Note that these matrices are defined only for a power of 2, but if needed, one can always adjust data by padding with s to enable the use of ‘the next larger’ H, doubling the number of dimensions in the worst case.

Matrices are representatives of a much larger family in which also attains computational savings. These are -normalized versions of Kronecker-product matrices of the form for , where stands for a Kronecker product and blocks have entries of the same magnitude and pairwise orthogonal rows each. For these matrices, matrix-vector products are computable in time Z. et al. [2015].

S includes also the Walsh matrices , where and , are binary representations of and respectively.

For diagonal , we mainly consider Rademacher entries leading to the following matrices.

Definition 2.1.

The -Rademacher random matrix with blocks is below, where are diagonal with iid Rademacher random variables [i.e. ] on the diagonals:

 M(k)SR=k∏i=1SD(R)i. (1)

Having established the two classes of ROMs, we next apply them to dimensionality reduction.

3 The Orthogonal Johnson-Lindenstrauss Transform (OJLT)

Let be a dataset of -dimensional real vectors. The goal of dimensionality reduction via random projections is to transform linearly each by a random mapping , where: for , such that for any the following holds: . If we furthermore have then the dot-product estimator is unbiased. In particular, this dimensionality reduction mechanism should in expectation preserve information about vectors’ norms, i.e. we should have: for any .

The standard JLT mechanism uses the randomized linear map , where is as in §2, requiring multiplications to evaluate. Several fast variants (FJLTs) have been proposed by replacing with random structured matrices, such as sparse or circulant Gaussian matrices Ailon and Chazelle [2006], Hinrichs and Vybíral [2011], Vybíral [2011], Zhang and Cheng [2013]. The fastest of these variants has time complexity, but at a cost of higher MSE for dot-products.

Our Orthogonal Johnson-Lindenstrauss Transform (OJLT) is obtained by replacing the unstructured random matrix with a sub-sampled ROM from §2: either , or a sub-sampled version of the -Rademacher ROM, given by sub-sampling rows from the left-most matrix in the product. We sub-sample since . We typically assume uniform sub-sampling without replacement. The resulting dot-product estimators for vectors are given by:

 ˆK basem(x,y)=1m(Gx)⊤(Gy)[unstructured iid baseline, previous % state-of-the-art accuracy], (2)

We contribute the following closed-form expressions, which quantify precisely the MSEs for these three estimators. See the Appendix for detailed proofs of these results and all others in this paper.

Lemma 3.1.

The MSE of the unstructured JLT dot-product estimator of using -dimensional random feature maps is unbiased, with

Theorem 3.2.

The estimator is unbiased and satisfies

 MSE(ˆKortm(x,y))=n(m−1)4m∥x∥2∥y∥2∫π20fn(ϕ)cos2(ϕ)Q(ϕ)dϕ+MSE(ˆKbasem(x,y)) (3)

where stands for the density function of the random variable that measures the angle between an -dimensional mean-zero Gaussian vector with identity covariance matrix and a fixed (arbitrary) -dimensional linear subspace and , where stands for an angle between and .

Theorem 3.3 (Key result).

The OJLT estimator with blocks, using -dimensional random feature maps and uniform sub-sampling policy without replacement, is unbiased with

 MSE(ˆK(k)m(x,y))=1m(n−mn−1)( ((x⊤y)2+∥x∥2∥y∥2)+ (4) k−1∑r=1(−1)r2rnr(2(x⊤y)2+∥x∥2∥y∥2)+(−1)k2knk−1n∑i=1x2iy2i).
Proof (Sketch).

For , the random projection matrix is given by sub-sampling rows from , and the computation can be carried out directly. For , the proof proceeds by induction. The random projection matrix in the general case is given by sub-sampling rows of the matrix . By writing the MSE as an expectation and using the law of conditional expectations conditioning on the value of the first random matrices , the statement of the theorem for block and for blocks can be neatly combined to yield the result. ∎

To our knowledge, it has not previously been possible to provide theoretical guarantees that -product matrices outperform iid matrices. Combining Lemma 3.1 with Theorem 3.3 yields the following important result.

Corollary 3.4 (Theoretical guarantee of improved performance).

Estimators (subsampling without replacement) yield guaranteed lower MSE than .

It is not yet clear when is better or worse than , we explore this empirically in §6. Theorem 3.3 shows that there are diminishing MSE benefits to using a large number of blocks. Interestingly, odd is better than even: it is easy to observe that . These observations, and those in §5, help to understand why empirically was previously observed to work well [Yu et al., 2016].

If we take to be a normalized Hadamard matrix , then even though we are using sub-sampling, and hence the full computational benefits of the Walsh-Hadamard transform are not available, still achieves improved MSE compared to the base method with less computational effort, as follows.

Lemma 3.5.

There exists an algorithm (see Appendix for details) which computes an embedding for a given datapoint using with set to and uniform sub-sampling policy in expected time .

Note that for or if , the time complexity is smaller than the brute force . The algorithm uses a simple observation that one can reuse calculations conducted for the upper half of the Hadamard matrix while performing computations involving rows from its other half, instead of running these calculations from scratch (details in the Appendix).

An alternative to sampling without replacement is deterministically to choose the first rows. In our experiments in §6, these two approaches yield the same empirical performance, though we expect that the deterministic method could perform poorly on adversarially chosen data. The first rows approach can be realized in time per datapoint.

Theorem 3.3 is a key result in this paper, demonstrating that -product matrices yield both statistical and computational improvements compared to the base iid procedure, which is widely used in practice. We next show how to obtain further gains in accuracy.

3.1 Complex variants of the OJLT

We show that the MSE benefits of Theorem 3.3 may be markedly improved by using -product matrices with complex entries . Specifically, we consider the variant -Hybrid random matrix below, where is a diagonal matrix with iid random variables on the diagonal, independent of , and is the unit circle of . We use the real part of the Hermitian product between projections as a dot-product estimator; recalling the definitions of §2, we use:

 M(k)SH=SD(U)kk−1∏i=1SD(R)i,ˆKH,(k)m(x,y)=1mRe[(¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯M(k),subSHx)⊤(M(k),subSHy)]. (5)

Remarkably, this complex variant yields exactly half the MSE of the OJLT estimator.

Theorem 3.6.

The estimator , applying uniform sub-sampling without replacement, is unbiased and satisfies: .

This large factor of improvement could instead be obtained by doubling for . However, this would require doubling the number of parameters for the transform, whereas the -Hybrid estimator requires additional storage only for the complex parameters in the matrix . Strikingly, it is straightforward to extend the proof of Theorem 3.6 (see Appendix) to show that rather than taking the complex random variables in to be , it is possible to take them to be and still obtain exactly the same benefit in MSE.

Theorem 3.7.

For the estimator defined in Equation (5): replacing the random matrix (which has iid elements on the diagonal) with instead a random diagonal matrix having iid elements on the diagonal, does not affect the MSE of the estimator.

It is natural to wonder if using an -product matrix with more complex random variables (for all blocks) would improve performance still further. However, interestingly, this appears not to be the case; details are provided in the Appendix §A.7.

3.2 Sub-sampling with replacement

Our results above focus on -product matrices where rows have been sub-sampled without replacement. Sometimes (e.g. for parallelization) it can be convenient instead to sub-sample with replacement. As might be expected, this leads to worse MSE, which we can quantify precisely.

Theorem 3.8.

For each of the estimators and , if uniform sub-sampling with (rather than without) replacement is used then the MSE is worsened by a multiplicative constant of .

4 Kernel methods with ROMs

ROMs can also be used to construct high-quality random feature maps for non-linear kernel approximation. We analyze here the angular kernel, an important example of a Pointwise Nonlinear Gaussian kernel (PNG). Random feature maps for PNGs can be computed by first applying random linear projections based on unstructured Gaussian matrices , and then pointwise some function which depends on the PNG.

Definition 4.1.

For a given function , the Pointwise Nonlinear Gaussian kernel (PNG) is defined by , where is a Gaussian vector with i.i.d entries.

Many prominent examples of kernels [Williams, 1998, Cho and Saul, 2009] are PNGs. Wiener’s tauberian theorem shows that all stationary kernels may be approximated arbitrarily well by sums of PNGs [Samo and Roberts, 2015]. In future work we hope to explore whether ROMs can be used to achieve statistical benefit in estimation tasks associated with a wide range of PNGs; here we focus one kernel in particular.

The nonlinear mapping corresponds to the PNG which is the angular kernel defined by , where is the angle between and .

Estimators of the PNGs we consider are of the form:

 ˆKm(x,y)=1mf(Mx)⊤f(My), (6)

where is a random matrix, and given , means . Such estimation procedures are heavily used in practice [Rahimi and Recht, 2007], as they allow fast approximate linear methods to be used Joachims [2006]. If , the unstructured Gaussian matrix, then we obtain the standard random feature estimator. Instead, we shall use matrices from the ROMs family.

When constructing random feature maps for kernels, very often . In this case, our structured mechanism can be applied by concatenating some number of independent structured blocks. Our theoretical guarantees will be given just for one block, but can easily be extended to a larger number of blocks since different blocks are independent.

The standard random feature approximation for approximating the angular kernel is defined by taking to be in Equation (6), and satisfies the following.

Lemma 4.2.

The estimator is unbiased and .

Our main result regarding angular kernels states that if we instead take in Equation (6), then we obtain an estimator with strictly smaller MSE, as follows.

Theorem 4.3.

Estimator is unbiased and satisfies:

 MSE(ˆKang,ortm(x,y))

We also derive a formula for the MSE of an estimator of the angular kernel which replaces with an arbitrary random matrix and uses random feature maps. The formula is helpful to see how the quality of the estimator depends on the probabilities that the projections of the rows of are contained in some particular convex regions of the -dimensional space spanned by datapoints and . For an illustration of the geometric definitions introduced in this Section, see Figure 1. The formula depends on probabilities involving events , where stands for the row of the structured matrix. Notice that , where stands for the projection of into and is the union of two cones in , each of angle .

Theorem 4.4.

Estimator satisfies the following, where: :

 MSE(ˆKang,Mm(x,y))=1m2[m−m∑i=1(1−2P[Ai])2]+4m2⎡⎣m∑i=1(P[Ai]−θx,yπ)2+∑i≠jδi,j⎤⎦.

Note that probabilities and depend on the choice of . It is easy to prove that for unstructured and we have: . Further, from the independence of the rows of , for . For unstructured we obtain Lemma 4.2. Interestingly, we see that to prove Theorem 4.3, it suffices to show , which is the approach we take (see Appendix). If we replace with , then the expression does not depend on . Hence, the angular kernel estimator based on Hadamard matrices gives smaller MSE estimator if and only if . It is not yet clear if this holds in general.

5 Understanding the effectiveness of orthogonality

Here we build intuitive understanding for the effectiveness of ROMs. We examine geometrically the angular kernel (see §4), then discuss a connection to random walks over orthogonal matrices.

Angular kernel.

As noted above for the -mechanism, smaller MSE than that for unstructured is implied by the inequality , which is equivalent to: . Now it becomes clear why orthogonality is crucial. Without loss of generality take: , , and let and be the first two rows of .

Consider first the extreme case (middle of left part of Figure 1), where all vectors are -dimensional. Recall definitions from just after Theorem 4.3. If is in then it is much less probable for also to belong to . In particular, if then the probability is zero. That implies the inequality. On the other hand, if is perpendicular to then conditioning on does not have any effect on the probability that belongs to (left subfigure of Figure 1). In practice, with high probability the angle between and is close to , but is not exactly . That again implies that conditioned on the projection of into to be in , the more probable directions of are perpendicular to (see: ellipsoid-like shape in the right subfigure of Figure 1 which is the projection of the sphere taken from the -dimensional space orthogonal to into ). This makes it less probable for to be also in . The effect is subtle since , but this is what provides superiority of the orthogonal transformations over state-of-the-art ones in the angular kernel approximation setting.

Markov chain perspective.

We focus on Hadamard-Rademacher random matrices , a special case of the -product matrices described in Section 2. Our aim is to provide intuition for how the choice of affects the quality of the random matrix, following our earlier observations just after Corollary 3.4, which indicated that for -product matrices, odd values of yield greater benefits than even values, and that there are diminishing benefits from higher values of . We proceed by casting the random matrices into the framework of Markov chains.

Definition 5.1.

The Hadamard-Rademacher process in dimensions is the Markov chain taking values in the orthogonal group , with almost surely, and almost surely, where is the normalized Hadamard matrix in dimensions, and are iid diagonal matrices with independent Rademacher random variables on their diagonals.

Constructing an estimator based on Hadamard-Rademacher matrices is equivalent to simulating several time steps from the Hadamard-Rademacher process. The quality of estimators based on Hadamard-Rademacher random matrices comes from a quick mixing property of the corresponding Markov chain. The following demonstrates attractive properties of the chain in low dimensions.

Proposition 5.2.

The Hadamard-Rademacher process in two dimensions: explores a state-space of orthogonal matrices, is ergodic with respect to the uniform distribution on this set, has period , the diameter of the Cayley graph of its state space is , and the chain is fully mixed after time steps.

This proposition, and the Cayley graph corresponding to the Markov chain’s state space (Figure 1 right), illustrate the fast mixing properties of the Hadamard-Rademacher process in low dimensions; this agrees with the observations in §3 that there are diminishing returns associated with using a large number of blocks in an estimator. The observation in Proposition 5.2 that the Markov chain has period 2 indicates that we should expect different behavior for estimators based on odd and even numbers of blocks of matrices, which is reflected in the analytic expressions for MSE derived in Theorems 3.3 and 3.6 for the dimensionality reduction setup.

6 Experiments

We present comparisons of estimators introduced in §3 and §4, illustrating our theoretical results, and further demonstrating the empirical success of ROM-based estimators at the level of Gram matrix approximation. We compare estimators based on: unstructured Gaussian matrices , matrices , -Rademacher and -Hybrid matrices with and different sub-sampling strategies. Results for do not show additional statistical gains empirically. Additional experimental results, including a comparison of estimators using different numbers of blocks, are in the Appendix §C. Throughout, we use the normalized Hadamard matrix for the structured matrix .

6.1 Pointwise kernel approximation

Complementing the theoretical results of §3 and §4, we provide several salient comparisons of the various methods introduced - see Figure 2 top. Plots presented here (and in the Appendix) compare MSE for dot-product and angular and kernel. They show that estimators based on , -Hybrid and -Rademacher matrices without replacement, or using the first rows, beat the state-of-the-art unstructured G approach on accuracy for all our different datasets in the JLT setup. Interestingly, the latter two approaches give also smaller MSE than -estimators. For angular kernel estimation, where sampling is not relevant, we see that and -Rademacher approaches again outperform the ones based on matrices .

6.2 Gram matrix approximation

Moving beyond the theoretical guarantees established in §3 and §4, we show empirically that the superiority of estimators based on ROMs is maintained at the level of Gram matrix approximation. We compute Gram matrix approximations (with respect to both standard dot-product, and angular kernel) for a variety of datasets. We use the normalized Frobenius norm error as our metric (as used by [Choromanski and Sindhwani, 2016]), and plot the mean error based on 1,000 repetitions of each random transform - see Figure 2 bottom. The Gram matrices are computed on a randomly selected subset of data points from each dataset. As can be seen, the -Hybrid estimators using the “no-replacement” or “first rows” sub-sampling strategies outperform even the orthogonal Gaussian ones in the dot-product case. For the angular case, the -approach and -Rademacher approach are practically indistinguishable.

7 Conclusion

We defined the family of random ortho-matrices (ROMs). This contains the -product matrices, which include a number of recently proposed structured random matrices. We showed theoretically and empirically that ROMs have strong statistical and computational properties (in several cases outperforming previous state-of-the-art) for algorithms performing dimensionality reduction and random feature approximations of kernels. We highlight Corollary 3.4, which provides a theoretical guarantee that -product matrices yield better accuracy than iid matrices in an important dimensionality reduction application (we believe the first result of this kind). Intriguingly, for dimensionality reduction, using just one complex structured matrix yields random features of much better quality. We provided perspectives to help understand the benefits of ROMs, and to help explain the behavior of -product matrices for various numbers of blocks. Our empirical findings suggest that our theoretical results might be further strengthened, particularly in the kernel setting.

Acknowledgements

We thank Vikas Sindhwani at Google Brain Robotics and Tamas Sarlos at Google Research for inspiring conversations that led to this work. We thank Jiri Hron, Matej Balog, Maria Lomeli, and Dave Janz for helpful comments. MR acknowledges support by the UK Engineering and Physical Sciences Research Council (EPSRC) grant EP/L016516/1 for the University of Cambridge Centre for Doctoral Training, the Cambridge Centre for Analysis. AW acknowledges support by the Alan Turing Institute under the EPSRC grant EP/N510129/1, and by the Leverhulme Trust via the CFI.

References

• Ailon and Chazelle [2006] N. Ailon and B. Chazelle. Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform. In STOC, 2006.
• Andoni et al. [2015] A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt. Practical and optimal LSH for angular distance. In NIPS, 2015.
• Bojarski et al. [2017] M. Bojarski, A. Choromanska, K. Choromanski, F. Fagan, C. Gouy-Pailler, A. Morvan, N. Sakr, T. Sarlos, and J. Atif. Structured adaptive and random spinners for fast machine learning computations. In AISTATS, 2017.
• Cho and Saul [2009] Y. Cho and L. K. Saul. Kernel methods for deep learning. In NIPS, 2009.
• Choromanska et al. [2016] A. Choromanska, K. Choromanski, M. Bojarski, T. Jebara, S. Kumar, and Y. LeCun. Binary embeddings with structured hashed projections. In ICML, 2016.
• Choromanski and Sindhwani [2016] K. Choromanski and V. Sindhwani. Recycling randomness with structure for sublinear time kernel expansions. In ICML, 2016.
• Hinrichs and Vybíral [2011] A. Hinrichs and J. Vybíral. Johnson-Lindenstrauss lemma for circulant matrices. Random Structures & Algorithms, 39(3):391–398, 2011.
• Jégou et al. [2011] H. Jégou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117–128, 2011.
• Joachims [2006] T. Joachims. Training linear SVMs in linear time. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06. ACM, 2006.
• Johnson and Lindenstrauss [1984] W. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26:189–206, 1984.
• Le et al. [2013] Q. Le, T. Sarlós, and A. Smola. Fastfood - approximating kernel expansions in loglinear time. In ICML, 2013.
• Rahimi and Recht [2007] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, 2007.
• Samo and Roberts [2015] Y.-L. K. Samo and S. Roberts. Generalized spectral kernels. CoRR, abs/1506.02236, 2015.
• Schmidt et al. [2014] L. Schmidt, M. Sharifi, and I. Moreno. Large-scale speaker identification. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014.
• Sidorov et al. [2014] G. Sidorov, A. Gelbukh, H. Gómez-Adorno, and D. Pinto. Soft similarity and soft cosine measure: Similarity of features in vector space model. Computación y Sistemas, 18(3), 2014.
• Sundaram et al. [2013] N. Sundaram, A. Turmukhametova, N. Satish, T. Mostak, P. Indyk, S. Madden, and P. Dubey. Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. Proceedings of the VLDB Endowment, 6(14):1930–1941, 2013.
• Vybíral [2011] J. Vybíral. A variant of the Johnson-Lindenstrauss lemma for circulant matrices. Journal of Functional Analysis, 260(4):1096–1105, 2011.
• Williams [1998] C. Williams. Computation with infinite neural networks. Neural Computation, 10(5):1203–1216, 1998.
• Yu et al. [2016] F. Yu, A. Suresh, K. Choromanski, D. Holtmann-Rice, and S. Kumar. Orthogonal random features. In NIPS, 2016.
• Z. et al. [2015] Xu Z., F. Yu, R. Guo, S. Kumar, S. Wang, and S. Chang. Fast orthogonal projection based on Kronecker product. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, 2015.
• Zhang and Cheng [2013] H. Zhang and L. Cheng. New bounds for circulant Johnson-Lindenstrauss embeddings. CoRR, abs/1308.6339, 2013.

Appendix: The Unreasonable Effectiveness of Random Orthogonal Embeddings

We present here details and proofs of all the theoretical results presented in the main body of the paper. We also provide further experimental results in §C.

We highlight proofs of several key results that may be of particular interest to the reader:

• The proof of Theorem 3.3; see §A.3.

• The proof of Theorem 3.6; see §A.5.

• The proof of Theorem 4.3; see §B.2.

In the Appendix we will use interchangeably two notations for the dot product between vectors and , namely: and .

Appendix A Proofs of results in §3

a.1 Proof of Lemma 3.1

Proof.

Denote , where stands for the row of the unstructured Gaussian matrix . Note that we have:

 ˆKbasem(x,y)=1mm∑i=1Xi. (7)

Denote . Notice that from the independence of s and the fact that: , , we get: , thus the estimator is unbiased. Since the estimator is unbiased, we have: . Thus we get:

 MSE(ˆKbasem(x,y))=1m2∑i,j(E[XiXj]−E[Xi]E[Xj]). (8)

From the independence of different s, we get:

 MSE(ˆKbasem(x,y))=1m2∑i(E[X2i]−(E[Xi])2). (9)

Now notice that different s have the same distribution, thus we get:

 MSE(ˆKbasem(x,y))=1m(E[X21]−(E[X1])2). (10)

From the unbiasedness of the estimator, we have: . Therefore we obtain:

 MSE(ˆKbasem(x,y))=1m(E[X21]−(x⊤y)2). (11)

Now notice that

 E[X21]=E[∑i1,j1,i2,j2gi1gj1gi2gj2xi1yj1xi2yj2]=∑i1,j1,i2,j2xi1yj1xi2yj2E[gi1gj1gi2gj2], (12)

where stands for the first row of . In the expression above the only nonzero terms corresponds to quadruples , where no index appears odd number of times. Therefore, from the inclusion-exclusion principle and the fact that and , we obtain

 E[X21] =∑i1=j1,i2=j2xi1yj1xi2yj2E[gi1gj1gi2gj2]+∑i1=i2,j1=j2xi1yj1xi2yj2E[gi1gj1gi2gj2] (13) \-+∑i1=j2,i2=j1xi1yj1xi2yj2E[gi1gj1gi2gj2]−∑i1=j1=i2=j2xi1yj1xi2yj2E[gi1gj1gi2gj2] (14) =((x⊤y)2−n∑i=1x2iy2i+3n∑i=1x2iy2i)+((∥x∥2∥y∥2)2−n∑i=1x2iy2i+3n∑i=1x2iy2i) (15) \-+((x⊤y)2−n∑i=1x2iy2i+3n∑i=1x2iy2i)−3⋅2n∑i=1x2iy2i (16) =(∥x∥2∥y∥2)2+2(x⊤y)2. (17)

Therefore we obtain

 MSE(ˆKbasem(x,y))=1m((∥x∥2∥y∥2)2+2(x⊤y)2−(x⊤y)2)=1m(∥x∥22∥y∥22+(x⊤y)2), (18)

which completes the proof. ∎

a.2 Proof of Theorem 3.2

Proof.

Fix two different rows of and denote them as and . We will use the following notation:

• - an angle between and the plane,

• - angle between the projection of into plane (the -axis of the plane) and vector ,

• - an angle between and

The unbiasedness of the Gaussian orthogonal estimator comes from the fact that every row of the Gaussian orthogonal matrix is sampled from multivariate Gaussian distribution with entries taken independently at random from .

Note that:

 Cov(Xi,Xj)=E[XiXj]−E[Xi]E[Xj], (19)

where: , and stand for the and row of the Gaussian orthogonal matrix respectively. From the fact that Gaussian orthogonal estimator is unbiased, we get:

 E[Xi]=x⊤y. (20)

Let us now compute . Taking , and applying the density function of directions of the projection of into the plane conditioned on , we get:

 E[XiXj]=E[(r⊤ix)(r⊤iy)(r⊤jx)(r⊤jy)]=n⋅nn−1∫π20f(ϕ)cos2(ϕ)(sin2(ϕ)+1)dϕ∫2π0⋅dψ2πcos(ψ)cos(ψ+θ)∥x∥2∥y∥2⋅∫2π0g(t)cos(t−ψ)cos(t−θ−ψ)∥x∥2∥y∥2dt, (21)

where: stands for the density function of the angle between and the plane, and is a density function of the direction of the projection of vector into plane conditioned on and . Notice that due to the isotropic property of the Gaussian vectors, we know that the density function of is .

After simplification we get

 E[XiXj]=∥x∥22∥y∥22n2n−1∫π20f(ϕ)cos2(ϕ)(sin2(ϕ)+1)dϕ∫2π0⋅dψ2πcos(ψ)cos(ψ+θ)⋅sin(ϕ)2π∫2π0cos(t−ψ)cos(t−θ−ψ)cos2(t)+sin2(ϕ)sin2(t)dt (22)

Note that we have

 E[Xi]E[Xj]=2n∥x∥22∥y∥22∫π20f(ϕ)cos2(ϕ)dϕ∫2π0dψ2πcos(ψ)cos(ψ+θ)⋅∫2π0dψ2πcos(ψ)cos(ψ+θ) (23)

Therefore we obtain

 E[XiXj]−E[Xi]E[Xj]=∥x∥22∥y∥22n2n−1∫π20f(ϕ)cos2(ϕ)(sin2(ϕ)+1)dϕ∫2π0⋅dψ2πcos(ψ)cos(ψ+θ)⋅sin(ϕ)2π∫2π0cos(t−ψ)cos(t−θ−ψ)cos2(t)+sin2(ϕ)sin2(t)dt−2n∥x∥22∥y∥22∫π20f(ϕ)cos2(ϕ)dϕ∫2π0dψ2πcos(ψ)cos(ψ+θ)⋅∫2π0dψ2πcos(ψ)cos(ψ+θ) (24)

Let us denote .

We have

 F(ϕ,ψ,θ)=∫2π0(sin(ψ)sin(t)+cos(ψ)cos(t))(sin(ψ+θ)sin(t)+cos(ψ+θ)cos(t))1−(1−sin2(ϕ))sin2(t)dt. (25)

Now we use the substitution: