McKernel: A Library for Approximate Kernel Expansions in Loglinear Time
Abstract
McKernel^{1}^{1}1McKernel, https://www.github.com/curto2/mckernel/ is a C++ library for largescale machine learning. It contains a CPU optimized implementation of the Fastfood algorithm in Le et al. (2013), that allows the computation of approximated kernel expansions in loglinear time. The algorithm requires to compute the product of Walsh Hadamard Transform (WHT) matrices. A cache friendly SIMD Fast Walsh Hadamard Transform (FWHT) that achieves compelling speed and outperforms current stateoftheart methods has been developed. McKernel allows to obtain nonlinear classification combining Fastfood and a linear classifier.
Computer Vision Laboratory, Eidgenössische Technische Hochschule Zürich
The Robotics Institute, Carnegie Mellon University
Department of Computer Science and Engineering, The Chinese University of Hong Kong
Department of Electronic Engineering, City University of Hong Kong F. Yang fengyang@google.com
Google Research A. J. Smola alex@smola.org
Machine Learning Department, Carnegie Mellon University L. Van Gool vangool@vision.ee.ethz.ch
Computer Vision Laboratory, Eidgenössische Technische Hochschule Zürich
Both authors contributed equally.
Keywords: Approximate Kernel Expansions, Fast Walsh Hadamard Transform, SIMD, RBF Gaussian kernel, RBF Matern kernel
1 Introduction
Kernel methods offer stateoftheart estimation performance. They
provide function classes that are flexible and easy to control in
terms of regularization properties. However, the use of kernels in largescale
machine learning problems has been beset with difficulty. This is
because using kernel expansions in large datasets is too expensive in
terms of computation and storage.
In order to solve this problem, Le et al. (2013) proposed an approximation algorithm, called Fastfood, based on Random Kitchen Sinks by Rahimi and Recht (2007), that speeds up the computation of a large range of kernel functions, allowing us to use them in big data. Recent works on the topic build on Fastfood to propose stateoftheart deep learning embeddings, Yang et al. (2015) and Hong et al. (2017).
At its heart, McKernel requires scalar multiplications, a permutation, access to trigonometric functions, and two Walsh Hadamard Transforms (WHT) for implementation. The key computational bottleneck here is the WHT. We provide a fast, cache friendly SIMD oriented implementation that outperforms stateoftheart codes such as Spiral Johnson and Püschel (2000). To allow for very compact distribution of models, we use hash functions and a Pseudorandom Permutation Generator for portability. In this way, for each feature dimension, we only need one floating point number. In summary, our implementation serves as a dropin feature generator for linear methods where attributes are generated onthefly, such as for regression, classification, or twosample tests. This obviates the need for explicit kernel computations, particularly on large amounts of data.
Outline:
2 McKernel
Kernel methods work by defining a kernel function on a domain . We can write as inner product between feature maps, as follows
(1) 
for some suitably chosen . Random Kitchen Sinks (Rahimi and Recht, 2007) approximate this feature mapping by a Fourier expansion in the case of radial basis function kernels, i.e. whenever . This is possible since the Fourier transform diagonalizes the corresponding integral operator (Smola et al., 1988). This leads to
(2) 
for some measurable function that is given by the Fourier transform of . Random Kitchen Sinks exploit this by replacing the integral by sampling . This allows for finite dimensional expansions but it is costly due to the large number of inner products required. Fastfood resolves this for rotationally invariant by providing a fast approximation of the matrix .
This is best seen for the RBF Gaussian kernel where . Since Fourier transforms of Gaussians are Gaussians, albeit with inverse covariance, it follows that and that contains i.i.d Gaussian random variables. It is this matrix that Fastfood approximates via
(3) 
Here and are diagonal matrices, is a random permutation matrix and is the Hadamard matrix. Whenever the number of rows in exceeds the dimensionality of the data, we can simply generate multiple instances of , drawn i.i.d, until the required number of dimensions is obtained.
 Diagonal Binary Matrix

This is a matrix with entries , drawn from the uniform distribution. To avoid memory footprint, we simply use Murmurhash^{2}^{2}2https://code.google.com/p/smhasher/ as hash function and extract bits from with .
 Walsh Hadamard Matrix

This matrix is iteratively composed of . It is fixed and matrixvector products are carried out efficiently in time using the Fast Walsh Hadamard Transform (FWHT). We will discuss implementation details for a fast variant below.
 Permutation Matrix

We generate a random permutation using the FisherYates shuffle. That is, given a list we generate permutations recursively as follows: pick a random element from . Use this as the image of and move to the position where the element was removed. The algorithm runs in linear time and its coefficients can be stored in space. Moreover, to obtain a deterministic mapping, replace the random number generator with calls to the hash function.
 Gaussian Scaling

This is a diagonal matrix with i.i.d Gaussian entries. We generate the random variates using the BoxMuller transform Box and Muller (1958) while substituting the random number generator by calls to the hash function to allow us to recompute the values at any time without the need to store random numbers.
 Diagonal Scaling

This is a random scaling operator whose behavior depends on the type of kernel chosen, such as the Matern kernel, the RBF Gaussian kernel or any other radial spectral distribution, Yang et al. (2014).
3 FWHT Implementation
A key part of the library is a FWHT efficient
implementation. In particular,
McKernel offers considerable improvement over the
Spiral^{3}^{3}3http://www.spiral.net library, due to automatic
code generation, the use of SIMD intrinsics (SSE2 using 128 bit
registers) and loop unrolling. This decreases the memory overhead.
McKernel proceeds with vectorized sums and subtractions iteratively for the first input vector positions (where is the length of the input vector and the iteration starting from ), computing the intermediate operations of the CooleyTukey algorithm till a small Hadamard routine that fits in cache. Then the algorithm continues in the same way but starting from the smallest length and doubling on each iteration the input dimension until the whole FWHT is done inplace.
For instance, on an intel i54200 CPU @ 1.60GHz laptop the performance obtained can be observed in Figure 1.
Our code outperforms Spiral by a factor of 2 consistently throughout the range of arguments. Furthermore, Spiral needs to precompute trees and by default can only perform the computation up to matrix size . On the other hand, our implementation works for any size since it computes the highlevel partitioning dynamically.
4 API Description
The API follows the factory design pattern. That is, while the McKernel object is fairly generic in terms of feature computation, we have a factory that acts as a means of instantiating the parameters according to prespecified sets of parameters for e.g. a RBF Gaussian or a Matern kernel. The sochosen parameters are deterministic, given by the values of a hash function. The advantage of this approach is that there is no need to save the coefficients generated for McKernel when deploying the functions. That said, we also provide a constructor that consumes and to allow for arbitrary kernels.
void fwht(float* data, unsigned long lgn)
Computes the FWHT inplace on data. The length of the vector it operates on is given by . Note that fwht does not check whether data points to a sufficiently large amount of memory.
4.1 Customizing McKernel
For instance, to generate each entry for Matern kernel we draw iid samples from the dimensional unit ball , add them and compute its Euclidean norm. To draw efficiently samples from we use the algorithm provided below.
Let be a vector of iid random variables drawn from , and let be the Euclidean norm of , then is uniformly distributed over the sphere, where it is obvious to see that is the projection of onto the surface of the dimensional sphere. To draw uniform random variables in the ball, we multiply by where . This can be proved as follows: Let be a random vector uniformly distributed in the unit ball. Then, the radius satisfies . Now, by the inverse transform method we get .Therefore to sample uniformly from the ball the following algorithm is used: .
4.2 Class McKernel
Class instance McKernel initializes the object by computing and storing and diagonal matrices. The Factory design lets you specify the creation of the matrix and by calling the proper kernel constructor.
Objects McKernel contain the following methods:

McFeatures(float* data, float* features) computes the features by using the real version of the complex feature map in Rahimi and Recht (2007), . SIMD vectorized instructions and cache locality are used to increase speed performance. To avoid a bottleneck in the basic trigonometric function computation a SSE2 sincos function^{4}^{4}4http://gruntthepeon.free.fr/ssemath/ is used. These allow an speed improvement of 18x times for a dimension input matrix.

float McEvaluate (float* data, float* weights) computes the inner product between the input vector and the weights.
4.3 Practical Use
McKernel can be embedded in front of a linear classifier. For instance, first doing feature extraction, then McKernel and finally SVM Linear Classification. Another example is to embedded it in a deep neural network architecture either as nonlinear mapping to the activation function, or as a separate layer embedding Hong et al. (2017).
References
 Box and Muller (1958) G. E. P. Box and M. E. Muller. A note on the generation of random normal deviates. In: The Annals of Mathematical Statistics, 1958.
 Hong et al. (2017) W. Hong, J. Yuan, and S. D. Bhattacharjee. Fried binary embedding for highdimensional visual features. In: CVPR, 2017.
 Johnson and Püschel (2000) J. Johnson and M. Püschel. In search of the optimal walshhadamard transform. In: IEEE, 2000.
 Le et al. (2013) Q. Le, T. Sarlós, and A. Smola. Fastfood  approximating kernel expansions in loglinear time. In: ICML, 2013.
 Rahimi and Recht (2007) A. Rahimi and B. Recht. Random features for largescale kernel machines. In: NIPS, 2007.
 Smola et al. (1988) A. J. Smola, B. Schölkopf, and K. R. Müller. General cost functions for support vector regression. In: Ninth Australian Conference on Neural Networks, 1988.
 Yang et al. (2014) Z. Yang, A. J. Smola, L. Song, and A. G. Wilson. À la carte  learning fast kernels. In: AISTATS, 2014.
 Yang et al. (2015) Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, and Z. Wang. Deep fried convnets. In: ICCV, 2015.