McKernel: A Library for Approximate Kernel Expansions in Log-linear Time
McKernel111McKernel, https://www.github.com/curto2/mckernel/ is a C++ library for large-scale machine learning. It contains a CPU optimized implementation of the Fastfood algorithm in Le et al. (2013), that allows the computation of approximated kernel expansions in log-linear time. The algorithm requires to compute the product of Walsh Hadamard Transform (WHT) matrices. A cache friendly SIMD Fast Walsh Hadamard Transform (FWHT) that achieves compelling speed and outperforms current state-of-the-art methods has been developed. McKernel allows to obtain non-linear classification combining Fastfood and a linear classifier.
Keywords: Approximate Kernel Expansions, Fast Walsh Hadamard Transform, SIMD, RBF Gaussian kernel, RBF Matern kernel
Kernel methods offer state-of-the-art estimation performance. They
provide function classes that are flexible and easy to control in
terms of regularization properties. However, the use of kernels in large-scale
machine learning problems has been beset with difficulty. This is
because using kernel expansions in large datasets is too expensive in
terms of computation and storage.
In order to solve this problem, Le et al. (2013) proposed an approximation algorithm, called Fastfood, based on Random Kitchen Sinks by Rahimi and Recht (2007), that speeds up the computation of a large range of kernel functions, allowing us to use them in big data. Recent works on the topic build on Fastfood to propose state-of-the-art deep learning embeddings, Yang et al. (2015) and Hong et al. (2017).
At its heart, McKernel requires scalar multiplications, a permutation, access to trigonometric functions, and two Walsh Hadamard Transforms (WHT) for implementation. The key computational bottleneck here is the WHT. We provide a fast, cache friendly SIMD oriented implementation that outperforms state-of-the-art codes such as Spiral Johnson and Püschel (2000). To allow for very compact distribution of models, we use hash functions and a Pseudorandom Permutation Generator for portability. In this way, for each feature dimension, we only need one floating point number. In summary, our implementation serves as a drop-in feature generator for linear methods where attributes are generated on-the-fly, such as for regression, classification, or two-sample tests. This obviates the need for explicit kernel computations, particularly on large amounts of data.
Kernel methods work by defining a kernel function on a domain . We can write as inner product between feature maps, as follows
for some suitably chosen . Random Kitchen Sinks (Rahimi and Recht, 2007) approximate this feature mapping by a Fourier expansion in the case of radial basis function kernels, i.e. whenever . This is possible since the Fourier transform diagonalizes the corresponding integral operator (Smola et al., 1988). This leads to
for some measurable function that is given by the Fourier transform of . Random Kitchen Sinks exploit this by replacing the integral by sampling . This allows for finite dimensional expansions but it is costly due to the large number of inner products required. Fastfood resolves this for rotationally invariant by providing a fast approximation of the matrix .
This is best seen for the RBF Gaussian kernel where . Since Fourier transforms of Gaussians are Gaussians, albeit with inverse covariance, it follows that and that contains i.i.d Gaussian random variables. It is this matrix that Fastfood approximates via
Here and are diagonal matrices, is a random permutation matrix and is the Hadamard matrix. Whenever the number of rows in exceeds the dimensionality of the data, we can simply generate multiple instances of , drawn i.i.d, until the required number of dimensions is obtained.
- Diagonal Binary Matrix
This is a matrix with entries , drawn from the uniform distribution. To avoid memory footprint, we simply use Murmurhash222https://code.google.com/p/smhasher/ as hash function and extract bits from with .
- Walsh Hadamard Matrix
This matrix is iteratively composed of . It is fixed and matrix-vector products are carried out efficiently in time using the Fast Walsh Hadamard Transform (FWHT). We will discuss implementation details for a fast variant below.
- Permutation Matrix
We generate a random permutation using the Fisher-Yates shuffle. That is, given a list we generate permutations recursively as follows: pick a random element from . Use this as the image of and move to the position where the element was removed. The algorithm runs in linear time and its coefficients can be stored in space. Moreover, to obtain a deterministic mapping, replace the random number generator with calls to the hash function.
- Gaussian Scaling
This is a diagonal matrix with i.i.d Gaussian entries. We generate the random variates using the Box-Muller transform Box and Muller (1958) while substituting the random number generator by calls to the hash function to allow us to recompute the values at any time without the need to store random numbers.
- Diagonal Scaling
This is a random scaling operator whose behavior depends on the type of kernel chosen, such as the Matern kernel, the RBF Gaussian kernel or any other radial spectral distribution, Yang et al. (2014).
3 FWHT Implementation
A key part of the library is a FWHT efficient
implementation. In particular,
McKernel offers considerable improvement over the
Spiral333http://www.spiral.net library, due to automatic
code generation, the use of SIMD intrinsics (SSE2 using 128 bit
registers) and loop unrolling. This decreases the memory overhead.
McKernel proceeds with vectorized sums and subtractions iteratively for the first input vector positions (where is the length of the input vector and the iteration starting from ), computing the intermediate operations of the Cooley-Tukey algorithm till a small Hadamard routine that fits in cache. Then the algorithm continues in the same way but starting from the smallest length and doubling on each iteration the input dimension until the whole FWHT is done in-place.
For instance, on an intel i5-4200 CPU @ 1.60GHz laptop the performance obtained can be observed in Figure 1.
Our code outperforms Spiral by a factor of 2 consistently throughout the range of arguments. Furthermore, Spiral needs to precompute trees and by default can only perform the computation up to matrix size . On the other hand, our implementation works for any size since it computes the high-level partitioning dynamically.
4 API Description
The API follows the factory design pattern. That is, while the McKernel object is fairly generic in terms of feature computation, we have a factory that acts as a means of instantiating the parameters according to pre-specified sets of parameters for e.g. a RBF Gaussian or a Matern kernel. The so-chosen parameters are deterministic, given by the values of a hash function. The advantage of this approach is that there is no need to save the coefficients generated for McKernel when deploying the functions. That said, we also provide a constructor that consumes and to allow for arbitrary kernels.
void fwht(float* data, unsigned long lgn)
Computes the FWHT in-place on data. The length of the vector it operates on is given by . Note that fwht does not check whether data points to a sufficiently large amount of memory.
4.1 Customizing McKernel
For instance, to generate each entry for Matern kernel we draw iid samples from the -dimensional unit ball , add them and compute its Euclidean norm. To draw efficiently samples from we use the algorithm provided below.
Let be a vector of iid random variables drawn from , and let be the Euclidean norm of , then is uniformly distributed over the -sphere, where it is obvious to see that is the projection of onto the surface of the -dimensional sphere. To draw uniform random variables in the -ball, we multiply by where . This can be proved as follows: Let be a random vector uniformly distributed in the unit ball. Then, the radius satisfies . Now, by the inverse transform method we get .Therefore to sample uniformly from the -ball the following algorithm is used: .
4.2 Class McKernel
Class instance McKernel initializes the object by computing and storing and diagonal matrices. The Factory design lets you specify the creation of the matrix and by calling the proper kernel constructor.
Objects McKernel contain the following methods:
McFeatures(float* data, float* features) computes the features by using the real version of the complex feature map in Rahimi and Recht (2007), . SIMD vectorized instructions and cache locality are used to increase speed performance. To avoid a bottleneck in the basic trigonometric function computation a SSE2 sincos function444http://gruntthepeon.free.fr/ssemath/ is used. These allow an speed improvement of 18x times for a dimension input matrix.
float McEvaluate (float* data, float* weights) computes the inner product between the input vector and the weights.
4.3 Practical Use
McKernel can be embedded in front of a linear classifier. For instance, first doing feature extraction, then McKernel and finally SVM Linear Classification. Another example is to embedded it in a deep neural network architecture either as non-linear mapping to the activation function, or as a separate layer embedding Hong et al. (2017).
- Box and Muller (1958) G. E. P. Box and M. E. Muller. A note on the generation of random normal deviates. In: The Annals of Mathematical Statistics, 1958.
- Hong et al. (2017) W. Hong, J. Yuan, and S. D. Bhattacharjee. Fried binary embedding for high-dimensional visual features. In: CVPR, 2017.
- Johnson and Püschel (2000) J. Johnson and M. Püschel. In search of the optimal walsh-hadamard transform. In: IEEE, 2000.
- Le et al. (2013) Q. Le, T. Sarlós, and A. Smola. Fastfood - approximating kernel expansions in loglinear time. In: ICML, 2013.
- Rahimi and Recht (2007) A. Rahimi and B. Recht. Random features for large-scale kernel machines. In: NIPS, 2007.
- Smola et al. (1988) A. J. Smola, B. Schölkopf, and K. R. Müller. General cost functions for support vector regression. In: Ninth Australian Conference on Neural Networks, 1988.
- Yang et al. (2014) Z. Yang, A. J. Smola, L. Song, and A. G. Wilson. À la carte - learning fast kernels. In: AISTATS, 2014.
- Yang et al. (2015) Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, and Z. Wang. Deep fried convnets. In: ICCV, 2015.