Generalized Intersection Kernel
Following the very recent line of work on the “generalized min-max” (GMM) kernel , this study proposes the “generalized intersection” (GInt) kernel and the related “normalized generalized min-max” (NGMM) kernel. In computer vision, the (histogram) intersection kernel has been popular, and the GInt kernel generalizes it to data which can have both negative and positive entries. Through an extensive empirical classification study on 40 datasets from the UCI repository, we are able to show that this (tuning-free) GInt kernel performs fairly well.
The empirical results also demonstrate that the NGMM kernel typically outperforms the GInt kernel. Interestingly, the NGMM kernel has another interpretation — it is the “asymmetrically transformed” version of the GInt kernel, based on the idea of “asymmetric hashing” . Just like the GMM kernel, the NGMM kernel can be efficiently linearized through (e.g.,) generalized consistent weighted sampling (GCWS), as empirically validated in our study. Owing to the discrete nature of hashed values, it also provides a scheme for approximate near neighbor search.
The proposed GInt kernel and NGMM kernel provide additional options for practitioners to deal with data in their specific domains. We also note that there have been a series of (unpublished) works on related topics. For example,  proposed the GMM kernel and the “normalized random Fourier features (NRFF)” method;  compared the Nystrom method for approximating the GMM kernel with NRFF;  developed some basic mathematical theory (e.g., the convergence and asymptotic normality) for the GMM kernel;  compared the hashed (linearized) GMM kernel with a series of kernels which can be approximated by sign stable random projections.
Following the idea from the recently proposed “generalized min-max (GMM)” kernel , we propose two types of nonlinear kernels which are basically tuning free and can handle data vectors with both negative and positive entries. The first step is a simple transformation on the original data. Consider, for example, the original data vector , to . We define the following transformation, depending on whether an entry is positive or negative:
For example, when and , the transformed data vector becomes .
Once we have only nonnegative data, we can define the GMM kernel as proposed in 
Inspired by the above idea, a variety of nonlinear kernels can be analogously defined, for example, the “generalized intersection (GInt)” kernel:
and the “normalized GMM (NGMM)” kernel:
Note that original (histogram) intersection kernel has been a popular tool in computer vision . In this study, we will provide an extensive empirical evaluation of the GInt kernel and NGMM kernel on 40 classification datasets from the UCI repository. The empirical results indicate that the NGMM kernel typically outperforms the GInt kernel. In addition, there is an interesting connection between these two kernels, because we can re-write the NGMM kernel as
which means that the NGMM kernel is a monotonic transformation of the GInt kernel. This connection can be also be interpreted as that the NGMM kernel is an “asymmetrically transformed” version of the GInt kernel, based on the recent idea of asymmetric hashing .
Next, we first provide an empirical study on kernel SVMs based on the aforementioned kernels, followed by an empirical study on hashing (linearizing) the NGMM kernel.
2 An Experimental Study on Kernel SVMs
Table 1 lists 40 publicly available datasets, solely from the UCI repository, for our experimental study, along with the kernel SVM classification results for the RBF kernel, the GMM kernel, and the proposed NGMM kernel and GInt kernel, at the (individually) best -regularization values. More detailed results (for all regularization values) are available in Figure 1. To ensure repeatability, we use the LIBSVM pre-computed kernel functionality. Note that for the RBF kernel with a scale parameter , we report the best result among a wide range of the and values.
The classification results in Table 1 and Figure 1 indicate that, on these datasets, the GInt kernel significantly outperforms the linear kernel, confirming the advantage of exploring data nonlinearity. The NGMM kernel typically outperforms the GInt kernel. It appears that the GInt kernel can be fairly safely replaced by the NGMM kernel, especially as the NGMM kernel can efficiently computed.
|Dataset||# train||# test||# dim||linear||RBF||GMM||NGMM||GInt|
3 Kernel Linearization
The kernel classification experiments in Table 1 and Figure 1 have demonstrated the effectiveness of nonlinear kernels (GInt, NGMM, GMM, and RBF) in terms of prediction accuracies, compared to the linear kernel. However, it is well understood  that computing kernels are expensive and the kernel matrix, if fully materialized, does not fit in memory even for relatively small applications.
For example, for a small dataset with merely data points, the kernel matrix has entries. In practice, being able to linearize nonlinear kernels becomes highly beneficial. Randomization (hashing) is a popular tool for kernel linearization. After data linearization, we can then apply our favorite linear learning packages such as LIBLINEAR  or SGD (stochastic gradient descent) .
There are multiple ways for kernel linearization. See  for the work on utilizing the Nystrom method for kernel approximation [13, 15] for GMM (and RBF). In this study, we will focus on the strategy based on GCWS hashing for linearizing the GMM and NGMM kernels.
3.1 GCWS: Generalized Consistent Weighted Sampling
After we have transformed the data according to (1), since the data are now nonnegative, we can apply the original “consistent weighted sampling” [12, 4, 5] to generate hashed data.  named this procedure “generalized consistent weighted sampling” (GCWS), as summarized in Algorithm 1.
With samples, we can estimate according to the following collision probability:
Note that and is unbounded. Recently,  made an interesting observation that for practical data, it is ok to completely discard . The following approximation
is accurate in practical settings and makes the implementation convenient via the idea of -bit minwise hashing .
For each vector , we obtain random samples , to . We store only the lowest bits of . We need to view those integers as locations (of the nonzeros) instead of numerical values. For example, when , we should view as a vector of length . If , then we code it as ; if , we code it as , etc. We concatenate all such vectors into a binary vector of length , which contains exactly 1’s. After we have generated such new data vectors for all data points, we feed them to a linear classifier. We can, of course, also use the new data for many other tasks including clustering, regression, and near neighbor search.
Note that for linear learning methods (especially online algorithms), the storage and computational cost is largely determined by the number of nonzeros in each data vector, i.e., the in our case. It is thus crucial not to use a too large .
3.2 An Experimental Study on “0-bit” GCWS
Figure 2 reports the test classification accuracies on 0-bit GCWS for hashing the NGMM kernel (solid curves) and the GMM kernel (dashed curves). In each panel, we also report the original linear kernel and NGMM kernel results using two solid and marked curves, with the upper curve for the NGMM kernel and bottom curve for the linear kernel. We report results for . We also need to choose , the number of bits for encoding each hashed value ; we reports experimental results for .
The classification results confirm, just like the prior work , that the “0-bit” GCWS scheme performs well for hashing the NGMM kernel. Clearly, the accuracies are affected by the choice of , but not too much especially when is not too small. In general, we recommend using a larger if the model size is affordable. The training cost is largely determined by , not much by . See  for the training time comparisons for different values.
Often times, practitioners are particularly interested in choosing large enough to exceed the accuracy of linear kernel. This is because in practice linear models are often used and a simple recipe which can be more accurate than linear models and does not increase much the computational cost would be highly desirable. We can see from Figure 2 that typically does not have to very large in order to outperform the original linear model.
We propose the “generalized intersection (GInt)” kernel and the related “normalized generalized min-max (NGMM)” kernel. The original (histogram) intersection kernel has been popular in (e.g.,) computer vision. Interestingly, the NGMM kernel can be viewed as an “asymmetrically transformed” version of the GInt kernel from the perspective of recently proposed “asymmetric hashing” . Our kernel SVM experiments on 40 UCI datasets illustrate that the NGMM kernel typically outperforms the GInt kernel. The recently proposed 0-bit GCWS scheme performs well for approximating the NGMM kernel, as expected. For readers who are interested in Nystrom method (another scheme for kernel linearization), please refer to an earlier technical report .
In this study, we focus on reporting results for classification. Obviously, the techniques can be used for many other tasks including regression, clustering, and near neighbor search. One notable advantage of the (0-bit) GCWS is that the hashed values are of discrete nature and can be directly used for building hash tables in the context of sublinear time near neighbor search. The Nystrom method dos not offer this benefit. Note that in the context of near neighbor search, the GCWS hashing provides a scheme for searching near neighbors in terms of not only the NGMM kernel distance but also the GInt kernel distance, because NGMM is a monotonic transformation of GInt.
While the results of this (short) report appear to support more the GMM kernel than the GInt kernel, we hope these two simple (tuning-free) kernels (NGMM and GInt) can provide practitioners more options to choose the appropriate method for their specific domain applications.
-  L. Bottou. http://leon.bottou.org/projects/sgd.
-  L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors. Large-Scale Kernel Machines. The MIT Press, Cambridge, MA, 2007.
-  R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008.
-  S. Ioffe. Improved consistent sampling, weighted minhash and L1 sketching. In ICDM, pages 246–255, Sydney, AU, 2010.
-  P. Li. 0-bit consistent weighted sampling. In KDD, Sydney, Australia, 2015.
-  P. Li. Sign stable random projections for large-scale learning. Technical report, arXiv:1504.07235, 2015.
-  P. Li. Generalized min-max kernel and generalized consistent weighted sampling. Technical report, arXiv:1605.05721, 2016.
-  P. Li. Nystrom method for approximating the gmm kernel. Technical report, arXiv:1605.05721, 2016.
-  P. Li and A. C. König. b-bit minwise hashing. In Proceedings of the 19th International Conference on World Wide Web, pages 671–680, Raleigh, NC, 2010.
-  P. Li and C.-H. Zhang. Theory of the gmm kernel. Technical report, arXiv:1608.00550, 2016.
-  S. Maji, A. Berg, and J. Malik. Classification using intersection kernel support vector machines is efficient. In CVPR, pages 1–8, 2008.
-  M. Manasse, F. McSherry, and K. Talwar. Consistent weighted sampling. Technical Report MSR-TR-2010-73, Microsoft Research, 2010.
-  E. J. Nyström. Über die praktische auflösung von integralgleichungen mit anwendungen auf randwertaufgaben. Acta Mathematica, 54(1):185–204, 1930.
-  A. Shrivastava and P. Li. Asymmetric minwise hashing for indexing binary inner products and set containment. In WWW, pages 981–991, 2015.
-  C. K. I. Williams and M. Seeger. Using the nyström method to speed up kernel machines. In NIPS, pages 682–688. 2001.