Isolation Kernel: The X Factor
in Efficient and Effective Large Scale Online Kernel Learning
Large scale online kernel learning aims to build an efficient and scalable kernel-based predictive model incrementally from a sequence of potentially infinite data points. To achieve this aim, the method must be able to deal with a potentially infinite number of support vectors. The current state-of-the-art is unable to deal with even a moderate number of support vectors.
This paper identifies the root cause of the current methods, i.e., the type of kernel used which has a feature map of infinite dimensionality. With this revelation and together with our discovery that a recently introduced Isolation Kernel has a finite feature map, to achieve the above aim of large scale online kernel learning becomes extremely simple—simply use Isolation Kernel instead of kernels having infinite feature map. We show for the first time that online kernel learning is able to deal with a potentially infinite number of support vectors.
In the age of big data, the ability to deal with large datasets or online data with potentially infinite data points is a key requirement of machine learning methods. Kernel methods are an elegant machine learning method to learn a nonlinear boundary from data. However, its applications in the age of big data is limited because of its perennial problem of high computational cost on high dimensional and large datasets.
The current state-of-the-art large scale online kernel learning focuses on improving the efficiency but compromising predictive accuracy. We find that its predictive accuracy will be degraded to an unacceptable low level when it is applied to datasets having more than 1000 dimensions.
In addition, an online kernel learning method must be able to deal with a potentially infinite number of support vectors. Current methods can handle a limited number of support vectors only, far short of the requirement.
The contributions of this paper are:
Identifying the root cause of the high computational cost of large scale online kernel learning, i.e., the type of kernel used which has a feature map of infinite dimensionality.
Revealing that a recent Isolation Kernel has exact, finite feature map.
Showing that Isolation Kernel with its exact finite feature map is the crucial factor in addressing the above-mentioned root cause and enabling efficient large scale online kernel learning without compromising accuracy. In contrast, an influential approach that employs kernel functional approximation in online kernel learning must compromise accuracy in order to achieve efficiency gain.
Together with three other key elements: learning with feature map, efficient dot product and GPU acceleration, we show for the first time that online kernel learning is able to deal with an infinite number of support vectors.
Demonstrating the impact of Isolation Kernel on an existing method of online kernel learning called Online Gradient Descent (OGD) and also support vector machines (SVM). Using Isolation Kernel, instead of a kernel with infinite feature map, the same algorithms (OGD and SVM) often achieve better predictive accuracy and always significantly faster runtime. In high dimensional datasets, the difference in accuracy is large; and its runtime is up to three orders of magnitude faster. In addition, OGD with Isolation Kernel has better accuracy than the state-of-the-art online OGD called NOGD and up to one order of magnitude faster.
Unveiling for the first time that (a) the Vonoroi-based implementation of Isolation Kernel produces better predictive accuracy than the tree-based implementation in kernel methods using OGD; (b) the GPU version of the implementation is up to four orders of magnitude faster than the CPU version.
Furthermore, our work has two wider implications:
The current key approach to convert an infinite feature map of a kernel to an approximate finite feature map, i.e., kernel functional approximation, becomes obsolete because an exact finite feature map can be derived directly and efficiently from Isolation Kernel. To produce efficient kernel-based methods, using a kernel with infinite feature map is forced to employ a kernel approximation. We show here that this approach has every disadvantage in terms of predictive accuracy and runtime in comparison with using Isolation Kernel.
Isolation Kernel is the only kernel, as far as know, that enables online kernel learning to live up to its full potential to deal with a potentially infinite number of support vectors. None of the existing methods that employ existing kernels have achieved this potential. This opens up opportunity for other kernel-based methods to employ Isolation Kernel—enabling them to deal with large scale datasets that would otherwise be impossible.
The rest of the paper is organised as follows. Section 2 describes the current challenges and key approach in large scale online kernel learning. Section 3 presents the two previously unknown advantages of Isolation Kernel and its current known advantage. Section 4 describes the current understanding of Isolation Kernel: its definition, implementations and characteristics. Section 5 presents our four conceptual contributions in relation to learning with exact feature map of Isolation Kernel. Its applications to online gradient descent and support vector machines are presented in Section 6. The experimental settings and results are provided in the next two sections. Section 9 describes the relationship with existing approaches to efficient kernel methods, followed by discussion and concluding remarks in the last two sections.
2 Current challenges and key approach in Large Scale online kernel learning
We will describe the current challenges in online kernel learning and an influential approach to meet one of these challenges in the next two subsections
2.1 Challenges in online kernel learning
Kernel methods are an elegant way to learn a nonlinear boundary. But they are hampered by high computational cost. Especially, one employs the kernel trick to avoid feature mapping by solving the dual optimisation problem. One of its main computational costs is due to the prediction function used. The evaluation of the prediction function has high cost if the number of support vectors is high: , where is the chosen kernel function; is the learned weight and is the class label of support vector ; and is the number of support vectors. The sign of , i.e., or , yields the final class prediction. Thus, limiting the number of support vectors is the key method in reducing the high computational cost.
Alternatively, abandoning the kernel trick by using an approximate feature map of a chosen nonlinear kernel, one usually solves the primal optimisation problem because its prediction function has less cost. The evaluation of the prediction function has cost proportional to the number of features in the feature map , i.e., , where can be pre-computed once the support vectors are determined. The success of this method relies on a method to produce a good approximate feature map. The method often needs to employ a small data subset in order to reduce its high computational cost. This is in addition to limiting the number of support vectors mentioned above.
Kernel methods, that are aimed for large scale datasets, solve the primal optimisation problem because has constant time cost, independent of the number of support vectors. A recent example is large scale online kernel learning .
In a nutshell, two key challenges in large scale online kernel learning are to:
Obtain a good approximate feature map of a chosen nonlinear kernel function, and
Limit the number of support vector with a budget,
such that the inevitable negative impact they have on accuracy is reduced as much as possible.
2.2 An existing influential approach
The need to approximate a feature map of a chosen nonlinear kernel arises because existing nonlinear kernels such as Gaussian and polynomial kernels have either infinite or a large number of features. Table 1 provides the sizes of their feature maps.
|Kernel||Feature map size|
One influential approach to meet the first key challenge is kernel functional approximation; and its two popular methods are: (a) The Nyström embedding method  which uses sample points from the given dataset to construct a matrix of low rank and derive a vector representation of data of proxy features. (b) Derive random features based on Fourier transform [4, 5] or Laplacian transform , independent of the given dataset. Both produce an approximate feature map of a chosen nonlinear kernel using proxy features which are aimed to be used as input to a linear learning algorithm.
A recent proposal of budget online kernel learning  has employed the Nyström embedding method and a budget of support vectors to meet the two challenges: A subset of sampled points is used as (i) seeds to generate the approximate feature map in the Nyström process; and (ii) the support vectors which remain the same during the course of an online setting, although their weights are updated. The algorithm called NOGD (OGD which employs the Nyström embedding method) has shown encouraging results, dealing with large scale datasets and has good predictive accuracy in online setting for datasets less than 800 dimensions .
However, because the feature map is an approximation and the number of support vectors is limited, the approach reduces the time and space complexities with the expense of accuracy. In addition, we demonstrate that NOGD has performed poorly on datasets more than 1000 dimensions (see results in Section 8).
We show here that the two challenges on online kernel learning only exist because of the kind of kernels employed. For existing commonly used nonlinear kernels, the dimensionality of their feature maps is not controllable by a user, and has infinite or a large number of dimensions. The kernel functional approximation approach is a workaround towards the first challenge without addressing its root cause. Setting a budget for support vectors (of the second challenge) is a mitigation that almost always reduces accuracy of the final model, irrespective of the budgeting scheme.
3 Advantages of Isolation Kernel
The unique characteristic is that Isolation Kernel has an exact feature map which is sparse and has a finite number of features that can be controlled by a user.
The sparse and finite representation, when represents each feature vector using out of the representative points, enables an efficient dot product implementation.
The first advantage eliminates the need to get an approximate feature map (through kernel functional approximation or other means)—when an exact feature map is available, there is no reason to use an approximate feature map. It destroys the premise of the first key challenge in online kernel learning.
This enables kernel learning to solve the primal optimisation problem efficiently with Isolation Kernel. This is because evaluating the prediction function can be conducted more efficiently using , where can be pre-computed once the support vectors are determined. This is applicable in the testing stage as well as in the training stage.
The second advantage enables the dot product in to be computed efficiently, i.e., orders of magnitude faster than that without the efficient implementation under some condition.
We show that, with the above advantages of Isolation Kernel, online kernel learning can be achieved without the need for a budget to limit the number of support vectors—the second key challenge in online kernel learning. This allows an efficient kernel-based prediction model to deal with an unlimited number of support vectors in a sequence of infinite data points.
In a nutshell, the type of kernel used, which has infinite or large number of features, has necessitated an intervention step to approximate its feature map. A considerable amount of research effort [2, 3, 4, 6] has been invested in order to produce a feature map that has a more manageable dimensionality. Using the type of kernel such as Isolation Kernel—which has an exact, user-controllable finite feature map—eliminates the need of such an intervention step for feature map approximation.
3.1 One known advantage
In addition to the above two (previously unknown) advantages, Isolation Kernel has one known advantage, i.e., it is data dependent [7, 8], as opposed to data independent kernels such as Gaussian and Laplacian kernels. It is solely dependent on data distribution, requiring neither class information nor explicit learning. Isolation Kernel has been shown to be a better kernel than existing kernels in SVM classification , and has better accuracy than existing methods such as multiple kernel learning  and distance metric learning . Isolation Kernel is also a successful way to kernelise density-based clustering .
These previous works have focused on the improvements on task-specific performances; but the use of Isolation Kernel has slowed the algorithms’ runtimes [7, 8]. They also have focused on the use of kernel trick, and the feature map of Isolation Kernel was either implicitly stated  or not mentioned at all .
Here we present the feature map of Isolation Kernel and its characteristic, and the benefits it bring to online kernel learning that would otherwise be impossible—a kernel learning which can deal with infinite number of support vectors; and run efficiently to handle large scale datasets, without compromising accuracy.
In summary, the known advantage of data dependency contributes to a trained model’s high accuracy; whereas the two previously unknown advantages contribute to efficiency gain. These will be demonstrated in the empirical evaluations reported in Section 8.
4 Isolation Kernel
Let be a dataset sampled from an unknown probability density function . Moreover, let denote the set of all partitionings that are admissible under the dataset , where each covers the entire space of ; and each of the isolating partitions isolates one data point from the rest of the points in a random subset , and .
For any two points , Isolation Kernel of and wrt is defined to be the expectation taken over the probability distribution on all partitionings that both and fall into the same isolating partition :
where is an indicator function.
In practice, Isolation Kernel is constructed using a finite number of partitionings , where each is created using :
is a shorthand for .
is a shorthand for hereafter.
4.1 iForest implementation
Here the aim is to isolate every point in . This is done recursively by randomly selecting an axis-parallel split to subdivide the data into two non-empty subsets until every point is isolated. Each partitioning produces isolating partitions ; and each partition contains a single point in .
The algorithm  produces , each built independently using a subset , sampled without replacement from , where .
4.2 aNNE Implementation
Like the tree method, the nearest neighbour method also produces each model which consists of isolating partitions , given a subsample of points. Rather than representing each isolating partition as a hyper-rectangle, it is represented as a cell in a Voronoi diagram, where the boundary between two points is the equal distance from these two points.
, being a Voronoi diagram, is built by employing points in , where each isolating partition or Voronoi cell isolates one data point from the rest of the points in . The point which determines a cell is regarded as the cell centre.
Given a Voronoi diagram constructed from a sample of points, the Voronoi cell centred at is:
where is a distance function and we use as Euclidean distance in this paper.
Note that the boundaries of a Voronoi diagram is derived implicitly to be equal distance between any two points in ; and it needs not be derived explicitly for our purpose in realising Isolation Kernel.
4.3 Kernel distributions and contour plots
Figure 1 is extracted from  which shows that the kernel distribution of Isolation Kernel approximates that of Laplacian kernel under uniform density distribution. A brief description of the proof is provided in the same paper.
Figure 2 shows that the contour plots of aNNE and iForest implementations of Isolation Kernel. Notice that each contour line, which denotes the same similarity to the centre (red point), is elongated along the sparse region and compressed along the dense region. In contrast, Laplacian kernel (or any data independent kernel) has the same symmetrical contour lines around the centre point, independent of data distribution (as shown in Figure 1(a)).
The reasons why Voronoi-based implementation are better than the tree-based implementation have been provided earlier ; and this has led to better density-based clustering result than using the Euclidean distance measure.
5 Learning with exact feature map
This section presents our four conceptual contributions. Section 5.1 presents the feature map of Isolation Kernel. Section 5.2 describes the theoretical underpinning of efficient learning with Isolation Kernel. How Isolation Kernel enables the use of in solving the primal optimisation problem, and its efficient dot product implementations are provided in the following two subsections.
5.1 Exact feature map of Isolation Kernel
Viewing each isolating partition as a feature, the -component of the feature space due to can be derived using the mapping (where is a binary domain); and can be constructed using a partitioning as follows:
Let be a vector of features of indicating the only isolating partition in which falls, out of the isolating partitions , where .
The inner summation of Equation ( ‣ 4) of Isolation Kernel can then be re-expressed in terms of as follows:
Because is in a quadratic form, it is a PSD (positive semi-definite). The sum of PSD, , is also PSD. Therefore, is a valid kernel.
An exact simple representation of Isolation Kernel can be derived by concatenating samples of . Let be a vector of binary features. Then, Isolation Kernel represented using these features can be expressed as:
Feature map of Isolation Kernel. For point , the feature mapping of is a vector that represents the partitions in all the partitioning , ; where falls into only one of the partitions in each partitioning .
Parameters and can be controlled by a user. Each setting of and yields a feature map.
|Nsytröm (approximate feature map of a kernel )||Isolation Kernel ()|
|1.||Sample from to construct kernel matrix||1.||Sample points from , times, to construct partitionings ; and each has partitions.|
|2.||, where and are eigenvectors and eigenvalues of||2.||, where each integer attribute has values: ; and each integer is an index to a partition . The attributes represent the partitionings .Convert to : is parsed over the partitionings.|
|3.||: Convert to :|
|Perform learning with feature map on|
5.2 Efficient Learning with Isolation Kernel
This subsection describes the theoretical underpinning of efficient learning with Isolation Kernel.
In a binary class learning problem of a given training set , where points and class labels , the goal of SVM is to learn a kernel prediction function by solving the following optimisation problem :
where is span over all points in the training set ; is a convex loss function wrt the prediction of ; and is the Reproducing Kernel Hilbert Space endowed with a kernel.
The computational cost of this kernel learning is high because the search space over is large for large .
In contrast, with Isolation Kernel, is replaced with a smaller set . because .
In simple terms, the span is over out of the representative points, rather than points.
When which leads to , learning with Isolation Kernel is expected to be faster than learning with commonly used data independent kernels such as Gaussian and Laplacian kernels.
The following subsections provide the implementations—due to the use of Isolation Kernel—which enable the significant efficiency gain without compromising predictive accuracy for online kernel learning.
5.3 Using instead of
The prediction function employed follows the respective functional form of either the dual or the primal optimisation problem in which one is solving.
When existing kernels such as Gaussian and Laplacian kernels are used, because they have infinite number of features, the dual optimisation problem and must be used (unless an approximate feature map is derived).
As Isolation Kernel has a finite feature map, this facilitates the use of prediction function ; thus solving the primal optimisation problem is a natural choice.
The evaluation of is faster than that of , when the number of support vectors () times the number of attributes of () is more than the effective number of features of , i.e., (see the reason why is the effective number of feature of the in next subsection). Its use yields a significant speedup when the domain is high dimensional and/or in an online setting where the points can potentially be infinite. The online setting necessitates the need to have a kernel learning system which can deal with potentially infinite number of support vectors. The procedure of such a kernel learning system using Isolation Kernel is described in Section 6.
5.4 Efficient dot product in
The use of Isolation Kernel facilitates an efficient dot product in . Recall that, , has exactly one feature having value=1 in a vector of binary features (stated in Section 5.1). Thus, can be computed with a summation of number of (rather than the naive dot product, computing products ):
where denotes the value of binary feature of ; and serves as an index to the -th element of indicating .
In summary, can be computed more efficiently using as an indexing scheme.
Note that this efficient dot product is independent of . For large , this dot product could result in orders of magnitude faster than using the naive dot product (see Figure 4 in Section 8.1.2 later).
The indexing scheme of the feature map of Isolation Kernel is constructed in two steps as shown in Table 2 that convert . The steps taken by the Nyström method [3, 1] to construct an approximate feature map is also shown for comparison in the same table.
The computational cost of the mapping from to either or is linear to . But this mapping needs to be done only once for each point. That is, every point needs to examine each partitioning only once to determine the partition into which the point falls.
6 Applications to Kernel learning that uses Online Gradient Descent and support vector machines
Online kernel learning aims to build an efficient and scalable kernel-based predictive model incrementally from a sequence of potentially infinite data points. One of the early methods is . One key challenge of online kernel learning is managing a growing number of support vectors, as every misclassified point is typically added to the set of support vectors. A number of ‘budget’ online kernel learning methods have been proposed (see  for a review of existing methods) to limit the number of support vectors.
One recent implementation of online kernel learning is called OGD  which employs :
If (incorrect prediction) then add to the set of support vectors with , where is the learning rate.
Without setting a budget, the number of support vectors () usually increases linearly with the number of points observed. Therefore, the testing time becomes increasingly slower as the number of points observed increases.
Here we show the benefits of Isolation Kernel will bring to online kernel learning: Its use improves both the time and space complexities of OGD significantly from to for every prediction while allowing to be infinite—eliminating the need to have a budget for support vectors. This is because is constant while grows as more points are observed.
This is done on exactly the same OGD implementation. The only change required in the procedure is that the function is evaluated based on its feature map of Isolation Kernel as follows:
During training, is the number of support vectors at the time an evaluation of the prediction function is required. For every addition of a new support vector during the training process, the weight vector is updated incrementally while increments. At the end of the training process, the final is ready to be used with to evaluate every test point .
Although the above expressions are in terms of , the computation is conducted more efficiently using , effectively as an indexing scheme for , as described in Section 5.4, for as well as .
To apply Isolation Kernel to support vector machines, we only need to use the algorithm which solves the primal optimisation problem such as LIBLINEAR  after converting the data using the feature map of Isolation Kernel.
Note that the above efficiency gain is possible for Isolation Kernel, and not possible for other existing kernels, because it has an exact feature map which has a finite number of features that can be controlled by a user; and others do not.
7 Experimental settings
We design experiments to evaluate the impact of Isolation Kernel on Online Kernel Learning. We use the implementations of the kernelised online gradient descent (OGD) and Nyström online gradient descent (NOGD)111Codes available at http://lsokl.stevenhoi.org/.. The kernelised online gradient descent  or OGD solves the dual optimisation problem; whereas IK-OGD solves the primal optimisation problem, so as NOGD . We also compare with a recent online method that employs multi-kernel learning and random fourier features, called AdaRaker .
Laplacian kernel is used as a base-line kernel because Isolation Kernel approximates Laplacian kernel under uniform density distribution 222As pointed in , Laplacian kernel can be expressed as , where . Laplacian kernel has been shown to be competitive to Gaussian kernel in SVM in a recent study .. As a result, Isolation Kernel and Laplacian kernel can be expressed using the same ‘sharpness’ parameter .
Two existing implementations of Isolation Kernel are used: (i) Isolation Forest , as described in ; and (ii) aNNE, a nearest neighbour ensemble that partitions the data space into Voronoi diagram, as described in . We refer IK-OGD to the iForest implementation. When a distinction is required, we denote IK-OGD as the iForest implementation; and IK-OGD the aNNE implementation.
All OGD related algorithms used the hinge loss function and the same learning rate , as used in . The only parameter search required for these algorithms is the kernel parameter. The search range in the experiments is listed in Table 3. The parameter is selected via 5-fold cross-validation on the training set.
The default settings for NOGD  are: the Nyström method uses the Eigenvalue-Decomposition; and budget ; and the matrix rank is set to . The default parameter used to create Isolation Kernel is set to .
AdaRaker (https://github.com/yanningshen/AdaRaker) employs sixteen Gaussian kernels and the specified bandwidths for these kernels are listed in Table 3. (The default three kernels in the code gave worse accuracy than that reported in the next section). In addition, AdaRaker uses 50 orthogonal random features (equivalent to for the Nyström method) and as default. The search range of through 5-fold cross-validation is given in Table 3.
Eleven datasets from www.csie.ntu.edu.tw/c̃jlin/libsvmtools/datasets/ are used in the experiments. The properties of these datasets are shown in Table 4. The datasets are selected in order to have diverse data properties: data sizes (20,000 to 2,400,000) and dimensions (22 to more than 3.2 million). Because the OGD and NOGD versions of the implementation we used work on two-class problems only, three multi-class datasets have been converted to two-class datasets of approximately equal class distribution333The two-class conversions from the original class labels were done for three multi-class datasets: mnist: and . smallNORB: and . cifar-10: and ..
Four experiments are conducted: (a) in online setting, (b) in batch setting, (c) examine the runtime in GPU, and (d) an investigation using SVM. The CPU experiments ran on a Linux CPU machine: AMD 16-core CPU with each core running at 1.8 GHz and 64 GB RAM. The GPU experiments ran on a machine having GPU: 2 x GTX 1080 Ti with 3584 (1.6 GHz) CUDA cores & 12GB graphic memory; and CPU: i9-7900X 3.30GHz processor (20 cores), 64GB RAM.
The results are presented in four subsections of Section 8.
In online setting, we simulate an online setting using each of the four largest datasets over half a million points (after combining their given training and testing sets) as follows. Given a dataset, it is first shuffled. Then, the initial training set has data size as the training set size shown in Table 4; and it is used to determine the best parameter based on 5-fold cross-validation before training the first model. The online stream is assumed to arrive sequentially in blocks of 1000 points. Each block is assumed to have no class labels initially: In testing mode, the latest trained model is used to make a prediction for every point in the block. After testing, class labels are made available: The block is in training mode and the model is updated444This simulation is more realistic than the previous online experiments which assume that class label of each point is available immediately after a prediction is made to enable model update . In practice, the algorithm can be made to be in the training mode whenever class labels are available, either partially or the entire block.. The above testing and training modes are repeated for each current block in the online stream until the data run out. The test accuracy up to the current block is reported along the data stream.
In batch setting, we report the result of a single trial of train-and-test for each dataset which consists of separate training set and testing set. The assessments are in terms of predictive accuracy and the total runtime of training and testing. Since AdaRaker has problem dealing with large datasets, it is used in the batch setting only.
In online setting, Isolation Kernel and for IK-OGD are established using the initial training set only. Once established, the kernel and are fixed for the rest of the data stream. This applies to the points selected for NOGD as well. In the batch setting, the given training set is used for these purposes.
8 Empirical Results
8.1 Results in online setting
Figure 3 shows that, in terms of accuracy, IK-OGD has higher accuracy than OGD and NOGD on four datasets, except that OGD has better accuracy on epsilon and url555Note that the first points in the accuracy plots can swing wildly because it is the accuracy of the initial trained model on the first data block.. Notice that, as more points are observed, OGD and IK-OGD have more rooms for accuracy improvement than NOGD because the former two have no budget and the latter has a limited budget. We will examine the extent to which increasing the budget and improve the accuracies of NOGD and IK-OGD, respectively, in Section 8.2.
In terms of runtime, IK-OGD runs faster than both OGD and NOGD on high dimensional datasets (url, rcv1.binary and epsilon); and it is only slower than NOGD in the low dimensional covertype dataset. Notice that the gap in runtime between IK-OGD and NOGD stays the same over the period because the time spent on is the same. In contrast, the gap between OGD and IK-OGD increases over time because the time spent on used by OGD increases as the number of support vectors increases over time. The runtimes of IK-OGD and NOGD are in the same order; but IK-OGD is 2 to 4 orders magnitude faster than OGD.
NOGD maintains fast execution by limiting the number of support vectors while using . The use of Laplacian kernel (or any other kernel) which has infinite or large number of features necessitates the use of a feature map approximation method. Despite all these measures for efficiency gain in NOGD, IK-OGD without budget still ran faster than NOGD with budget () on the two high-dimensional datasets! The efficiency gain in NOGD is a trade-off with accuracy—both the feature map approximation and the limit on the number of support vectors reduce the accuracy.
The use of Isolation Kernel provides a cleaner and simpler utilisation of in online setting than the kernel functional approximation approach (in which NOGD is a good representative method). As a result, IK-OGD achieves the efficiency gain without compromising the accuracy because an exact rather than an approximate feature map is used.
8.1.1 The effect of or on IK-OGD
To demonstrate the impact of the type of prediction function used in IK-OGD (stated in Section 5.3), we create a version which employs named IK-OGD(dual) to compare with IK-OGD which employs .
The proportions of time spent on the two prediction functions out of the total runtimes are given as follows: IK-OGD took 2.3% and 0.77% on rcv1.binary and epsilon, respectively. In contrast, IK-OGD(dual) took 99.9% and 99.8%, respectively. This shows that has reduced the time spent on the prediction function from almost the total runtime to a tiny fraction of the total runtime!
The total runtimes of IK-OGD versus IK-OGD(dual) are 37 seconds versus 280,656 seconds on rcv1.binary; and 103 seconds versus 235,966 seconds on epsilon. In other words, it also reduced the total runtime significantly by 3 to 4 orders of magnitude. The difference in runtimes enlarges as more points are observed because the number of support vectors increases which affects IK-OGD(dual) only. The number of support vectors used at the end of the data stream is: 349,009 for rcv1.binary; and 349,481 for epsilon.
|Accuracy||Runtime (CPU seconds)|
|news20.binary||.50||.57||.89||.50||—||915||1||See Section 8.3 & Table 6||11||-ME-|
8.1.2 The effect of efficient dot product on IK-OGD
Here we show the effect of the efficient dot product, described in Section 5.4. The implementation which computes the summation of products is named IK-OGD(naive). It is compared with IK-OGD with the efficient implementation. As the impact on runtimes varies with , the experiment is conducted with increasing .
Figure 4 shows that the runtime difference between IK-OGD and IK-OGD(naive) enlarges as increases; and IK-OGD(naive) was close to two orders of magnitude slower than IK-OGD at on both datasets. Note that the efficient dot product in IK-OGD is independent of . IK-OGD’s runtime depends on only in the process of mapping to (recall the mapping stated in Table 2).
8.2 Results in batch setting
Observations from the results shown in Table 5 are:
In terms predictive accuracy:
IK-OGD performs better than OGD on six datasets; it has equal or approximately equal accuracy on url, mnist and a9a. This outcome is purely due to the kernel employed—Isolation Kernel approximates Laplacian kernel under uniform density distribution; and it adapts to density structure of the given dataset . This relative result between Isolation Kernel and Laplacian Kernel on OGD is consistent with the previous relative result on SVM . The only two datasets on which IK-OGD performs significantly worse than OGD are smallNORB and epsilon. We will see in Section 8.2.2 that the gap can be significantly reduced by increasing , without a significant runtime increase.
NOGD has lower accuracy than OGD on ten out of eleven datasets because it employs an approximate feature map of the Laplacian kernel. As a consequence, NOGD can be significantly worse than OGD. Examples are url, smallNORB, cifar-10, epsilon and mnist. While increasing its budget may improve NOGD’s accuracy to approach the level of accuracy of OGD; it will still perform worse than IK-OGD. Indeed, NODG performed worse than IK-OGD on ten out of eleven datasets in Table 5.
IK-OGD has equal or better accuracy than IK-OGD. This result is consistent with the assessment comparing the two implementations of Isolation Kernel in density-based clustering . This is because Voronoi diagram produces partitions of non-axis-parallel regions; whereas iForest yields axis-parallel partitions only. Notice that the accuracy difference between IK-OGD and OGD is huge on news20, rcv1 real-sim and covertype.
In terms of runtime:
While OGD and IK-OGD are using exactly the same training procedure (with the exception of the prediction function used), IK-OGD has advantage in two aspects:
The differences in runtimes are huge—IK-OGD is three orders of magnitude faster than OGD on six out of the eleven datasets; and at least one order of magnitude faster on other datasets. This is due to the efficient implementations made possible through Isolation Kernel, described in Section 5.
Both OGD and IK-OGD can potentially incorporate an infinite number of support vectors. But, the prediction function used has denied OGD the opportunity to live up to its full potential because its testing time complexity is proportional to the number of support vectors. In contrast, IK-OGD has constant test time complexity, independent of the number of support vectors.
Compare with NOGD, IK-OGD is up to one order of magnitude faster in runtime in high dimensional datasets. On low dimensional datasets (100 or less), IK-OGD ran only slightly slower. This is remarkable given that IK-OGD has no budget and NOGD has a budget of 100 support vectors only. As a result, NOGD has lower accuracy than IK-OGD on all datasets, except a9a.
In a nutshell, IK-OGD inherits the advantages of OGD (no budget) and NOGD (using ); yet, it does not have their disadvantages: OGD (using ); and NOGD (the need to have a budget which lowers its predictive accuracy).
8.2.1 Comparison with AdaRaker
Table 5 also shows that multi-kernel learning method AdaRaker  has lower accuracy than OGD (and even NOGD) using a single kernel. This result is consistent with the comparison between SimpleMKL  and SVM using Isolation Kernel conducted previously . Out of the five datasets on which it could run within reasonable time and without memory errors, AdaRaker ran slower than OGD in three datasets; but faster in two. Compare with IK-OGD and NOGD, AdaRaker is at least two orders of magnitude slower on the five datasets.
AdaRaker has memory error issues with high dimensional datasets.
8.2.2 The effects of on IK-OGD and on NOGD
Two datasets, epsilon and smallNORB, are used in this experiment because the accuracy differences between OGD and NOGD on these datasets are the largest; and they are the only two datasets in which IK-OGD performed significantly worse than OGD. We examine the effects of parameters and on IK-OGD and NOGD.
Figure 5 shows that IK-OGD’s accuracy is improved significantly as increases. Note that, using on epsilon, the accuracy of IK-OGD reached the same level of accuracy of OGD shown in Table 5; yet, IK-OGD still ran two orders of magnitude faster than OGD. In contrast, although NOGD’s accuracy has improved when was increased from 100 to 10000, it still performed worse than OGD and IK-OGD by a large margin of 10%. In addition, NOGD at ran two orders of magnitude slower than NOGD . On smallNORB, IK-OGD also improves its accuracy as increases up to ; but NOGD has showed little improvement over the entire range between and .
NOGD’s runtime increases linearly wrt ; whereas the runtime of IK-OGD increases sublinearly wrt .
8.3 CPU and GPU versions of IK-Ogd
The use of Voronoi diagram to partition the data space for Isolation Kernel has slowed down the runtime significantly, compared to that implemented using iForest, mainly due to the need to search for nearest neighbours. However, because the search for nearest neighbours is amenable to GPU accelerations, we investigate a runtime comparison of the CPU and GPU versions of IK-OGD.
The result is shown in Table 6. The GPU version of IK-OGD is up to four orders of magnitude faster than the CPU version. Despite this GPU speedup, IK-OGD is still up to one order of magnitude slower than IK-OGD ran on CPU on some datasets.
In summary, GPU is a good means to speed up IK-OGD. When accuracy is paramount, IK-OGD is always a better choice than IK-OGD (as shown in Table 5) though the former, even with GPU, runs slower than the latter with CPU.
Note that while it is possible to speed up the original OGD which employs the dual prediction function using GPU, it is not a good solution for two reasons. First, it does not improve OGD’s accuracy if the same data independent kernel is used. Second, the GPU-accelerated OGD is expected to still run slower than the CPU version of OGD which employs the primal prediction function using the same kernel.
The runtime reported in Table 6 consists of two components: feature mapping time and OGD runtime. For example, the longest GPU runtime is on epsilon which consists of feature mapping time 457 GPU seconds and OGD runtime of .9 CPU seconds. In other words, the bulk of the runtime is spent on feature mapping; and OGD took only a tiny fraction of a second to complete the job with CPU.
8.4 SVM versus IK-SVM
Isolation Kernel is the only nonlinear kernel, as far as we know, that allows the trick of using to be applied to kernel-based methods, including SVM, to speed up the runtime in training as well as testing. We apply Isolation Kernel to SVM to produce IK-SVM. It is realized using LIBLINEAR since IK-SVM is equivalent to applying the IK feature mapped data to a linear SVM. IK-SVM is compared with LIBSVM with Laplacian kernel (denoted as SVM).
-ME- denotes memory errors.
Table 7 shows the comparison result of SVM and IK-SVM. The relative result between SVM and IK-SVM is reminiscent of that comparing OGD with IK-OGD in Table 5, i.e., IK-SVM has better accuracy than SVM in high dimensional datasets (news20, rcv1, real-sim and cifar-10); and they have comparable accuracy in datasets less than 2000 dimensions (mnist, a9a and ijcnn1). In terms of runtime, IK-SVM is up to four orders of magnitude faster.
The memory errors of SVM on datasets with large training sets are the known limitation of SVM using existing nonlinear kernels. Our result in Table 7 shows that Isolation Kernel enables SVM to deal with large datasets that would otherwise be impossible.
Note that the runtime reported in Table 7 does not include the feature mapping time. With GPU, adding the GPU runtime reported in Table 6 (the bulk is the feature mapping time) to that of IK-SVM does not change the conclusion: IK-SVM runs order(s) of magnitude faster than SVM and has better accuracy in high dimensional and large scale datasets.
9 Relation to existing approaches for efficient kernel methods
9.1 Kernel functional approximation
Kernel functional approximation is a popular effective approach to produce a user-controllable, finite, approximate feature map of a kernel having infinite number of features.
One representative is the Nyström method [16, 3, 17]. It first samples points from the given dataset, and then constructs a matrix of low rank , and derives a vector representation of data of features. This gives , where is a normlised eigenfunction of . See  for details. For , it reduces the search space significantly.
The key overhead is the eigenvalue decomposition computation of the low rank matrix. This overhead is not large only if both and are small, relative to the data size and dimensionality . The overhead becomes impracticably large for problems which require large and .
Also, though the Nyström method depends on data when deriving an approximate feature map of a chosen nonlinear kernel, but the kernel it is approximating is still data independent (e.g., Gaussian and Laplacian kernels).
In any case, the efficiency gain from the kernel functional approximation approach comes with the cost of reduced accuracy as it is an approximation of the chosen nonlinear kernel function.
In contrast, Isolation Kernel has an exact feature map. As a result, the efficiency gain from the use of Isolation Kernel does not degrade accuracy. It is a direct method which does not need an intervention step to approximate a feature map from a kernel having infinite or large number of features.
9.2 Sparse kernel approximation
To represent non-linearity, the feature map of a kernel has dimensionality which is usually significantly larger than the dimension of the given dataset. The Nyström method reduces the dimensionality to produce a dense representation.
In contrast, sparse kernel approximation aims to produce high-dimensional sparse features666A sparse representation yields vectors having many zero values, where a feature with zero value means that the feature is irrelevant.. One proposal  approximates each feature vector of using a small subset of representative points, e.g., ’s neighbours (rather than all representative points). It then uses product quantization (PQ)777Product Quantization (an improvement over vector quantization) aims to reduce storage and retrieval time for conducting approximate nearest neighbour search. to encode the sparse features, and employ bundle methods to learn directly from the PQ codes.
Interestingly, each feature vector of of Isolation Kernel is both a sparse representation and a coding which employs exactly representative points, from random subsets of points, i.e., exactly one out of the points in one subset is used for the sparse representation, concatenated times.
There are other sparse representations, e.g., Local Deep Kernel Learning  learns a tree-based feature embedding which is high dimensional and sparse through a generalised version of Localized Multiple Kernel Learning of multiple data independent kernels.
The key difference between Isolation Kernel and current sparse kernel approximation is that the former is a data dependent kernel [7, 8] which has an exact feature map. Sparse kernel approximation may be viewed as another intervention step (alternative to kernel functional approximation) to produce a finite sparse approximate feature map from one or more data independent kernels having infinite number of features. In addition, computationally expensive learning  or PQ  are not required in Isolation Kernel.
It is important to note that Isolation Kernel is not one kernel function such as Gaussian kernel, but a class of kernels which has different kernel distributions depending on the space partitioning mechanism employed. We use two implementations of Isolation Kernel: (a) iForest  which has its kernel distribution similar to that of Laplacian Kernel under uniform density distribution ; (b) when a Voronoi diagram is used to partition the space , Isolation Kernel has its distribution more akin to an exponential kernel under uniform density distribution. Both realisations of Isolation Kernel adapt to local density of a given dataset, unlike existing data independent kernels. The criterion required of a partitioning mechanism in order to produce an effective Isolation Kernel is described in [7, 8]. This paper has focused on efficient implementations of Isolation Kernel in online kernel learning, without compromising accuracy.
When using linear kernel, the trick of using instead of to speed up the runtime of both the training stage and the testing stage has been applied previously, e.g., in LIBLINEAR , even when it is solving the dual optimisation problem. This is possible in LIBLINEAR because linear kernel has an exact and finite feature map. But, if you are using an existing nonlinear kernel such as Gaussian or Laplacian kernel, such a trick cannot be applied to SVM because its feature map is not finite.
Note that the work reported in , including OGD and NOGD, and IK-OGD used here do not address the concept change issue in online setting. Nevertheless, all these works address the efficiency issue in online setting which serves as the foundation to tackling the efficacy issue of large scale online kernel learning under concept change.
Geurts et. al.  describe a kernel view of Extra-Trees (a variant of Random Forest ) where its feature map is also sparse and similar to the one we presented here. However, like Random Forest (RF) kernel , this kernel was offered as a view point to explain the behaviour of Random Forest; and no evaluation has been conducted to assess its efficacy using a kernel-based method. Ting et. al.  have provided the conceptual differences between RF-like kernels and Isolation Kernel; and the empirical evaluation has revealed that RF-like kernels are inferior to Isolation Kernel when used in SVM.
11 Concluding remarks
The identification of the root cause is prime importance in solving any problem. Without knowing the root cause, attempts to solve the problem are at best mitigating the real issue, merely masking the symptoms, without realising it. As in the case of the two challenges of the current online kernel learning—(i) using kernel functional approximation to convert an infinite feature map of a chosen kernel to a finite approximate feature map; and (ii) using various methods to limit the number of support vectors. These methods have achieved what they set out to do—trading off efficiency gain with degraded accuracy. Yet, they still do not enable online kernel learning to live up to its full potential.
Once the root cause of the perennial problem of online kernel learning has been identified—high computational cost on high dimensional and large datasets is due to the type of kernels used, we show that the solution is extremely simple—a kernel which addresses the root cause shall have a feature map that is sparse and finite, and the number of features is controllable by a user.
When Isolation Kernel is used, the two challenges of the current online kernel learning becomes a non-issue. They are challenges only if the chosen kernel has infinite feature map. In other words, the challenges are only symptoms of the real issue; and current approaches treat the symptoms without addressing the real issue.
The outcome of this work is unprecedented: It enables online kernel learning to achieve what the current approaches are unable to, i.e., to live up to its full potential to deal with a potentially infinite number of support vectors in online setting having infinite number of data points. This outcome is derived from the identification of the root cause of the problem and addressing it directly.
This outcome is a result of bringing four key elements together: (i) Isolation Kernel’s exact and finite feature map; (ii) solving the primal optimisation problem in kernel learning with feature mapped data; (iii) efficient dot product; and (iv) GPU acceleration. Whilst the individual elements are uncomplicated and not even new (except the first one), together they have achieved the outcome that has evaded many methods thus far, particularly in terms of predictive accuracy and runtime. The first element is the crucial core that jells with other elements to achieve this outcome.
-  J. Lu, S. C. H. Hoi, J. Wang, P. Zhao, and Z.-Y. Liu, “Large scale online kernel learning,” Journal of Machine Learning Research, vol. 17, no. 1, pp. 1613–1655, 2016.
-  Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M. Ringgaard, and C.-J. Lin, “Training and testing low-degree polynomial data mappings via linear svm,” Journal of Machine Learning Research, vol. 11, pp. 1471–1490, 2010.
-  C. K. I. Williams and M. Seeger, “Using the nyström method to speed up kernel machines,” in Advances in Neural Information Processing Systems 13, T. K. Leen, T. G. Dietterich, and V. Tresp, Eds. MIT Press, 2001, pp. 682–688.
-  A. Rahimi and B. Recht, “Random features for large-scale kernel machines,” in Advances in Neural Information Processing Systems, ser. NIPS’07, 2007, pp. 1177–1184.
-  X. Y. Felix, A. T. Suresh, K. M. Choromanski, D. N. Holtmann-Rice, and S. Kumar, “Orthogonal random features,” in Advances in Neural Information Processing Systems, ser. NIPS’16, 2016, pp. 1975–1983.
-  J. Yang, V. Sindhwani, Q. Fan, H. Avron, and M. Mahoney, “Random Laplace feature maps for semigroup kernels on histograms,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 971–978.
-  K. M. Ting, Y. Zhu, and Z.-H. Zhou, “Isolation kernel and its effect on SVM,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2018, pp. 2329–2337.
-  X. Qin, K. M. Ting, Y. Zhu, and V. C. S. Lee, “Nearest-neighbour-induced isolation similarity and its impact on density-based clustering,” in Proceedings of The Thirty-Third AAAI Conference on Artificial Intelligence, 2019.
-  A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet, “SimpleMKL,” Journal of Machine Learning Research, vol. 9, no. Nov, pp. 2491–2521, 2008.
-  P. Zadeh, R. Hosseini, and S. Sra, “Geometric mean metric learning,” in International Conference on Machine Learning, 2016, pp. 2464–2471.
-  F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in Proceedings of the IEEE International Conference on Data Mining, 2008, pp. 413–422.
-  B. Scholkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2001.
-  J. Kivinen, A. J. Smola, and R. C. Williamson, “Online learning with kernels,” in Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, 2001, pp. 785–792.
-  R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “Liblinear: A library for large linear classification,” Journal of Machine Learning Research, vol. 9, pp. 1871–1874, 2008.
-  Y. Shen, T. Chen, and G. Giannakis, “Online ensemble multi-kernel learning adaptive to non-stationary and adversarial environments,” in Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, 2018, pp. 2037–2046.
-  T. Yang, Y.-F. Li, M. Mahdavi, R. Jin, and Z.-H. Zhou, “Nyström method vs random fourier features: A theoretical and empirical comparison,” in Advances in Neural Information Processing Systems, ser. NIPS’12, 2012, pp. 476–484.
-  J. Wu, L. Ding, and S. Liao, “Predictive nyström method for kernel methods,” Neurocomputing, vol. 234, pp. 116–125, 2017.
-  A. Vedaldi and A. Zisserman, “Sparse kernel approximations for efficient classification and detection,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2320–2327.
-  C. Jose, P. Goyal, P. Aggrwal, and M. Varma, “Local deep kernel learning for efficient non-linear svm prediction,” in Proceedings of the 30th International Conference on Machine Learning, 2013, pp. III–486–III–494.
-  P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,” Machine learning, vol. 63, no. 1, pp. 3–42, 2006.
-  L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
-  ——, “Some infinity theory for predictor ensembles,” Technical Report 577. Statistics Dept. UCB., 2000.