Isolation Kernel: The X Factor
in Efficient and Effective Large Scale Online Kernel Learning
Abstract
Large scale online kernel learning aims to build an efficient and scalable kernelbased predictive model incrementally from a sequence of potentially infinite data points. To achieve this aim, the method must be able to deal with a potentially infinite number of support vectors. The current stateoftheart is unable to deal with even a moderate number of support vectors.
This paper identifies the root cause of the current methods, i.e., the type of kernel used which has a feature map of infinite dimensionality. With this revelation and together with our discovery that a recently introduced Isolation Kernel has a finite feature map, to achieve the above aim of large scale online kernel learning becomes extremely simple—simply use Isolation Kernel instead of kernels having infinite feature map. We show for the first time that online kernel learning is able to deal with a potentially infinite number of support vectors.
[1]itemsep=2pt,partopsep=2pt,parsep=0pt,topsep=2pt \setitemize[1]itemsep=2pt,partopsep=2pt,parsep=0pt,topsep=2pt
1 Introduction
In the age of big data, the ability to deal with large datasets or online data with potentially infinite data points is a key requirement of machine learning methods. Kernel methods are an elegant machine learning method to learn a nonlinear boundary from data. However, its applications in the age of big data is limited because of its perennial problem of high computational cost on high dimensional and large datasets.
The current stateoftheart large scale online kernel learning focuses on improving the efficiency but compromising predictive accuracy. We find that its predictive accuracy will be degraded to an unacceptable low level when it is applied to datasets having more than 1000 dimensions.
In addition, an online kernel learning method must be able to deal with a potentially infinite number of support vectors. Current methods can handle a limited number of support vectors only, far short of the requirement.
The contributions of this paper are:

Identifying the root cause of the high computational cost of large scale online kernel learning, i.e., the type of kernel used which has a feature map of infinite dimensionality.

Revealing that a recent Isolation Kernel has exact, finite feature map.

Showing that Isolation Kernel with its exact finite feature map is the crucial factor in addressing the abovementioned root cause and enabling efficient large scale online kernel learning without compromising accuracy. In contrast, an influential approach that employs kernel functional approximation in online kernel learning must compromise accuracy in order to achieve efficiency gain.

Together with three other key elements: learning with feature map, efficient dot product and GPU acceleration, we show for the first time that online kernel learning is able to deal with an infinite number of support vectors.

Demonstrating the impact of Isolation Kernel on an existing method of online kernel learning called Online Gradient Descent (OGD) and also support vector machines (SVM). Using Isolation Kernel, instead of a kernel with infinite feature map, the same algorithms (OGD and SVM) often achieve better predictive accuracy and always significantly faster runtime. In high dimensional datasets, the difference in accuracy is large; and its runtime is up to three orders of magnitude faster. In addition, OGD with Isolation Kernel has better accuracy than the stateoftheart online OGD called NOGD and up to one order of magnitude faster.

Unveiling for the first time that (a) the Vonoroibased implementation of Isolation Kernel produces better predictive accuracy than the treebased implementation in kernel methods using OGD; (b) the GPU version of the implementation is up to four orders of magnitude faster than the CPU version.
Furthermore, our work has two wider implications:

The current key approach to convert an infinite feature map of a kernel to an approximate finite feature map, i.e., kernel functional approximation, becomes obsolete because an exact finite feature map can be derived directly and efficiently from Isolation Kernel. To produce efficient kernelbased methods, using a kernel with infinite feature map is forced to employ a kernel approximation. We show here that this approach has every disadvantage in terms of predictive accuracy and runtime in comparison with using Isolation Kernel.

Isolation Kernel is the only kernel, as far as know, that enables online kernel learning to live up to its full potential to deal with a potentially infinite number of support vectors. None of the existing methods that employ existing kernels have achieved this potential. This opens up opportunity for other kernelbased methods to employ Isolation Kernel—enabling them to deal with large scale datasets that would otherwise be impossible.
The rest of the paper is organised as follows. Section 2 describes the current challenges and key approach in large scale online kernel learning. Section 3 presents the two previously unknown advantages of Isolation Kernel and its current known advantage. Section 4 describes the current understanding of Isolation Kernel: its definition, implementations and characteristics. Section 5 presents our four conceptual contributions in relation to learning with exact feature map of Isolation Kernel. Its applications to online gradient descent and support vector machines are presented in Section 6. The experimental settings and results are provided in the next two sections. Section 9 describes the relationship with existing approaches to efficient kernel methods, followed by discussion and concluding remarks in the last two sections.
2 Current challenges and key approach in Large Scale online kernel learning
We will describe the current challenges in online kernel learning and an influential approach to meet one of these challenges in the next two subsections
2.1 Challenges in online kernel learning
Kernel methods are an elegant way to learn a nonlinear boundary. But they are hampered by high computational cost. Especially, one employs the kernel trick to avoid feature mapping by solving the dual optimisation problem. One of its main computational costs is due to the prediction function used. The evaluation of the prediction function has high cost if the number of support vectors is high: , where is the chosen kernel function; is the learned weight and is the class label of support vector ; and is the number of support vectors. The sign of , i.e., or , yields the final class prediction. Thus, limiting the number of support vectors is the key method in reducing the high computational cost.
Alternatively, abandoning the kernel trick by using an approximate feature map of a chosen nonlinear kernel, one usually solves the primal optimisation problem because its prediction function has less cost. The evaluation of the prediction function has cost proportional to the number of features in the feature map , i.e., , where can be precomputed once the support vectors are determined. The success of this method relies on a method to produce a good approximate feature map. The method often needs to employ a small data subset in order to reduce its high computational cost. This is in addition to limiting the number of support vectors mentioned above.
Kernel methods, that are aimed for large scale datasets, solve the primal optimisation problem because has constant time cost, independent of the number of support vectors. A recent example is large scale online kernel learning [1].
In a nutshell, two key challenges in large scale online kernel learning are to:

Obtain a good approximate feature map of a chosen nonlinear kernel function, and

Limit the number of support vector with a budget,
such that the inevitable negative impact they have on accuracy is reduced as much as possible.
2.2 An existing influential approach
The need to approximate a feature map of a chosen nonlinear kernel arises because existing nonlinear kernels such as Gaussian and polynomial kernels have either infinite or a large number of features. Table 1 provides the sizes of their feature maps.
Kernel  Feature map size  

Gaussian  infinite  
Polynomial  
Isolation  spacepartitioning() 
One influential approach to meet the first key challenge is kernel functional approximation; and its two popular methods are: (a) The Nyström embedding method [3] which uses sample points from the given dataset to construct a matrix of low rank and derive a vector representation of data of proxy features. (b) Derive random features based on Fourier transform [4, 5] or Laplacian transform [6], independent of the given dataset. Both produce an approximate feature map of a chosen nonlinear kernel using proxy features which are aimed to be used as input to a linear learning algorithm.
A recent proposal of budget online kernel learning [1] has employed the Nyström embedding method and a budget of support vectors to meet the two challenges: A subset of sampled points is used as (i) seeds to generate the approximate feature map in the Nyström process; and (ii) the support vectors which remain the same during the course of an online setting, although their weights are updated. The algorithm called NOGD (OGD which employs the Nyström embedding method) has shown encouraging results, dealing with large scale datasets and has good predictive accuracy in online setting for datasets less than 800 dimensions [1].
However, because the feature map is an approximation and the number of support vectors is limited, the approach reduces the time and space complexities with the expense of accuracy. In addition, we demonstrate that NOGD has performed poorly on datasets more than 1000 dimensions (see results in Section 8).
We show here that the two challenges on online kernel learning only exist because of the kind of kernels employed. For existing commonly used nonlinear kernels, the dimensionality of their feature maps is not controllable by a user, and has infinite or a large number of dimensions. The kernel functional approximation approach is a workaround towards the first challenge without addressing its root cause. Setting a budget for support vectors (of the second challenge) is a mitigation that almost always reduces accuracy of the final model, irrespective of the budgeting scheme.
3 Advantages of Isolation Kernel
We show here that a recent kernel called Isolation Kernel [7, 8] has two previously unknown advantages, compared with existing data independent kernels:

The unique characteristic is that Isolation Kernel has an exact feature map which is sparse and has a finite number of features that can be controlled by a user.

The sparse and finite representation, when represents each feature vector using out of the representative points, enables an efficient dot product implementation.
The first advantage eliminates the need to get an approximate feature map (through kernel functional approximation or other means)—when an exact feature map is available, there is no reason to use an approximate feature map. It destroys the premise of the first key challenge in online kernel learning.
This enables kernel learning to solve the primal optimisation problem efficiently with Isolation Kernel. This is because evaluating the prediction function can be conducted more efficiently using , where can be precomputed once the support vectors are determined. This is applicable in the testing stage as well as in the training stage.
The second advantage enables the dot product in to be computed efficiently, i.e., orders of magnitude faster than that without the efficient implementation under some condition.
We show that, with the above advantages of Isolation Kernel, online kernel learning can be achieved without the need for a budget to limit the number of support vectors—the second key challenge in online kernel learning. This allows an efficient kernelbased prediction model to deal with an unlimited number of support vectors in a sequence of infinite data points.
In a nutshell, the type of kernel used, which has infinite or large number of features, has necessitated an intervention step to approximate its feature map. A considerable amount of research effort [2, 3, 4, 6] has been invested in order to produce a feature map that has a more manageable dimensionality. Using the type of kernel such as Isolation Kernel—which has an exact, usercontrollable finite feature map—eliminates the need of such an intervention step for feature map approximation.
3.1 One known advantage
In addition to the above two (previously unknown) advantages, Isolation Kernel has one known advantage, i.e., it is data dependent [7, 8], as opposed to data independent kernels such as Gaussian and Laplacian kernels. It is solely dependent on data distribution, requiring neither class information nor explicit learning. Isolation Kernel has been shown to be a better kernel than existing kernels in SVM classification [7], and has better accuracy than existing methods such as multiple kernel learning [9] and distance metric learning [10]. Isolation Kernel is also a successful way to kernelise densitybased clustering [8].
These previous works have focused on the improvements on taskspecific performances; but the use of Isolation Kernel has slowed the algorithms’ runtimes [7, 8]. They also have focused on the use of kernel trick, and the feature map of Isolation Kernel was either implicitly stated [7] or not mentioned at all [8].
Here we present the feature map of Isolation Kernel and its characteristic, and the benefits it bring to online kernel learning that would otherwise be impossible—a kernel learning which can deal with infinite number of support vectors; and run efficiently to handle large scale datasets, without compromising accuracy.
In summary, the known advantage of data dependency contributes to a trained model’s high accuracy; whereas the two previously unknown advantages contribute to efficiency gain. These will be demonstrated in the empirical evaluations reported in Section 8.
4 Isolation Kernel
We provide the pertinent details of Isolation Kernel in this section. Other details can be found in [7, 8].
Let be a dataset sampled from an unknown probability density function . Moreover, let denote the set of all partitionings that are admissible under the dataset , where each covers the entire space of ; and each of the isolating partitions isolates one data point from the rest of the points in a random subset , and .
Definition 1.
For any two points , Isolation Kernel of and wrt is defined to be the expectation taken over the probability distribution on all partitionings that both and fall into the same isolating partition :
(0)  
where is an indicator function.
In practice, Isolation Kernel is constructed using a finite number of partitionings , where each is created using :
(0)  
is a shorthand for .
is a shorthand for hereafter.
4.1 iForest implementation
Here the aim is to isolate every point in . This is done recursively by randomly selecting an axisparallel split to subdivide the data into two nonempty subsets until every point is isolated. Each partitioning produces isolating partitions ; and each partition contains a single point in .
The algorithm [11] produces , each built independently using a subset , sampled without replacement from , where .
4.2 aNNE Implementation
As an alternative to using trees in its first implementation of Isolation Kernel [7], a nearest neighbour ensemble (aNNE) has been used instead [8].
Like the tree method, the nearest neighbour method also produces each model which consists of isolating partitions , given a subsample of points. Rather than representing each isolating partition as a hyperrectangle, it is represented as a cell in a Voronoi diagram, where the boundary between two points is the equal distance from these two points.
, being a Voronoi diagram, is built by employing points in , where each isolating partition or Voronoi cell isolates one data point from the rest of the points in . The point which determines a cell is regarded as the cell centre.
Given a Voronoi diagram constructed from a sample of points, the Voronoi cell centred at is:
where is a distance function and we use as Euclidean distance in this paper.
Note that the boundaries of a Voronoi diagram is derived implicitly to be equal distance between any two points in ; and it needs not be derived explicitly for our purpose in realising Isolation Kernel.
4.3 Kernel distributions and contour plots
Figure 1 is extracted from [7] which shows that the kernel distribution of Isolation Kernel approximates that of Laplacian kernel under uniform density distribution. A brief description of the proof is provided in the same paper.
Figure 2 shows that the contour plots of aNNE and iForest implementations of Isolation Kernel. Notice that each contour line, which denotes the same similarity to the centre (red point), is elongated along the sparse region and compressed along the dense region. In contrast, Laplacian kernel (or any data independent kernel) has the same symmetrical contour lines around the centre point, independent of data distribution (as shown in Figure 1(a)).
The reasons why Voronoibased implementation are better than the treebased implementation have been provided earlier [8]; and this has led to better densitybased clustering result than using the Euclidean distance measure.
5 Learning with exact feature map
This section presents our four conceptual contributions. Section 5.1 presents the feature map of Isolation Kernel. Section 5.2 describes the theoretical underpinning of efficient learning with Isolation Kernel. How Isolation Kernel enables the use of in solving the primal optimisation problem, and its efficient dot product implementations are provided in the following two subsections.
5.1 Exact feature map of Isolation Kernel
Viewing each isolating partition as a feature, the component of the feature space due to can be derived using the mapping (where is a binary domain); and can be constructed using a partitioning as follows:
Let be a vector of features of indicating the only isolating partition in which falls, out of the isolating partitions , where .
The inner summation of Equation ( ‣ 4) of Isolation Kernel can then be reexpressed in terms of as follows:
Because is in a quadratic form, it is a PSD (positive semidefinite). The sum of PSD, , is also PSD. Therefore, is a valid kernel.
An exact simple representation of Isolation Kernel can be derived by concatenating samples of . Let be a vector of binary features. Then, Isolation Kernel represented using these features can be expressed as:
Definition 2.
Feature map of Isolation Kernel. For point , the feature mapping of is a vector that represents the partitions in all the partitioning , ; where falls into only one of the partitions in each partitioning .
Parameters and can be controlled by a user. Each setting of and yields a feature map.
Nsytröm (approximate feature map of a kernel )  Isolation Kernel ()  
1.  Sample from to construct kernel matrix  1.  Sample points from , times, to construct partitionings ; and each has partitions. 
2.  , where and are eigenvectors and eigenvalues of  2.  , where each integer attribute has values: ; and each integer is an index to a partition . The attributes represent the partitionings .Convert to : is parsed over the partitionings. 
3.  : Convert to :  
Perform learning with feature map on 
5.2 Efficient Learning with Isolation Kernel
This subsection describes the theoretical underpinning of efficient learning with Isolation Kernel.
In a binary class learning problem of a given training set , where points and class labels , the goal of SVM is to learn a kernel prediction function by solving the following optimisation problem [12]:
where is span over all points in the training set ; is a convex loss function wrt the prediction of ; and is the Reproducing Kernel Hilbert Space endowed with a kernel.
The computational cost of this kernel learning is high because the search space over is large for large .
In contrast, with Isolation Kernel, is replaced with a smaller set . because .
In simple terms, the span is over out of the representative points, rather than points.
When which leads to , learning with Isolation Kernel is expected to be faster than learning with commonly used data independent kernels such as Gaussian and Laplacian kernels.
The following subsections provide the implementations—due to the use of Isolation Kernel—which enable the significant efficiency gain without compromising predictive accuracy for online kernel learning.
5.3 Using instead of
The prediction function employed follows the respective functional form of either the dual or the primal optimisation problem in which one is solving.
When existing kernels such as Gaussian and Laplacian kernels are used, because they have infinite number of features, the dual optimisation problem and must be used (unless an approximate feature map is derived).
As Isolation Kernel has a finite feature map, this facilitates the use of prediction function ; thus solving the primal optimisation problem is a natural choice.
The evaluation of is faster than that of , when the number of support vectors () times the number of attributes of () is more than the effective number of features of , i.e., (see the reason why is the effective number of feature of the in next subsection). Its use yields a significant speedup when the domain is high dimensional and/or in an online setting where the points can potentially be infinite. The online setting necessitates the need to have a kernel learning system which can deal with potentially infinite number of support vectors. The procedure of such a kernel learning system using Isolation Kernel is described in Section 6.
5.4 Efficient dot product in
The use of Isolation Kernel facilitates an efficient dot product in . Recall that, , has exactly one feature having value=1 in a vector of binary features (stated in Section 5.1). Thus, can be computed with a summation of number of (rather than the naive dot product, computing products ):
where denotes the value of binary feature of ; and serves as an index to the th element of indicating .
In summary, can be computed more efficiently using as an indexing scheme.
Note that this efficient dot product is independent of . For large , this dot product could result in orders of magnitude faster than using the naive dot product (see Figure 4 in Section 8.1.2 later).
The indexing scheme of the feature map of Isolation Kernel is constructed in two steps as shown in Table 2 that convert . The steps taken by the Nyström method [3, 1] to construct an approximate feature map is also shown for comparison in the same table.
The computational cost of the mapping from to either or is linear to . But this mapping needs to be done only once for each point. That is, every point needs to examine each partitioning only once to determine the partition into which the point falls.
6 Applications to Kernel learning that uses Online Gradient Descent and support vector machines
Online kernel learning aims to build an efficient and scalable kernelbased predictive model incrementally from a sequence of potentially infinite data points. One of the early methods is [13]. One key challenge of online kernel learning is managing a growing number of support vectors, as every misclassified point is typically added to the set of support vectors. A number of ‘budget’ online kernel learning methods have been proposed (see [1] for a review of existing methods) to limit the number of support vectors.
One recent implementation of online kernel learning is called OGD [1] which employs :
If (incorrect prediction) then add to the set of support vectors with , where is the learning rate.
Without setting a budget, the number of support vectors () usually increases linearly with the number of points observed. Therefore, the testing time becomes increasingly slower as the number of points observed increases.
Here we show the benefits of Isolation Kernel will bring to online kernel learning: Its use improves both the time and space complexities of OGD significantly from to for every prediction while allowing to be infinite—eliminating the need to have a budget for support vectors. This is because is constant while grows as more points are observed.
This is done on exactly the same OGD implementation. The only change required in the procedure is that the function is evaluated based on its feature map of Isolation Kernel as follows:
where .
During training, is the number of support vectors at the time an evaluation of the prediction function is required. For every addition of a new support vector during the training process, the weight vector is updated incrementally while increments. At the end of the training process, the final is ready to be used with to evaluate every test point .
Although the above expressions are in terms of , the computation is conducted more efficiently using , effectively as an indexing scheme for , as described in Section 5.4, for as well as .
We named the OGD implementation which employs Isolation Kernel and as IKOGD. The algorithms of OGD (as implemented by [1]) and IKOGD are shown as Algorithms 1 and 2, respectively.
To apply Isolation Kernel to support vector machines, we only need to use the algorithm which solves the primal optimisation problem such as LIBLINEAR [14] after converting the data using the feature map of Isolation Kernel.
Note that the above efficiency gain is possible for Isolation Kernel, and not possible for other existing kernels, because it has an exact feature map which has a finite number of features that can be controlled by a user; and others do not.
7 Experimental settings
We design experiments to evaluate the impact of Isolation Kernel on Online Kernel Learning. We use the implementations of the kernelised online gradient descent (OGD) and Nyström online gradient descent (NOGD)^{1}^{1}1Codes available at http://lsokl.stevenhoi.org/.. The kernelised online gradient descent [13] or OGD solves the dual optimisation problem; whereas IKOGD solves the primal optimisation problem, so as NOGD [1]. We also compare with a recent online method that employs multikernel learning and random fourier features, called AdaRaker [15].
Laplacian kernel is used as a baseline kernel because Isolation Kernel approximates Laplacian kernel under uniform density distribution ^{2}^{2}2As pointed in [7], Laplacian kernel can be expressed as , where . Laplacian kernel has been shown to be competitive to Gaussian kernel in SVM in a recent study [7].. As a result, Isolation Kernel and Laplacian kernel can be expressed using the same ‘sharpness’ parameter .
Two existing implementations of Isolation Kernel are used: (i) Isolation Forest [11], as described in [7]; and (ii) aNNE, a nearest neighbour ensemble that partitions the data space into Voronoi diagram, as described in [8]. We refer IKOGD to the iForest implementation. When a distinction is required, we denote IKOGD as the iForest implementation; and IKOGD the aNNE implementation.
All OGD related algorithms used the hinge loss function and the same learning rate , as used in [1]. The only parameter search required for these algorithms is the kernel parameter. The search range in the experiments is listed in Table 3. The parameter is selected via 5fold crossvalidation on the training set.
The default settings for NOGD [1] are: the Nyström method uses the EigenvalueDecomposition; and budget ; and the matrix rank is set to . The default parameter used to create Isolation Kernel is set to .
AdaRaker (https://github.com/yanningshen/AdaRaker) employs sixteen Gaussian kernels and the specified bandwidths for these kernels are listed in Table 3. (The default three kernels in the code gave worse accuracy than that reported in the next section). In addition, AdaRaker uses 50 orthogonal random features (equivalent to for the Nyström method) and as default. The search range of through 5fold crossvalidation is given in Table 3.
Kernel/Algorithm  Search range 

Laplacian  
Isolation  
AdaRaker  
Eleven datasets from www.csie.ntu.edu.tw/c̃jlin/libsvmtools/datasets/ are used in the experiments. The properties of these datasets are shown in Table 4. The datasets are selected in order to have diverse data properties: data sizes (20,000 to 2,400,000) and dimensions (22 to more than 3.2 million). Because the OGD and NOGD versions of the implementation we used work on twoclass problems only, three multiclass datasets have been converted to twoclass datasets of approximately equal class distribution^{3}^{3}3The twoclass conversions from the original class labels were done for three multiclass datasets: mnist: and . smallNORB: and . cifar10: and ..
Four experiments are conducted: (a) in online setting, (b) in batch setting, (c) examine the runtime in GPU, and (d) an investigation using SVM. The CPU experiments ran on a Linux CPU machine: AMD 16core CPU with each core running at 1.8 GHz and 64 GB RAM. The GPU experiments ran on a machine having GPU: 2 x GTX 1080 Ti with 3584 (1.6 GHz) CUDA cores & 12GB graphic memory; and CPU: i97900X 3.30GHz processor (20 cores), 64GB RAM.
The results are presented in four subsections of Section 8.
#train  #test  #dimensions  nnz%  

url  30,000  2,366,130  3,231,961  0.0036 
news20.binary  15,997  3,999  1,355,191  0.03 
rcv1.binary  20,242  677,399  47,236  0.16 
realsim  57,848  14,461  20,958  0.24 
smallNORB  24,300  24,300  18,432  100.0 
cifar10  50,000  10,000  3,072  99.8 
epsilon  400,000  100,000  2,000  100.0 
mnist  60,000  10,000  780  19.3 
a9a  32,561  16,281  123  11.3 
covertype  464,810  116,202  54  22.1 
ijcnn1  49,990  91,701  22  59.1 
In online setting, we simulate an online setting using each of the four largest datasets over half a million points (after combining their given training and testing sets) as follows. Given a dataset, it is first shuffled. Then, the initial training set has data size as the training set size shown in Table 4; and it is used to determine the best parameter based on 5fold crossvalidation before training the first model. The online stream is assumed to arrive sequentially in blocks of 1000 points. Each block is assumed to have no class labels initially: In testing mode, the latest trained model is used to make a prediction for every point in the block. After testing, class labels are made available: The block is in training mode and the model is updated^{4}^{4}4This simulation is more realistic than the previous online experiments which assume that class label of each point is available immediately after a prediction is made to enable model update [1]. In practice, the algorithm can be made to be in the training mode whenever class labels are available, either partially or the entire block.. The above testing and training modes are repeated for each current block in the online stream until the data run out. The test accuracy up to the current block is reported along the data stream.
In batch setting, we report the result of a single trial of trainandtest for each dataset which consists of separate training set and testing set. The assessments are in terms of predictive accuracy and the total runtime of training and testing. Since AdaRaker has problem dealing with large datasets, it is used in the batch setting only.
In online setting, Isolation Kernel and for IKOGD are established using the initial training set only. Once established, the kernel and are fixed for the rest of the data stream. This applies to the points selected for NOGD as well. In the batch setting, the given training set is used for these purposes.
8 Empirical Results
8.1 Results in online setting
Figure 3 shows that, in terms of accuracy, IKOGD has higher accuracy than OGD and NOGD on four datasets, except that OGD has better accuracy on epsilon and url^{5}^{5}5Note that the first points in the accuracy plots can swing wildly because it is the accuracy of the initial trained model on the first data block.. Notice that, as more points are observed, OGD and IKOGD have more rooms for accuracy improvement than NOGD because the former two have no budget and the latter has a limited budget. We will examine the extent to which increasing the budget and improve the accuracies of NOGD and IKOGD, respectively, in Section 8.2.
In terms of runtime, IKOGD runs faster than both OGD and NOGD on high dimensional datasets (url, rcv1.binary and epsilon); and it is only slower than NOGD in the low dimensional covertype dataset. Notice that the gap in runtime between IKOGD and NOGD stays the same over the period because the time spent on is the same. In contrast, the gap between OGD and IKOGD increases over time because the time spent on used by OGD increases as the number of support vectors increases over time. The runtimes of IKOGD and NOGD are in the same order; but IKOGD is 2 to 4 orders magnitude faster than OGD.
NOGD maintains fast execution by limiting the number of support vectors while using . The use of Laplacian kernel (or any other kernel) which has infinite or large number of features necessitates the use of a feature map approximation method. Despite all these measures for efficiency gain in NOGD, IKOGD without budget still ran faster than NOGD with budget () on the two highdimensional datasets! The efficiency gain in NOGD is a tradeoff with accuracy—both the feature map approximation and the limit on the number of support vectors reduce the accuracy.
The use of Isolation Kernel provides a cleaner and simpler utilisation of in online setting than the kernel functional approximation approach (in which NOGD is a good representative method). As a result, IKOGD achieves the efficiency gain without compromising the accuracy because an exact rather than an approximate feature map is used.
The next two subsections provide empirical evidence of efficiency gains in IKOGD, described in Sections 5.3 and 5.4.
8.1.1 The effect of or on IKOGD
To demonstrate the impact of the type of prediction function used in IKOGD (stated in Section 5.3), we create a version which employs named IKOGD(dual) to compare with IKOGD which employs .
The proportions of time spent on the two prediction functions out of the total runtimes are given as follows: IKOGD took 2.3% and 0.77% on rcv1.binary and epsilon, respectively. In contrast, IKOGD(dual) took 99.9% and 99.8%, respectively. This shows that has reduced the time spent on the prediction function from almost the total runtime to a tiny fraction of the total runtime!
The total runtimes of IKOGD versus IKOGD(dual) are 37 seconds versus 280,656 seconds on rcv1.binary; and 103 seconds versus 235,966 seconds on epsilon. In other words, it also reduced the total runtime significantly by 3 to 4 orders of magnitude. The difference in runtimes enlarges as more points are observed because the number of support vectors increases which affects IKOGD(dual) only. The number of support vectors used at the end of the data stream is: 349,009 for rcv1.binary; and 349,481 for epsilon.
Accuracy  Runtime (CPU seconds)  

OGD  IKOGD  IKOGD  NOGD  AdaRaker  OGD  IKOGD  IKOGD  NOGD  AdaRaker  
url  .97  .96  .96  .67  —  8,393  62  303  ME  
news20.binary  .50  .57  .89  .50  —  915  1  See Section 8.3 & Table 6  11  ME 
rcv1.binary  .48  .73  .96  .48  —  10,499  22  114  ME  
realsim  .73  .83  .96  .69  —  1,468  2  6  ME  
smallNORB  .93  .78  .88  .51  —  64,183  73  353  one week  
cifar10  .69  .72  .73  .50  .54  20,260  15  69  5,661  
epsilon  .88  .65  .71  .57  —  496,065  106  430  one week  
mnist  .97  .95  .98  .85  .80  659  4  12  1,453  
a9a  .84  .84  .84  .84  .79  95  3  2  308  
covertype  .76  .86  .92  .70  .70  20,863  25  10  3,740  
ijcnn1  .94  .95  .97  .93  .90  76  8  2  576 
8.1.2 The effect of efficient dot product on IKOGD
Here we show the effect of the efficient dot product, described in Section 5.4. The implementation which computes the summation of products is named IKOGD(naive). It is compared with IKOGD with the efficient implementation. As the impact on runtimes varies with , the experiment is conducted with increasing .
Figure 4 shows that the runtime difference between IKOGD and IKOGD(naive) enlarges as increases; and IKOGD(naive) was close to two orders of magnitude slower than IKOGD at on both datasets. Note that the efficient dot product in IKOGD is independent of . IKOGD’s runtime depends on only in the process of mapping to (recall the mapping stated in Table 2).
8.2 Results in batch setting
Observations from the results shown in Table 5 are:
In terms predictive accuracy:

IKOGD performs better than OGD on six datasets; it has equal or approximately equal accuracy on url, mnist and a9a. This outcome is purely due to the kernel employed—Isolation Kernel approximates Laplacian kernel under uniform density distribution; and it adapts to density structure of the given dataset [7]. This relative result between Isolation Kernel and Laplacian Kernel on OGD is consistent with the previous relative result on SVM [7]. The only two datasets on which IKOGD performs significantly worse than OGD are smallNORB and epsilon. We will see in Section 8.2.2 that the gap can be significantly reduced by increasing , without a significant runtime increase.

NOGD has lower accuracy than OGD on ten out of eleven datasets because it employs an approximate feature map of the Laplacian kernel. As a consequence, NOGD can be significantly worse than OGD. Examples are url, smallNORB, cifar10, epsilon and mnist. While increasing its budget may improve NOGD’s accuracy to approach the level of accuracy of OGD; it will still perform worse than IKOGD. Indeed, NODG performed worse than IKOGD on ten out of eleven datasets in Table 5.

IKOGD has equal or better accuracy than IKOGD. This result is consistent with the assessment comparing the two implementations of Isolation Kernel in densitybased clustering [8]. This is because Voronoi diagram produces partitions of nonaxisparallel regions; whereas iForest yields axisparallel partitions only. Notice that the accuracy difference between IKOGD and OGD is huge on news20, rcv1 realsim and covertype.
In terms of runtime:

While OGD and IKOGD are using exactly the same training procedure (with the exception of the prediction function used), IKOGD has advantage in two aspects:

The differences in runtimes are huge—IKOGD is three orders of magnitude faster than OGD on six out of the eleven datasets; and at least one order of magnitude faster on other datasets. This is due to the efficient implementations made possible through Isolation Kernel, described in Section 5.

Both OGD and IKOGD can potentially incorporate an infinite number of support vectors. But, the prediction function used has denied OGD the opportunity to live up to its full potential because its testing time complexity is proportional to the number of support vectors. In contrast, IKOGD has constant test time complexity, independent of the number of support vectors.


Compare with NOGD, IKOGD is up to one order of magnitude faster in runtime in high dimensional datasets. On low dimensional datasets (100 or less), IKOGD ran only slightly slower. This is remarkable given that IKOGD has no budget and NOGD has a budget of 100 support vectors only. As a result, NOGD has lower accuracy than IKOGD on all datasets, except a9a.
In a nutshell, IKOGD inherits the advantages of OGD (no budget) and NOGD (using ); yet, it does not have their disadvantages: OGD (using ); and NOGD (the need to have a budget which lowers its predictive accuracy).
8.2.1 Comparison with AdaRaker
Table 5 also shows that multikernel learning method AdaRaker [15] has lower accuracy than OGD (and even NOGD) using a single kernel. This result is consistent with the comparison between SimpleMKL [9] and SVM using Isolation Kernel conducted previously [7]. Out of the five datasets on which it could run within reasonable time and without memory errors, AdaRaker ran slower than OGD in three datasets; but faster in two. Compare with IKOGD and NOGD, AdaRaker is at least two orders of magnitude slower on the five datasets.
AdaRaker has memory error issues with high dimensional datasets.
8.2.2 The effects of on IKOGD and on NOGD
Two datasets, epsilon and smallNORB, are used in this experiment because the accuracy differences between OGD and NOGD on these datasets are the largest; and they are the only two datasets in which IKOGD performed significantly worse than OGD. We examine the effects of parameters and on IKOGD and NOGD.
Figure 5 shows that IKOGD’s accuracy is improved significantly as increases. Note that, using on epsilon, the accuracy of IKOGD reached the same level of accuracy of OGD shown in Table 5; yet, IKOGD still ran two orders of magnitude faster than OGD. In contrast, although NOGD’s accuracy has improved when was increased from 100 to 10000, it still performed worse than OGD and IKOGD by a large margin of 10%. In addition, NOGD at ran two orders of magnitude slower than NOGD . On smallNORB, IKOGD also improves its accuracy as increases up to ; but NOGD has showed little improvement over the entire range between and .
NOGD’s runtime increases linearly wrt ; whereas the runtime of IKOGD increases sublinearly wrt .
8.3 CPU and GPU versions of IKOgd
The use of Voronoi diagram to partition the data space for Isolation Kernel has slowed down the runtime significantly, compared to that implemented using iForest, mainly due to the need to search for nearest neighbours. However, because the search for nearest neighbours is amenable to GPU accelerations, we investigate a runtime comparison of the CPU and GPU versions of IKOGD.
The result is shown in Table 6. The GPU version of IKOGD is up to four orders of magnitude faster than the CPU version. Despite this GPU speedup, IKOGD is still up to one order of magnitude slower than IKOGD ran on CPU on some datasets.
CPU  GPU  

url  1,527  65 
news20.binary  1,079  10 
rcv1.binary  100,247  67 
realsim  31,946  10 
smallNORB  406,256  178 
cifar10  340,047  147 
epsilon  1,029,092  458 
mnist  56,774  45 
a9a  5,589  3 
covertype  100,081  42 
ijcnn1  11,999  3 
In summary, GPU is a good means to speed up IKOGD. When accuracy is paramount, IKOGD is always a better choice than IKOGD (as shown in Table 5) though the former, even with GPU, runs slower than the latter with CPU.
Note that while it is possible to speed up the original OGD which employs the dual prediction function using GPU, it is not a good solution for two reasons. First, it does not improve OGD’s accuracy if the same data independent kernel is used. Second, the GPUaccelerated OGD is expected to still run slower than the CPU version of OGD which employs the primal prediction function using the same kernel.
The runtime reported in Table 6 consists of two components: feature mapping time and OGD runtime. For example, the longest GPU runtime is on epsilon which consists of feature mapping time 457 GPU seconds and OGD runtime of .9 CPU seconds. In other words, the bulk of the runtime is spent on feature mapping; and OGD took only a tiny fraction of a second to complete the job with CPU.
8.4 SVM versus IKSVM
Isolation Kernel is the only nonlinear kernel, as far as we know, that allows the trick of using to be applied to kernelbased methods, including SVM, to speed up the runtime in training as well as testing. We apply Isolation Kernel to SVM to produce IKSVM. It is realized using LIBLINEAR since IKSVM is equivalent to applying the IK feature mapped data to a linear SVM. IKSVM is compared with LIBSVM with Laplacian kernel (denoted as SVM).
Accuracy  Runtime  

SVM  IKSVM  SVM  IKSVM  
url  .96  .96  131  7.5 
news20.binary  .50  .92  684  1.5 
rcv1.binary  .54  .96  7,472  .6 
realsim  .75  .96  1,116  1.2 
smallNORB  —  .88  ME  .9 
cifar10  .51  .71  3,703  1.2 
epsilon  —  .70  ME  90.0 
minst  .98  .99  919  1.0 
a9a  .85  .84  69  .5 
covtype  —  .93  ME  63.1 
ijcnn1  .99  .98  59  2.4 
ME denotes memory errors.
Table 7 shows the comparison result of SVM and IKSVM. The relative result between SVM and IKSVM is reminiscent of that comparing OGD with IKOGD in Table 5, i.e., IKSVM has better accuracy than SVM in high dimensional datasets (news20, rcv1, realsim and cifar10); and they have comparable accuracy in datasets less than 2000 dimensions (mnist, a9a and ijcnn1). In terms of runtime, IKSVM is up to four orders of magnitude faster.
The memory errors of SVM on datasets with large training sets are the known limitation of SVM using existing nonlinear kernels. Our result in Table 7 shows that Isolation Kernel enables SVM to deal with large datasets that would otherwise be impossible.
Note that the runtime reported in Table 7 does not include the feature mapping time. With GPU, adding the GPU runtime reported in Table 6 (the bulk is the feature mapping time) to that of IKSVM does not change the conclusion: IKSVM runs order(s) of magnitude faster than SVM and has better accuracy in high dimensional and large scale datasets.
9 Relation to existing approaches for efficient kernel methods
9.1 Kernel functional approximation
Kernel functional approximation is a popular effective approach to produce a usercontrollable, finite, approximate feature map of a kernel having infinite number of features.
One representative is the Nyström method [16, 3, 17]. It first samples points from the given dataset, and then constructs a matrix of low rank , and derives a vector representation of data of features. This gives , where is a normlised eigenfunction of . See [16] for details. For , it reduces the search space significantly.
The key overhead is the eigenvalue decomposition computation of the low rank matrix. This overhead is not large only if both and are small, relative to the data size and dimensionality . The overhead becomes impracticably large for problems which require large and .
Also, though the Nyström method depends on data when deriving an approximate feature map of a chosen nonlinear kernel, but the kernel it is approximating is still data independent (e.g., Gaussian and Laplacian kernels).
In any case, the efficiency gain from the kernel functional approximation approach comes with the cost of reduced accuracy as it is an approximation of the chosen nonlinear kernel function.
In contrast, Isolation Kernel has an exact feature map. As a result, the efficiency gain from the use of Isolation Kernel does not degrade accuracy. It is a direct method which does not need an intervention step to approximate a feature map from a kernel having infinite or large number of features.
9.2 Sparse kernel approximation
To represent nonlinearity, the feature map of a kernel has dimensionality which is usually significantly larger than the dimension of the given dataset. The Nyström method reduces the dimensionality to produce a dense representation.
In contrast, sparse kernel approximation aims to produce highdimensional sparse features^{6}^{6}6A sparse representation yields vectors having many zero values, where a feature with zero value means that the feature is irrelevant.. One proposal [18] approximates each feature vector of using a small subset of representative points, e.g., ’s neighbours (rather than all representative points). It then uses product quantization (PQ)^{7}^{7}7Product Quantization (an improvement over vector quantization) aims to reduce storage and retrieval time for conducting approximate nearest neighbour search. to encode the sparse features, and employ bundle methods to learn directly from the PQ codes.
Interestingly, each feature vector of of Isolation Kernel is both a sparse representation and a coding which employs exactly representative points, from random subsets of points, i.e., exactly one out of the points in one subset is used for the sparse representation, concatenated times.
There are other sparse representations, e.g., Local Deep Kernel Learning [19] learns a treebased feature embedding which is high dimensional and sparse through a generalised version of Localized Multiple Kernel Learning of multiple data independent kernels.
The key difference between Isolation Kernel and current sparse kernel approximation is that the former is a data dependent kernel [7, 8] which has an exact feature map. Sparse kernel approximation may be viewed as another intervention step (alternative to kernel functional approximation) to produce a finite sparse approximate feature map from one or more data independent kernels having infinite number of features. In addition, computationally expensive learning [19] or PQ [18] are not required in Isolation Kernel.
10 Discussion
It is important to note that Isolation Kernel is not one kernel function such as Gaussian kernel, but a class of kernels which has different kernel distributions depending on the space partitioning mechanism employed. We use two implementations of Isolation Kernel: (a) iForest [11] which has its kernel distribution similar to that of Laplacian Kernel under uniform density distribution [7]; (b) when a Voronoi diagram is used to partition the space [8], Isolation Kernel has its distribution more akin to an exponential kernel under uniform density distribution. Both realisations of Isolation Kernel adapt to local density of a given dataset, unlike existing data independent kernels. The criterion required of a partitioning mechanism in order to produce an effective Isolation Kernel is described in [7, 8]. This paper has focused on efficient implementations of Isolation Kernel in online kernel learning, without compromising accuracy.
When using linear kernel, the trick of using instead of to speed up the runtime of both the training stage and the testing stage has been applied previously, e.g., in LIBLINEAR [14], even when it is solving the dual optimisation problem. This is possible in LIBLINEAR because linear kernel has an exact and finite feature map. But, if you are using an existing nonlinear kernel such as Gaussian or Laplacian kernel, such a trick cannot be applied to SVM because its feature map is not finite.
Note that the work reported in [1], including OGD and NOGD, and IKOGD used here do not address the concept change issue in online setting. Nevertheless, all these works address the efficiency issue in online setting which serves as the foundation to tackling the efficacy issue of large scale online kernel learning under concept change.
Geurts et. al. [20] describe a kernel view of ExtraTrees (a variant of Random Forest [21]) where its feature map is also sparse and similar to the one we presented here. However, like Random Forest (RF) kernel [22], this kernel was offered as a view point to explain the behaviour of Random Forest; and no evaluation has been conducted to assess its efficacy using a kernelbased method. Ting et. al. [7] have provided the conceptual differences between RFlike kernels and Isolation Kernel; and the empirical evaluation has revealed that RFlike kernels are inferior to Isolation Kernel when used in SVM.
11 Concluding remarks
The identification of the root cause is prime importance in solving any problem. Without knowing the root cause, attempts to solve the problem are at best mitigating the real issue, merely masking the symptoms, without realising it. As in the case of the two challenges of the current online kernel learning—(i) using kernel functional approximation to convert an infinite feature map of a chosen kernel to a finite approximate feature map; and (ii) using various methods to limit the number of support vectors. These methods have achieved what they set out to do—trading off efficiency gain with degraded accuracy. Yet, they still do not enable online kernel learning to live up to its full potential.
Once the root cause of the perennial problem of online kernel learning has been identified—high computational cost on high dimensional and large datasets is due to the type of kernels used, we show that the solution is extremely simple—a kernel which addresses the root cause shall have a feature map that is sparse and finite, and the number of features is controllable by a user.
When Isolation Kernel is used, the two challenges of the current online kernel learning becomes a nonissue. They are challenges only if the chosen kernel has infinite feature map. In other words, the challenges are only symptoms of the real issue; and current approaches treat the symptoms without addressing the real issue.
The outcome of this work is unprecedented: It enables online kernel learning to achieve what the current approaches are unable to, i.e., to live up to its full potential to deal with a potentially infinite number of support vectors in online setting having infinite number of data points. This outcome is derived from the identification of the root cause of the problem and addressing it directly.
This outcome is a result of bringing four key elements together: (i) Isolation Kernel’s exact and finite feature map; (ii) solving the primal optimisation problem in kernel learning with feature mapped data; (iii) efficient dot product; and (iv) GPU acceleration. Whilst the individual elements are uncomplicated and not even new (except the first one), together they have achieved the outcome that has evaded many methods thus far, particularly in terms of predictive accuracy and runtime. The first element is the crucial core that jells with other elements to achieve this outcome.
References
 [1] J. Lu, S. C. H. Hoi, J. Wang, P. Zhao, and Z.Y. Liu, “Large scale online kernel learning,” Journal of Machine Learning Research, vol. 17, no. 1, pp. 1613–1655, 2016.
 [2] Y.W. Chang, C.J. Hsieh, K.W. Chang, M. Ringgaard, and C.J. Lin, “Training and testing lowdegree polynomial data mappings via linear svm,” Journal of Machine Learning Research, vol. 11, pp. 1471–1490, 2010.
 [3] C. K. I. Williams and M. Seeger, “Using the nyström method to speed up kernel machines,” in Advances in Neural Information Processing Systems 13, T. K. Leen, T. G. Dietterich, and V. Tresp, Eds. MIT Press, 2001, pp. 682–688.
 [4] A. Rahimi and B. Recht, “Random features for largescale kernel machines,” in Advances in Neural Information Processing Systems, ser. NIPS’07, 2007, pp. 1177–1184.
 [5] X. Y. Felix, A. T. Suresh, K. M. Choromanski, D. N. HoltmannRice, and S. Kumar, “Orthogonal random features,” in Advances in Neural Information Processing Systems, ser. NIPS’16, 2016, pp. 1975–1983.
 [6] J. Yang, V. Sindhwani, Q. Fan, H. Avron, and M. Mahoney, “Random Laplace feature maps for semigroup kernels on histograms,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 971–978.
 [7] K. M. Ting, Y. Zhu, and Z.H. Zhou, “Isolation kernel and its effect on SVM,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2018, pp. 2329–2337.
 [8] X. Qin, K. M. Ting, Y. Zhu, and V. C. S. Lee, “Nearestneighbourinduced isolation similarity and its impact on densitybased clustering,” in Proceedings of The ThirtyThird AAAI Conference on Artificial Intelligence, 2019.
 [9] A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet, “SimpleMKL,” Journal of Machine Learning Research, vol. 9, no. Nov, pp. 2491–2521, 2008.
 [10] P. Zadeh, R. Hosseini, and S. Sra, “Geometric mean metric learning,” in International Conference on Machine Learning, 2016, pp. 2464–2471.
 [11] F. T. Liu, K. M. Ting, and Z.H. Zhou, “Isolation forest,” in Proceedings of the IEEE International Conference on Data Mining, 2008, pp. 413–422.
 [12] B. Scholkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2001.
 [13] J. Kivinen, A. J. Smola, and R. C. Williamson, “Online learning with kernels,” in Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, 2001, pp. 785–792.
 [14] R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin, “Liblinear: A library for large linear classification,” Journal of Machine Learning Research, vol. 9, pp. 1871–1874, 2008.
 [15] Y. Shen, T. Chen, and G. Giannakis, “Online ensemble multikernel learning adaptive to nonstationary and adversarial environments,” in Proceedings of the TwentyFirst International Conference on Artificial Intelligence and Statistics, 2018, pp. 2037–2046.
 [16] T. Yang, Y.F. Li, M. Mahdavi, R. Jin, and Z.H. Zhou, “Nyström method vs random fourier features: A theoretical and empirical comparison,” in Advances in Neural Information Processing Systems, ser. NIPS’12, 2012, pp. 476–484.
 [17] J. Wu, L. Ding, and S. Liao, “Predictive nyström method for kernel methods,” Neurocomputing, vol. 234, pp. 116–125, 2017.
 [18] A. Vedaldi and A. Zisserman, “Sparse kernel approximations for efficient classification and detection,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2320–2327.
 [19] C. Jose, P. Goyal, P. Aggrwal, and M. Varma, “Local deep kernel learning for efficient nonlinear svm prediction,” in Proceedings of the 30th International Conference on Machine Learning, 2013, pp. III–486–III–494.
 [20] P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,” Machine learning, vol. 63, no. 1, pp. 3–42, 2006.
 [21] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
 [22] ——, “Some infinity theory for predictor ensembles,” Technical Report 577. Statistics Dept. UCB., 2000.