Scalable Alignment Kernels via SpaceEfficient Feature Maps
Abstract
String kernels are attractive data analysis tools for analyzing string data. Among them, alignment kernels are known for their high prediction accuracies in string classifications when tested in combination with SVMs in various applications. However, alignment kernels have a crucial drawback in that they scale poorly due to their quadratic computation complexity in the number of input strings, which limits largescale applications in practice. We present the first approximation named ESP+SFM for alignment kernels leveraging a metric embedding named editsensitive parsing (ESP) and spaceefficient feature maps (SFM) for random Fourier features (RFF) for largescale string analyses. Input strings are projected into vectors of RFF leveraging ESP and SFM. Then, SVMs are trained on the projected vectors, which enables to significantly improve the scalability of alignment kernels while preserving their prediction accuracies. We experimentally test ESP+SFM on its ability to learn SVMs for largescale string classifications with various massive string data, and we demonstrate the superior performance of ESP+ SFM with respect to prediction accuracy, scalability and computation efficiency.
1 Introduction
Massive string data are now ubiquitous throughout research and industry, in areas such as biology, chemistry, natural language processing and data science. For example, ecommerce companies face a serious problem in analyzing huge datasets of user reviews, question answers and purchasing histories [He16, McAuley15]. In biology, homology detection from huge collections of protein and DNA sequences is an important part for their functional analyses [Saigo04]. There is therefore a strong need to develop powerful methods to make best use of massive string data on a largescale.
Kernel methods [Hofmann08] are attractive data analysis tools because they can approximate any (possibly nonlinear) function or decision boundary well with enough training data. In kernel methods, a kernel matrix a.k.a. Gram matrix is computed from training data and nonlinear support vector machines (SVMs) are trained on the matrix. Although it is known that kernel methods achieve high prediction accuracy for various tasks such as classification and regression, they scale poorly due to its quadratic complexity in the number of training data [Joachims06, Ferris03]. In addition, calculation of a classification requires, in the worst case, linear time in the number of training data, which limits largescale applications of kernel methods in practice.
String kernels [Gartner03] are kernel methods for strings, and a variety of strings kernels have been proposed using different string similarity measures [Leslie02, Saigo04, Cuturi11, Lodhi02]. Alignment kernels are the stateoftheart string kernel, and they are known for high prediction accuracies in string classifications such as remote homology detection for protein sequences [Saigo04] and time series classifications [Zhou10, Cuturi11], when tested in combination with SVMs. However, alignment kernels have a crucial drawback in that they scale poorly due to their quadratic computation complexity in the number of training data as in other kernel methods.
To solve the scalability issues in kernel methods, kernel approximations using feature maps (FM) have been proposed. FM projects training data into lowdimensional vectors such that the kernel value (similarity) between each pair of training data is approximately equal to the inner product of the corresponding pair of low dimensional vectors. Then, linear SVMs are trained on the projected vectors, which significantly improves the scalability of nonlinear SVMs while preserving their prediction accuracies. Although a variety of kernel approximations using FM have been proposed for enhancing the scalability of kernel methods (e.g., Jaccard kernels [Li11], polynomial kernels [Pham13] and MinMax kernels [Li16]), and Random Fourier Features (RFF) [Rahimi07] is an approximation of shiftinvariant kernels (e.g., Laplacian and radial basis fuction (RBF) kernels), there are no previous works on the approximation for alignment kernels. Thus, an important open challenge, which is required for largescale analyses of string data, is to develop a kernel approximation for alignment kernels.
Training  Training  Prediction  Online  
Approach  time  space  time  learning  
LAK [Saigo04]  Local alignment  Unsupported  
GAK [Cuturi07, Cuturi11]  Global alignment  Unsupported  
ESP+SFM (this study)  ESP  Supported  
CGK+SFM (this study)  CGK  Supported  

Several metric embeddings for string distance measures have been proposed for largescale string processing [Cormode07, Chakraborty16]. Editsensitive parsing (ESP) [Cormode07] is a metric embedding of a string distance measure called edit distance with moves (EDM) that consists of ordinal edit operations of insertion, deletion and replacement in addition to substring move operation. ESP maps all the strings from the EDM space into integer vectors named characteristic vectors in the distance space. Thus, EDM between each pair of input strings is approximately preserved by the corresponding pair of characteristic vectors. To date, ESP has been applied only to string processing such as string compression [Maruyama13], indexing [Takabatake14], editdistance computation [Cormode07]; however, as we will see, there remains high potential for application to an approximation of alignment kernels. ESP is expected to be effective for approximating alignment kernels, because it approximates EDM between strings as distance between integer vectors.
Contribution. In this paper, we present the first approximation for alignment kernels to solve largescale learning problems on string data. Key ideas behind our method are (i) to project input strings into characteristic vectors leveraging ESP, (ii) to map characteristic vectors into low dimensional vectors by FM of RFF, and (iii) to train linear SVMs on the mapped vectors. However, applying FM for RFF to high dimensional vectors in a direct way requires memory linearly proportional to , where is the dimensionality of input vectors and is the target dimension. Both and need to be large for our applications (e.g., tens of millions of and thousands of ), which limits the applicability of FM on a largescale. To solve the problem, we present spaceefficient FM (SFM) that requires memory. Our method called ESP+SFM has the following desirable properties:

Scalability: ESP+SFM is applicable to massive string data.

Fast training: ESP+SFM trains SVMs fast.

Space efficiency: ESP+SFM trains SVMs spaceefficiently.

Prediction accuracy: ESP+SFM can achieve high prediction accuracy.

Online learning: ESP+SFM can train SVMs with alignment kernels in an online manner.
We experimentally test the ability of ESP+SFM to train SVMs with various massive string data, and demonstrate that ESP+SFM has superior performance in terms of prediction accuracy, scalability and computational efficiency.
2 Literature Review
Several alignment kernels have been proposed for analyzing string data. We briefly review the state of the art, which is also summarized in Table 1. Early methods are proposed in [Bahlmann02, Shimodaira02, Zhou10] and are known not to satify the positive definiteness for their kernel matrices. Thus, they are proposed with numberical corrections for any deficiency of the kernel matrices.
Global alignment kernel (GAK) [Cuturi07, Cuturi11] is an alignment kernel based on global alignments using Dynamic Time Warping Distance (DTW) as a distance measure. GAK defines a kernel as summation score of all possible global alignments between two time series. The computation time of GAK is for the number of strings and the length of strings . The space usage is .
Saigo et al. [Saigo04] proposed local alignment kernel (LAK) on the notion of the SmithWaterman algorithm [Smith81] for protein remote homology detections. LAK measures the similarity between each pair of strings by summing up scores obtained from local alignments with gaps of strings. The computation time of LAK is for the number of strings and the length of strings . The space usage is . Although LAK achieves high classification accuracies for protein sequences in combination with SVM, LAK is applicable to only protein strings because its scoring function is optimized for proteins.
Despite the importance of a scalable learning with alignment kernels, no previous work has been able to achieve high scalabilities and enables online learning with alignment kernels while preserving high prediction accuracies. We present the first scalable learning with alignment kernels that meets these demands and is made possible by leveraging an idea behind ESP and FM.
CGK [Chakraborty16] is another metric embedding for edit distance and maps input strings of alphabet and of the maximum length into strings of fixedlength such that the edit distance between each pair of input strings is approximately preserved by the Hamming distance between the corresponding pair of mapped strings. Recently, CGK has been applied to the problem of edit similarity joins [Zhang17]. We also present a kernel approximation of alignment kernels called CGK+SFM by leveraging an idea behind CGK and SFM.
Details of the proposed method are presented in the next section.
3 Editsensitive parsing
Editsensitive parsing (ESP) [Cormode07] is an approximation method for efficiently computing edit distance with moves (EDM). EDM is a stringtostring distance measure to turn one string to the other in a series of string operations where substring move is included as a string operation in addition to typical string operations such as insertion, deletion and replacement. Let be a string of length and be th character in . Formally, EDM between two string and is defined as the minimum number of edit operations defined below to transform into :

Insertion: character at position in is inserted, resulted in ,

Deletion: character at position in is deleted, resulted in ,

Replacement: character at position in is replaced by , resulted in ,

Substring move: a substring in is moved and inserted at position , resulted in .
Finding EDM between two strings is known as an NPcomplete problem [Shapira02]. ESP can approximately compute EDM by embedding strings into vector space by a parsing technique.
Given string , ESP builds a parse tree named ESP tree, which is illustrated in Figure 1 as an example. ESP tree is a balanced tree and each node in an ESP tree belongs to one of three types: (i) node with three children, (ii) node with two children and (iii) node without children (i.e., leaf). In addition, internal nodes in ESP tree have the same node label if and only if they have children satisfying both two conditions: (i) the numbers of those children are the same; (ii) node labels of those children are the same in the lefttoright order. Since ESP tree is balanced, nodes in each level can be considered as a sequence of node labels that are listed in the lefttoright order. We denote as a sequence of node labels at level in an ESP tree built from input string . A sequence of node labels for leaves is denoted as and is the same as input string , i.e., . We denote node labels for internal nodes as . The height of ESP tree is for the length of input string .
Let be a dimension integer vector built from ESP tree such that each dimension of is the number of a node label appearing in . Such vectors are called characteristic vectors. ESP builds ESP trees such that as many subtrees with the same node labels as possible are built for common substrings for strings and , resulted in an approximation of EDM between and by distance between their characteristic vectors and , i.e., , where is an norm. More precisely, the upper and lower bounds of the approximation are as follows,
where is the iterated logarithm of , which is recursively defined as , and for a positive integer .
In the next section, we introduce left preferential parsing (LPP) as a basic algorithm of ESP. In the later of this section, we present the ESP algorithm.
3.1 Left preferential parsing (LPP)
A basic idea of LPP is to make pairs of nodes from the left to the right positions preferentially in a sequence of nodes at ESP tree and make triples of remaining three nodes. Then, ESP builds type2 nodes for these pairs of nodes and a type1 node for the triple of nodes. LPP builds an ESP tree in a bottomup manner.
More precisely, if the length of sequence at th level of ESP tree is even, LPP makes pairs of and for all and builds type2 nodes for all the pairs. Thus, at th level of ESP tree is a sequence of type2 nodes. If the length of sequence of th level of ESP tree is odd, LPP makes pairs of and for each . and makes triple of , and . LPP builds type2 nodes for pairs of nodes and type1 node for the triple of nodes. Thus, at th level of ESP tree is a sequence of type2 nodes except the last node and the type1 node as the last node. LPP builds an ESP tree in a bottomup manner, i.e., it build an ESP tree from leaves (i.e, ) to the root. See Example 2 for an example.
A crucial drawback of LPP is that it can build completely different ESP trees for similar strings. For example, is a string where character is inserted at the first position of in Figure 1. Although and are similar strings, LPP builds completely different ESP trees and for and , respectively, resulted in a large difference between EDM and distance for characteristic vectors and . Thus, LPP lacks an ability of approximating EDM.
3.2 The ESP algorithm
ESP uses an engineered strategy while using LPP in its algorithm. ESP classifies a string into substrings of three categories and applies different parsing strategies according to their categories. An ESP tree for an input string is built by gradually applying this parsing strategy in ESP to strings from the lowest to the highest level of the tree.
Given sequence , ESP divides into subsequences in the following three categories: (i) Substring such that all pairs of adjacent node labels are different and substring length is at least . Formally, substring starting from position and ending at position in satisfies for any and ; (ii) Substring of same node label and of length at least . Formally, substring starting from position and ending at position satisfies for any and ; (iii) None of the above categories (i) and (ii).
After classifying a sequence into subsequences of above three categories, ESP applies different parsing methods to each substring according to their categories. ESP applies LPP to each subsequence of sequence in category (ii) and (iii), and it builds nodes at ()level. For subsequences in category (i), ESP applies a special parsing technique named alphabet reduction.
Alphabet reduction. alphabet reduction is a procedure for converting a sequence to a new sequence of alphabet size at most 3. For each symbol , a conversion is performed as follows. is a left adjacent symbol of . Suppose and are represented as binary integers. Let be the index of the least significant bit in which differs from , and let be the binary integer of at the th bit index. label is defined as and is computed for each position in . When this conversion is applied to a sequence of alphabet , the alphabet size of the resulted label sequence is , In addition, an important property of labels is that all adjacent labels in a label sequence are different, i.e., for all . Thus, this conversion can be iteratively applied to new label sequence, , until its alphabet size is at most .
Reduction of the alphabet size from to is performed as follows. First, each in a sequence is replaced with the least element from that does not neighbor the , then do the same for each and , which generates a new sequence of node labels drawn from , Where no adjacent characters are identical.
We then select any position which is a local maximum, i.e., . We shall call those positions landmarks. In addition, we pick out as a landmark any position which is a local minimum, i.e., , and not adjacent to an already chosen landmark. An important property for those landmarks is that for any two successive landmark positions and , either or hold. because is a sequence of no adjacent characters in alphabet . Figure 3 illustrates alphabet reduction for sequence .
3.3 Computation complexity
Given a collection of strings ,,…, where the maximum length of strings among those strings is , ESP builds ESPtrees ,,…, where is built from for each ,…,. To build ESPtrees satisfying the condition that nodes having the same label sequences in their children have the same node labels, ESP uses a hash table that allows the unique symbol to be found from a pair or a triple of symbols. The hash table, denoted by , holds pairs/triples of symbols keyed on the corresponding symbol, resulted in memory used in ESP. In practice, the total number of pairs/triples of symbols is much less than . Thus, the memory used for hash table remains small in practice, which is shown in Sec.6. The computation time of ESP is .
ESP builds characteristic vectors of high dimension (e.g., tens of millions of vectors). Applying FM for RFF to such high dimensional vectors consumes a large amount of memory. In the next section, we preset SFM for building RFF spaceefficiently.
4 Spaceefficient Feature Maps
In this section we present FM for RFF using space also proportional to and independent of the dimension for vectors of RFF. Our spaceefficient FM is called SFM and improves the space usage for generating RFF, which is proportional to . From an abstract point of view, RFF is based on a way of constructing a random mapping
such that for every choice of vectors we have
where is the kernel function. The randomness of comes from a vector sampled from an appropriate distribution that depends on (see section 5 for more details), and the expectation is over the choice of . For the purposes of this section all we need to know about is that the vector coordinates are independently sampled according to the marginal distribution .
Since we have , bounded variance, but this in itself does not imply the desired approximation as . Indeed, is a poor estimator of . The accuracy can be improved by increasing the output dimension to . Specifically RFF uses independent vectors sampled from , and considers FM
that concatenates the functions values to one dimensional vector. Then one can show that with high probability for sufficiently large.
In order to represent the function one needs to store the matrix containing vectors , which uses space . Our insight is that the vectors do not need to be independent to ensure a good approximation. Instead, for a small integer parameter we compute each vector using a hash function chosen from a wise independent family such that for every , comes from distribution . Then, instead of storing we only need to store the description of the hash function in memory. Apriori there seems to be two issues with this approach:

It is not clear how to construct wise independent hash functions whose output has distribution .

Is wise independence sufficient to ensure results similar to the fully independent setting?
We address these issues in the next two subsections.
4.1 Hash functions with distribution
For concreteness we base our construction on the following class of wise independent hash functions, where is a parameter: For chosen uniformly at random, let
where computes the fractional part of . It can be shown that for any distinct integer inputs , the vector is uniformly distributed in [Rahimi07].
Let denote the inverse of the cumulative distribution function of the marginal distribution . Then if is uniformly distributed in , . We can now construct the hash function where the th coordinate on input is:
where are chosen independently from . We see that for every , has distribution . Furthermore, for every set of distinct integer inputs , the hash values are independent.
4.2 Concentration bounds
We next show that like for RFF, random features suffice to approximate the kernel function within error with probability arbitrarily close to 1.
Theorem 1.
For every pair of vectors , if the mapping is constructed as described above using we have for every :
Proof.
Our proof follows the same outline as the standard proof of Chebychev’s inequality. Consider the second central moment:
The second equality above uses wise independence, and the fact that to conclude that only terms in the expansion have nonzero expectation. Finally, we have:
where the second inequality follows from Markov’s inequality. This concludes the proof. ∎
In the original analysis of RFF a stronger kind of approximation guarantee was considered, namely approximation of the kernel function for all pairs of points in a bounded region of . This kind of result can be achieved by chosing sufficiently large to obtain stronger tail bounds. However, our experiments suggest that the type of pointwise guarantee (i.e., ) provided by Theorem 1 is sufficient for an application in kernel approximations.
5 Scalable Alignment Kernels
In this section we present the ESP+SFM algorithm for scalable learning with alignment kernels.
Let us assume a collection of strings and their labels where . We define alignment kernels using for each pair of strings and as follows,
(1) 
where is a parameter. We apply ESP to each for and build ESP trees . Since ESP approximates as an distance between characteristic vectors and built from ESP trees and for and , i.e., , we can approximate as follows,
(2) 
Since Eq.2 is a Laplacian kernel that is also known as a shift invariant kernel [Rahimi07], we can aprroximate using FM for RFF as follows,
(3) 
where
For Laplacian kernels, for each is defined as
(4) 
where random vectors for are sampled from the Cauchy distribution.
Applying FM to high dimensional characteristic vectors consumes huge amount of memory. Thus, we present SFM for RFF using memory by applying wise independent hash functions introduced in Sec.4. We fix in this study.
Algorithm 1 is a random number generation from Cauchy distribution using memory. We use two arrays and initialized with 64bit random numbers. Function is implemented using and in and returns a random number in for given and as input. Then, random number returned from is converted to a random number from Cauchy distribution in as at line 8. Algorithm 2 implements SFM generating RFF in Eq.4. The computation time and memory for SFM are and , respectively.
Dataset  Number  #positives  Alphabet size  Average length 

Protein  3,238  96  20  607 
DNA  3,238  96  4  1,827 
Music  10,261  9,022  61  329 
Sports  296,337  253,017  63  307 
Compound  1,367,074  57,536  44  53 
5.1 Online learning
Vector of RFF for characteristic vector from ESP for each string can be computed in an online manner, because ESP and SFM are executable in an online manner. By applying online learning of linear SVMs such as passive aggressive algorithm [Wang12] to , linear SVMs can be trained in an online manner.
5.2 Cgk+sfm
CGK [Chakraborty16, Zhang17] is another string embedding using a randomized algorithm. Let for ,,, be input strings of alphabet and let be the maximum length of input strings. CGK maps input strings in the edit distance space into strings of length in the Hamming space, i.e, the edit distance between each pair and of input strings is approximately preserved by the Hamming distance of the corresponding pair and of the mapped strings. See [Zhang17] for the detail of CGK.
To apply SFM, we convert mapped strings in the Hamming space by CGK to characteristic vectors in the distance space as follows. We view elements for ,,…, as locations (of the nonzero elements) instead of characters. For example, when , we view each as a vector of length . If , then we code it as ; if , then we code it as . We then concatenate those vectors into one vector of dimension and with nonzero elements, resulted in Hamming distance .
Vectors of RFF can be built using SFM. We shall call approximation of alignment kernels using CGK and SFM CGK+SFM. CGK+SFM cannot achieve high prediction accuracies in practice, which is shown in Sec.6.
Data  Protein  DNA  Music  Sports  Compound  

Method  ESP  CGK  ESP  CGK  ESP  CGK  ESP  CGK  ESP  CGK 
Time (sec)  
Memory (MB)  
Dimension 
6 Experiments
In this section, we evaluated the performance of ESP+SFM with five massive sting datasets, as shown in Table 2.
The ”Protein” and ”DNA” datasets consist of 3,238 human enzymes obtained from the KEGG GENES database [Kanehisa17], respectively.
Each enzyme in ”DNA” was coded by a string consisting of four types of nucleotides or bases (i.e., A, T, G, and C).
Similarity, each enzyme in ”Protein” was coded by a string consisting of 20 types of amino acids.
Enzymes belonging to class 5 in the enzyme commission (EC) numbers in ”DNA” and ”Protein” have positive labels and the other enzymes have negative labels.
There are 96 enzymes with positive labels and 3,142 enzymes with negative labels in ”DNA” and ”Protein”.
The ”Music” and ”Sports” datasets consist of 10,261 and 296,337 reviews of musical instruments products and sports products in English from Amazon [He16, McAuley15], respectively.
Each review has a rating of five levels.
We assigned positive labels to reviews with four or five levels for rating and negative labels to the other reviews.
The numbers of positive and negative reviews were 9,022 and 1,239, respectively in ”Music”.
The numbers of positive and negative reviews were 253,017 and 296,337, respectively in ”Sports”.
The ”Compound” dataset consists of 1,367,074 bioactive compounds obtained from the PubChem database in the national center for biotechnology information (NCBI) [Kim16].
Each compound was coded by a string representation of chemical structures called simplified molecular input line entry system (SMILES).
The biological activities of the compounds for human proteins were obtained from the ChEMBL database.
In this study we focused on the biological activity for the human protein microtubule associated protein tau (MAPT).
The label of each compound corresponds to the presence or absence of biological activity for MAPT.
The numbers of positive and negative compounds were 57,537 and 1,309,537, respectively.
We implemented all the methods by C++ and performed all the experiments on one core of a quadcore Intel Xeon CPU E52680 (2.8GHz).
We stopped the execution of each method if it was not finished within 48 hours in the experiments.
Softwares and datasets used in this experiments are downloadable from
Protein  DNA  Music  Sports  Compound 

Protein  DNA  Music  Sports  Compound 

6.1 Scalability of ESP
First, we evaluated the scalability of ESP in the comparison with CGK. Table 3 shows the execution time, memory and dimension for characteristic vectors generated by ESP and CGK. ESP and CGK were practically fast enough to build characteristic vectors for large datasets. The executions of ESP and CGK finished within 60 seconds for ”Compound” that was the largest dataset consisting of more than 1 million compounds. The memory used in CGK was smaller than that of ESP for each dataset and at most 1.5GB memory was consumed in the execution of ESP. These results demonstrates high scalability of ESP for massive datasets.
Characteristic vectors of very high dimensions were built by ESP and CGK for each dataset. For example, 18 million dimension vectors were built by ESP for the ”Sports” dataset. Applying FM to such high dimension characteristic vectors consumed huge amount of memory, deteriorating the scalability of FM. Our proposed SFM can solve the scalability problem, which will be shown in the next subsection.
6.2 Efficiency of SFM
We evaluated the efficiency of SFM applied to characteristic vectors built from ESP and CGK, and compared it with FM. We examined combinations of characteristic vectors and projected vectors of ESP+FM, ESP+SFM, CGK+FM and CGK+SFM. The dimension of projected vectors of RFF was examined for .
Figure 5 shows the memory consumed in SFM and FM for characteristic vectors built by ESP and CGK for each dataset. Huge amount of memory was consumed in FM for high dimension characteristic vectors and projected vectors. Around 1.1TB and 323GB memories were consumed by ESP+FM for in the results of ”Sports” and ”Compound”, respectively, which prevented us from building high dimension vectors of RFF. The memory required by SFM was linear to dimension of characteristic vectors for each dataset. Only 280MB and 80MB were consumed by ESP+SFM for in the results of ”Sports” and ”Compound”, respectively. These results suggest that ESP+SFM enables dramatical reduction of required memory, compared with ESP+FM.
Figure 6 shows the execution time for building projected vectors for each dataset. The execution time increased linearly to dimension for each method and SFM built 16,384 dimension vectors of RFF for ”Compound” in around 9 hours.
We evaluated the accuracies of our approximations of alignment kernels by the average error of RFF defined as
where was defined as Eq. 2 and was fixed. The average error of SFM was compared with that of FM for each dataset. Table 4 shows the average errors of SFM and FM using characteristic vectors built from ESP for each dataset. The average errors were almost the same between ESP+SFM and ESP+FM for all datasets and dimension . The accuracies of FM were preserved in SFM while achieving the dramatical reduction of memory required for FM. The same tendencies were observed for the average errors of SFM in combination with CGK, as shown in Table 5
Method  Protein  DNA  Music  Sports  Compound 

ESP+SFM(D=128)  
ESP+FM(D=128)  
ESP+SFM(D=512)  
ESP+FM(D=512)  
ESP+SFM(D=2048)  
ESP+FM(D=2048)  
ESP+SFM(D=8192)  
ESP+FM(D=8192)  
ESP+SFM(D=16384)  
ESP+FM(D=16384) 
Method  Protein  DNA  Music  Sports  Compound 

CGK+SFM(D=128)  
CGK+FM(D=128)  
CGK+SFM(D=512)  
CGK+FM(D=512)  
CGK+SFM(D=2048)  
CGK+FM(D=2048)  
CGK+SFM(D=8192)  
CGK+FM(D=8192)  
CGK+SFM(D=16384)  
CGK+FM(D=16384) 
6.3 Classification performance of ESP+SFM
We evaluated the classification abilities of ESP+SFM, CGK+SFM, LAK and GAK.
We used an implementation of LAK downloadable from
Method  Protein  DNA  Music  Sports  Compound 

ESP+SFM(D=128)  5  8  11  204  261 
ESP+SFM(D=512)  22  34  47  799  1,022 
ESP+SFM(D=2048)  93  138  193  3,149  4,101 
ESP+SFM(D=8192)  367  544  729  12,179  16,425 
ESP+SFM(D=16384)  725  1,081  1,430  24,282  32,651 
CGK+SFM(D=128)  14  52  26  452  397 
CGK+SFM(D=512)  60  222  104  1,747  1,570 
CGK+SFM(D=2048)  237  981  415  7,156  6,252 
CGK+SFM(D=8192)  969  3,693  1,688  27,790  25,054 
CGK+SFM(D=16384)  1,937  7,596  3,366  53,482  49,060 
LAK  31,718         
GAK  25,252  48h  101,079  48h  48h 
ESP+Kernel  20  28  162  48h  48h 
CGK+Kernel  31  95  196  48h  48h 

Table 6 shows the execution time for building RFF and computing kernel matrices in addition to training linear/nonlinear SVMs for each method. LAK was applied to only ”Protein” because its scoring function was optimized for protein sequences. It took 9 hours for LAK to finish the execution, which was the most timeconsuming among all methods on ”Protein”. The execution of GAK finished within 48 hours for only ”Protein” and ”Music” and it took around 7 hours and 28 hours for ”Protein” and ”Music”, respectively. The execution of ESP+Kernel and CGK+Kernel did not finish with 48 hours for ”Sports” and ”Compound”. These results suggest that existing alignment kernels are not suitable for applications to massive string datasets. The executions of ESP+SFM and CGK+SFM finished with 48 hours for all datasets. ESP+SFM and CGK+SFM took around 9 hours and 13 hours, respectively, for ”Compound” consisting of 1.3 million strings in the setting of large .
Figure 8 shows the memory consumed for training linear/nonlinear SVMs for each method, where GAK, LAK, ESP+Kernel, CGK+ Kernel are represented as ”Kernel”. ”Kernel” required small memory for relatively small datasets such as ”Protein”, ”DNA” and ”Music”, but it required huge memory for relatively large datasets such as ”Sports” and ”Compound”. For example, it consumed 654GB and 1.3TB memories for ”Sports” and ”Compound”, respectively. The memories for ESP+SFM and CGK +SFM were at least one order magnitude smaller than those of ”Kernel”. ESP+SFM and CGK+SFM required 36GB and 166GB memories for ”Sports” and ”Compound” in the setting of large , respectively. These results demonstrated high memoryefficiency of ESP+SFM and CGK+SFM.
Figure 8 shows the classification accuracy of each method. The prediction accuracies of ESP+SFM and CGK+SFM improved for larger . The prediction accuracy of ESP+SFM was higher than that of CGK+SFM for any on all datasets and competitive to those of LAK, GAK, ESP+Kernel and CGK+Kernel. These results suggest that ESP+SFM can achieve high classification accuracy and it is much more efficient than the other methods in terms of memory and time for building RFF and training linear SVMs.
7 Conclusions and Future Work
We presented the first approximation of alignment kernels for solving largescale classifications of string datasets. Our method called ESP+SFM has the following appealing properties:

Scalability: ESP+SFM is applicable to massive string data (see Section 6).

Fast training: ESP+SFM trains SVMs fast (see Section 6.3).

Space efficiency: ESP+SFM trains SVMs spaceefficiently (see Section 6.3).

Prediction accuracy: ESP+SFM can achieve high prediction accuracy (see Section 6.3).

Online learning: ESP+SFM can train SVMs with alignment kernels in an online manner (see Section 5.1).
We developed novel spaceefficient feature maps named SFM for RFF and applied to approximations of alignment kernels, but in principle it is useful for scaling up other kernel approximations using FM. Thus, an important feature work is to apply SFM to other kernel approximations, Such extensions will be important to scale up kernel methods to be applied in various machine learning applications.
8 Acknowledgments
We thank Takaaki Nishimoto and Ninh Pham for useful discussions of kernel approximation methods. The research of Rasmus Pagh has received funding from the European Research Council under the European Union’s 7th Framework Programme (FP7/20072013) / ERC grant agreement no. 614331.