Boosting the kernelized shapelets: Theory and algorithms for local features

# Boosting the kernelized shapelets: Theory and algorithms for local features

\nameDaiki Suehiro \emailsuehiro@ait.kyushu-u.ac.jp
\addrFaculty of Information Science and Electrical Engineering, Kyushu University
and AIP, RIKEN
Fukuoka, Japan, 8190395 \AND\nameKohei Hatano \emailhatano@inf.kyushu-u.ac.jp
\addrResearch and Development Division, Library, Kyushu University
and AIP, RIKEN
Fukuoka, Japan, 8190395 \AND\nameEiji Takimoto \emaileiji@inf.kyushu-u.ac.jp
\addrDepartment of Informatics, Kyushu University
Fukuoka, Japan, 8190395 \AND\nameShuji Yamamoto \emailyamashu@math.keio.ac.jp \AND\nameKenichi Bannai \emailbannai@math.keio.ac.jp
\addrDepartment of Mathematics, Keio University
and AIP, RIKEN
Kanagawa, Japan, 2238522 \AND\nameAkiko Takeda \emailatakeda@ism.ac.jp
\addrDepartment of Mathematical Analysis and Statistical Inference,
The Institute of Statistical Mathematics
and AIP, RIKEN
Tokyo, Japan, 1908562
###### Abstract

We consider binary classification problems using local features of objects. One of motivating applications is time-series classification, where features reflecting some local closeness measure between a time series and a pattern sequence called shapelet are useful. Despite the empirical success of such approaches using local features, the generalization ability of resulting hypotheses is not fully understood and previous work relies on a bunch of heuristics. In this paper, we formulate a class of hypotheses using local features, where the richness of features is controlled by kernels. We derive generalization bounds of sparse ensembles over the class which is exponentially better than a standard analysis in terms of the number of possible local features. The resulting optimization problem is well suited to the boosting approach and the weak learning problem is formulated as a DC program, for which practical algorithms exist. In preliminary experiments on time-series data sets, our method achieves competitive accuracy with the state-of-the-art algorithms with small parameter-tuning cost.

Boosting the kernelized shapelets: Theory and algorithms for local features Daiki Suehiro suehiro@ait.kyushu-u.ac.jp
Faculty of Information Science and Electrical Engineering, Kyushu University
and AIP, RIKEN
Fukuoka, Japan, 8190395
Kohei Hatano hatano@inf.kyushu-u.ac.jp
Research and Development Division, Library, Kyushu University
and AIP, RIKEN
Fukuoka, Japan, 8190395
Eiji Takimoto eiji@inf.kyushu-u.ac.jp
Department of Informatics, Kyushu University
Fukuoka, Japan, 8190395
Shuji Yamamoto yamashu@math.keio.ac.jp Kenichi Bannai bannai@math.keio.ac.jp
Department of Mathematics, Keio University
and AIP, RIKEN
Kanagawa, Japan, 2238522
Akiko Takeda atakeda@ism.ac.jp
Department of Mathematical Analysis and Statistical Inference,
The Institute of Statistical Mathematics
and AIP, RIKEN
Tokyo, Japan, 1908562

Editor:

## 1 Introduction

Classifying objects using their “local” patterns is often effective in various applications. For example, in time-series classification problems, a local feature called shapelet is shown to be quite powerful in the data mining literature (Ye and Keogh, 2009; Keogh and Rakthanmanon, 2013; Hills et al., 2014; Grabocka et al., 2014). More precisely, a shapelet is a real-valued “short” sequence in for some . Given a time-series , a typical measure of closeness between the time-series and the shapelet is , where and . Here, the measure focuses on “local” closeness between the time-series and the shapelet. In many time-series classification problems, sparse combinations of features based on the closeness to some shapelets are useful (Grabocka et al., 2015; Renard et al., 2015; Hou et al., 2016). Similar situations could happen in other applications. Say, for image classification problems, template matching is a well-known technique to measure similarity between an image and “(small) template image” in a local sense.

Despite the empirical success of applying local features, theoretical guarantees of such approaches are not fully investigated. In particular, trade-offs of the richness of such local features and the generalization ability are not characterized yet.

In this paper, we formalize a class of hypotheses based on some local closeness. Here, the richness of the class is controlled by associated kernels. We show generalization bounds of ensembles of such local classifiers. Our bounds are exponentially tighter in terms of some parameter than typical bounds obtained by a standard analysis.

Further, for learning ensembles of the kernelized hypotheses, our theoretical analysis suggests a -norm constrained optimization problem with infinitely many parameters, for which the dual problem is categorized as a semi-infinite program (see Shapiro, 2009). To obtain approximate solutions of the problem efficiently, we take the approach of boosting (Schapire et al., 1998). In particular, we employ LPBoost (Demiriz et al., 2002), which solve -norm constrained soft margin optimization via a column generation approach. As a result, our approach has two stages, where the master problem is a linear program and the sub-problems are difference of convex programs (DC programs), which are non-convex. While it is difficult to solve the sub-problems exactly due to non-convexity, various techniques are investigated for DC programs and we can find good approximate solutions efficiently for many cases in practice.

In preliminary experiments on time-series data sets, our method achieves competitive accuracy with the state-of-the-art algorithms using shapelets. While the previous algorithms need careful parameter tuning and heuristics, our method uses less parameter tuning and parameters can be determined in an organized way. In addition, our solutions tend to be sparse and could be useful for domain experts to select good local features.

### 1.1 Related work

The concept of time-series shapelets was first introduced by Ye and Keogh (2009). The algorithm finds shapelets by using the information gains of potential candidates associated with all the subsequences of the given time series and constructs a decision tree. Shapelet transform (Hills et al., 2014) is a technique combining with shapelets and machine learning. The authors consider the time-series examples as feature vectors defined by the set of local closeness to some shapelets and in order to obtain classification rule, they employed some effective learning algorithms such as linear SVM or random forests. Note that shapelet transform completely separate the phase searching for shapelets from the phases of creating classification rules. Afterward, many algorithms have been proposed to search the good shapelets efficiently keeping high prediction accuracy in practice (Keogh and Rakthanmanon, 2013; Grabocka et al., 2015; Renard et al., 2015; Karlsson et al., 2016). The algorithms are based on the idea that discriminative shapelets are contained in the training data. This approach, however, might overfit without a regularization. Learning Time-Series Shapelets (LTS) algorithm (Grabocka et al., 2014) is a different approach from such subsequence-based algorithms. LTS approximately solves an optimization problem of learning the best shapelets directly without searching subsequences in a brute-force way. In contrast to the above subsequence-based methods, LTS finds nearly optimal shapelets and achieves higher prediction accuracy than the other existing methods in practice. However, there is no theoretical guarantee of its generalization error.

There are previous results using local features based on kernels  (e.g., Odone et al., 2005; Harris, ; Shimodaira et al., 2001; Zhang et al., 2010). However, their approaches focus on the closeness (similarity) between examples (e.g., subsequence of example and subsequence of example ), not known to capture local similarity.

## 2 Preliminaries

Let be a set, in which an element is called a pattern. Our instance space is a set of sequences of patterns in . For simplicity, we assume that every sequence in is of the same length. That is, for some integer . We denote by an instance sequence in , where every is a pattern in . The learner receives a labeled sample of size , where each labeled instance is independently drawn according to some unknown distribution over .

Let be a kernel over , which is used to measure the similarity between patterns, and let denote a feature map associated with the kernel for a Hilbert space . That is, for patterns , where denotes the inner product over . The norm induced by the inner product is denoted by and satisfies for .

For each , we define the base classifier (or the feature), denoted by , as the function that maps a given sequence to the maximum of the similarity scores between and over all patterns in . More specifically,

 hu(x)=maxj∈[Q]⟨u,Φ(x(j))⟩,

where denotes the set . For a set , we define the class of base classifiers as

 HU={hu∣u∈U}

and we denote by the set of convex combinations of base classifiers in . More precisely,

 conv(HU)={∑u∈U′wuhu(x)∣∀u∈U′,wu≥0,∑u∈U′wu=1,U′⊆U is a finite support}.

The goal of the learner is to find a final hypothesis , so that its generalization error is small.

#### Example: Learning with time-series shapelets

For a typical setting of learning with time-series shapelets, an instance is a sequence of real numbers and a base classifier , which is associated with a shapelet (i.e., a “short” sequence of real numbers) , is defined as

 hs(x)=max1≤j≤L−ℓ+1K(s,x(j)),

where is the subsequence of of length that begins with th index111 In previous work, is not necessarily a kernel. For instance, the negative of Euclidean distance is often used as a similarity measure .. In our framework, this corresponds to the case where , , and .

## 3 Risk bounds of the hypothesis classes

In this section, we give generalization bounds of the hypothesis classes for various and . To derive the bounds, we use the Rademacher and the Gaussian complexity (Bartlett and Mendelson, 2003).

###### Definition 1

(The Rademacher and the Gaussian complexity, Bartlett and Mendelson, 2003) Given a sample , the empirical Rademacher complexity of a class w.r.t.  is defined as , where and each is an independent uniform random variable in . The empirical Gaussian complexity of w.r.t.  is defined similarly but each is drawn independently from the standard normal distribution.

The following bounds are well-known.

###### Lemma 1

(Bartlett and Mendelson, 2003, Lemma 4) .

###### Lemma 2

(Mohri et al., 2012, Corollary 6.1) For fixed , , the following bound holds with probability at least : for all ,

 ED(f)≤ES,ρ(f)+2ρRS(H)+3√log1δ2m,

where is the empirical margin loss of over , i.e., the ratio of examples for which has margin .

To derive generalization bounds based on the Rademacher or the Gaussian complexity is quite standard in the statistical learning theory literature and applicable to our classes of interest as well. However, a standard analysis provides us sub-optimal bounds.

For example, let us consider the simple case where the class of base classifiers is defined by the linear kernel with to be the set of vectors in of bounded norm. In this case, can be viewed as , where for every . Then, by a standard analysis  (see, e.g., Mohri et al., 2012, Lemma 8.1), we have . However, this bound is weak since it is linear in . In the following, we will give an improved bound of . The key observation is that if base classes are “correlated” somehow, one could obtain better Rademacher bounds. In fact, we will exploit some geometric properties among these base classes.

### 3.1 Main theorem

First, we show our main theoretical result on the generalization bound of .

Given a sample , let be the set of patterns appearing in and let . Let . By viewing each instance as a hyperplane , we can naturally define a partition of the Hilbert space by the set of all hyperplanes . Let be the set of all cells of the partition. Each cell is a polyhedron which is defined by a minimal set that satisfies . Let

 μ∗=minI∈Imaxu∈I∩Uminv∈VI|⟨u,v⟩|.

Let be the VC dimension of the set of linear classifiers over the finite set , given by . Then, our main result is stated as follows:

###### Theorem 1

Suppose that for any , . Then, for any and , with probability at least , the following holds for any with :

 ED(g)≤ Eρ(g)+O⎛⎜ ⎜⎝RΛ√d∗Φ,Slog(mQ)ρ√m+√log1δm⎞⎟ ⎟⎠. (1)

In particular, (i) if is the identity mapping (i.e., the associated kernel is the linear kernel), or (ii) if satisfies that is monotone decreasing with respect to (e.g., the mapping defined by the Gaussian kernel) and , then can be replaced by . (iii) Otherwise, .

In order to prove the theorem 1, we show several definitions, lemmas and the proofs as following subsection.

### 3.2 Proof sketch

###### Definition 2

The set of mappings from an instance to a pattern For any , let be a mapping defined by

 θu(i):=argmaxj∈[Q]⟨u,Φ(x(j)i)⟩,

and we denote the set of all as .

###### Lemma 3

Suppose that for any , . Then, the empirical Gaussian complexity of with respect to for is bounded as follows:

 GS(H)≤RΛ√(√2−1)+2(ln|Θ|)√m.

The proof is given in the supplemental material.

Thus, it suffices to bound the size . Naively the size is at most since there are possible mappings from to . However, this naive bound is too pessimistic. The basic idea to get better bounds is the following. Fix any and consider points . Then, we define equivalence classes of such that is the same, which define a Voronoi diagram for the points . Note here that the closeness is measured by the inner product, not a distance. More precisely, let be the Voronoi diagram defined as . Let us consider the set of intersections for all combinations of . The key observation is that each non-empty intersection corresponds to a mapping . Thus, we obtain . In other words, the size of is exactly the number of rooms defined by the intersections of Voronoi diagrams . From now on, we will derive upper bounds based on this observation.

###### Lemma 4
 |Θ|=O((mQ)2d∗Φ,S).

The proof is shown in the supplemental material.

###### Theorem 2

1. If is the identity mapping over , then .

2. if satisfies that is monotone decreasing with respect to (e.g., the mapping defined by the Gaussian kernel) and , then .

3. Otherwise, .

The proof is shown in the supplemental material.

Now we are ready to prove Theorem 1.

Proof [Proof of Theorem 1] By using Lemma 1, and 2, we obtain the generalization bound in terms of the Gaussian complexity of . Then, by applying Lemma 3 and Theorem 2, we complete the proof.

## 4 Optimization problem formulation

In this section, we formulate an optimization problem to learn ensembles in for . The problem is a -norm constrained soft margin optimization problem using hypotheses in , which is a linear program.

 maxρ,w,ξ ρ−1νmm∑i=1ξi (2) sub.to ∫u∈Uyiwuhu(xi)du≥ρ−ξi, i∈[m],∫u∈Uwudu=1,w≥0, ρ∈R,

where is a given base classifiers, is a constant parameter, is a target margin and is a slack variable. The dual problem (2) is given as follows.

 minγ,d γ (3) sub.to m∑i=1yidihu(xi)≤γ,u∈U,  0≤di≤1/νm (i∈[m]),  m∑i=1di=1, γ∈R.

The dual problem is categorized as a semi-infinite program (SIP), since it contains possibly infinitely many constraints. Note that the duality gap is zero since the problem (3) is convex and the optimum is finite (Shapiro, 2009, Theorem 2.2). Our approach is to approximately solve the primal and the dual problems for by solving the sub-problems over a finite subset . Such approach is called column generation in linear programming, which add a new constraint (column) to the dual problems and solve them iteratively. LPBoost (Demiriz et al., 2002) is a well known example of the approach in the boosting setting. At each iteration, LPBoost chooses a hypothesis so that maximally violates the constraints in the current dual problem. This sub-problem is called weak learning in the boosting setting and formulated as follows:

 maxu∈Hm∑i=1yidimaxj∈[Q]⟨u,Φ(x(j)i)⟩sub.to∥u∥2H≤Λ2. (4)

However, the problem (4) cannot be solved directly, since we have only access to though the associated kernel. Fortunately, the optimal solution of the problem can be written as a linear combination of the functions because of the following representer theorem.

###### Theorem 3 (Representer Theorem)

The optimal solution of optimization problem (4) has the form of .

The proof is shown in the supplemental material.

By Theorem 3, we can design a weak learner by solving the following equivalent problem:

 minα−∑p:yp=+1^dpmaxj∈[Q]m∑i=1Q∑k=1αikK(x(k)i,x(j)p)+∑q:yq=−1^dqmaxj∈[Q]m∑i=1Q∑k=1αikK(x(k)i,x(j)q) (5) sub.tom∑i,kQ∑j,lαijαklK(x(j)i,x(l)k)≤Λ2.

## 5 Algorithm

The optimization problem (5) is difference of convex functions (DC) programming problem and we can obtain local optimum -approximately by using DC Algorithm (Tao and Souad, 1988). For the above optimization problem, we replace with , then we get the equivalent optimization problem as below.

 minα,λ −∑p:yp=+1^dpmaxj∈[Q]m∑i=1Q∑k=1αikK(x(k)i,x(j)p)+∑q:yq=−1^dqλq (6) sub.to m∑i=1Q∑k=1αikK(x(k)i,x(j)q)≤λq (j∈[Q],∀q:yq=−1), m∑i,k=1Q∑j,l=1αijαklK(x(j)i,x(l)k)≤Λ2.

We show the pseudo code of LPBoost using column generation algorithm, and our weak learning algorithm using DC programming in Algorithm 1 and 2, respectively. In Algorithm 2, the optimization problem (7) can be solved by standard QP solver.

It is interesting to note that, if we restrict to unit vector, we can get the optimal analytically in each weak learning step, and the final hypothesis obtained by LPBoost totally behaves like time-series shapelets.

## 6 Experiments

In the following experiments, we show that our methods are practically effective for time-series classification problem as one of the applications.

### 6.1 Classification accuracy for UCR datasets

We use UCR datasets (Chen et al., 2015), that are often used as benchmark datasets for time-series classification methods. For simplicity, we use several binary classification datasets (of course, our method is applicable to multi-class). The detailed information of the datasets is described in the left side of Table 1. In our experiments, we assume that patterns are subsequence of a time series of length .

We set the hyper-parameters as following: The length of pattern was searched in , where is the length of each time-series dataset, and . We found good and through a grid search via -fold cross validation. We use the Gaussian kernel . For each , we choose from , which maximizes the variance of the values of the kernel matrix. We set , and the number of maximum iterations of LPBoost is , and the number of maximum iterations of DC Programming is .

Unfortunately, the quadratical normalization constraints of the optimization problem (6) have highly computational costs. Thus, in practice, we replace the constraint with -norm regularization: and solve the linear program. We expect to obtain sparse solutions while that makes the decision space of small. As a LP solver for the optimization problem of the weak learner and LPBoost, we used the CPLEX software.

The results of classification accuracy are shown in the right side of Table 1. We referred to the experimental result reported by Hou et al. (2016) with regard to the accuracies of the following three state-of-the-arts algorithms, LTS (Grabocka et al., 2014), IG-SVM (Hills et al., 2014), and FLAG (Hou et al., 2016). It is reported that the above algorithms are highly accurate in practice. Particularly, LTS is known as one of the most accurate algorithm in practice, however, it highly and delicately depends on the input hyper-parameters (see, e.g., Wistuba et al., 2015). Many of other existing shapelet-based methods including them have difficulty of adjusting hyper-parameters. However, as shown in Table 1, our algorithm had competitive performance for used datasets with a little effort to parameter search.

### 6.2 Sparsity and visualization analysis

It is said that shapelets-based hypotheses have great visibility (see, e.g., Ye and Keogh, 2009). Now, we show that the solution obtained by our method has high sparsity and it induces visibility, despite our final hypothesis contains (non-linear) kernels. Let us explain the sparsity using Gun-point dataset as an example. Now, we focus on the final hypothesis: , where . The number of final hypotheses s such that is , and the number of non-zero elements of such s is out of ().

Actually, for the other datasets, percentages of non-zero elements are from to . It is clear that the solution obtained by our method has high sparsity.

Figure 1 is an example of visualization of a final hypothesis obtained by our method for Gun-point time series data. The colored lines shows all of the s in where both and are non-zero. Each value of the legends show the multiplication of and corresponding to . Since it is hard to visualize the local features over the Hilbert space, we plotted each of them to match the original time series based on Euclidean distance. In contrast to visualization analyses by time-series shapelets (see, e.g., Ye and Keogh, 2009), our visualization, colored lines and plotted position, do not strictly represent the meaning of the final hypothesis because of the non-linear feature mapping. However, we can say that colored lines represent “discriminative pattern” and certainly make important contributions to classification. Thus, we believe that our solutions are useful for domain experts to interpret important patterns.

## 7 Conclusion

In this paper, we show generalization error bounds of the hypothesis classes based on local features. Further, we formulate an optimization problem, which fits in the boosting framework. We design an efficient algorithm for the weak learner using DC programming techniques. Experimental results show that our method achieved competitive accuracy with the existing methods with small parameter-tuning costs.

## References

• Bartlett and Mendelson (2003) Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, March 2003. ISSN 1532-4435.
• Chen et al. (2015) Yanping Chen, Eamonn Keogh, Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdullah Mueen, and Gustavo Batista. The ucr time series classification archive, July 2015.
• Demiriz et al. (2002) A Demiriz, K P Bennett, and J Shawe-Taylor. Linear Programming Boosting via Column Generation. Machine Learning, 46(1-3):225–254, 2002.
• Grabocka et al. (2014) Josif Grabocka, Nicolas Schilling, Martin Wistuba, and Lars Schmidt-Thieme. Learning time-series shapelets. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, pages 392–401, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2956-9.
• Grabocka et al. (2015) Josif Grabocka, Martin Wistuba, and Lars Schmidt-Thieme. Scalable discovery of time-series shapelets. CoRR, abs/1503.03238, 2015.
• (6) Zellig Harris. Distributional structure. Word, 10(23):146–162, 1954.
• Hills et al. (2014) Jon Hills, Jason Lines, Edgaras Baranauskas, James Mapp, and Anthony Bagnall. Classification of time series by shapelet transformation. Data Mining and Knowledge Discovery, 28(4):851–881, July 2014. ISSN 1384-5810.
• Hou et al. (2016) Lu Hou, James T. Kwok, and Jacek M. Zurada. Efficient learning of timeseries shapelets. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., pages 1209–1215, 2016.
• Karlsson et al. (2016) Isak Karlsson, Panagiotis Papapetrou, and Henrik Boström. Generalized random shapelet forests. Data Min. Knowl. Discov., 30(5):1053–1085, September 2016. ISSN 1384-5810.
• Keogh and Rakthanmanon (2013) Eamonn J. Keogh and Thanawin Rakthanmanon. Fast shapelets: A scalable algorithm for discovering time series shapelets. In Proceedings of the 13th SIAM International Conference on Data Mining, May 2-4, 2013. Austin, Texas, USA., pages 668–676, 2013.
• Mohri et al. (2012) Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. The MIT Press, 2012.
• Odone et al. (2005) Francesca Odone, Annalisa Barla, and Alessandro Verri. Building kernels from binary strings for image matching. IEEE Trans. Image Processing, 14(2):169–180, 2005.
• Renard et al. (2015) Xavier Renard, Maria Rifqi, Walid Erray, and Marcin Detyniecki. Random-shapelet: an algorithm for fast shapelet discovery. In 2015 IEEE International Conference on Data Science and Advanced Analytics (IEEE DSAA’2015), pages 1–10, Paris, France, October 2015. IEEE.
• Schapire et al. (1998) Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wen Sun Lee. Boosting the margin: a new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5):1651–1686, 1998.
• Schölkopf and Smola (2002) B. Schölkopf and AJ. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, USA, December 2002.
• Shapiro (2009) Alexander Shapiro. Semi-infinite programming, duality, discretization and optimality conditions. Optimization, 58(2):133–161, 2009.
• Shimodaira et al. (2001) Hiroshi Shimodaira, Ken-ichi Noma, Mitsuru Nakai, and Shigeki Sagayama. Dynamic time-alignment kernel in support vector machine. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, NIPS’01, pages 921–928, Cambridge, MA, USA, 2001. MIT Press.
• Tao and Souad (1988) Pham Dinh Tao and El Bernoussi Souad. Duality in D.C. (Difference of Convex functions) Optimization. Subgradient Methods, pages 277–293. Birkhäuser Basel, Basel, 1988. ISBN 978-3-0348-9297-1.
• Wistuba et al. (2015) Martin Wistuba, Josif Grabocka, and Lars Schmidt-Thieme. Ultra-fast shapelets for time series classification. CoRR, abs/1503.05018, 2015.
• Ye and Keogh (2009) Lexiang Ye and Eamonn Keogh. Time series shapelets: A new primitive for data mining. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, pages 947–956, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-495-9.
• Zhang et al. (2010) D. Zhang, W. Zuo, D. Zhang, and H. Zhang. Time series classification using support vector machine with gaussian elastic metric kernel. In 2010 20th International Conference on Pattern Recognition, pages 29–32, Aug 2010. doi: 10.1109/ICPR.2010.16.

## Appendix A Supplemental material

### a.1 Proof of Lemma 3

Proof  Since can be partitioned into ,

 GS(HU) =1mEσ[supθ∈Θsupu∈U:θu=θ⟨u,(m∑i=1σiΦ(x(θ(i))i))⟩] ≤1mEσ[supθ∈Θsupu∈U⟨u,(m∑i=1σiΦ(x(θ(i))i))⟩]≤ΛmEσ[supθ∈Θ∥∥ ∥∥m∑i=1σix(θ(i))i∥∥ ∥∥H] =ΛmEσ⎡⎢⎣supθ∈Θ ⎷∥∥ ∥∥m∑i=1σiΦ(x(θ(i))i)∥∥ ∥∥2H⎤⎥⎦=ΛmEσ⎡⎢⎣ ⎷supθ∈Θ∥∥ ∥∥m∑i=1σiΦ(x(θ(i))i)∥∥ ∥∥2H⎤⎥⎦ ≤Λm  ⎷Eσ⎡⎣supθ∈Θ∥∥ ∥∥m∑i=1σiΦ(x(θ(i))i)∥∥ ∥∥2H⎤⎦. (8)

The first inequality is derived from the relaxation of , the second inequality is due to Cauchy-Schwarz inequality and the fact , and the last inequality is due to Jensen’s inequality. We denote by the kernel matrix such that . Then, we have

 Eσ⎡⎣supθ∈Θ∥∥ ∥∥m∑i=1σiΦ(x(θ(i))i)∥∥ ∥∥2H⎤⎦=Eσ[supθ∈Θm∑i,j=1σiσjK(θ)ij]. (9)

We now derive an upper bound of the r.h.s. as follows.

For any ,

 exp(cEσ[supθ∈Θm∑i,j=1σiσjK(θ)ij])≤Eσ[exp(csupθ∈Θm∑i,j=1σiσjK(θ)ij)] = Eσ[supθ∈Θexp(cm∑i,j=1σiσjK(θ)ij)]≤∑θ∈ΘEσ[exp(cm∑i,j=1σiσjK(θ)ij)]

The first inequality is due to Jensen’s inequality, and the second inequality is due to the fact that the supremum is bounded by the sum. By using the symmetry property of , we have , which is rewritten as

where are the eigenvalues of and is the orthonormal matrix such that is the eigenvector that corresponds to the eigenvalue . By the reproductive property of Gaussian distribution, obeys the same Gaussian distribution as well. So,

 ∑θ∈ΘEσ[exp(cm∑i,j=1σiσjK(θ)ij)]=∑θ∈ΘEσ[exp(cσ⊤K(θ)σ)] = ∑θ∈ΘEσ[exp(cm∑k=1λk(v⊤kσ)2)]=∑θ∈ΘΠmk=1Eσk[exp(cλkσ2k)]  (replace σ=v⊤kσ) = ∑θ∈ΘΠmk=1(∫∞−∞exp(cλkσ2)exp(−σ)2√2πdσ)=∑θ∈ΘΠmk=1(∫∞−∞exp(−(1−cλk)σ2)√2πdσ).

Now we replace by . Since , we have:

 ∫∞−∞exp(−(1−cλk)σ2)√2πdσ=1√2π∫∞−∞exp(−σ′2)√1−cλkdσ′=1√1−cλk.

Now, by letting and applying the inequality that for , the bound becomes

 exp(cEσ[supθ∈Θm∑i,j=1σiσjK(θ)ij])≤∑θ∈ΘΠmk=1(1+2(√2−1)cλk) (10)

because it holds that in . Further, taking logarithm, dividing the both sides by and applying , we get:

 Eσ[supθ∈Θm∑i,j=1σiσjK(θ)ij] ≤(√2−1)m∑k=1λk+2λ1ln|Θ| (11) =(√2−1)tr(K)+2λ1ln|Θ| (12) ≤(√2−1)mR2+2mR2ln|Θ|, (13)

where the last inequality holds since . By equation (A.1) and (11), we have:

 GS(H)≤Λm ⎷Eσ[supθ∈Θm∑i,j=1σiσjK(θ)ij]≤ΛR√(√2−1)+2ln|Θ|√m. (14)

### a.2 Proof of Lemma 4

Proof  We will reduce the problem of counting intersections of the Voronoi diagrams to that of counting possible labelings by hyperplanes for some set. Note that for each neighboring Voronoi regions, the border is a part of hyperplane since the closeness is defined in terms of the inner product. So, by simply extending each border to a hyperplane, we obtain intersections of halfspaces defined by the extended hyperplanes. Note that, the size of these intersections gives an upper bound of intersections of the Voronoi diagrams. More precisely, we draw hyperplanes for each pair of points in so that each point on the hyperplane has the same inner product between two points. Note that for each pair , the normal vector of the hyperplane is given as (by fixing the sign arbitrary). So, the set of hyperplanes obtained by this procedure is exactly . The size of is , which is at most . Now, we consider a “dual” space by viewing each hyperplane as a point and each point in as a hyperplane. Note that points (hyperplanes in the dual) in an intersection give the same labeling on the points in the dual domain. Therefore, the number of intersections in the original domain is the same as the number of the possible labelings on by hyperplanes in . By the classical Sauer’s Lemma and the VC dimension of hyperplanes (see, e.g., Theorem 5.5 in Schölkopf and Smola (2002)), the size is at most .

### a.3 Proof of Theorem 2

Proof  (i) In this case, the Hilbert space is contained in . Then, by the fact that VC dimension is at most and Lemma 4, the statement holds.

(ii) f is monotone decreasing for , then the following holds:

Therefore, , where . It indicates that the number of Voronoi cells made by corresponds to the . Then, by following the same argument for the linear kernel case, we get the same statement.

(iii) We follow the argument in Lemma 4. For the set of classifiers , its VC dimension is known to be at most for (see, e.g., Schölkopf and Smola (2002)). By the definition of , for each intersections given by hyperplanes, there always exists a point whose inner product between each hyperplane is at least . Therefore, the size of the intersections is bounded by the number of possible labelings in the dual space by . Thus we obtain that