Index
\newmdtheoremenv

mydef[dummy]Definition \newmdtheoremenvmyprop[dummy]Proposition \newmdtheoremenvmylemma[dummy]Lemma \newmdtheoremenvmytheorem[dummy]Theorem

Master’s Degree in Computer Science

[CM9] Computer Science - D.M. 270/2004

Final Thesis

Efficient Tensor Kernel methods for sparse regression

Supervisors
Ch. Prof. Marcello Pelillo
Ch. Prof. Massimiliano Pontil
Assistant Supervisor
Dott. Saverio Salzo
Matriculation Number 854342 2019/2020

Abstract

Recently, classical kernel methods have been extended by the introduction of suitable tensor kernels so to promote sparsity in the solution of the underlying regression problem. Indeed they solve an -norm regularization problem, with and even integer, which happens to be close to a lasso problem. However, a major drawback of the method is that storing tensors generally requires a considerable amount of memory, ultimately limiting its applicability. In this work we address this problem by proposing two advances. First, we directly reduce the memory requirement, by introducing a new and more efficient layout for storing the data. Second, we use Nyström-type subsampling approach, which allows for a training phase with a smaller number of data points, so to reduce the computational cost. Experiments, both on synthetic and real datasets show the effectiveness of the proposed improvements. Finally, we take care of implementing the code in C++ so to further speed-up the computation.

Keywords:

Machine Learning, Tensor Kernels, Regularization, Optimization

## 1 Introduction

Kernel methods are widely used in different machine learning applications. They are usually formulated as an empirical risk minimization problem, enriched with a regularization term lying in the space. This is because the inner product structure is necessary to the theory behind common kernel functions (in order to build the corresponding Reproducing Kernel Hilbert Space). But this restriction does not permit to consider other regularization schemes, such as the sparsity providing one, achievable by making use of the norm. Specific kernel functions built on such schemes (with an associated Reproducing Kernel Banach Space) are particularly restrictive and computationally unfeasible. However, it was proved that choosing arbitrary close to can be seen as a proxy to the case, providing relatively similar sparsity property. Recent findings have proposed tensorial kernel functions to tackle the problem of regularization with arbitrary close to 1. This method is taken into consideration in this thesis.

Tensor kernel functions are employed by utilising a tensorial structure storing the values of such defined kernel. Problems arise in storing such structures since they are particularly memory requiring. The purpose of this work is to propose improvements not only to reduce the memory usage by introducing a novel memory layout, but also to improve the overall execution time for solving the optimization problem by making use of Nyström type strategy.

This thesis is organized as follows:
In Sec. 2 we introduce the preliminaries of this field, that is we begin by providing the basics of functional analysis, from the definition of a vector space to that of Banach and Hilbert spaces. Successively introducing kernel functions together with some properties and related relevant theory such as Reproducing Kernel Hilbert Space, Riesz representation theorem and more.
In LABEL:sec_classical_kernel_methods we provide fundamental concepts about classical kernel methods, thus including an introduction to statistical learning theory (discussing loss functions and empirical risk minimization), followed by discussions about regularization together with examples such as Ridge and Lasso regression methods. We conclude the chapter by introducing notable examples of kernel methods, such as kernel ridge regression and support vector machine.
Tensor kernel functions are introduced in LABEL:tensor_section by analogy to the kernel methods of previous chapter in order to better clarify the theory behind them. Some examples of tensor kernel functions are provided in the end of the chapter.
The last two chapters are dedicated to experiments. In particular in LABEL:sec_five we discuss the proposed data layout and the improvement made possible. So we carry out experiments on Memory gain and execution times both on real world and synthetic datasets. In LABEL:Nyström_like_strategy_for_Time_Efficiency instead, the second improvement is considered, namely the Nyström like strategy, by means of experiments carried out on large numbers otherwise unfeasible. Also in this case, both real world and synthetic datasets are utilised, with an emphasis on analysing the feature selection capability of the algorithm.

## 2 Basic Concepts - Preliminaries

In this chapter we are first going to see (in Subsec. 2.1) some of the basic concepts regarding vector spaces and how we move into Banach and Hilbert spaces, along with the connection of the latter to spaces. Next comes the presentation of kernel functions, some properties and reproducing kernel hilbert spaces (in Subsec. 2.2).

### 2.1 Functional Analysis

{mydef}

[Vector Space]A vectors space over is a set endowed with two binary operations and such that

• Associative law:

• Commutative law:

• Identity vector:

• Existence of the inverse:

• Identity laws:

• Distributive laws:

Examples of Vector spaces are:

• continuous functions from a metric space

{mydef}

[Norm]A norm is a function on a vector space over , , with the following properties:
:

• Non-negative:

• Strictly-positive:

• Homogeneous: Triangle inequality:

A vector space endowed with a norm is called a normed vector space.

{mydef}

Let be a normed vector space. A Cauchy Sequence is a sequence such that: .
A sequence is said to be convergent in if there is a point such that . In that case one writes .

{mydef}

A complete vector space is a vector space equipped with a norm and complete with respect to it, i.e., for every Cauchy sequence in there exists an element such that .

{mydef}

A Banach Space is a complete normed vector space

{mydef}

A function , with and vector spaces over , is a bounded linear operator if:

 T(α1u1+α2u2)=α1T(u1)+α2T(u2),∀α1,α2∈R, ∀u1,u2∈V

and

 ∥Tu∥W≤c∥u∥V,∀u∈V.

In such case the norm of is

 ∥T∥=sup∥u∥V≤1∥Tu∥W. (2.1)

Bounded linear operators which are defined from to are called bounded functionals on . The space of the bounded functionals on is called the dual space of and denoted by , that is,

 V∗={φ:V→R:φ is bounded linear % functional }. (2.2)

endowed with the norm . Finally, the canonical pairing between and is the mapping

 ⟨⋅,⋅⟩:V×V∗→R,⟨u,φ⟩=φ(u). (2.3)
{mydef}

[ space]For , the space of sequences is defined as

 ℓp={{xi}∞i=0 : ∞∑i=0|xi|p<∞} (2.4)
{mydef}

[Norm on ]Given an space, we define the norm on by

 ∥∥{xi}∞i=0∥∥p=(∞∑i=0|xi|p)1/p (2.5)
{mydef}

[Inner product] An Inner/Dot/Scalar product on a vector space over is a map , satisfying the following:
:

• Symmetry:

• Linearity w.r.t. first term:

• Linearity w.r.t. second term:

• Associative:

• Positive Definite:

The norm associated to the scalar product is defined as follows: .

{mydef}

A pre-Hilbert Space is a vector space endowed with a scalar product. If the norm associated to the scalar product defines a complete normed space, then is called a Hilbert space.

{mydef}

[ space]The space is defined as

 ℓ2={{xn}∞n=0 : ∞∑n=0|xn|2<∞} (2.6)

endowed with the scalar product

 ⟨x,y⟩=+∞∑n=0xnyn. (2.7)

This space is a Hilbert space.

{myprop}

Let be a (separable) Hilbert space. Then an orthonormal basis of is a sequence in such that, and , for every . In such case, for every , we have

 +∞∑n=0|⟨u,an⟩|2<+∞andu=+∞∑n=0⟨u,an⟩an. (2.8)

Moreover, fore very , and . This establishes an isomorphism between and .

### 2.2 Reproducing Kernel Hilbert Spaces

In this section we are going to introduce a fundamental aspect of this work, that is kernel functions and Reproducing Kernel Hilbert Spaces. These concepts are the starting point upon which, the definition of Tensor Kernels arises.
Definitions are kept in a general form in order to give a wider point of view of the field. Main properties are reported. More in depth discussions can be retrieved in (steinwart2008).

{mydef}

[Kernel] Let be a non-empty set, a Kernel is a function and there exists a Hilbert Space and a mapping s.t. for all

 k(x,x′)=⟨Φ(x),Φ(x′)⟩ (2.9)

The mapping is called a Feature Map and a Feature Space of function . There are no conditions on other than being a non-empty set, that is, it can be a set of discrete objects such as documents, strings, nodes of a graph or an entire graph. That means that we do not require an inner product to be defined for elements of .

Suppose that is a separable Hilbert space. Then there exists an isomorphism , meaning that is a bounded linear and bijective operator and . Therefore,

 k(x,x′)=⟨Φ(x),Φ(x′)⟩H=⟨TΦ(x),TΦ(x′)⟩ℓ2. (2.10)

This shows that one can always choose a feature map with values in .

To build a kernel function from scratch, as the successive theorem proves, we first need the definition of positive definiteness and symmetry.

{mydef}

[Positive definite]A function is said to be positive definite if, for all , and all

 n∑i=1n∑j=1αiαjk(xi,xj)≥0 (2.11)

Let be a matrix for . Such matrix is referred to as the Gram matrix of with respect to . Similarly to the above definition, we say that matrix is positive semi-definite if:

 α⊤Kα=n∑i=1n∑j=1αiαjKij≥0

holds for all . This definition is also referred to as the energy-based definition, and it doesn’t seem trivial how to assure such property. A more immediate definition of positive definite is a matrix having all positive eigenvalues. There are other tests one can carry out to prove positive definiteness for a generic matrix :

• Check that all eigenvalues associated to are positive.

• Check that all pivots of matrix are positive.

• Check that all upper-left matrix determinants are positive.

• Have a decomposition of , with rectangular having all independent columns.

A function is called symmetric if for all .

Kernel functions are symmetric and positive definite.
Let be a kernel function with the associated feature map. Since the inner product in is symmetric, then is symmetric.
Furthermore, for :

which shows that is also positive definite.

The following theorem proves that being symmetric and positive definite are necessary and sufficient conditions for a function to be a kernel.

{mytheorem}

[Symmetric, positive definite functions are kernels]
A function is a kernel function if and only if it satisfies the properties of being symmetric and positive definite.

###### Proof.

To prove the theorem we will first consider a pre-Hilbert space of functions, from which we derive a proper inner product that satisfies the aforementioned properties. Successively by defining a proper feature map to a proper feature space (a Hilbert space), we arrive at the definition of a kernel function as presented in Definition 2.2.
Consider

 Hpre:={n∑i=1αik(⋅,xi) | n∈N, αi∈R, xi∈X, i=1,...n}

and taking two elements:

 f:=n∑i=1αik(⋅,xi)∈Hpre g:=m∑j=1βjk(⋅,x′j)∈Hpre

we define the following:

 ⟨f,g⟩Hpre:=n∑i=1m∑j=1αiβjk(xi,x′j)

We note that it is bilinear, symmetric and we can write independently from the representation of or :

 ⟨f,g⟩H=m∑j=1βjf(x′j) ⟨f,g⟩H=n∑i=1αig(xi)

Since is positive definite, is also positive, that is for all . Moreover, it satisfies the Cauchy-Schwarz inequality:

 |⟨f,g⟩|2≤⟨f,f⟩H⋅⟨g,g⟩Hf,g∈Hpre.

That is important in order to prove that is an inner product for , in particular, to prove the last property of Definition 2.1 (Inner Product) we write:

 |f(x)|2=|n∑i=1αik(x,xi)|2=|⟨f,k(⋅,x)⟩H|2≤⟨k(⋅,x),k(⋅,x)⟩H⋅⟨f,f⟩H=0

hence we find . So is an inner product for .

Let be a completion of and be the isometric embedding. Then is a Hilbert space and we have

 Φ(x)=Ik(⋅,x)

and

 ⟨Ik(⋅,x),Ik(⋅,x′)⟩H=⟨k(⋅,x),k(⋅,x′)⟩Hpre=k(x,x′)∀x,x′∈X

that is the definition of kernel and is the feature map of .∎

Now we are going to introduce the concept of a Reproducing Kernel Hilbert Space (RKHS), followed by some interesting results.
To get to the definition of RKHS, we first have a look at the definition of an evaluation functional.

{mydef}

Let be a Hilbert Space of functions from to . A Dirac evaluation functional at is a functional

 δx:H→Rs.t.δx(f)=f(x)∀f∈H

This functional simply evaluates the function at the point .

{mydef}

A Reproducing Kernel Hilbert Space (RKHS) is a Hilbert Space of function where all the Dirac evaluation functionals in are bounded and continuous.

Being continuous means that:

 ∀f∈H∥δx(f)∥≤cx∥f∥H,for some cx>0 (2.12)

In other words, this statement points out that norm convergence implies pointwise convergence.

Definition 2.2 is compact and makes use of evaluation functionals. There are some properties residing behind such definition which are fundamental. In particular we are going to see what a Reproducing Kernel is, which, as the name suggests, is the building block of RKHSs. Indeed, an alternative definition based on Reproducing Kernels can be retrieved.

{mydef}

[Reproducing Kernel] Let be a Hilbert Space of functions from to , with . A function is called a Reproducing Kernel of if the following holds:

• (Reproducing Property)

The following Lemma says that an RKHS is defined by a Hilbert space that has a reproducing kernel.

{mylemma}

[Reproducing kernels are kernels]Let be a Hilbert space of functions over that has a reproducing kernel k. Then is a RKHS and is also a feature space of k, where the feature map is given by

 Φ(x)=k(⋅,x),x∈X.

We call the canonical feature map.

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters