Gaussian Processes indexed on the symmetric group: prediction and learning

# Gaussian Processes indexed on the symmetric group: prediction and learning

## Abstract

In the framework of the supervised learning of a real function defined on a space , the so called Kriging method stands on a real Gaussian field defined on . The Euclidean case is well known and has been widely studied. In this paper, we explore the less classical case where is the non commutative finite group of permutations. In this setting, we propose and study an harmonic analysis of the covariance operators that enables to consider Gaussian processes models and forecasting issues. Our theory is motivated by statistical ranking problems.

keywords : Gaussian processes, Ranks, Kernel Methods.

## 1 Introduction

The problem of ranking a set of items is a fundamental task in today’s data driven world. Analyzing observations which are not quantitative variables but rankings has been often studied in social sciences. Nowadays, it has become very popular in statistical learning. This is mainly due to the generalization of the use of automatic recommendation systems. Rankings are labels that model an order over a finite set . Hence, an observation is a set of preferences between these points. It is thus a one to one relation acting from onto . In other words, lies in the finite symmetric group of all permutations of .

In this paper, our aim is to predict a function defined on the permutation group and for this we will use the framework of Gaussian processes indexed on this set. Actually, Gaussian process models rely on the definition of a covariance function that characterizes the correlations between values of the process at different observation points. As the notion of similarity between data points is crucial, i.e. close location inputs are likely to have similar target values, covariance functions are the key ingredient in using Gaussian processes for prediction. Indeed, the covariance operator contains nearness or similarity informations. In order to obtain a satisfying model one need to choose a covariance function (i.e. a positive definite kernel) that respects the structure of the index space of the dataset.

A large number of applications gave rise to recent researches on ranking including “ranking aggregation” ([KCS17]) , clustering rankings (see [CGJ11]) or kernels on rankings for supervised learning. Constructing kernels over the set of permutations has been tackled in several manners. In [Kon08], Kondor provides results about kernel in non-commutative finite groups and constructs “diffusion kernels” (which are positive definite) on the permutation group . These diffusion kernels are based on a discrete notion of neighborliness. We remark that the kernels considered therein are very different from those considered here. Furthermore, the diffusion kernels are not in general covariance functions because of their tricky dependency on permutations. The paper [KB10] deals with the complexity reduction of computing the kernel computation for partial ranking. Recently, [JV17] proved that the Kendall and Mallow’s kernels are positive definite. [MRW16] extended this study characterizing both the feature spaces and the spectral properties associated with these two kernels.

The goal in this paper is twofold : first we define Gaussian processes indexed by permutations by providing a class of covariance kernels. They generalize previous results on the Mallow’s kernel (see [JV17]). Second, we study the asymptotic properties of the maximum likelihood estimator of the parameters of the covariance function and the properties of the prediction of the Gaussian Process associated. We prove the asymptotic accuracy of the Kriging prediction under the estimated covariance parameters. We also provide simulations that illustrate the performances of the studied kernels.

The paper falls into the following parts. In Section 2 we recall generalities on the set of permutations and provide some covariance kernels. Asymptotic results on the estimation of the covariance function are presented in Section 3. Section 4 is devoted to numerical illustration. Finally, Section 5 deals with the special case of partial rankings. The proofs are postponed to the appendix.

## 2 Covariance model for rankings

We will use the following notations. Let be the set of permutations on . In order to define a Gaussian process and in order to provide asymptotic results, we require the process to be defined over an infinite set. For this, we will consider the space , where is the identity operator on . This corresponds to a set of observations where rankings are given on the first elements while leaving the other invariant.
This framework can be seen as a model to simulate long processes where it is possible to change the order of the tasks, leading to several outcomes. For example in process mining, consider that we have to collect firms from a large number of people (we assume that there is a countably infinite number of them) to process out an administrative document. There is a predefined sequential order for the document signatures, resulting in an overall time of treatment . We call the -th person who signs the document according to this predefined order. Let us now call the processing time required when the order of signatures is given by Assume that is a realization of a Gaussian process with zero-mean and covariance function . Our aim is to predict the time , for new permutations , for instance in the aim of finding the order resulting in the shortest processing time for the document. Another example is given by a collection of machines in a supply line that need to be tuned in order to optimize the production of a good. The machines can be tuned in different orders, each corresponding to a permutation. The objective of the model will thus be to forecast the outcome of a specific order for the machines. Recall that we set . Furthermore, let be the set of permutations on the integers. As we will consider increasing domains, if and if , we can consider to be in with for all . With this simplification, we can write . Several distances can be considered on . We will focus here on the three following distances (see [Dia88]). For any permutations and of let

• The Kendall’s tau distance be defined by

 dτ(π,σ):=∑iσ(j),π(i)<π(j)+\mathds1σ(i)<σ(j),π(i)>π(j), (2.1)

that is, it counts the number of pairs on which the permutations disagree in ranking.

• The Hamming distance be defined by

 dH(π,σ):=∑i\mathds1τ(i)≠σ(i). (2.2)
• The Spearman’s footrule distance be defined by

 dS(π,σ):=∑i|τ(i)−σ(i)|. (2.3)

These three distance are right-invariant. That is, , . Other right-invariant distances are discussed in [Dia88]. We extend the last distances naturally on and obtain a countably infinite discrete space. We then extend these distances on , taking infinite sums of positive numbers and assuming that the distances can be equal to . For example, the Kendall’s tau distance is extended to

 dτ(π,σ)=∑i,j∈N,iσ(j),π(i)<π(j)+\mathds1σ(i)<σ(j),π(i)>π(j). (2.4)

We observe that is still a group with the composition.

We aim to define a Gaussian process indexed by permutations. Let us recall that the law of a Gaussian random process indexed by a set is entirely characterized by its mean and covariance functions

 M:x↦E(Xx)

and

 K:(x,y)↦Cov(Xx,Xy).

Hence we have only to build a covariance function on .
We recall the definition of a positive definite kernel on a space . A symmetric map is called a “positive definite kernel” if for all and for all , the matrix is positive semi-definite.
In this paper, we call a “strictly positive definite kernel” if is symmetric and for all and for all such that if , the matrix is positive definite.

This notion is particularly interesting for (and any finite set). Indeed, if is a strictly positive definite kernel, then for any function , there exists such that:

 f=∑σ∈Snaσk(.,σ), (2.5)

and is of course an “universal kernel” (see [SFL11]). The last decomposition is no longer true neither in nor in , but we have a result a little bit weaker than the universality of the kernel in .

###### Proposition 1.

If is a strictly positive definite kernel on , then

 Vect{n∑i=1aik(.,σi),n∈N,ai∈R,σi∈S∞} (2.6)

is dense for the pointwise convergence topology in the space of all the functions on .

###### Proof.

Let and let be the restriction of on . The kernel is strictly definite positive on so there exists , and such that

 fn=Nn∑i=1anik(.,σni). (2.7)

Hence is the pointwise limit of . ∎

###### Corollary 1.

Let be a strictly positive definite kernel on and let be its RKHS. Then, is dense, in the pointwise convergence topology, in the space of all the functions on .

We provide now three different covariance kernels. They share the following type

 Kθ∗1,θ∗2(σ,σ′):=θ∗2exp(−θ∗1d(σ,σ′)), (2.8)

where is one of the three distances discussed previously. More precisely, for the Kendall’s tau distance, let be the corresponding covariance function; for the Hamming distance, let be the corresponding covariance function; and for the Spearman’s footrule distance, let be the corresponding covariance function. We will write (resp. ) for all three kernels (resp. distances). Note that when , we have . Note further that the right-invariance of the distances is inherited by the kernel .

Finally, let and let us write for the following covariance kernel

 K′θ(σ,σ′):=Kθ1,θ2(σ,σ′)+θ3\mathds1σ=σ′. (2.9)

In our case, we have assumed that is a covariance function, so that is a strictly positive definite kernel. The following theorem proves this assumption.

###### Theorem 1.

For all and , the maps , and are strictly positive definite kernel on , on and on .

###### Corollary 2.

The kernel is strictly positive definite on , on and on .

## 3 Gaussian fields on the Symmetric group

Let us consider a Gaussian process indexed by , with zero mean and unknown covariance function . A classical assumption is that the covariance function belongs to a parametric set of the form

 {Kθ;θ∈Θ}, (3.1)

with and where for all , is a covariance function. The quantity is generally called the covariance parameter. In this framework, for some parameter .

The parameter is estimated from noisy observations of the values of the Gaussian process on several inputs. Namely for . Actually, let us consider an independent sample of random permutations . Assume that we observe and a realization of the random vector defined by

 Y(σk)=Z(σk)+εk. (3.2)

Here, is independent of , and is a Gaussian process indexed by independent of and . We assume that is centered with covariance function (see (2.8) in Section 2). Thus, is a Gaussian process with zero mean and covariance function defined by (2.9). The Gaussian process (resp. ) is stationary in the sense that for all and for all , the finite-dimensional distribution of (resp. ) at is the same as the finite-dimensional distribution at .

Several techniques have been proposed for constructing an estimator of . Here, we shall focus on the maximum likelihood one. It is widely used in practice and has received a lot of theoretical attention. The maximum likelihood estimate is defined as

 ˆθML=ˆθn∈argmin{θ∈Θ}Lθ (3.3)

with

 Lθ:=1nln(detRθ)+1nytR−1θy, (3.4)

where . We consider that for some given ().

When considering the asymptotic behaviour of the Maximum Likelihood Estimate, two different frameworks can be studied: fixed domain and increasing domain asymptotics ([Ste99]). Under increasing-domain asymptotics, as , the observation points are such that is lower bounded and becomes large with . Under fixed-domain asymptotics, the sequence (or triangular array) of observation points is dense in a fixed bounded subset. For a Gaussian field on , under increasing-domain asymptotics, the true covariance parameter can be estimated consistently by maximum likelihood. Furthermore, the maximum likelihood estimator is asymptoticly normal ([MM84, CL93, CL96, Bac14]). Moreover, prediction performed using the estimated covariance parameter is asymptotically as good as the one computed with as pointed out in [Bac14]. Finally, note that in the Symmetric group, the fixed-domain framework can not be considered (contrary to the input space ) since is a finite space and is a discrete space.

We will consider hereafter the increasing-domain framework. Hence, we observe values of the Gaussian process on the permutations that are assumed to fulfill the following assumptions

1. Condition 1: There exists such that , .

2. Condition 2: There exists such that .

Such conditions are ensured for particular choices of observations for the three different distances previously considered. For example consider the following setting.

###### Lemma 1.

We fix and we choose with a random permutation such that are independent (we do not make further assumptions on the law of ). Let the cycle defined by , if and if . Finally, is a permutation such that , is a random variable in if or if , if and if . The conditions are satisfied with and for the Kendall’s tau distance, for the Hamming distance and for the Spearman’s footrule distance.

###### Remark 1.

If there is a for , Condition 2 ensures that all the observations belong to . More generally, using the stationarity of the Gaussian process and writing instead of we can assume that all the observations belong to .

The following theorem ensures the consistency of the estimator when the number of observations increases.

###### Theorem 2.

Let be defined as in (3.3), then under Conditions 1 and 2, we get

 ˆθMLP⟶n→+∞θ∗. (3.5)

The following Lemmas are useful for the proof of Theorem 2 (and of Theorems 3 and 4 below). Their proofs are postponed to the appendix.

###### Lemma 2.

The eigenvalues of are lower-bounded by uniformly in , and .

###### Lemma 3.

For all , with and with , the eigenvalues of are upper-bounded uniformly in , and .

###### Lemma 4.

Uniformly in ,

 ∀α>0,liminfn→+∞inf∥θ−θ∗∥≥α1nn∑i,j=1(K′θ(σi,σj)−K′θ∗(σi,σj))2>0. (3.6)
###### Lemma 5.

, uniformly in ,

 liminfn→+∞1nn∑i,j=1(3∑k=1λi∂∂θkK′θ∗(σi,σj))2>0. (3.7)

With these lemmata we are ready to prove the main asymptotic results.

###### Proof.

of Theorem 2. We follow the proof of Theorem V.9 of [BGLV17]. We first show that for all , almost surely,

 P(supθ|(Lθ−Lθ∗)−(E(Lθ|Σ)−E(Lθ∗|Σ))|≥ϵ∣∣∣Σ)→n→∞0. (3.8)

We then prove that, for a fixed ,

 E(Lθ|Σ)−E(Lθ∗|Σ)≥a1nn∑i,j=1(Kθ(σi,σj)−Kθ∗(σi,σj))2. (3.9)

We conclude since (3.8), (3.9) and Lemma 4 imply consistency. ∎

The following theorem provides the asymptotic normality of the estimator.

###### Theorem 3.

Let be the matrix defined by

 (MML)i,j=12nTr(R−1θ∗∂Rθ∗∂θiR−1θ∗∂Rθ∗∂θj). (3.10)

Then

 Extra open brace or missing close brace (3.11)

Furthermore,

 0
###### Proof.

We proceed as in the proof of Theorem V.10 in [BGLV17]. First, we prove (3.12). We then use a proof by contradiction: we assume that (3.11) is not true. So, there exists a bounded measurable function and so that, up to extracting a subsequence

 ∣∣∣E[g(√nM12ML(ˆθML−θ∗)]−E(g(U))∣∣∣≥ξ, (3.13)

with . As in [BGLV17], we prove that, extracting another subsequence, we have:

 Extra open brace or missing close brace (3.14)

which is in contradiction with (3.13). ∎

Given the maximum likelihood estimator , the value , for any input , can be predicted by plugging the estimated parameter in the conditional expectation (or posterior mean) expression for Gaussian processes. Hence is predicted by

 ^Yˆθ(σ)=rtˆθ(σ)R−1ˆθy (3.15)

with

 rˆθ(σ)=⎡⎢ ⎢ ⎢⎣K′ˆθ(σ,σ1)⋮K′ˆθ(σ,σn)⎤⎥ ⎥ ⎥⎦.

We point out that is the conditional expectation of given , when assuming that is a centered Gaussian process with covariance function .

###### Theorem 4.
 ∀σ∈SN,∣∣ˆYˆθML(σ)−ˆYθ∗(σ)∣∣=oP(1). (3.16)
###### Proof.

We follow the same guidelines as in Theorem V.11 in [BGLV17], showing that, for

 ∣∣∣supθ∈Θ∂∂θk^Yθ(σ)∣∣∣=OP(1) (3.17)

## 4 Numerical illustrations

To illustrate Theorem 2, we suggest a numerical application to show that the maximum likelihood is consistent. We generated the observations suggested in Section 3 with . We recall that with a random permutation.

###### Remark 2.

This choice of observations can model real cases. Recall the example given in Section 2, where is the time for a document to be signed in the order To estimate , we have to observe a realization of the time at with and is a random permutation. Assume that the first persons are in the same office We begin to give the document to the person . signs the document, then, observing that he/she is the first one to sign, gives the document to one of the first persons, who then sign in a random order.

To highlight the dependency with , we write the maximum likelihood estimator for observations. For each value of , we estimate the probability using a Monte-Carlo method and a sample of 1000 values of . Figure 1 depicts these estimates for , and .

In Figure 2, we display the density of the coordinates of the maximum likelihood estimator for different values of (20, 60 and 150). These densities have been estimated with a 1000 sample of the maximum likelihood estimator. We observe that the densities can be far from the true parameter for or but are quite close to it for . We can see that for , the Kendall’s tau distance seems to give better estimates of . However, the computation time of the distance matrix is much longer with the Kendall’s tau distance than with the other distances.

In Figure 3, we display estimates of the probability that the absolute value of the prediction of given in (3.15) with the parameter minus the prediction of with the parameter is greater than . Theorem 4 ensures us that this probability converges to when .

## 5 Partial Rankings

### 5.1 Introduction

In many situations, when is large, preferences are not given for all points but only for a small number of points . This situation occurs often in social science. When considering statistical models which analyzes the behavior of human decision in a consumer behaviour, partial rankings are often considered. Actually, given objects governed by a large number of variables, how can we model the decision to choose one object rather than another ? Objects are described by a set of quantitative variables . These variables are representative of a specific property of each object. For instance when buying a bike one may be interested in the weight, the price, the number of velocities, the height, or any other quantitative or qualitative descriptors. Each consumer when confronted to the choice of a product, chooses to give more importance to certain variables while discarding others. The consumer selects a small number of variables (features) that are essential in his choice, ranks these variables according to its preferences while the others play little importance.

In the general framework, we have a finite set . A partial ranking aims at giving an order of preference between different elements of . A partial ranking is a statement of the form

 X1≻X2≻...≻Xm, (5.1)

where are disjoint set of . This partial ranking means that any element of if preferred to any element of . We can associate to the partial ranking the set of defined by

 ER:={σ∈Sn,∀(xi1,...,xik)∈X1×...×Xm,σ(i1)<σ(i2)<...<σ(im)}. (5.2)
###### Remark 3.

In [KB10] and [JV17], the set is defined by the set of the permutations such that . They chose this definition to simplify their computations but in this way the ranking mapped to is

 xσ−1(in)≻...≻xσ−1(i1).

The definition (5.2) seems to be more natural because we map to the ranking

 xσ−1(i1)≻...≻xσ−1(in).

The first natural way to extend a positive definite kernel on the partial rankings (see [KB10], [JV17],…) is letting

 K(R,R′):=1|ER||ER′|∑σ∈ER∑σ′∈ER′K(σ,σ′). (5.3)

If is a positive definite kernel on permutations, then defined by (5.3) is a positive definite kernel on partial ranking ([Hau99]). We also can see this saying that if are partial rankings and if , then

 n∑i,j=1aiajK(Ri,Rj)=∑σ,σ′∈Snbσbσ′K(σ,σ′), (5.4)

letting

 bσ:=∑i,σ∈Riai|ERi|. (5.5)
###### Remark 4.

The values of depends on . It can be very closed to , that means for a Gaussian process indexed by the partial rankings that the value is almost constant. To circumvent this problem, we can define a new kernel

 Knew(R,R′):=1√K(R,R)K(R′,R′)K(R,R′). (5.6)

The computation of this kernel seems to be very long because we have to sum over permutations. In the following, we aim to reduce this computation. We focus especially on the following kernel on :

 K(σ,σ′):=e−νd(σ,σ′), (5.7)

where if the Kendall’s tau distance, the Hamming distance or the Spearman’s footrule distance. These kernels are interesting for two reasons: they are strictly positive definite and they are easy to interpret (more than a kernel defined by a matrix exponential).

### 5.2 Direct computations

The first idea is to simplify the expression of (5.3). However, this does not seem to be a simple task, that is why we take a particular framework. In this section, we assume that all the items are ranked, i.e. is a partition of . Let and is a partition of . This computation has always been done in [LM08] for the Kendall’s tau distance. Let us sum up the result that interests us in the following proposition.

###### Proposition 2.

[LM08]
Let be a partition of . For all , let and let

 aγj := ∣∣{(s,t),s

Then, if for ,

 K(R1,R2)=1|Sγ1||Sγ2|⎛⎜ ⎜⎝∑σ∈π1π−12Sγ2e−ν∑m1=j

Now, we do the same work with the Hamming distance. Before, we need to introduce a new notation.

###### Definition 1.

We define

 cn,d:=|{σ∈Sn,dH(σ,id)=d}|=n!(n−d)!d∑k=0(−1)kk!, (5.9)

the number of permutations of elements which move exactly elements.

Now we give a proposition similar than Proposition 2 with the Hamming distance.

###### Proposition 3.

Let us define

 aγj := ∣∣{i∈[1:n],i≠τ(i),(i,τ(i))∈[gj+1:gj+1]}∣∣ bγjl(τ) := ∣∣{i∈[1:n],l≠j,i∈[gj+1:gj+1],τ(i)∈[gl+1:gl+1]}∣∣

Then,

 K(R1,R2)=1|Sγ1||Sγ2|⎛⎜ ⎜⎝∑σ∈π1π−12Sγ2e−ν∑mj,l=1bγ1jl(σ)⎞⎟ ⎟⎠⎛⎝m∏s=1γ1s∑h=0cγ1,he−νh⎞⎠. (5.10)

### 5.3 Fourier Transform of the kernel on partial ranking

#### Notations

In this section, we use the usual kernel on partial ranking defined by

 K(R,R′):=1|ER||ER′|∑σ∈ER∑σ′∈ER′K(σ,σ′). (5.11)

We assume that the kernel on the set of permutation is right-invariant and we write . We extend the work of [KB10]. We compute the Fourier transform for general partial rankings, i.e. statement of the form

 X1≻X2≻...≻Xm (5.12)

Let be the size of the , let be the sum of the and let be the partition of defined by . Let be the set of interleaving of with

 σ(i)≤σ(j) si i

Then, writing , we have (as in [KB10] but generalized for all partial ranking)

 ER:=ΠnkS~γπR, (5.13)

where is such that . Finaly, let us write .

#### Reduction of number of terms

We just generalize the works of [KB10] for general partial rankings. Let () be the partial rankings defined by

 ERi=EγiπRi. (5.14)

As in [KB10], we identify a set of permutation with the function of which associates to the number if and otherwise. Proposition 6 of [KB10] gives

 K(R1,R2)=1n!|ER1||ER2|∑λ⊢ndλTr(ˆER1(ρλ)∗ˆk(ρλ)ˆER2(ρλ)). (5.15)

The next proposition (which generalizes Proposition 8 of [KB10]) show how the sum in the previous equation can be reduced to a lower number of terms.

###### Proposition 4.

Let and let be the set of Young’s diagrams of boxes with at least boxes in their first row. Then

 K(R1,R2)=1n!|ER1|ER2|∑λ∈ΛncdλTr(ˆER1(ρλ)∗ˆk(ρλ)ˆER2(ρλ)). (5.16)

#### Reduction of each remaining term

Here, we assume that the partial rankings for which we want to compute the kernel have always the same forms . For example, assume that all these partial rankings are top- lists and partial rankings of the form . In this case, we just have and . The next proposition show that the computation of (5.16) can still be reduced.

###### Proposition 5.

Let () be the partial rankings defined by

 ERi=Eγiπi. (5.17)

Then

 K(R1,R2)=1n!∑λ∈ΛncdλTr(ρλ(π2π−11)qλ(Eγ1)ˆk(ρλ)pλ(Eγ2)), (5.18)

where

 qλ(Eγ) := 1|Eγ|∑σ∈Eγρλ(σ−1),