Minimax bounds for structured prediction

# Minimax bounds for structured prediction

Kevin Bello
Department of Computer Science
Purdue Univeristy
West Lafayette, IN 47906, USA
kbellome@purdue.edu
&Asish Ghoshal
Department of Computer Science
Purdue Univeristy
West Lafayette, IN 47906, USA
aghoshal@purdue.edu
&Jean Honorio
Department of Computer Science
Purdue Univeristy
West Lafayette, IN 47906, USA
jhonorio@purdue.edu
###### Abstract

Structured prediction can be considered as a generalization of many standard supervised learning tasks, and is usually thought as a simultaneous prediction of multiple labels. One standard approach is to maximize a score function on the space of labels, which decomposes as a sum of unary and pairwise potentials, each depending on one or two specific labels, respectively. For this approach, several learning and inference algorithms have been proposed over the years, ranging from exact to approximate methods while balancing the computational complexity. However, in contrast to binary and multiclass classification, results on the necessary number of samples for achieving learning is still limited, even for a specific family of predictors such as factor graphs. In this work, we provide minimax bounds for a class of factor-graph inference models for structured prediction. That is, we characterize the necessary sample complexity for any conceivable algorithm to achieve learning of factor-graph predictors.

## 1 Introduction

Structured prediction has been continuously used over the years in multiple domains such as computer vision, natural language processing, and computational biology. Key examples of structured prediction problems include image segmentation, dependency parsing, part-of-speech tagging, named entity recognition, machine translation and protein folding. In this setting, the input is some observation, e.g., social network, an image, a sentence. The output is a labeling , e.g., an assignment of each individual of a social network to a cluster, or an assignment of each pixel in the image to foreground or background, or an acyclic graph as in dependency parsing. A property common to these tasks is that, in each case, the natural loss function admits a decomposition along the output substructures. Thus, a common approach to structured prediction is to exploit local features to infer the global structure. For instance, one could include a feature that encourages two individuals of a social network to be assigned to different clusters whenever there is a strong disagreement in opinions about a particular subject. Then, one can define a posterior distribution over the set of possible labelings conditioned on the input.

The output structure and corresponding loss function make these problems significantly different from the (unstructured) binary or multiclass classification problems extensively studied in learning theory. Some classical algorithms for learning the parameters of the model include conditional random fields [14], structured support vector machines [22, 24, 2], kernel-regression algorithm [8], search-based structured prediction [10]. More recently, deep learning algorithms have been developed for specific tasks such as image annotation [27], part-of-speech-tagging [13, 26], and machine translation [30].

However, in contrast to the several algorithms developed, there have been relatively few studies devoted to the theoretical understanding of structured prediction. From the few theoretical literature, the most studied aspect has been the generalization error bounds. [6, 5, 23] provided learning guarantees that hold primarily for losses such as the Hamming loss and apply to specific factor graph models. [16, 12, 3, 11] provide PAC-Bayesian guarantees for arbitrary losses through the analysis of randomized algorithms using count-based hypotheses. Literature on lower bounding the sample complexity for structure prediction is scarcer even for specific hypothesis classes of losses. Information-theoretic bounds have been studied in the context of binary graphical models [18, 20] and Gaussian Markov random fields [29]. Nevertheless, there is still a lack of understanding in the context of more general structured prediction problems.

Our main contribution consists of characterizing the necessary sample complexity for learning factor graph models in the context of structured prediction. Specifically, in Theorem 1, we show that the finiteness of the -dimension (see Definition 3) is necessary for learning. We further show in Theorem 2 the connection of the -dimension to the VC-dimension [25], which will allow us to compute the -dimension from the several known results on VC-dimension.

## 2 Preliminaries

Let denote the input space and the output space. In structured prediction, the output space usually consists of a large (e.g., exponential) set of discrete objects admitting some possibly overlapping structure. Among common structures in the literature, one finds set of sequences, graphs, images, parse trees, etc. Thus, we consider the output space to be decomposable into substructures: . Here, is the set of possible labels that can be assigned to substructure . For example, in a webpage collective classification task [21], each is a webpage label, whereas is a joint label for an entire website. In this work we assume that , that is, for all . In this case, the number of possible assignments to is exponential in the number of substructures , i.e., .

#### The Hamming loss.

In order to measure the success of a prediction, we use the Hamming loss throughout this work. Specifically, for two outputs , with and , the Hamming loss, , is defined as . The Hamming loss has been widely used in structured prediction, for instance, in image segmentation one may count the number of pixels that are incorrectly assigned as foreground/background; in graphs, one may count the number of different edges between the prediction and the true label.

#### Factor graphs and scoring functions.

We adopt a common approach in structured prediction where predictions are based on a scoring function mapping to . Let be a family of scoring functions. For any , we denote by the predictor defined by : for any ,

Furthermore, we assume that each function can be decomposed as a sum, as is standard in structured prediction. We consider the most general case for such decompositions through the notion of factor graphs, described also in [7]. A factor graph is a bipartite graph, and is represented as a tuple , where is a set of variable nodes, a set of factor nodes, and a set of undirected edges between a variable node and a factor node. In our context, can be identified with the set of substructure indices, that is We further assume that is connected. Note that, in contrast to graphical models, we do not assume to be a probabilistic model but it would also be captured by this framework.

For any factor node , denote by the set of variable nodes connected to via an edge and define as the substructure set cross-product . Then, decomposes as a sum of functions , each taking as argument an element of the input space and an element of , :

 f(x,y)=∑ϕ∈Φfϕ(x,yϕ).

Specifically, we focus on factor graphs with unary and pairwise factors, that is, each factor node is connected to one or two nodes in . We let denote a pairwise factor node connected to , i.e., . Then, the score induced from is given by . Note that in this case and represent the same factor node and induce the same score. Similarly, for unary factor nodes, we let denote a factor node connected to with score given by . We further use to denote functions that are decomposable with respect to the graph . Note also that while all decompose with respect to same graph , the score functions and are allowed to be different for any , . Figure 1 shows different examples of factor graphs with unary and pairwise factors.

#### Learning.

We receive a training set of i.i.d. samples drawn according to some distribution over . We denote by the expected Hamming loss and by the empirical Hamming loss of :

 RP(f)=E(x,y)∼P[LH(f(x),y)]andRS(f)=1m∑(x,y)∈SLH(f(x),y). (1)

Our learning scenario consists of using the sample to select a hypothesis with small expected Hamming loss .

Next, we introduce the definition of Bayes-Hamming loss, which in words is the minimum attainable expected Hamming loss by any predictor.

###### Definition 1 (Bayes-Hamming loss).

For any given distribution over , the Bayes-Hamming loss is defined as the minimum achievable expected Hamming loss among all possible predictors . That is,

Then the Bayes-Hamming predictor, , is defined as the function that achieves the Bayes-Hamming loss, that is, .

The following proposition shows how the Bayes-Hamming predictor makes its decision with respect to the Hamming loss.

###### Proposition 1.

For any given distribution over , the Bayes-Hamming predictor is: where is the marginal probability , for each substructure .

(See Appendix A for detailed proofs.)

We emphasize that the above definition considers the Hamming loss, , as defined at the beginning of Section 2. For other types of loss functions, the Bayes predictor can have different optimal decisions.

### 2.1 Minimax risk framework

The standard minimax risk consists of a family of distributions over a sample space , and a function defined on , that is, a mapping . We aim to estimate the parameter based on a sequence of i.i.d. observations drawn from the (unknown) distribution . To evaluate the quality of an estimator , we let denote a semi-metric on the space , which we use to measure the error of an estimator with respect to the parameter . For a distribution and for a given estimator , we assess the quality of the estimate in terms of the (expected) risk:

 EP[ρ(ˆθ(z1,…,zm),θ(P))],

where denotes the expectation with respect to . A common approach, first suggested by [28], for choosing an estimator is to select the one that minimizes the maximum risk, that is,

 supP∈PEP[ρ(ˆθ(z1,…,zm),θ(P))].

An optimal estimator for this metric then gives the minimax risk, which is defined as:

 Mm(θ(P),ρ):=infˆθsupP∈PEP[ρ(ˆθ(z1,…,zm),θ(P))],

where we take the supremum (worst-case) over distributions , and the infimum is taken over all estimators . Here the notation indicates that we consider distributions in and parameters for .

### 2.2 Minimax risk in structured prediction

We now apply the framework above to our context and study a specialized notion of risk appropriate for prediction problems. In this setting, we aim to estimate a function by using samples from a distribution . For any sample , we will measure the quality of our estimation, , by comparing its output to the structure drawn from through the Hamming loss. By taking expectation, we obtain the expected risk or expected Hamming loss, defined in eq.(1). We then compare this risk to the best possible Hamming loss, i.e., the Bayes-Hamming loss. That is, we assume that at least one function achieves the Bayes-Hamming loss. Thus, we arrive to the following minimax excess risk:

 Mm(P)=infAsupP∈PES∼Pm[RP(A(S))−RP(f∗)], (2)

where , and is any algorithm that returns a predictor given training samples from . Moreover, defines a family of distributions over . Intuitively speaking, for a fixed distribution , the quantity represents the minimum expected excess loss achievable by any algorithm with respect to the factor graph . Then looks into the distribution that attains the worst expected excess loss.

## 3 Information-theoretic lower bound for structured prediction

We are interested on finding a lower bound to the minimax risk (2) presented in Section 2.2. By doing this, we characterize the necessary number of samples to have any hope in achieving learning.

Before presenting our main result, we introduce a new type of dimension that will show up in our lower bound and will help to characterize learnability. Note that it is known that different notions of dimension of function classes help to characterize learnability in certain prediction problems. For example, for binary classification, the finiteness of the VC dimension [25] is necessary for learning [15]. For multiclass classification, it was shown that the finiteness of the Natarajan dimension is necessary for learning [9]. General notion of dimensions for multiclass classification has also been study in [4].

For a given function class , and dataset of samples, we use the following shorthand notation: That is, contains all the matrices in that can be produced by applying all functions in to the dataset . Next we define the standard notion of shattering.

###### Definition 2 ({0,1}2-shattering).

A function class, , -shatters a finite set of samples if produces all possible binary matrices in . That is, .

###### Definition 3 ({0,1}2-dimension).

The -dimension of a function class , denoted , is the maximal size of a set that can be shattered by . If can shatter sets of arbitrarily large size we say that has infinite -dimension.

The above dimension applies to functions with output in . We will create functions with output in as follows. Let denote the function with for all . Then, let , that is, the output of is in . The following dimension applies to function classes based on factor graphs.

###### Definition 4 (max-{0,1}2-dimension).

For a given factor graph , the -dimension of a function class , denoted as , is defined as:

 max-{0,1}2-Dim(F(G))=max(u,v)∈T {0,1}2-Dim(F(0)u,v),

where , and .

###### Theorem 1.

Let be a factor graph with pairwise and unary factors, let denote a class of functions , where each decomposes according to , and let . Then, we have that for any and any :

 Mm(P)≥181min(d−1γm,√d−1m).
###### Proof.

The proof is motivated by the work of [15] for binary classifiers. As a first step it is clear that one can lower bound eq.(2) by defining the maximum over a subset of . That is, we create a collection of family of distributions , where . Each family distribution is further indexed by . Then we have,

 Mm(P)≥max(u,v)∈TMm(Dγ,u,v).

Our approach consists of first defining the families of distributions such that its elements can be naturally indexed by the vertices of a binary hypercube. We will then relate the expected excess risk problem to an estimation of binary strings in order to apply Assouad’s lemma.

#### Construction of Dγ,u,v.

Consider a fixed . We first focus on constructing a family of distributions, , parameterized by . Each distribution is further indexed by a binary matrix , where is the -dimension of . To construct these distributions, we will first pick the marginal distribution of the feature , and then specify the conditional distributions of given , for each .

We construct as follows. Since is a class with -dimension , there exists a set of points that are shattered by , that is, for any binary matrix there exists at least one function such that , for all . We now define the marginal distribution such that its support is the shattered set , i.e., . For a given parameter , whose value is set later, we have:

 P(x)γ,u,v,B[xi]={p,if i∈{1,…,d(0)u,v−1}1−(d(0)u,v−1)p,otherwise.

Next, for a fixed , the conditional distribution of given , , is defined as:

 P(y|x)γ,u,v,B[y|x]=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩1−3γ4,if x=xi, and yu=1−Bi1, and yv=1−Bi2,and yk=0 for k∈V∖{u,v}, and i∈{1,…,d(0)u,v−1}1+γ4,if x=xi, and yk=0 for k∈V∖{u,v}, and i∈{1,…,d(0)u,v−1}0,otherwise,

here we implicitly assume that in order to obtain a valid distribution. The above definition produces the following marginal probabilities:

 η(γ,u,v,B)j(x)≡P(yj|x)γ,u,v,B[yj=1|x]=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩1−γ2,if x=xi for some i∈{1,…,d(0)u,v−1},and((j=u and Bi1=0) or (j=v and Bi2=0))1+γ2,if x=xi for some i∈{1,…,d(0)u,v−1},and((j=u and Bi1=1) or (j=v and Bi2=1))0,otherwise, (3)

where we note that for each and any we have that . Given the above marginals, the corresponding Bayes-Hamming predictor for substructure for a given input (see Proposition 1), which we denote by , is given by:

 (f∗B,u,v(x))j=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩0, if x=xi for some i∈{1,…,d(0)u,v−1},and ((j=u and Bi1=0) or (j=v and Bi2=0))1, if x=xi for some i∈{1,…,d(0)u,v−1},and ((j=u and Bi1=1) or (j=v and Bi2=1))0, otherwise. (4)

That is, we have that the output of the Bayes-Hamming predictor on each for , for each substructure for , is equal to the bit value or , and zero otherwise.

#### Reduction to estimation of binary strings.

For any distribution , we can further express the expected excess risk in eq.(2) as follows:

 RB,u,v(A(S))−RB,u,v(f∗B,u,v)=E(x,y)∼Dγ,u,v,B[l∑j=1(1−2yj)((^fm(x))j−(f∗B,u,v(x))j)] =l∑j=1Ex∼D(x)γ,u,v,B⎡⎢⎣Eyj∼D(yj|x)γ,u,v,B[(1−2yj)((^fm(x))j−(f∗B,u,v(x))j)]⎤⎥⎦ =l∑j=1Ex∼D(x)γ,u,v,B[∣∣2η(γ,u,v,B)j(x)−1∣∣⋅∣∣(^fm(x))j−(f∗B,u,v(x))j∣∣] ≥γ⋅Ex∼D(x)γ,u,v,B[l∑j=1∣∣(^fm(x))j−(f∗B,u,v(x))j∣∣] (5) =γ⋅d(0)u,v∑i=1l∑j=1∣∣(^fm(xi))j−(f∗B,u,v(xi))j∣∣⋅P(x)γ,u,v,B[xi]def=γ⋅∥^fm−f∗B,u,v∥1,1, (6)

where denotes the expected risk and the Bayes-Hamming predictor, both with respect to . Here is the output of , with denoting the -th substructure of the output , and denotes the marginal probability . Equation (5) follows from our definition of (see eq.(3)), and the matrix norm in eq.(6) is computed with respect to . Thus, we have that:

 Mm(Dγ,u,v) =inf^fmmaxB∈{0,1}(d(0)u,v−1)×2EB,u,v[RB,u,v(^fm)−RB,u,v(f∗B,u,v)] ≥γ⋅inf^fmmaxB∈{0,1}(d(0)u,v−1)×2EB,u,v[∥^fm−f∗B,u,v∥1,1], (7)

where denotes the expectation with respect to . Equation (7) follows from eq.(6). Given any candidate estimation , let be defined as follows:

 ˆBm def=argminB∈{0,1}(d(0)u,v−1)×2∥^fm−f∗B,u,v∥1,1. (8)

Intuitively, is the binary matrix that indexes the element of which is the closest to in norm. Then, for any , we have

 ∥f∗ˆBm,u,v−f∗B,u,v∥1,1 ≤∥f∗ˆBm,u,v−^fm∥1,1+∥^fm−f∗B,u,v∥1,1≤2∥^fm−f∗B,u,v∥1,1,

where we first applied the triangle inequality, and then used eq.(8). Applying this to eq.(7), we obtain:

 Mm(Dγ,u,v)≥γ2infˆBmmaxB∈{0,1}(d(0)u,v−1)×2EB,u,v[∥f∗ˆBm,u,v−f∗B,u,v∥1,1], (9)

here the infimum is over all estimators that take values in based on samples, i.e., over . We now compute for any two . Using eq.(4) we have:

 ∥f∗B,u,v−f∗B′,u,v∥1,1 =d(0)u,v∑i=1l∑j=1∣∣(f∗B,u,v(xi))j−(f∗B′,u,v(xi))j∣∣⋅P(x)γ,u,v,B[xi] =p⋅d(0)u,v−1∑i=12∑j=1 ∣∣Bij−B′ij∣∣=p⋅LH(B,B′).

In the last equality we abuse notation and consider the matrix as a vector of dimension . Replacing this result into eq.(9), we get:

 Mm(Dγ,u,v)≥pγ2infˆBmmaxB∈{0,1}(d(0)u,v−1)×2EB,u,v[LH(ˆBm,B)],

which is related to an estimation problem in the hypercube.

In order to apply Assouad’s lemma, we need an upper bound on the squared Hellinger distance for all with . For any two we have: In the above summation, the inner sum is zero if . Since we are interested on and such that , this implies that for only one row from we have with exactly one bit different. Then, the Hellinger distance results in: Applying Assouad’s lemma we obtain: (10) Let , and noting that if then the condition holds. Replacing in eq.(10) we have: (11) If , and using the same construction as above with , we see that: (12) Therefore, combining equations (11) and (12), and since the choice of was arbitrary, we have that:

## 4 Relation of {0,1}2-dimension to VC-dimension

In this section, we show a connection of our defined -dimension to the classical VC-dimension [25].

The following theorem shows that for a function class , the -dimension of is related to the minimum VC-dimension of a subclass of functions derived from .

###### Theorem 2.

Let be a function class. Let be four function classes defined as

 H11={h(⋅)=g(⋅)1g(⋅)2∣g∈G}, H