Maximum Entropy Discrimination Markov Networks

# Maximum Entropy Discrimination Markov Networks

\nameJun Zhu \emailjun-zhu@mails.tsinghua.edu.cn
\addrState Key Lab of Intelligent Technology Systems
\addrTsinghua National Lab for Information Science and Technology
\addrDepartment of Computer Science and Technology
Tsinghua University
\AND\nameEric P. Xing \emailepxing@cs.cmu.edu
\addrMachine Learning Department
Carnegie Mellon University
###### Abstract

Standard maximum margin structured prediction methods lack a straightforward probabilistic interpretation of the learning scheme and the prediction rule. Therefore its unique advantages such as dual sparseness and kernel tricks cannot be easily conjoined with the merits of a probabilistic model such as Bayesian regularization, model averaging, and allowing hidden variables. In this paper, we present a novel and general framework called Maximum Entropy Discrimination Markov Networks (MaxEnDNet), which integrates these two approaches and combines and extends their merits. Major innovations of this model include: 1) It generalizes the extant Markov network prediction rule based on a point estimator of weights to a Bayesian-style estimator that integrates over a learned distribution of the weights. 2) It extends the conventional max-entropy discrimination learning of classification rule to a new structural max-entropy discrimination paradigm of learning the distribution of Markov networks. 3) It subsumes the well-known and powerful Maximum Margin Markov network (MN) as a special case, and leads to a model similar to an -regularized MN that is simultaneously primal and dual sparse, or other types of Markov network by plugging in different prior distributions of the weights. 4) It offers a simple inference algorithm that combines existing variational inference and convex-optimization based MN solvers as subroutines. 5) It offers a PAC-Bayesian style generalization bound. This work represents the first successful attempt to combine Bayesian-style learning (based on generative models) with structured maximum margin learning (based on a discriminative model), and outperforms a wide array of competing methods for structured input/output learning on both synthetic and real OCR and web data extraction data sets.

Maximum Entropy Discrimination Markov Networks Jun Zhu jun-zhu@mails.tsinghua.edu.cn
State Key Lab of Intelligent Technology Systems
Tsinghua National Lab for Information Science and Technology
Department of Computer Science and Technology
Tsinghua University
Eric P. Xing epxing@cs.cmu.edu
Machine Learning Department
Carnegie Mellon University

Editor: ?

Keywords: Maximum entropy discrimination Markov networks, Bayesian max-margin Markov networks, Laplace max-margin Markov networks, Structured prediction.

## 1 Introduction

Inferring structured predictions based on high-dimensional, often multi-modal and hybrid covariates remains a central problem in data mining (e.g., web-info extraction), machine intelligence (e.g., machine translation), and scientific discovery (e.g., genome annotation). Several recent approaches to this problem are based on learning discriminative graphical models defined on composite features that explicitly exploit the structured dependencies among input elements and structured interpretational outputs. Major instances of such models include the conditional random fields (CRFs) (Lafferty et al., 2001), Markov networks (MNs) (Taskar et al., 2003), and other specialized graphical models (Altun et al., 2003). Various paradigms for training such models based on different loss functions have been explored, including the maximum conditional likelihood learning (Lafferty et al., 2001) and the max-margin learning (Altun et al., 2003; Taskar et al., 2003; Tsochantaridis et al., 2004), with remarkable success.

The likelihood-based models for structured predictions are usually based on a joint distribution of both input and output variables (Rabiner, 1989) or a conditional distribution of the output given the input (Lafferty et al., 2001). Therefore this paradigm offers a flexible probabilistic framework that can naturally facilitate: hidden variables that capture latent semantics such as a generative hierarchy (Quattoni et al., 2004; Zhu et al., 2008a); Bayesian regularization that imposes desirable biases such as sparseness (Lee et al., 2006; Wainwright et al., 2006; Andrew and Gao, 2007); and Bayesian prediction based on combining predictions across all values of model parameters (i.e., model averaging), which can reduce the risk of overfitting. On the other hand, the margin-based structured prediction models leverage the maximum margin principle and convex optimization formulation underlying the support vector machines, and concentrate directly on the input-output mapping (Taskar et al., 2003; Altun et al., 2003; Tsochantaridis et al., 2004). In principle, this approach can lead to a robust decision boundary due to the dual sparseness (i.e., depending on only a few support vectors) and global optimality of the learned model. However, although arguably a more desirable paradigm for training highly discriminative structured prediction models in a number of application contexts, the lack of a straightforward probabilistic interpretation of the maximum-margin models makes them unable to offer the same flexibilities of likelihood-based models discussed above.

For example, for domains with complex feature space, it is often desirable to pursue a “sparse” representation of the model that leaves out irrelevant features. In likelihood-based estimation, sparse model fitting has been extensively studied. A commonly used strategy is to add an -penalty to the likelihood function, which can also be viewed as a MAP estimation under a Laplace prior. However, little progress has been made so far on learning sparse MNs or log-linear models in general based on the maximum margin principle. While sparsity has been pursued in maximum margin learning of certain discriminative models such as SVM that are “unstructured” (i.e., with a univariate output), by using -regularization (Bennett and Mangasarian, 1992) or by adding a cardinality constraint (Chan et al., 2007), generalization of these techniques to structured output space turns out to be extremely non-trivial, as we discuss later in this paper. There is also very little theoretical analysis on the performance guarantee of margin-based models under direct -regularization. Our empirical results as shown in this paper suggest that an -regularized estimation, especially the likelihood based estimation, can be unrobust. Discarding the features that are not completely irrelevant can potentially hurt generalization ability.

In this paper, we propose a general theory of maximum entropy discrimination Markov networks (MaxEnDNet, or simply MEDN) for structured input/output learning and prediction. This formalism offers a formal paradigm for integrating both generative and discriminative principles and the Bayesian regularization techniques for learning structured prediction models. It integrates the spirit of maximum margin learning from SVM, the design of discriminative structured prediction model in maximum margin Markov networks (MN), and the ideas of entropy regularization and model averaging in maximum entropy discrimination methods (Jaakkola et al., 1999). It allows one to learn a distribution of maximum margin structured prediction models that offers a wide range of important advantages over conventional models such as MN, including more robust prediction due to an averaging prediction-function based on the learned distribution of models, Bayesian-style regularization that can lead to a model that is simultaneous primal and dual sparse, and allowance of hidden variables and semi-supervised learning based on partially labeled data.

While the formalism of MaxEnDNet is extremely general, our main focus and contributions of this paper will be concentrated on the following results. We will formally define the MaxEnDNet as solving a generalized entropy optimization problem subject to expected margin constraints due to the training data, and under an arbitrary prior of feature coefficients; and we offer a general close-form solution to this problem. An interesting insight immediately follows this general solution is that, a trivial assumption on the prior distribution of the coefficients, i.e., a standard normal, reduces the linear MaxEnDNet to the standard MN, as shown in Theorem 3. This understanding opens the way to use different priors for MaxEnDNet to achieve more interesting regularization effects. We show that, by using a Laplace prior for the feature coefficients, the resulting LapMEDN is effectively an MN that is not only dual sparse (i.e., defined by a few support vectors), but also primal sparse (i.e., shrinkage on coefficients corresponding to irrelevant features). We develop a novel variational approximate learning method for the LapMEDN, which leverages on the hierarchical representation of the Laplace prior (Figueiredo, 2003) and the reducibility of MaxEnDNet to MN, and combines the variation Bayesian technique with existing convex optimization algorithms developed for MN (Taskar et al., 2003; Bartlett et al., 2004; Ratliff et al., 2007). We also provide a formal analysis of the generalization error of the MaxEnDNet, and prove a novel PAC-Bayes bound on the structured prediction error by MaxEnDNet. We performed a thorough comparison of the Laplace MaxEnDNet with a competing methods, including MN (i.e., the Gaussian MaxEnDNet), -regularized MN 111This model has not yet been reported in the literature, and represents another new extension of the MN, which we will present in a separate paper in detail., CRFs, -regularized CRFs, and -regularized CRFs, on both synthetic and real structured input/output data. The Laplace MaxEnDNet exhibits mostly superior, and sometimes comparable performance in all scenarios been tested.

The rest of the paper is structured as follows. In the next section, we review the basic structured prediction formalism and set the stage for our model. Section 3 presents the general theory of maximum entropy discrimination Markov networks and some basic theoretical results, followed by two instantiations of the general MaxEnDNet, the Gaussian MaxEnDNet and the Laplace MaxEnDNet. Section 4 offers a detailed discussion of the primal and dual sparsity property of Laplace MaxEnDNet. Section 5 presents a novel iterative learning algorithm based on variational approximation and convex optimization. In Section 6, we briefly discuss the generalization bound of MaxEnDNet. Then, we show empirical results on both synthetic and real OCR and web data extraction data sets in Section 7. Section 8 discusses some related work and Section 9 concludes this paper.

## 2 Preliminaries

In structured prediction problems such as natural language parsing, image annotation, or DNA decoding, one aims to learn a function that maps a structured input , e.g., a sentence or an image, to a structured output , e.g., a sentence parsing or a scene annotation, where, unlike a standard classification problem, is a multivariate prediction consisting of multiple labeling elements. Let denote the cardinality of the output, and where denote the arity of each element, then with represents a combinatorial space of structured interpretations of the multi-facet objects in the inputs. For example, could correspond to the space of all possible instantiations of the parse trees of a sentence, or the space of all possible ways of labeling entities over some segmentation of an image. The prediction is structured because each individual label within must be determined in the context of other labels , rather than independently as in classification, in order to arrive at a globally satisfactory and consistent prediction.

Let represent a discriminant function over the input-output pairs from which one can define the predictive function, and let denote the space of all possible . A common choice of is a linear model, , where is a -dimensional column vector of the feature functions , and is the corresponding vector of the weights of the feature functions. Typically, a structured prediction model chooses an optimal estimate by minimizing some loss function , and defines a predictive function in terms of an optimization problem that maximizes over the response variable given an input :

 h0(x;w⋆)=argmaxy∈Y(x)F(x,y;w⋆), (1)

where is the feasible subset of structured labels for the input . Here, we assume that is finite for any .

Depending on the specific choice of (e.g., linear, or log linear), and of the loss function for estimating the parameter (e.g., likelihood, or margin), incarnations of the general structured prediction formalism described above can be seen in classical generative models such as the HMM (Rabiner, 1989) where can be an exponential family distribution function and is the joint likelihood of the input and its labeling; and in recent discriminative models such as the CRFs (Lafferty et al., 2001), where is a Boltzmann machine and is the conditional likelihood of the structured labeling given input; and the M(Taskar et al., 2003), where is an identity function and is the margin between the true labeling and any other feasible labeling in . Our approach toward a more general discriminative training is based on a maximum entropy principle that allows an elegant combination of the discriminative maximum margin learning with the generative Bayesian regularization and hierarchical modeling, and we consider the more general problem of finding a distribution over that enables a convex combination of discriminant functions for robust structured prediction.

Before delving into the exposition of the proposed approach, we end this section with a brief recapitulation of the basic MN, upon which the proposed approach is built. Under a max-margin framework, given a set of fully observed training data , we obtain a point estimate of the weight vector by solving the following max-margin problem P0 (Taskar et al., 2003):

 P0 (M3N): minw,ξ 12∥w∥2+CN∑i=1ξi s.t. ∀i,∀y≠yi: w⊤Δfi(y)≥Δℓi(y)−ξi, ξi≥0 ,

where and is the “margin” between the true label and a prediction , is a loss function with respect to , and represents a slack variable that absorbs errors in the training data. Various loss functions have been proposed in the literature (Tsochantaridis et al., 2004). In this paper, we adopt the hamming loss used in (Taskar et al., 2003): , where is an indicator function that equals to one if the argument is true and zero otherwise. The optimization problem P0 is intractable because the feasible space for ,

 F0={w:w⊤Δfi(y)≥Δℓi(y)−ξi; ∀i,∀y≠yi},

is defined by number of constraints, and itself is exponential to the size of the input . Exploring sparse dependencies among individual labels in , as reflected in the specific design of the feature functions (e.g., based on pair-wise labeling potentials in a pair-wise Markov network), and the convex duality of the objective, efficient optimization algorithms based on cutting-plane (Tsochantaridis et al., 2004) or message-passing (Taskar et al., 2003) have been proposed to obtain an approximate optimum solution to P0. As described shortly, these algorithms can be directly employed as subroutines in solving our proposed model.

## 3 Maximum Entropy Discrimination Markov Networks

Instead of learning a point estimator of as in MN, in this paper, we take a Bayesian-style approach and learn a distribution , in a max-margin manner. For prediction, we employ a convex combination of all possible models based on , that is:

 h1(x)=argmaxy∈Y(x)∫p(w)F(x,y;w)dw. (2)

Now, the open question underlying this averaging prediction rule is how we can devise an appropriate loss function and constraints over , in a similar spirit as the margin-based scheme over in P0, that lead to an optimum estimate of . In the sequel, we present Maximum Entropy Discrimination Markov Networks (MaxEnDNet, or MEDN), a novel framework that facilitates the estimation of a Bayesian-style regularized distribution of MNs defined by . As we show below, this new Bayesian-style max-margin learning formalism offers several advantages such as simultaneous primal and dual sparsity, PAC-Bayesian generalization guarantee, and estimation robustness. Note that the MaxEnDNet is different from the traditional Bayesian methods for discriminative structured prediction such as the Bayesian CRFs (Qi et al., 2005), where the likelihood function is well defined. Here, our approach is of a “Bayesian-style” because it learns and uses a “posterior” distribution of all predictive models instead of choosing one model according to some criterion, but the learning algorithm is not based on the Bayes theorem, but a maximum entropy principle that biases towards a posterior that makes less additional assumptions over a given prior over the predictive models.

### 3.1 Structured Maximum Entropy Discrimination

Given a training set of structured input-output pairs, analogous to the feasible space for the weight vector in a standard MN (c.f., problem P0), we define the feasible subspace for the weight distribution by a set of expected margin constraints:

 F1={p(w):∫p(w)[ΔFi(y;w)−Δℓi(y)]dw≥−ξi, ∀i,∀y≠yi}.

We learn the optimum from based on a structured maximum entropy discrimination principle generalized from (Jaakkola et al., 1999). Under this principle, the optimum corresponds to the distribution that minimizes its relative entropy with respect to some chosen prior , as measured by the Kullback-Leibler divergence between and : , where denotes the expectations with respect to . If is uniform, then minimizing this KL-divergence is equivalent to maximizing the entropy . A natural information theoretic interpretation of this formulation is that we favor a distribution over the hypothesis class that bears minimum assumptions among all feasible distributions in . The is a regularizer that introduces an appropriate bias, if necessary.

To accommodate non-separable cases in the discriminative prediction problem, instead of minimizing the usual KL, we optimize the generalized entropy (Dudík et al., 2007; Lebanon and Lafferty, 2001), or a regularized KL-divergence, , where is a closed proper convex function over the slack variables. This term can be understood as an additional “potential” in the maximum entropy principle. Putting everything together, we can now state a general formalism based on the following Maximum Entropy Discrimination Markov Network framework:

###### Definition 1

(Maximum Entropy Discrimination Markov Networks) Given training data , a chosen form of discriminant function , a loss function , and an ensuing feasible subspace (defined above) for parameter distribution , the MaxEnDNet model that leads to a prediction function of the form of Eq. (2) is defined by the following generalized relative entropy minimization with respect to a parameter prior :

 P1 (MaxEnDNet): minp(w),ξ KL(p(w)||p0(w))+U(ξ) s.t.  p(w)∈F1, ξi≥0,∀i.

The P1 defined above is a variational optimization problem over in a subspace of valid parameter distributions. Since both the KL and the function in P1 are convex, and the constraints in are linear, P1 is a convex program. In addition, the expectations are required to be bounded in order for to be a meaningful model. Thus, the problem P1 satisfies the Slater’s condition222Since are bounded and , there always exists a , which is large enough to make the pair satisfy the Slater’s condition. (Boyd and Vandenberghe, 2004, chap. 5), which together with the convexity make P1 enjoy nice properties, such as strong duality and the existence of solutions. The problem P1 can be solved via applying the calculus of variations to the Lagrangian to obtain a variational extremum, followed by a dual transformation of P1. We state the main results below as a theorem, followed by a brief proof that lends many insights into the solution to P1 which we will explore in subsequent analysis.

###### Theorem 2 (Solution to MaxEnDNet)

The variational optimization problem P1 underlying the MaxEnDNet gives rise to the following optimum distribution of Markov network parameters :

 p(w)=1Z(α)p0(w)exp{∑i,y≠yiαi(y)[ΔFi(y;w)−Δℓi(y)]}, (3)

where is a normalization factor and the Lagrangian multipliers (corresponding to the constraints in ) can be obtained by solving the dual problem of P1:

 D1: maxα −logZ(α)−U⋆(α) s.t.  αi(y)≥0, ∀i, ∀y≠yi

where is the conjugate of the slack function , i.e., .

Proof  (sketch) Since the problem P1 is a convex program and satisfies the Slater’s condition, we can form a Lagrange function, whose saddle point gives the optimal solution of P1 and D1, by introducing a non-negative dual variable for each constraint in and another non-negative dual variable for the normalization constraint . Details are deferred to Appendix B.1.

Since the problem P1 is a convex program and satisfies the Slater’s condition, the saddle point of the Lagrange function is the KKT point of P1. From the KKT conditions (Boyd and Vandenberghe, 2004, chap. 5), it can be shown that the above solution enjoys dual sparsity, that is, only a few Lagrangian multipliers will be non-zero, which correspond to the active constraints whose equality holds, analogous to the support vectors in SVM. Thus MaxEnDNet enjoys a similar generalization property as the MN and SVM due to the the small “effective size” of the margin constraints. But it is important to realize that this does not mean that the learned model is “primal-sparse”, i.e., only a few elements in the weight vector are non-zero. We will return to this point in Section 4.

For a closed proper convex function , its conjugate is defined as . In the problem D1, by convex duality (Boyd and Vandenberghe, 2004), the log normalizer can be shown to be the conjugate of the KL-divergence. If the slack function is , it is easy to show that , where is a function that equals to zero when its argument holds true and infinity otherwise. Here, the inequality corresponds to the trivial solution , that is, the training data are perfectly separative. Ignoring this inequality does not affect the solution since the special case is still included. Thus, the Lagrangian multipliers in the dual problem D1 comply with the set of constraints that . Another example is by introducing uncertainty on the slack variables (Jaakkola et al., 1999). In this case, expectations with respect to are taken on both sides of all the constraints in . Take the duality, and the dual function of is another log normalizer. More details can be found in (Jaakkola et al., 1999). Some other functions and their dual functions are studied in (Lebanon and Lafferty, 2001; Dudík et al., 2007).

Unlike most extant structured discriminative models including the highly successful MN, which rely on a point estimator of the parameters, the MaxEnDNet model derived above gives an optimum parameter distribution, which is used to make prediction via the rule (2). Indeed, as we will show shortly, the MaxEnDNet is strictly more general than the MN and subsumes the later as a special case. But more importantly, the MaxEnDNet in its full generality offers a number of important advantages while retaining all the merits of the MN. First, MaxEnDNet admits a prior that can be designed to introduce useful regularization effects, such as a primal sparsity bias. Second, the MaxEnDNet prediction is based on model averaging and therefore enjoys a desirable smoothing effect, with a uniform convergence bound on generalization error. Third, MaxEnDNet offers a principled way to incorporate hidden generative models underlying the structured predictions, but allows the predictive model to be discriminatively trained based on partially labeled data. In the sequel, we analyze the first two points in detail; exploration of the third point is beyond the scope of this paper, and can be found in (Zhu et al., 2008c), where a partially observed MaxEnDNet (PoMEN) is developed, which combines (possibly latent) generative model and discriminative training for structured prediction.

### 3.2 Gaussian MaxEnDNet

As Eq. (3) suggests, different choices of the parameter prior can lead to different MaxEnDNet models for predictive parameter distribution. In this subsection and the following one, we explore a few common choices, e.g., Gaussian and Laplace priors.

We first show that, when the parameter prior is set to be a standard normal, MaxEnDNet leads to a predictor that is identical to that of the MN. This somewhat surprising reduction offers an important insight for understanding the property of MaxEnDNet. Indeed this result should not be totally unexpected given the striking isomorphisms of the opt-problem P1, the feasible space , and the predictive function underlying a MaxEnDNet, to their counterparts P0, , and , respectively, underlying an MN. The following theorem makes our claim explicit.

###### Theorem 3 (Gaussian MaxEnDNet: Reduction of MEDN to M3N)

Assuming
, , and , where denotes an identity matrix, then the posterior distribution is , where , and the Lagrangian multipliers in are obtained by solving the following dual problem, which is isomorphic to the dual form of the MN:

 maxα ∑i,y≠yiαi(y)Δℓi(y)−12∥∑i,y≠yiαi(y)Δfi(y)∥2 s.t.  ∑y≠yiαi(y)=C; αi(y)≥0, ∀i, ∀y≠yi,

where as in P0. When applied to , leads to a predictive function that is identical to given by Eq. (1).

Proof  See Appendix B.2 for details.

The above theorem is stated in the duality form. We can also show the following equivalence in the primal form.

###### Corollary 4

Under the same assumptions as in Theorem 3, the mean of the posterior distribution under a Gaussian MaxEnDNet is obtained by solving the following primal problem:

 minμ,ξ 12μ⊤μ+CN∑i=1ξi s.t.  μ⊤Δfi(y)≥Δℓi(y)−ξi; ξi≥0,  ∀i, ∀y≠yi.

Proof  See Appendix B.3 for details.

Theorem 3 and Corollary 4 both show that in the supervised learning setting, the MN is a special case of MaxEnDNet when the slack function is linear and the parameter prior is a standard normal. As we shall see later, this connection renders many existing techniques for solving the MN directly applicable for solving the MaxEnDNet.

### 3.3 Laplace MaxEnDNet

Recent trends in pursuing “sparse” graphical models has led to the emergence of regularized version of CRFs (Andrew and Gao, 2007) and Markov networks (Lee et al., 2006; Wainwright et al., 2006). Interestingly, while such extensions have been successfully implemented by several authors in maximum likelihood learning of various sparse graphical models, they have not yet been explored in the context of maximum margin learning. Such a gap is not merely due to a negligence. Indeed, learning a sparse MN can be significantly harder as we discuss below.

One possible way to learn a sparse MN is to adopt the strategy of -SVM (Bennett and Mangasarian, 1992; Zhu et al., 2004) and directly use an instead of the -norm of in the loss function (see appendix A for a detailed description of this formulation and the duality derivation). However, the primal problem of an -regularized MN is not directly solvable by re-formulating it as an LP problem due to the exponential number of constraints; solving the dual problem, which now has only a polynomial number of constraints as in the dual of MN, is still non-trivial due to the complicated form of the constraints as shown in appendix A. The constraint generation methods are possible. However, although such methods (Tsochantaridis et al., 2004) have been shown to be efficient for solving the QP problem in the standard MN, our preliminary empirical results show that such a scheme with an LP solver for the -regularized MN can be extremely expensive for a non-trivial real data set. Another possible solution is the gradient descent methods (Ratliff et al., 2007) with a projection to -ball (Duchi et al., 2008).

The MaxEnDNet interpretation of the MN offers an alternative strategy that resembles Bayesian regularization (Tipping, 2001; Kaban, 2007) in maximum likelihood estimation, where shrinkage effects can be introduced by appropriate priors over the model parameters. As Theorem 3 reveals, an MN corresponds to a Gaussian MaxEnDNet that admits a standard normal prior for the weight vector . According to the standard Bayesian regularization theory, to achieve a sparse estimate of a model, in the posterior distribution of the feature weights, the weights of irrelevant features should peak around zero with very small variances. However, the isotropy of the variances in all dimensions of the feature space under a standard normal prior makes it infeasible for the resulting MN to adjust the variances in different dimensions to fit a sparse model. Alternatively, now we employ a Laplace prior for to learn a Laplace MaxEnDNet. We show in the sequel that, the parameter posterior under a Laplace MaxEnDNet has a shrinkage effect on small weights, which is similar to directly applying an -regularizer on an MN. Although exact learning of a Laplace MaxEnDNet is also intractable, we show that this model can be efficiently approximated by a variational inference procedure based on existing methods.

The Laplace prior of is expressed as . This density function is heavy tailed and peaked at zero; thus, it encodes a prior belief that the distribution of is strongly peaked around zero. Another nice property of the Laplace density is that it is log-concave, or the negative logarithm is convex, which can be exploited to obtain a convex estimation problem analogous to LASSO (Tibshirani, 1996).

###### Theorem 5 (Laplace MaxEnDNet: a sparse M3N)

Assuming , , and , then the Lagrangian multipliers in (as defined in Theorem 2) are obtained by solving the following dual problem:

 maxα ∑i,y≠yiαi(y)Δℓi(y)−K∑k=1logλλ−η2k s.t.  ∑y≠yiαi(y)=C; αi(y)≥0, ∀i, ∀y≠yi.

where , and represents the th component of . Furthermore, constraints , must be satisfied.

Since several intermediate results from the proof of this Theorem will be used in subsequent presentations, we provide the complete proof below. Our proof is based on a hierarchical representation of the Laplace prior. As noted in (Figueiredo, 2003), the Laplace distribution is equivalent to a two-layer hierarchical Gaussian-exponential model, where follows a zero-mean Gaussian distribution and the variance admits an exponential hyper-prior density,

 p(τ|λ)=λ2exp{−λ2τ},  for τ≥0.

This alternative form straightforwardly leads to the following new representation of our multivariate Laplace prior for the parameter vector in MaxEnDNet:

 p0(w)=K∏k=1p0(wk)=K∏k=1∫p(wk|τk)p(τk|λ)dτk=∫p(w|τ)p(τ|λ)dτ, (4)

where and represent multivariate Gaussian and exponential, respectively, and .

Proof  (of Theorem 5) Substitute the hierarchical representation of the Laplace prior (Eq. 4) into in Theorem 2, and we get the normalization factor as follows,

 Z(α) =∫∫p(w|τ)p(τ|λ)dτ⋅exp{w⊤η−∑i,y≠yiαi(y)Δℓi(y)}dw (5) =∫p(τ|λ)∫p(w|τ)⋅exp{w⊤η−∑i,y≠yiαi(y)Δℓi(y)}dwdτ =∫p(τ|λ)exp{12η⊤Aη−∑i,y≠yiαi(y)Δℓi(y)}dτ =exp{−∑i,y≠yiαi(y)Δℓi(y)}K∏k=1∫λ2exp(−λ2τk)exp(12η2kτk)dτk =exp{−∑i,y≠yiαi(y)Δℓi(y)}K∏k=1λλ−η2k,

where is a diagonal matrix and is a column vector with defined as in Theorem 5. The last equality is due to the moment generating function of an exponential distribution. The constraint is needed in this derivation to avoid the integration going infinity. Substituting the normalization factor derived above into the general dual problem D1 in Theorem 2, and using the same argument of the convex conjugate of as in Theorem 3, we arrive at the dual problem in Theorem 5.

It can be shown that the dual objective function of Laplace MaxEnDNet in Theorem 5 is concave333 is convex over because it is the composition of with an affine mapping. So, is concave and is also concave due to the composition rule (Boyd and Vandenberghe, 2004).. But since each depends on all the dual variables and appears within a logarithm, the optimization problem underlying Laplace MaxEnDNet would be very difficult to solve. The SMO (Taskar et al., 2003) and the exponentiated gradient methods (Bartlett et al., 2004) developed for the QP dual problem of MN cannot be easily applied here. Thus, we will turn to a variational approximation method, as shown in Section 5. For completeness, we end this section with a corollary similar to the Corollary 4, which states the primal optimization problem underlying the MaxEnDNet with a Laplace prior. As we shall see, the primal optimization problem in this case is complicated and provides another perspective of the hardness of solving the Laplace MaxEnDNet.

###### Corollary 6

Under the same assumptions as in Theorem 5, the mean  of the posterior distribution  under a Laplace MaxEnDNet is obtained by solving the following primal problem:

 minμ,ξ √λK∑k=1(√μ2k+1λ−1√λlog√λμ2k+1+12)+CN∑i=1ξi s.t.  μ⊤Δfi(y)≥Δℓi(y)−ξi; ξi≥0,  ∀i, ∀y≠yi.

Proof  The proof requires the result of Corollary 7. We defer it to Appendix B.4.

Since the “norm”444This is not exactly a norm because the positive scalability does not hold. But the KL-norm is non-negative due to the non-negativity of KL-divergence. In fact, by using the inequality , we can show that each component is monotonically increasing with respect to and , where the equality holds only when . Thus, penalizes large weights. For convenient comparison with the popular and norms, we call it a KL-norm.

 K∑k=1(√μ2k+1λ−1√λlog√λμ2k+1+12)≜∥μ∥KL

corresponds to the KL-divergence between and under a Laplace MaxEnDNet, we will refer to it as a KL-norm and denote it by in the sequel. This KL-norm is different from the -norm as used in MN, but is closely related to the -norm, which encourages a sparse estimator. In the following section, we provide a detailed analysis of the sparsity of Laplace MaxEnDNet resulted from the regularization effect from this norm.

## 4 Entropic Regularization and Sparse M3N

Comparing to the structured prediction law due to an MN, which enjoys dual sparsity (i.e., few support vectors), the defined by a Laplace MaxEnDNet is not only dual-sparse, but also primal sparse; that is, features that are insignificant will experience strong shrinkage on their corresponding weight .

The primal sparsity of achieved by the Laplace MaxEnDNet is due to a shrinkage effect resulting from the Laplacian entropic regularization. In this section, we take a close look at this regularization effect, in comparison with other common regularizers, such as the -norm in MN (which is equivalent to the Gaussian MaxEnDNet), and the -norm that at least in principle could be directly applied to MN. Since our main interest here is the sparsity of the structured prediction law , we examine the posterior mean under via exact integration. It can be shown that under a Laplace MaxEnDNet, exhibits the following posterior shrinkage effect.

###### Corollary 7 (Entropic Shrinkage)

The posterior mean of the Laplace MaxEnDNet has the following form:

 ⟨wk⟩p=2ηkλ−η2k, ∀1≤k≤K, (6)

where and .

Proof  Using the integration result in Eq. (5), we can get:

 ∂logZ∂αi(y)=v⊤Δfi(y)−Δℓi(y), (7)

where is a column vector and . An alternative way to compute the derivatives is using the definition of . We can get:

 ∂logZ∂αi(y)=⟨w⟩⊤pΔfi(y)−Δℓi(y). (8)

Comparing Eqs. (7) and (8), we get , that is, . The constraints are required to get a finite normalization factor as shown in Eq. (5).

A plot of the relationship between under a Laplace MaxEnDNet and the corresponding revealed by Corollary 7 is shown in Figure 1 (for example, the red curve), from which we can see that, the smaller the is, the more shrinkage toward zero is imposed on .

This entropic shrinkage effect on is not present in the standard MN, and the Gaussian MaxEnDNet. Recall that by definition, the vector is determined by the dual parameters obtained by solving a model-specific dual problem. When the ’s are obtained by solving the dual of the standard MN, it can be shown that the optimum point solution of the parameters . When the ’s are obtained from the dual of the Gaussian MaxEnDNet, Theorem 3 shows that the posterior mean of the parameters . (As we have already pointed out, since these two dual problems are isomorphic, the ’s for MN and Gaussian MaxEnDNet are identical, hence the resulting ’s are the same.) In both cases, there is no shrinkage along any particular dimension of the parameter vector or of the mean vector of . Therefore, although both MN and Gaussian MaxEnDNet enjoy the dual sparsity, because the KKT conditions imply that most of the dual parameters ’s are zero, and are not primal sparse. From Eq. (6), we can conclude that the Laplace MaxEnDNet is also dual sparse, because its mean can be uniquely determined by . But the shrinkage effect on different components of the vector causes to be also primal sparse.

A comparison of the posterior mean estimates of under MaxEnDNet with three different priors versus their associated is shown in Figure 1. The three priors in question are, a standard normal, a Laplace with , and a Laplace with . It can be seen that, under the entropic regularization with a Laplace prior, the gets shrunk toward zero when is small. The larger the value is, the greater the shrinkage effect. For a fixed , the shape of the shrinkage curve (i.e., the curve) is smoothly nonlinear, but no component is explicitly discarded, that is, no weight is set explicitly to zero. In contrast, for the Gaussian MaxEnDNet, which is equivalent to the standard MN, there is no such a shrinkage effect.

Corollary 6 offers another perspective of how the Laplace MaxEnDNet relates to the -norm MN, which yields a sparse estimator. Note that as goes to infinity, the KL-norm approaches , i.e., the -norm555As , the logarithm terms in disappear because of the fact that when .. This means that the MaxEnDNet with a Laplace prior will be (nearly) the same as the -MN if the regularization constant is large enough.

A more explicit illustration of the entropic regularization under a Laplace MaxEnDNet, comparing to the conventional and regularization over an MN, can be seen in Figure 2, where the feasible regions due to the three different norms used in the regularizer are plotted in a two dimensional space. Specifically, it shows (1) -norm: ; (2) -norm: ; and (2) KL-norm666The curves are drawn with a symbolic computational package to solve a equation of the form: , where is the variable to be solved and is a constant.: , where is a parameter to make the boundary pass the point for easy comparison with the and curves. It is easy to show that equals to . It can be seen that the -norm boundary has sharp turning points when it passes the axises, whereas the and KL-norm boundaries turn smoothly at those points. This is the intuitive explanation of why the -norm directly gives sparse estimators, whereas the -norm and KL-norm due to a Laplace prior do not. But as shown in Figure 2, when the gets larger and larger, the KL-norm boundary moves closer and closer to the -norm boundary. When , and , which yields exactly the -norm in the two dimensional space. Thus, under the linear model assumption of the discriminant functions , our framework can be seen as a smooth relaxation of the -MN.

## 5 Variational Learning of Laplace MaxEnDNet

Although Theorem 2 seems to offer a general closed-form solution to under an arbitrary prior , in practice the Lagrangian parameters in can be very hard to estimate from the dual problem D1 except for a few special choices of , such as a normal as shown in Theorem 3, which can be easily generalized to any normal prior. When is a Laplace prior, as we have shown in Theorem 5 and Corollary 6, the corresponding dual problem or primal problem involves a complex objective function that is difficult to optimize. Here, we present a variational method for an approximate learning of the Laplace MaxEnDNet.

Our approach is built on the hierarchical interpretation of the Laplace prior as shown in Eq. (4). Replacing the in Problem P1 with Eq. (4), and applying the Jensen’s inequality, we get an upper bound of the KL-divergence:

 KL(p||p0) ≤−H(p)−⟨∫q(τ)logp(w|τ)p(τ|λ)q(τ)dτ⟩p ≜L(p(w),q(τ)),

where is a variational distribution used to approximate . The upper bound is in fact a KL-divergence: . Thus, is convex over , and , respectively, but not necessarily joint convex over .

Substituting this upper bound for the KL-divergence in P1, we now solve the following Variational MaxEnDNet problem,

 P1′ (vMEDN):minp(w)∈F1;q(τ);ξ L(p(w),q(τ))+U(ξ). (9)

can be solved with an iterative minimization algorithm alternating between optimizing over and , as outlined in Algorithm 1, and detailed below.

Step 1: Keep fixed, optimize with respect to . Using the same procedure as in solving P1, we get the posterior distribution as follows,

 p(w) ∝exp{∫q(τ)logp(w|τ)dτ−b}⋅exp{w⊤η−∑i,y≠yiαi(y)Δℓi(y)} ∝exp{−12w⊤⟨A−1⟩qw−b+w⊤η−∑i,y≠yiαi(y)Δℓi(y)} =N(w|μ,Σ),

where , , and is a constant. The posterior mean and variance are and , respectively. Note that this posterior distribution is also a normal distribution. Analogous to the proof of Theorem 3, we can derive that the dual parameters are estimated by solving the following dual problem:

 maxα ∑i,y≠yiαi(y)Δℓi(y)−12η⊤Ση (10) s.t.  ∑y≠yiαi(y)=C; αi(y)≥0, ∀i, ∀y≠yi.

This dual problem is now a standard quadratic program symbolically identical to the dual of an , and can be directly solved using existing algorithms developed for , such as (Taskar et al., 2003; Bartlett et al., 2004). Alternatively, we can solve the following primal problem:

 minw,ξ 12w⊤Σ−1w+CN∑i=1ξi (11) s.t.  w⊤Δfi(y)≥Δℓi(y)−ξi; ξi≥0,  ∀i, ∀y≠yi.

Based on the proof of Corollary 4, it is easy to show that the solution of the problem (11) leads to the posterior mean of under , which will be used to do prediction by . The primal problem can be solved with the subgradient (Ratliff et al., 2007), cutting-plane (Tsochantaridis et al., 2004), or extragradient (Taskar et al., 2006) method.

Step 2: Keep fixed, optimize with respect to . Taking the derivative of with respect to and set it to zero, we get:

 q(τ) ∝p(τ|λ)exp{⟨logp(w|τ)⟩p}.

Since both and can be written as a product of univariate Gaussian and univariate exponential distributions, respectively, over each dimension, also factorizes over each dimension: , where each can be expressed as:

 ∀k:q(τk) ∝p(τk|λ)exp{⟨logp(wk|τk)⟩p} ∝N(√⟨w2k⟩p|0,τk)exp(−12λτk).

The same distribution has been derived in (Kaban, 2007), and similar to the hierarchical representation of a Laplace distribution we can get the normalization factor: . Also, as in (Kaban, 2007), we can calculate the expectations which are required in calculating as follows,

 ⟨1τk⟩q=∫1τkq(τk)dτk=√λ⟨w2k⟩p. (12)

We iterate between the above two steps until convergence. Due to the convexity (not joint convexity) of the upper bound, the algorithm is guaranteed to converge to a local optimum. Then, we apply the posterior distribution , which is in the form of a normal distribution, to make prediction using the averaging prediction law in Eq. (2). Due to the shrinkage effect of the Laplacian entropic regularization discussed in Section 4, for irrelevant features, the variances should converge to zeros and thus lead to a sparse estimation of . To summarize, the intuition behind this iterative minimization algorithm is as follows. First, we use a Gaussian distribution to approximate the Laplace distribution and thus get a QP problem that is analogous to that of the standard MN; then, in the second step we update the covariance matrix in the QP problem with an exponential hyper-prior on the variance.

## 6 Generalization Bound

The PAC-Bayes theory for averaging classifiers (Langford et al., 2001) provides a theoretical motivation to learn an averaging model for classification. In this section, we extend the classic PAC-Bayes theory on binary classifiers to MaxEnDNet, and analyze the generalization performance of the structured prediction rule in Eq. (2). In order to prove an error bound for , the following mild assumption on the boundedness of discriminant function is necessary, i.e., there exists a positive constant , such that,