# Complex Structure Leads to Overfitting: A Structure Regularization Decoding Method for Natural Language Processing

## Abstract

Recent systems on structured prediction focus on increasing the level of structural dependencies within the model. However, our study suggests that complex structures entail high overfitting risks. To control the structure-based overfitting, we propose to conduct *structure regularization decoding* (SR decoding). The decoding of the complex structure model is regularized by the additionally trained simple structure model. We theoretically analyze the quantitative relations between the structural complexity and the overfitting risk. The analysis shows that complex structure models are prone to the *structure-based overfitting*. Empirical evaluations show that the proposed method improves the performance of the complex structure models by reducing the structure-based overfitting. On the sequence labeling tasks, the proposed method substantially improves the performance of the complex neural network models. The maximum F1 error rate reduction is 36.4% for the third-order model. The proposed method also works for the parsing task. The maximum UAS improvement is 5.5% for the tri-sibling model. The results are competitive with or better than the state-of-the-art results. ^{1}

## 1Introduction

Structured prediction models are often used to solve the structure dependent problems in a wide range of application domains including natural language processing, bioinformatics, speech recognition, and computer vision. To solve the structure dependent problems, many structured prediction methods have been developed. Among them the representative models are conditional random fields (CRFs), deep neural networks, and structured perceptron models. In order to capture the structural information more accurately, some recent studies emphasize on intensifying structural dependencies in structured prediction by applying long range dependencies among tags, developing long distance features or global features, and so on.

From the probabilistic perspective, complex structural dependencies may lead to better modeling power. However, this is not the case for most of the structured prediction problems. It has been noticed that some recent work that tries to intensify the structural dependencies does not really benefit as expected, especially for neural network models. For example, in sequence labeling tasks, a natural way to increase the complexity of the structural dependencies is to make the model predict two or more consecutive tags for a position. The new label for a word now becomes a concatenation of several consecutive tags. To correctly predict the new label, the model can be forced to learn the complex structural dependencies involved in the transition of the new label. Nonetheless, the experiments contradict the hypothesis. With the increasing number of the tags to be predicted for a position, the performance of the model deteriorates. In the majority of the tasks we tested, the performance decreases substantially. We show the results in Section ?.

We argue that over-emphasis on intensive structural dependencies could be misleading. Our study suggests that complex structures are actually harmful to model accuracy. Indeed, while it is obvious that intensive structural dependencies can effectively incorporate the structural information, it is less obvious that intensive structural dependencies have a drawback of increasing the generalization risk. Increasing the generalization risk means that the trained models tend to overfit the training data. The more complex the structures are, the more instable the training is. Thus, the training is more likely to be affected by the noise in the data, which leads to overfitting. Formally, our theoretical analysis reveals why and with what degree the structure complexity lowers the generalization ability of the trained models. Since this type of overfitting is caused by the structural complexity, it can hardly be solved by ordinary regularization methods, e.g., the weight regularization methods, such as and regularization schemes, which are used only for controlling the weight complexity.

To deal with this problem, we propose a simple structural complexity regularization solution based on *structure regularization decoding*. The proposed method trains both the complex structure model and the simple structure model. In decoding, the simpler structure model is used to regularize the complex structure model, deriving a model with better generalization power.

We show both theoretically and empirically that the proposed method can reduce the overfitting risk. In theory, the structural complexity has the effect of reducing the empirical risk, but increasing the overfitting risk. By regularizing the complex structure with the simple structure, a balance between the empirical risk and the overfitting risk can be achieved. We apply the proposed method to multiple sequence labeling tasks, and a parsing task. The formers involve linear-chain models, i.e., LSTM [?] models, and the latter involves hierarchical models, i.e., structured perceptron [?] models. Experiments demonstrate that the proposed method can easily surpass the performance of both the simple structure model and the complex structure model. Moreover, the results are competitive with the state-of-the-art results or better than the state-of-the-arts.

To the best of our knowledge, this is the first theoretical effort on quantifying the relation between the structural complexity and the generalization risk in structured prediction. This is also the first proposal on structural complexity regularization via regularizing the decoding of the complex structure model by the simple structure model. The contributions of this work are two-fold:

On the methodology side, we propose a general purpose structural complexity regularization framework for structured prediction. We show both theoretically and empirically that the proposed method can effectively reduce the overfitting risk in structured prediction. The theory reveals the quantitative relation between the structural complexity and the generalization risk. The theory shows that the structure-based overfitting risk increases with the structural complexity. By regularizing the structural complexity, the balance between the empirical risk and the overfitting risk can be maintained. The proposed method regularizes the decoding of the complex structure model by the simple structure model. Hence, the structured-based overfitting can be alleviated.

On the application side, we derive structure regularization decoding algorithms for several important natural language processing tasks, including the sequence labeling tasks, such as chunking and name entity recognition, and the parsing task, i.e., joint empty category detection and dependency parsing. Experiments demonstrate that our structure regularization decoding method can effectively reduce the overfitting risk of the complex structure models. The performance of the proposed method easily surpasses the performance of both the simple structure model and the complex structure model. The results are competitive with the state-of-the-arts or even better.

The structure of the paper is organized as the following. We first introduce the proposed method in Section ? (including its implementation on linear-chain models and hierarchical models). Then, we give theoretical analysis of the problem in Section ?. The experimental results are presented in Section ?. Finally, we summarize the related work in Section ?, and draw our conclusions in Section ?.

## 2Structure Regularization Decoding

Some recent work focuses on intensifying the structural dependencies. However, the improvements fail to meet the expectations, and the results are even worse sometimes. Our theoretical study shows that the reason is that although the complex structure results in the low empirical risk, it causes the high structure-based overfitting risk. The theoretical analysis is presented in Section ?.

According to the theoretical analysis, the key to reduce the overall overfitting risk is to use a complexity-balanced structure. However, such kind of structure is hard to define in practice. Instead, we propose to conduct joint decoding of the complex structure model and the simple structure model, which we call *Structure Regularization Decoding (SR Decoding)*. In SR decoding, the simple structure model acts as a regularizer that balances the structural complexity.

As the structures vary with the tasks, the implementation of structure regularization decoding also varies. In the following, we first introduce the general framework, and then show two specific algorithms for different structures. One is for the linear-chain structure models on the sequence labeling tasks, and the other is for the hierarchical structure models on the joint empty category detection and dependency parsing task.

### 2.1General Structure Regularization Decoding

Before describing the method formally, we first define the input and the output of the model. Suppose we are given a training example , where is the input and is the output. Here, stands for the feature space, and stands for the *tag* space, which includes the possible substructures regarding to each input position. In preprocessing, the raw text input is transformed into a sequence of corresponding features, and the output structure is transformed into a sequence of input position related tags. The reason for such definition is that the output structure varies with tasks. For simplicity and flexibility, we do not directly model the output structure. Instead, we regard the output structure as a structural combination of the output tags of each position, which is the input position related substructure.

Different modeling of the structured output, i.e., different output tag space, will lead to different complexity of the model. For instance, in the sequence labeling tasks, could be the space of unigram tags, and then is a sequence of . could also be the space of bigram tags. The bigram tag means the output regrading to the input feature involves two consecutive tags, e.g. the tag at position , and the tag at position . Then, is the combination of the bigram tags . This also requires proper handling of the tags at overlapping positions. It is obvious that the structural complexity of the bigram tag space is higher than that of the unigram tag space.

Point-wise classification is donated by , where is the model that assigns the scores to each possible output tag at the position . For simplicity, we denote , so that . Given a point-wise cost function , which scores the predicted output based on the gold standard output, the model can be learned as:

For the same structured prediction task, suppose we could learn two models and of different structural complexity, where is the simple structure model, and is the complex structure model. Given the corresponding point-wise cost function and , the proposed method first learns the models separately:

Suppose there is a mapping from to , that is, the complex structure can be decomposed into simple structures. In testing, prediction is done by structure regularization decoding of the two models, based on the complex structure model:

In the decoding of the complex structure model, the simple structure model acts as a regularizer that balances the structural complexity.

We try to keep the description of the method as straight-forward as possible without loss of generality. However, we do make some assumptions. For example, the general algorithm combines the scores of the complex model and the simple model at each position by addition. However, the combination method should be adjusted to the task and the simple structure model used. For example, if the model is a probabilistic model, multiplication should be used instead of the addition. Besides, the number of the models is not limited in theory, as long as (Equation 1) is changed accordingly. Moreover, if the joint training of the models is affordable, the models are not necessarily to be trained independently. These examples are intended to demonstrate the flexibility of the structure regularization decoding algorithms. The detailed implementation should be considered with respect to the task and the model.

In what follows, we show how structure regularization decoding is implemented on two typical structures in natural language processing, i.e. the sequence labeling tasks, which involve linear-chain structures, and the dependency parsing task, which involves hierarchical structures. We focus on the differences and the considerations when deriving structure regularization decoding algorithms. It needs to be reminded that the implementation of the structure regularization decoding method can be adapted to more kinds of structures. The implementation is not limited to the structures or the settings that we use.

### 2.2SR Decoding for Sequence Labeling Tasks

We first describe the model, and then explain how the framework in Section ? can be implemented for the sequence labeling tasks.

Sequence labeling tasks involve linear-chain structures. For a sequence labeling task, a reasonable model is to find a label sequence with the maximum probability conditioned on the sequence of observations, i.e. words. Given a sequence of observations , and a sequence of labels, , where denotes the sentence length, we want to estimate the joint probability of the labels conditioned on the observations as follows:

If we model the preceding joint probability directly, the number of parameters that need to be estimated is extremely large, which makes the problem intractable. Most existing studies make Markov assumption to reduce the parameters. We also make an order- Markov assumption. Different from the typical existing work, we decompose the original joint probability into a few localized order- joint probabilities. The multiplication of these localized order- joint probabilities is used to approximate the original joint probability. Furthermore, we decompose each localized order- joint probability to the stacked probabilities from order-1 to order-, such that we can efficiently combine the multi-order information.

By using different orders of the Markov assumptions, we can get different structural complexity of the model. For example, if we use the Markov assumption of order-1, we obtain the simplest model in terms of the structural complexity:

If we use the Markov assumption with order-2, we estimate the original joint probability as follows:

This formula models the bigram tags with respect to the input. The search space expands, and entails more complex structural dependencies. To learn such models, we need to estimate the conditional probabilities. In order to make the problem tractable, a feature mapping is often introduced to extract features from conditions, to avoid a large parameter space.

In this paper, we use BLSTM to estimate the conditional probabilities, which has the advantage that feature engineering is reduced to the minimum. Moreover, the conditional probabilities of higher order models can be converted to the joint probabilities of the output labels conditioned on the input labels. When using neural networks, learning of -order models can be conducted by just extending the tag set from unigram labels to -gram labels. In training, this only affects the computational cost of the output layer, which is linear to the size of the tag set. The models can be trained very efficiently with a affordable training cost. The strategy is showed in Figure ?.

To decode with structure regularization, we need to connect the models of different complexity. Fortunately, the complex model can be decomposed into a simple model and another complex model. Notice that:

The original joint probability can be rewritten as:

In the preceding equation, the complex model is . It predicts the next label based on the current label and the input. is the simplest model. In practice, we estimate also by BLSTMs, and the computation is the same with . The derivation can also be generalized to an order- case, which consists of the models predicting the length-1 to length-n label sequence for a position. Moreover, the equation explicitly shows a reasonable way of SR decoding, and how the simple structure model can be used to regularize the complex structure model. Figure ? illustrates the method.

However, considering the sequence of length and an order- model, decoding is not scalable, as the computational complexity of the algorithm is . To accelerate SR decoding, we prune the tags of a position in the complex structure model by the top-most possible tags of the simplest structure model. For example, if the output tags of the complex structure are bigram labels, i.e. , the available tags for a position in the complex structure model are the combination of the most probable unigram tags of the position, and the position before. In addition, the tag set of the complex model is also pruned so that it only contains the tags appearing in the training set. The detailed algorithm with pruning, which we name *scalable multi-order decoding*, is given in ?.

### 2.3SR Decoding for Dependency Parsing

We first give a brief introduction to the task. Then, we introduce the model, and finally we show the structure regularization decoding algorithm for this task.

The task in question is the parsing task, specifically, joint empty category detection and dependency parsing, which involves hierarchical structures. In many versions of Transformational Generative Grammars, e.g., the Government and Binding [?] theory, empty category is the key concept bridging S-Structure and D-Structure, due to its possible contribution to trace *movements*. Following the linguistic insights, a traditional dependency analysis can be augmented with empty elements, viz. covert elements [?]. Figure 1 shows an example of the dependency parsing analysis augmented with empty elements. The new representations leverages hierarchical tree structures to encode not only surface but also deep syntactic information. The goal of empty category detection is to find out all empty elements, and the goal of dependency parsing thus includes predicting not only the dependencies among normal words but also the dependencies between a normal word and an empty element.

In this paper, we are concerned with how to employ the structural complexity regularization framework to improve the performance of empty category augmented dependency analysis, which is a complex structured prediction problem compared to the regular dependency analysis.

A traditional dependency graph is a directed graph, such that for sentence the following holds:

,

.

The vertex set consists of nodes, each of which is represented by a single integer. Especially, represents a virtual root node , while all others corresponded to words in . The arc set represents the unlabeled dependency relations of the particular analysis . Specifically, an arc represents a dependency from head to dependent . A dependency graph is thus a set of unlabeled dependency relations between the root and the words of . To represent an empty category augmented dependency tree, we extend the vertex set and define a directed graph as usual.

To define a parsing model, we denote the *index set* of all possible dependencies as . A dependency parse can then be represented as a vector

where if there is an arc in the graph, and otherwise. For a sentence , we define dependency parsing as a search for the highest-scoring analysis of :

Here, is the set of all trees compatible with and evaluates the event that tree is the analysis of sentence . In brief, given a sentence , we compute its parse by searching for the highest-scored dependency parse in the set of compatible trees . The scores are assigned by Score. In this paper, we evaluate structured perceptron and define as ,where is a feature-vector mapping and is the corresponding parameter vector.

In general, performing a direct maximization over the set is infeasible. The common solution used in many parsing approaches is to introduce a part-wise factorization:

Above, we have assumed that the dependency parse can be factored into a set of parts , each of which represents a small substructure of . For example, might be factored into the set of its component dependencies. A number of dynamic programming (DP) algorithms have been designed for first- [?], second- [?], third- [?] and fourth-order [?] factorization.

Parsing for joint empty category detection and dependency parsing can be defined in a similar way. We use another index set , where indicates an empty node. Then a dependency parse with empty nodes can be represented as a vector similar to :

Let denote the set of all possible for sentence . We then define joint empty category detection and dependency parsing as a search for the highest-scoring analysis of :

When the output of the factorization function, namely , is defined as the collection of all sibling or tri-sibling dependencies, decoding for the above two optimization problems, namely (Equation 2) and (Equation 3), can be resolved in low-degree polynomial time with respective to the number of words contained in [?]. In particular, the decoding algorithms proposed by [?] are extensions of the algorithms introduced respectively by [?] and [?].

To perform structure regularization decoding, we need to combine the two models. In this problem, as the models are linear and do not involve probability, they can be easily combined together. Assume that and assign scores to parse trees without and with empty elements, respectively. In particular, the training data for estimating are sub-structures of the training data for estimating . Therefore, the training data for can be viewed as the mini-samples of the training data for . A reasonable model to integrate and is to find the optimal parse by solving the following optimization problem:

where is a weight for score combination.

In this paper, we employ dual decomposition to resolve the optimization problem (Equation 4). We sketch the solution as follows.

The Lagrangian of (Equation 4) is

where is the Lagrangian multiplier. Then the dual is

We instead try to find the solution for

By using a subgradient method to calculate , we have another SR decoding algorithm. Notice that, there is no need to train the simple model and the complex model separately.

## 3Theoretical Analysis: Structure Complexity vs. Overfitting Risk

We first describe the settings for the theoretical analysis, and give the necessary definitions (Section Section 3.1). We then introduce the proposed method with the proper annotations for clearance of the analysis (Section ?). Finally, we give the theoretical results on analyzing the generalization risk regarding to the structure complexity based on stability (Section ?). The general idea behind the theoretical analysis is that the overfitting risk increases with the complexity of the structure, because more complex structures are less stable in training. If some examples are taken out of the training set, the impact on the complex structure models is much severer compared to the simple structure models. The detailed relations among the factors are shown by the analysis.

### 3.1Problem Settings of Theoretical Analysis

In this section, we give the preliminary definitions necessary for the analysis, including the learning algorithm, the data, and the cost functions, and especially the definition of structural complexity. We also describe the properties and the assumptions we make to facilitate the theoretical analysis.

A graph of observations (even with arbitrary structures) can be indexed and be denoted by an indexed sequence of observations . We use the term *sample* to denote . For example, in natural language processing, a sample may correspond to a sentence of words with dependencies of linear chain structures (e.g., in part-of-speech tagging) or tree structures (e.g., in syntactic parsing). In signal processing, a sample may correspond to a sequence of signals with dependencies of arbitrary structures. For simplicity in analysis, we assume all samples have observations (thus tags). In the analysis, we define structural complexity as the scope of the structural dependency. For example, a dependency scope of two tags is considered less complex than a dependency scope of three tags. In particular, the dependency scope of tags is considered the full dependency scope which is of the highest structural complexity.

A sample is converted to an indexed sequence of feature vectors , where is of the dimension and corresponds to the local features extracted from the position/index . We can use an matrix to represent . In other words, we use to denote the input space on a position, so that is sampled from . Let be structured output space, so that the structured output are sampled from . Let be a unified denotation of structured input and output space. Let , which is sampled from , be a unified denotation of a pair in the training data.

Suppose a training set is

with size , and the samples are drawn i.i.d. from a distribution which is unknown. A learning algorithm is a function with the function space , i.e., maps a training set to a function . We suppose is symmetric with respect to , so that is independent on the order of .

Structural dependencies among tags are the major difference between structured prediction and non-structured classification. For the latter case, a local classification of based on a position can be expressed as , where the term represents a local window. However, for structured prediction, a local classification on a position depends on the whole input rather than a local window, due to the nature of structural dependencies among tags (e.g., graphical models like CRFs). Thus, in structured prediction a local classification on should be denoted as . To simplify the notation, we define

Given a training set of size , we define as a modified training set, which removes the ’th training sample:

and we define as another modified training set, which replaces the ’th training sample with a new sample drawn from :

We define the *point-wise cost function* as , which measures the cost on a position by comparing and the gold-standard tag . We introduce the point-wise loss as

Then, we define the *sample-wise cost function* , which is the cost function with respect to a whole sample. We introduce the sample-wise loss as

Given and a training set , what we are most interested in is the *generalization risk* in structured prediction (i.e., the expected average loss) [?]:

Unless specifically indicated in the context, the probabilities and expectations over random variables, including , , , and , are based on the unknown distribution .

Since the distribution is unknown, we have to estimate from by using the *empirical risk*:

In what follows, sometimes we will use the simplified notations, and , to denote and .

To state our theoretical results, we must describe several quantities and assumptions which are important in structured prediction. We follow some notations and assumptions on non-structured classification [?]. We assume a simple real-valued structured prediction scheme such that the class predicted on position of is the sign of . In practice, many popular structured prediction models have a real-valued cost function. Also, we assume the point-wise cost function is convex and *-smooth* such that

While many structured learning models have convex objective function (e.g., CRFs), some other models have non-convex objective function (e.g., deep neural networks). It is well-known that the theoretical analysis on the non-convex cases are quite difficult. Our theoretical analysis is focused on the convex situations and hopefully it can provide some insight for the more difficult non-convex cases. In fact, we will conduct experiments on neural network models with non-convex objective functions, such as LSTM. Experimental results demonstrate that the proposed structural complexity regularization method also works in the non-convex situations, in spite of the difficulty of the theoretical analysis.

Then, *-smooth* versions of the loss and the cost function can be derived according to their prior definitions:

Also, we use a value to quantify the bound of while changing a single sample (with size ) in the training set with respect to the structured input . This *-admissible* assumption can be formulated as ,

where is a value related to the design of algorithm .

### 3.2Structural Complexity Regularization

Base on the problem settings, we give definitions for the common weight regularization and the proposed structural complexity regularization. In the definition, the proposed structural complexity regularization decomposes the dependency scope of the training samples into smaller localized dependency scopes. The smaller localized dependency scopes form mini-samples for the learning algorithms. It is assumed that the smaller localized dependency scopes are not overlapped. Hence, the analysis is for a simplified version of structural complexity regularization. We are aware that in implementation, the constraint can be hard to guarantee. From an empirical side, structural complexity works well without this constraint.

Most existing regularization techniques are proposed to regularize model weights/parameters, e.g., a representative regularizer is the Gaussian regularizer or so called regularizer. We call such regularization techniques as *weight regularization*.

While weight regularization normalizes model weights, the proposed structural complexity regularization method normalizes the structural complexity of the training samples. Our analysis is based on the different *dependency scope* (i.e., the scope of the structural dependency), such that, for example, a tag depending on two tags in context is considered to have less structural complexity than a tag depending on four tags in context. The structural complexity regularization is defined to make the *dependency scope* smaller. To simplify the analysis, we suppose a baseline case that a sample has full dependency scope , such that all tags in have dependencies. Then, we introduce a factor such that a sample has localized dependency scope . In this case, represents the reduction magnitude of the dependency scope. To simplify the analysis without losing generality, we assume the localized dependency scopes do not overlap with each other. Since the dependency scope is localized and non-overlapping, we can split the original sample of the dependency scope into mini-samples of the dependency scope of . What we want to show is that, the learning with small and non-overlapping dependency scope has less overfitting risk than the learning with large dependency scope. Real-world tasks may have an overlapping dependency scope. Hence, our theoretical analysis is for a simplified “essential” problem distilled from the real-world tasks.

In what follows, we also directly call the dependency scope of a sample as the *structure complexity* of the sample. Then, a simplified version of structural complexity regularization, specifically for our theoretical analysis, can be formally defined as follows:

Note that, when the structural complexity regularization strength , we have and .

Now, we have given a formal definition of structural complexity regularization, by comparing it with the traditional weight regularization. Below, we show that the structural complexity regularization can improve the stability of learned models, and can finally reduce the overfitting risk of the learned models.

### 3.3Stability of Structured Prediction

Because the generalization of a learning algorithm is positively correlated with the stability of the learning algorithm [?], to analyze the generalization of the proposed method, we instead examine the stability of the structured prediction. Here, stability describes the extent to which the resulting learning function changes, when a sample in the training set is removed. We prove that by decomposing the dependency scopes, i.e, by regularizing the structural complexity, the stability of the learning algorithm can be improved.

We first give the formal definitions of the stability with respect to the learning algorithm, i.e., function stability.

The stability with respect to the cost function can be similarly defined.

It is clear that the upper bounds of loss stability and function stability are linearly correlated under the problem settings.

The proof is provided in Appendix A.

Here, we show that lower structural complexity has lower bound of stability, and is more stable for the learning algorithm. The proposed method improves stability by regularizing the structural complexity of training samples.

The proof is given in Appendix A.

We can see that increasing the size of training set results in linear improvement of , and increasing the strength of structural complexity regularization results in quadratic improvement of .

The function stability is based on comparing and , i.e., the stability is based on removing a mini-sample. Moreover, we can extend the analysis to the function stability based on comparing and , i.e., the stability is based on removing a full-size sample.

The proof is presented in Appendix A.

In the case that a full sample is removed, increasing the strength of structural complexity regularization results in linear improvement of .

### 3.4Reduction of Generalization Risk

In this section, we formally describe the relation between the generalization and the stability, and summarize the relationship between the proposed method and the generalization. Finally, we draw our conclusions from the theoretical analysis.

Now, we analyze the relationship between the generalization and the stability.

The proof is in Appendix A.

The upper bound of the generalization risk contains the loss stability, which is rewritten as the function stability. We can see that better stability leads to lower bound of the generalization risk.

By substituting the function stability with the formula we get from the structural complexity regularization, we get the relation between the generalization and the structural complexity regularization.

The proof is in Appendix A.

We call the term in ( ?) as the “overfit-bound”. Reducing the overfit-bound is crucial for reducing the generalization risk bound. Most importantly, we can see from the overfit-bound that the structural complexity regularization factor always stays together with the weight regularization factor , working together to reduce the overfit-bound. This indicates that the structural complexity regularization is as important as the weight regularization for reducing the generalization risk for structured prediction.

Moreover, since , and are typically small compared with other variables, especially , ( ?) can be approximated as follows by ignoring the small terms:

First, (Equation 7) suggests that structure complexity can increase the overfit-bound on a magnitude of , and applying weight regularization can reduce the overfit-bound by . Importantly, applying structural complexity regularization further (over weight regularization) can additionally reduce the overfit-bound by a magnitude of . When , which means “no structural complexity regularization”, we have the worst overfit-bound. Also, (Equation 7) suggests that increasing the size of training set can reduce the overfit-bound on a square root level.

Theorem ? also indicates that too simple structures may overkill the overfit-bound but with a dominating empirical risk, while too complex structures may overkill the empirical risk but with a dominating overfit-bound. Thus, to achieve the best prediction accuracy, a balanced complexity of structures should be used for training the model.

By regularizing the complex structure with the simple structure, a balance between the empirical risk and the overfitting risk can be achieved. In the proposed method, the model of the complex structure and the simple structure are both used in decoding. In essence, the decoding is based on the complex model, for the purpose of keeping the empirical risk down. The simple model is used to regularize the structure of the output, which means the structural complexity of the complex model is compromised. Therefore, the overfitting risk is reduced.

To summarize, the proposed method decomposes the dependency scopes, that is, regularizes the structural complexity. It leads to better stability of the model, which means the generalization risk is lower. Under the problem settings, increasing the regularization strength can bring linear reduction of the overfit-bound. However, too simple structure may cause a dominating empirical risk. To achieve a balanced structural complexity, we could regularize the complex structure model with the simple structure model. The complex structure model has low empirical risk, while the simple structure model has low structural risk. The proposed method takes the advantages of both the simple structure model and the complex structure model. As a result, the overall overfitting risk can be reduced.

## 4Experiments

We conduct experiments on natural language processing tasks. We are concerned with two types of structures: linear-chain structures, e.g. word sequences, and hierarchical structures, e.g. phrase-structure trees and dependency trees. The natural language processing tasks concerning linear-chain structures include (1) text chunking, (2) English named entity recognition, and (3) Dutch named entity recognition. We also conduct experiments on a natural language processing task that involves hierarchical structures, i.e. (4) dependency parsing with empty category detection.

### 4.1Experiments on Sequence Labeling Tasks

Text Chunking (Chunking):

The chunking data is from the CoNLL-2000 shared task [?]. The training set consists of 8,936 sentences, and the test set consists of 2,012 sentences. Since there is no development data provided, we randomly sampled 5% of the training data as development set for tuning hyper-parameters. The evaluation metric is F1-score.

English Named Entity Recognition (English-NER):

The English NER data is from the CoNLL-2003 shared task [?]. There are four types of entities to be recognized: PERSON, LOCATION, ORGANIZATION, and MISC. This data has 14,987 sentences for training, 3,466 sentences for development, and 3,684 sentences for testing. The evaluation metric is F1-score.

Dutch Named Entity Recognition (Dutch-NER):

We use the D-NER dataset [?] from the shared task of CoNLL-2002. The dataset contains four types of named entities: PERSON, LOCATION, ORGANIZATION, and MISC. It has 15,806 sentences for training, 2,895 sentences for development, and 5,195 sentences for testing. The evaluation metric is F1-score.

Since LSTM [?] is a popular implementation of recurrent neural networks, we highlight experiment results on LSTM. In this work, we use the bi-directional LSTM (BLSTM) as the implementation of LSTM, considering it has better accuracy in practice. For BLSTM, we set the dimension of input layer to 200 and the dimension of hidden layer to 300 for all the tasks.

The experiments on BLSTM are based on the Adam learning method [?]. Since we find the default hyper parameters work satisfactorily on those tasks, following [?] we use the default hyper parameters as follows: , .

For the tasks with BLSTM, we find there is almost no difference by adding regularization or not. Hence, we do not add regularization for BLSTM. All weight matrices, except for bias vectors and word embeddings, are diagonal matrices and randomly initialized by normal distribution.

We implement our code with the python package *Tensorflow*.

#### Results

Test score | Chunking | English-NER | Dutch-NER |
---|---|---|---|

BLSTM order1 | 93.97 | 87.65 | 76.04 |

BLSTM order2 | 93.24 | 87.59 | 76.33 |

BLSTM order2 + SR | 94.81 (1.57) | 89.72 (2.13) | 80.51 (+4.18) |

BLSTM order3 | 92.50 | 87.16 | 76.57 |

BLSTM order3 + SR | 95.23 (2.73) | 90.59 (3.43) | 81.62 (+5.05) |

First, we apply the proposed scalable multi-order decoding method on BLSTM (BLSTM-SR). Table 1 compares the scores of BLSTM-SR and BLSTM on standard test data. As we can see, when the order of the model is increased, the baseline model worsens. The exception is the result of the Dutch-NER task. When the order of the model is increased, the model is slightly improved. It demonstrates that, in practice, although complex structure models have lower empirical risks, the structural risks are more dominant.

The proposed method easily surpasses the baseline. For Chunking, the F1 error rate reduction is 23.2% and 36.4% for the second-order model and the third-order model, respectively. For English-NER, the proposed method reduces the F1 error rate by 17.2% and 26.7% for the second-order model and the third-order model, respectively. For Dutch-NER, the F1 error rate reduction of 17.7% and 21.6% is achieved respectively for the second-order model and the third-order model. It is clear that the improvement is significant. We suppose the reason is that the proposed method can combine both low-order and high order information. It helps to reduce the overfitting risk. Thus, the F1 score is improved.

Moreover, the reduction is larger when the order is higher, i.e., the improvement of order-3 models is better than that of order-2 models. This confirms the theoretical results that higher structural complexity leads to higher structural risks. This also suggests the proposed method can alleviate the structural risks and keep the empirical risks low. The phenomenon is better illustrated in Figure ?.

Table 2 shows the results on Chunking compared to previous work. We achieve the state-of-the-art in all-phrase chunking. [?] achieve the same score as ours. However, they conduct experiments in noun phrase chunking (NP-chunking). All phrase chunking contains much more tags than NP-chunking, which is more difficult.

Model | F1 |
---|---|

93.48 | |

93.91 | |

94.30 | |

94.29 | |

94.34 | |

94.32 | |

94.52 | |

94.46 | |

This paper | 95.23 |

SR decoding also achieves better results on English NER and Dutch NER than existing methods. [?] employ a BLSTM-CRF model in the English NER task and achieve F1 score of 90.10%. The score is lower than our best F1 score. [?] present a hybrid BLSTM with F1 score of 90.77%. The model slightly outperforms our method, which may be due to the external CNNs they used to extract word features. [?] keep the best result of Dutch NER. However, the model is trained on corpora of multilingual languages. Their model trained with a single language gets 78.08% on F1 score and performs worse than ours. [?] reach 78.6% F1 with a semi-supervised approach in Dutch NER. Our model still outperforms the method.

### 4.2Experiments on Parsing

#{Sent} | #{Overt} | #{Covert} | ||
---|---|---|---|---|

English |
train | 38667 | 909114 | 57342 |

test | 2336 | 54242 | 3447 | |

Chinese |
train | 8605 | 193417 | 13518 |

test | 941 | 21797 | 1520 | |

Joint Empty Category Detection and Dependency Parsing

For joint empty category detection and dependency parsing, we conduct experiments on both English and Chinese treebanks. In particular, English Penn TreeBank (PTB) [?] and Chinese TreeBank (CTB) [?] are used . Because PTB and CTB are phrase-structure treebanks, we need to convert them into dependency annotations. To do so, we use the tool provided by Stanford CoreNLP to process PTB, and the tool introduced by [?] to process CTB 5.0. We use gold-standard POS to derive features for disambiguation. To simplify our experiments, we preprocess the obtained dependency trees in the following way.

We combine successive empty elements with identical head into one new empty node that is still linked to the common head word.

Because the high-order algorithm is very expensive with respect to the computational cost, we only use relatively short sentence. Here we only keep the sentences that are less than 64 tokens.

We focus on unlabeled parsing.

The statistics of the data after cleaning are shown in Table 3. We use the standard training, validation, and test splits to facilitate comparisons. Accuracy is measured with unlabeled attachment score for all overt words (UAS): the percentage of the overt words with the correct head. We are also concerned with the prediction accuracy for empty elements. To evaluate performance on empty nodes, we consider the correctness of empty edges. We report the percentage of the empty words in the right slot with correct head. The -th slot in the sentence means that the position immediately after the -th concrete word. So if we have a sentence with length , we get slots.

#### Results

Empty Element | English | Chinese | |
---|---|---|---|

Sibling | No | 91.73 | 89.16 |

Sibling (complex) | Yes | 91.70 | 89.20 |

Tri-sibling | No | 92.23 | 90.00 |

Tri-sibling (complex) | Yes | 92.41 | 89.82 |

English | Chinese | |
---|---|---|

Sibling (complex) | 91.70 | 89.20 |

Sibling (complex) + SR | 91.96 (0.26) | 89.53 (0.33) |

Tri-sibling (complex) | 92.41 | 89.82 |

Tri-sibling (complex) + SR | 92.71 (0.30) | 90.38 (0.56) |

Table 4 lists the accuracy of individual models coupled with different decoding algorithms on the test sets. We focus on the prediction for overt words only. When we take into account empty categories, more information is available. However, the increased structural complexity affects the algorithms. From the table, we can see that the complex sibling factorization works worse than the simple sibling factorization in English, but works better in Chinese. The results of the tri-sibling factorization are exactly the opposite. The complex tri-sibling factorization works better than the simple tri-sibling factorization in English, but works worse in Chinese. The results can be explained by our theoretical analysis. While the structural complexity is positively correlated with the overfitting risk, it is negatively correlated with the empirical risk. In this task, although the overfitting risk is increased when using the complex structure, the empirical risk is decreased more sometimes. Hence, the results vary both with the structural complexity and the data.

Table 5 lists the accuracy of different SR decoding models on the test sets. We can see that the SR decoding framework is effective to deal with the structure-based overfitting. This time, the accuracy of analysis for overt words is consistently improved. For the second-order model, SR decoding reduces the error rate by 3.1% for English, and by 3.0% for Chinese. For the third-order model, the error rate reduction of 4.0% for English, and 5.5% for Chinese is achieved by the proposed method. Similar to the sequence labeling tasks, the third-order model is improved more. We suppose the consistent improvements come from the ability of reducing the structural risk of the SR decoding algorithm. Although in this task, the complex structure is sometimes helpful to the accuracy of the parsing, the structural risk still increases. By regularizing the structural complexity, further improvements can be achieved, on top of the decreased empirical risk brought by the complex structure.

We use the Hypothesis Tests method [?] to evaluate the improvements. When the *p-value* is set to 0.05, all improvements in Table 5 are statistically significant.

## 5Related Work

The term *structural regularization* has been used in prior work for regularizing *structures of features*. For (typically non-structured) classification problems, there are considerable studies on structure-related regularization. [?] apply spectral regularization for modeling feature structures in multi-task learning, where the shared structure of the multiple tasks is summarized by a spectral function over the tasks’ covariance matrix, and then is used to regularize the learning of each individual task. [?] regularize feature structures for structural large margin binary classifiers, where data points belonging to the same class are clustered into subclasses so that the features for the data points in the same subclass can be regularized. While those methods focus on the regularization approaches, many recent studies focus on exploiting the structured sparsity. Structure sparsity is studied for a variety of non-structured classification models [?] and structured prediction scenarios [?], via adopting mixed norm regularization [?], *Group Lasso* [?], posterior regularization [?], and a string of variations [?].

Compared with those pieces of prior work, the proposed method works on a substantially different basis. This is because the term *structure* in all of the aforementioned work refers to *structures of feature space*, which is substantially different compared with our proposal on regularizing tag structures (interactions among tags).

There are other related studies, including the studies of [?] and [?] on piecewise/decomposed training methods, and the study of [?] on a “lookahead” learning method. They both try to reduce the computational cost of the model by reducing the structure involved. [?] try to simply the structural dependencies of the graphic probabilistic models, so that the model can be efficiently trained. [?] try to simply the output space in structured SVM by decomposing the structure into sub-structures, so that the search space is reduced, and the training is tractable. [?] try to train a localized structured perceptron, where the local output is searched by stepping into the future instead of directly using the result of the classifier.

Our work differs from [?], because our work is built on a regularization framework, with arguments and justifications on reducing generalization risk and for better accuracy, although it has the effect that the decoding space of the complex model is reduced by the simple model. Also, the theoretical results can fit general graphical models, and the detailed algorithm is quite different.

On generalization risk analysis, related studies include [?] on non-structured classification and [?] on structured classification. This work targets the theoretical analysis of the relations between the structural complexity of structured classification problems and the generalization risk, which is a new perspective compared with those studies.

## 6Conclusions

We propose a structural complexity regularization framework, called structure regularization decoding. In the proposed method, we train the complex structure model. In addition, we also train the simple structure model. The simple structure model is used to regularize the decoding of the complex structure model. The resulting model embodies a balanced structural complexity, which reduces the structure-based overfitting risk. We derive the structure regularization decoding algorithms for linear-chain models on sequence labeling tasks, and for hierarchical models on parsing tasks.

Our theoretical analysis shows that the proposed method can effectively reduce the generalization risk, and the analysis is suitable for graphic models. In theory, higher structural complexity leads to higher structure-based overfitting risk, but lower empirical risk. To achieve better performance, a balanced structural complexity should be maintained. By regularizing the structural complexity, that is, decomposing the structural dependencies but also keeping the original structural dependencies, structure-based overfitting risk can be alleviated and empirical risk can be kept low as well.

Experimental results demonstrate that the proposed method easily surpasses the performance of the complex structure models. Especially, the proposed method is also suitable for deep learning models. On the sequence labeling tasks, the proposed method substantially improves the performance of the complex structure models, with the maximum F1 error rate reduction of 23.2% for the second-order models, and 36.4% for the third-order models. On the parsing task, the maximum UAS improvement of 5.5% on Chinese tri-sibling factorization is achieved by the proposed method. The results are competitive with or even better than the state-of-the-art results.

#### Acknowledgments

This work was supported in part by National Natural Science Foundation of China (No. 61300063), and Doctoral Fund of Ministry of Education of China (No. 20130001120004). This work is a substantial extension of a conference paper presented at NIPS 2014 [?].

## References

## AProof

Our analysis sometimes needs to use McDiarmid’s inequality.

where the 3rd step is based on for and , given that is symmetric.

### a.1Proofs

Proof of Lemma ?

According to (Equation 5), we have

This gives the bound of loss stability.

Also, we have

This derives the bound of sample loss stability.

Proof of Theorem ?

When a convex and differentiable function has a minimum in space , its Bregman divergence has the following property for :

With this property, we have

Then, based on the property of Bregman divergence that , we have

Moreover, is a convex function and its Bregman divergence satisfies:

Combining (Equation 9) and (Equation 10) gives

which further gives

Given -admissibility, we derive the bound of function stability based on sample with size . We have

With the feature dimension and for , we have

Similarly, we hav