Inference algorithms for pattern-based CRFs on sequence data

# Inference algorithms for pattern-based CRFs on sequence data

takhanov@mail.ru                                               vnk@ist.ac.at
Nazarbayev University, Kazakhstan              Institute of Science and Technology Austria
###### Abstract

We consider Conditional Random Fields (CRFs) with pattern-based potentials defined on a chain. In this model the energy of a string (labeling) is the sum of terms over intervals where each term is non-zero only if the substring equals a prespecified pattern . Such CRFs can be naturally applied to many sequence tagging problems.

We present efficient algorithms for the three standard inference tasks in a CRF, namely computing (i) the partition function, (ii) marginals, and (iii) computing the MAP. Their complexities are respectively , and where is the combined length of input patterns, is the maximum length of a pattern, and is the input alphabet. This improves on the previous algorithms of [Ye et al. NIPS 2009] whose complexities are respectively , and , where is the number of input patterns. In addition, we give an efficient algorithm for sampling, and revisit the case of MAP with non-positive weights.

A preliminary version of this paper appeared in Proceedings of the 30th International Conference on Machine Learning (ICML), 2013 [8]. This work was partially supported by the European Research Council under the European Unions Seventh Framework Programme (FP7/2007-2013)/ERC grant agreement no 616160.

This paper addresses the sequence labeling (or the sequence tagging) problem: given an observation (which is usually a sequence of values), infer labeling where each variable takes values in some finite domain . Such problem appears in many domains such as text and speech analysis, signal analysis, and bioinformatics.

One of the most successful approaches for tackling the problem is the Hidden Markov Model (HMM). The th order HMM is given by the probability distribution with the energy function

 E(x|z)=∑i∈[1,n]ψi(xi,zi)+∑(i,j)∈Ekψij(xi:j) (1)

where and is the substring of from to . A popular generalization is the Conditional Random Field model [3] that allows all terms to depend on the full observation :

 E(x|z)=∑i∈[1,n]ψi(xi,z)+∑(i,j)∈Ekψij(xi:j,z) (2)

We study a particular variant of this model called a pattern-based CRF. It is defined via

 E(x|z)=∑α∈Γ∑[i,j]⊆[1,n]j−i+1=|α|ψαij(z)⋅[xi:j=α] (3)

where is a fixed set of non-empty words, is the length of word and is the Iverson bracket. If we take then (3) becomes equivalent to (2); thus we do not loose generality (but gain more flexibility).

Intuitively, pattern-based CRFs allow to model long-range interactions for selected subsequences of labels. This could be useful for a variety of applications: in part-of-speech tagging patterns could correspond to certain syntactic constructions or stable idioms; in protein secondary structure prediction - to sequences of dihedral angles corresponding to stable configuration such as -helixes; in gene prediction - to sequences of nucleatydes with supposed functional roles such as “exon” or “intron”, specific codons, etc.

Inference  This paper focuses on inference algorithms for pattern-based CRFs. The three standard inference tasks are

• computing the partition function ;

• computing marginal probabilities for all triplets present in (3);

• computing MAP, i.e. minimizing energy (3).

The complexity of solving these tasks is discussed below. We denote to be total length of patterns and to be the maximum length of a pattern.

A naive approach is to use standard message passing techniques for an HMM of order . However, they would take time which would become impractical for large . More efficient algorithms with complexities , and respectively were given by Ye et al. [10].111Some of the bounds stated in [10] are actually weaker. However, it is not difficult to show that their algorithms can be implemented in times stated above, using our Lemma 1. Our first contribution is to improve this to , and respectively (more accurate estimates are given in the next section).

We also give an algorithm for sampling from the distribution . Its complexity is either (i) per sample, or (ii) per sample with an preprocessing (assuming that we have an oracle that produces independent samples from the uniform distribution on in time).

Finally, we consider the case when all costs are non-positive. Komodakis and Paragios [2] gave an technique for minimizing energy (3) in this case. We present a modification that has the same worst-case complexity but can beat the algorithm in [2] in the best case.

Related work The works of [10] and [2] are probably the most related to our paper. The former applied pattern-based CRFs to the handwritten character recognition problem and to the problem of identification of named entities from texts. The latter considered a pattern-based CRF on a grid for a computer vision application; the MAP inference problem in [2] was converted to sequence labeling problems by decomposing the grid into thin “stripes”.

Qian et al. [5] considered a more general formulation in which a single pattern is characterized by a set of strings rather than a single string . They proposed an exact inference algorithm and applied it to the OCR task and to the Chinese Organization Name Recognition task. However, their algorithm could take time exponential in the total lengths of input patterns; no subclasses of inputs were identified which could be solved in polynomial time.

A different generalization (for non-sequence data) was proposed by Rother et al. [6]. Their inference procedure reduces the problem to the MAP estimation in a pairwise CRF with cycles, which is then solved with approximate techniques such as BP, TRW or QPBO. This model was applied to the texture restoration problem.

Nguyen et al. [4] extended algorithms in [10] to the Semi-Markov model [7]. We conjecture that our algorithms can be extended to this case as well, and can yield a better complexity compared to [4].

In [8] we applied the pattern-based CRF model to the problem of the protein dihedral angles prediction.

## 1 Notation and preliminaries

First, we introduce a few definitions.

• A pattern is a pair where is an interval in and is a sequence over alphabet indexed by integers in (). The length of is denoted as .

• Symbols “” denotes an arbitrary word or pattern (possibly the empty word or the empty pattern at position ). The exact meaning will always be clear from the context. Similary, “+” denotes an arbitrary non-empty word or pattern.

• The concatenation of patterns and is the pattern . Whenever we write we assume that it is defined, i.e.  and for some .

• For a pattern and interval , the subpattern of at position is the pattern where .
If then is called a prefix of . If then is a suffix of .

• If is a subpattern of , i.e.  for some , then we say that is contained in . This is equivalent to the condition .

• is the set of patterns with interval . We typically use letter for patterns in and letters for other patterns. Patterns will be called partial labelings.

• For a set of patterns and index we denote to be the set of patterns in that end at position : .

• For a pattern let be the prefix of of length ; if is empty then is undefined.

We will consider the following general problem. Let be the set of patterns of words in placed at all possible positions: . Let be a commutative semiring with elements which are identities for and respectively. Define the cost of pattern via

 f(x)=⨂α∈Π∘,x=∗α∗cα (4)

where are fixed constants. (Throughout the paper we adopt the convention that operations and over the empty set of arguments give respectively and , and so e.g. .) Our goal is to compute

 Z=⨁x∈D1:nf(x) (5)

Example 1  If then problem (5) is equivalent to computing the partition function for the energy (3), if we set .

Example 2  If where then we get the problem of minimizing energy (3), if .

The complexity of our algorithms will be stated in terms of the following quantities:

• is the number of distinct non-empty prefixes of words in . Note that .

• is the number of distinct proper prefixes of words in . There holds .
If then . If is a sparse random subset of the set above then .

• is the set of non-empty words which are both prefixes and suffixes of some words in . Note that and .

We will present 6 algorithms:

Sec. 2: algorithm for the case when is a ring, i.e. it has operation that satisfies for all . This holds for the semiring in Example 1 (but not for Example 2).
Sec. 3: algorithm for sampling. Alternatively, it can be implemented to produce independent samples in time per sample with a preprocessing.
Sec. 4: algorithm for computing marginals for all patterns .
Sec. 5: algorithm for a general commutative semiring, which is equivalent to the algorithm in [10]. It will be used as a motivation for the next algorithm.
Sec. 6: algorithm for a general commutative semiring; for the semiring in Example 2 the complexity can be reduced to .
Sec. 7: algorithm for the case , for all .

All algorithms will have the following structure. Given the set of input patterns , we first construct another set of patterns ; it will typically be either the set of prefixes or the set of proper prefixes of patterns in . This can be done in a preprocessing step since sets will be isomorphmic (up to a shift) for indexes that are sufficiently far from the boundary. (Recall that is the set of patterns in that end at position .) Then we recursively compute messages for which have the following interpretation: is the sum (“”) of costs over a certain set of partial labelings of the form . In some of the algorithms we also compute messages which is the sum of over all partial labelings of the form .

Graph    The following construction will be used throughout the paper. Given a set of patterns and index , we define to be the Hasse diagram of the partial order on , where iff is a suffix of (). In other words, is a directed acyclic graph with the following set of edges: belongs to for if and there exists no “intermediate” pattern with . It can be checked that graph is a directed forest. If then is connected and therefore is a tree. In this case we treat as the root. An example is shown in Fig. 1.

Computing partial costs   Recall that for a pattern is the cost of all patterns inside (eq. (4)). We also define to be the cost of only those patterns that are suffixes of :

 ϕ(α)=⨂β∈Π∘,α=∗βcβ (6)

Quantities and will be heavily used by the algorithms below; let us show how to compute them efficiently.

###### Lemma 1.

Let be a set of patterns with for all . Values for all can be computed using multiplications (“”). The same holds for values assuming that is prefix-closed, i.e.  for all non-empty patterns .

###### Proof.

To compute for patterns , we use the following procedure: (i) set ; (ii) traverse edges of tree (from the root to the leaves) and set

 ϕ(β):={ϕ(α)⊗cβif β∈Π∘ϕ(α)otherwise

Now suppose that is prefix-closed. After computing , we go through indexes and set

 f(εs):=\mathds1,f(α):=f(α−)⊗ϕ(α)∀α∈Πs−{εs}

Sets of partial labelings   Let be a set of patterns that end at position . Assume that . For a pattern we define

 Xs(α) = {x∈D1:s|x=∗α} (7) Xs(α;Πs) = Xs(α)−⋃(α,β)∈E[Πs]Xs(β) (8)

It can be seen that sets are disjoint, and their union over is . Furthermore, there holds

 Xs(α;Πs)={x∈Xs(α)|x≠∗β∀β=+α∈Πs} (9)

We will use eq. (9) as the definition of in the case when .

## 2 Computing partition function

In this section we give an algorithm for computing quantity (5) assuming that is a ring. This can be used, in particular, for computing the partition function. We will assume that ; we can always add to if needed222Note that we still claim complexity where is the number of distinct non-empty prefixes of words in the original set . Indeed, we can assume w.l.o.g. that each letter in occurs in at least one word . (If not, then we can “merge” non-occuring letters to a single letter and add this letter to ; clearly, any instance over the original pair can be equivalenly formulated as an instance over the new pair. The transformation increases only by ). The assumption implies that . Adding to increases by at most , and thus does not affect bound ..

First, we select set as the set of prefixes of patterns in :

 Π={α|∃α∗∈Π∘} (10)

We will compute the following quantities for each , :

 Ms(α)=⨁x∈Xs(α;Πs)f(x),Ws(α)=⨁x∈Xs(α)f(x) (11)

It is easy to see that for the following equalities relate and :

 Ms(α) = Ws(α)⊖⨁(α,β)∈E[Πs]Ws(β) (12a) Ws(α) = Ms(α)⊕⨁(α,β)∈E[Πs]Ws(β) (12b)

These relations motivate the following algorithm. Since for indexes that are sufficiently far from the boundary, its complexity is assuming that values in eq. (13a) are computed using Lemma 1.

###### Theorem 2.

Algorithm 1 is correct, i.e. it returns the correct value of .

### 2.1 Proof of Theorem 2

Eq. (13b) coincides with (12b); let us show that eq. (13a) holds for any . (Note, for step 2 is correct: assumption implies that , and therefore , ).

For a partial labeling define the “reduced partial cost” as

 f−(x)=⨂α∈Π∘,x=∗α+cα (14)

It is easy to see from (11) that for any

 Ws−1(α−)=∑x∈Xs(α)f−(x) (15)

Consider . We will show that for any there holds

 \llbracketx∈Xs(α;Πs)\rrbracket⊗f(x)=ϕ(α)⊗[f−(x)⊖⨁(α,β)∈E[Πs]:x∈Xs(β)f−(x)] (16)

where if the argument is true, and otherwise. This will be sufficient for establishing the theorem: summing these equations over and using (11), (15) yields eq. (13a).

Two cases are possible:

Case 1: for some . (Such is unique since sets are disjoint.) Then both sides of (16) are .

Case 2: . Then eq. (16) is equivalent to . This holds since there is no pattern with (otherwise we would have and thus by definition (9) - a contradiction).

## 3 Sampling

In this section consider the semiring from Example 1. We assume that all costs are strictly positive. We present an algorithm for sampling labelings according to the probability distribution .

As in the previous section, we assume that , and define to be the set of prefixes of patterns in (eq. (10)). For a node let be the set of nodes in the subtree of rooted at , with . For a pattern we define set

 Δs(α)=Ts(α−)−⋃(α,β)∈G[Πs+1]Ts(β−) (17)

We can now present the algorithm (see Algorithm 2).

We say that step of the algorithm is valid if either (i) , or (ii) , step is valid, and for some . (This is a recursive definition.) Clearly, if step is valid then line 3 of the algorithm is well-defined.

###### Theorem 3.

(a) With probability 1 all steps of the algorithm are valid. (b) The returned labeling is distributed according to .

Complexity   Assume that we have an oracle that produces independent samples from the uniform distribution on in time.

The main subroutine performed by the algorithm is sampling from a given discrete distribution. Clearly, this can be done in time where is the number of allowed values of the random variable. With a preprocessing, a sample can also be produced in time by the so-called “alias method” [9].

This leads to two possible complexities: (i) (without preprocessing); (ii) per sample (with preprocessing). Let us discuss the complexity of this preprocessing. Running Algorithm 1 takes time. After that, for each we need to run the linear time procedure of [9] for distributions . The following theorem implies that this takes time.

There holds .

###### Proof.

Consider pattern . For a letter let be longest suffix of with (at least one such suffix exists, namely ). It can be seen that the set is exactly the set of patterns for which contains (checking this fact is just definition chasing). Therefore, the sum in the theorem equals . ∎

To summarize, we showed that with a preprocessing we can compute independent samples from in time per sample.

### 3.1 Proof of Theorem 3

Suppose that step of the algorithm is valid; this means that patterns for are well-defined. For we then define the set of patterns (if then we define instead). We also define sets of labelings

 Yt(α) = {yxt+1:n|y∈Xt(α;Πt)}∀α∈At (18) Yt = Yt(αt) (19)

where is a labeling with for . Let .

###### Lemma 5.

Suppose that step is valid.
(a) is a disjoint union of sets over .
(b) For each there holds , and consequently for any there holds

 ∑y∈Ys(α)f(y)=consts⋅∑y∈Xs(α;Πs)f(y)=consts⋅Ms(α)

Theorem 3 will follow from this lemma. Indeed, the lemma shows that the algorithm implicitly computes a sequence of nested sets . At step we divide set into disjoint subsets , and select one of them, , with the probability proportional to .

We still need to show that if step is valid then step is valid as well with probability 1. It follows from the precondition that sampled in line 3 satisfies with probability 1; this implies that . From the paragraph above we get that with probability 1. We also have implying that for some . This concludes the proof that step is valid with probability 1.

It remains to prove Lemma 5.

Part (a)   First, we need to check that is equal to the disjoint union of over where . Disjointness of for different is obvious. Since , then for any , is straightforward from the definition of . Thus, we only need to check the inclusion of in the union.

Elements of can be seen as nodes in tree . Then any pattern from defines the longest suffix such that . It is easy to see that , and moreover, the descending path in from to does not contain elements from , otherwise . It is easy to see that this is equivalent to . Since , is a subset of the union of over .

Now according to definition of we can write:

 Ys+1 = {yxs+2:n|y∈Xs+1(αs+1;Πs+1)} = {y(αs+1)s+1:s+1xs+2:n|y∈Xs(α−s+1;Π−s+1)} = ⋃α∈Δs(αs+1){y(αs+1)s+1:s+1xs+2:n|y∈Xs(α;Πs)}

It only remains to check that in the last union the set corresponding to is exactly equal to .

Part (b)   Let be the start position of , i.e. . Consider labeling , we then must have . Let be a pattern with , . We will prove that ; this will imply the claim.

Suppose on the contrary that . Denote , then and . Therefore, (since ). However, this contradicts the assumption that .

## 4 Computing marginals

In this section we again consider the semiring from Example 1 where all costs are strictly positive, and consider a probablity distribution over labelings .

For a pattern we define

 Ω(α) = {x∈D1:n|x=∗α∗} (20) Z(α) = ∑x∈Ω(α)f(x) (21)

We also define the set of patterns

 Π={α|∃α∗,∗α∈Π∘,α is % non-empty} (22)

Note that and for indexes that are sufficiently far from the boundary. We will present an algorithm for computing for all patterns in time . Marginal probabilities of a pattern-based CRF can then be computed as for a pattern .

In the previous section we used graph for a set of patterns ; here we will need an analogous but a slightly different construction for patterns in . For patterns we write if . If we have then we write .

Now consider . We define to be the set of patterns such that and there is no other pattern with .

Our algorithm is given below. In the first step it runs Algorithm 1 from left to right and from right to left; as a result, we get forward messages and backward messages for patterns such that

 −→Wj(α)=∑x=∗αx∈D1:jf(x)←−Wi(α)=∑y=α∗y∈Di:nf(y) (23)
###### Theorem 6.

Algorithm 3 is correct.

We prove the theorem in section 4.1, but first let us discuss algorithm’s complexity. We claim that all values used by the algorithm can be computed in time where and are respectively the number of distinct non-empty prefixes and suffixes of words in . Indeed, we first compute these values for patterns in the set ; by Lemma 1, this takes time. This covers values used in eq. (24a). As for the value in eq. (24b) for pattern , we can use the formula

 f(αi+1:j−1)=f(α)~cα→ϕ(α)←ϕ(α)

where if and otherwise, and

 →ϕ(α)=∏β∈Π∘,α=∗βcβ, ←ϕ(α)=∏β∈Π∘,α=β∗cβ

The latter values can be computed in time by applying Lemma 1 in the forward and backward directions. (In fact, they were already computed when running Algorithm 1.)

We showed that step 1 can be implemented in time; let us analyze step 2. The following lemma implies that it performs arithmetic operations; since , we then get that the overall complexity is .

###### Lemma 7.

For each there exist at most patterns such that .

###### Proof.

Let be the set of such patterns . Note, there holds . We need to show that . Let us order patterns lexicographically (first by , then by ): with , and denote where is the interval for . We will prove by induction that for ; this will imply that , as desired.

The base case is trivial. Suppose that it holds for ; let us prove it for . If then by the definition of the order on , so the claim holds. Suppose that . If then contradicting the condition . Thus, , and so the claim of the induction step holds. ∎

Remark 1   An alternative method for computing marginals with complexity was given in [10]. They compute value directly from messages and by summing over pairs of patterns (thus the square factor in the complexity). In contrast, we use a recursive rule that uses previously computed values of . We also use the existence of the “