Large Alphabet Compression and Predictive Distributions through Poissonization and Tilting

Large Alphabet Compression and Predictive Distributions through Poissonization and Tilting

Xiao Yang,  Andrew R. Barron,  Part of the paper was presented at The Sixth Workshop on Information Theoretic Methods in Science and Engineering, 26-29 August 2013, Tokyo, Japan
Abstract

This paper introduces a convenient strategy for coding and predicting sequences of independent, identically distributed random variables generated from a large alphabet of size . In particular, the size of the sample is allowed to be variable. The employment of a Poisson model and tilting method simplifies the implementation and analysis through independence. The resulting strategy is optimal within the class of distributions satisfying a moment condition, and is close to optimal for the class of all i.i.d distributions on strings of a given length. Moreover, the method can be used to code and predict strings with a condition on the tail of the ordered counts. It can also be applied to distributions in an envelope class.

large alphabet, minimax regret, normalized maximum likelihood, Poisson distribution, power law, universal coding, Zipf’s law

I Introduction

Large alphabet compression and prediction problems concern understanding the probabilistic scheme of a huge number of possible outcomes. In many cases the ordered probability of individual outcomes displays a quickly falling shape, with a small number of outcomes happening most often. An example is Chinese characters. A recent published dictionary contains 85568 Chinese characters in total [1], but the number of frequent characters is considerably smaller. Here we consider an i.i.d model for this problem. Despite the possible dependence among the symbols in the alphabet as in language, it serves as a start and can be extended to models taking dependence into account.

Previous theoretical analysis usually assumes the length of a message is known in advance when it is coded. This is not always true in practice. Serialization writers do not know how many words a novel contains exactly before he finishes the last sentence. Nevertheless, given a limited time or space, one could possibly guess how many words on average can be accommodated.

Suppose a string of random variables is generated independently from a discrete alphabet of size . We allow the string length to be variable. A special case is when is given as a fixed number, or it can be random. In either case, is a member of the set of all finite length srtings

Our goal is to code/predict the string . Note that the length is determined by the string. There will be an agreed upon distribution of , perhaps Poisson or deterministic.

Now suppose given , each random variable is generated independently according to a probability mass function in a parametric family on . Thus

for Of particular interest is the class of all distributions with parameterized by the simplex .

Let denote the vector of counts for symbol . The observed sample size is the sum of the counts . Both and have factorizations based on the distribution of the counts

and

The first factor of the two equations is the uniform distribution on the set of strings with given counts, which does not depend on . The vector of counts forms a sufficient statistic for . Modeling the distribution of the counts is essential for forming codes and predictions. In the particular case of all i.i.d. distributions parameterized by the simplex, the distribution is the distribution.

In the above, there is a need for a distribution of the total count . Of particular interest is the case that the total count is taken to be , because then the resulting distribution of individual counts makes them independent.

Accordingly, we give particular attention to the target family , in which is the product of distribution for . It makes the total count with and yields the distribution by conditioning on , where . And the induced distribution on is

The task of coding a string is equivalent to providing a probabilistic scheme. A coder for the string is also a (sub)probability distribution on which assigns a probability to each string and produces a binary string of length (we do not worry about the integer constraint). Ideally the true probability distribution could be used if were known, as it produces no extra bits for coding purpose. The regret induced by using instead of is

where is logarithm base . Likewise, the expected regret is

In universal coding the expected regret is also called the redundancy.

Here we can construct by choosing a probability distribution for the counts and then use the uniform distribution for the distribution of strings given the counts, written as . That is

Then the regret becomes the ratio of the counts probability

And the redundancy becomes

In the pointwise regret story, the set of codelengths provides a standard with which our coder is to be compared. Given the family , consider the best candidate with hindsight , which achieves the maximum value, (corresponding to ), where is the maximum likelihood estimator of , and compare it to our strategy . The maximization is equivalent to maximizing for the count probability, as the uniform distribution dose not depend on , i.e.

Then the problem becomes: given the family , how to choose to minimize the maximized regret

or the redundancy,

For the regret, the maximum can be restricted to a set of counts instead of the whole space. A traditional choice being associated with a given sample size , in which case the minimax regret is

As is familiar in universal coding [2][3], the normalized maximum likelihood (NML) distribution

is the unique pointwise minimax strategy when is finite, and is the minimax value. When is large, the NML distribution can be unwieldy to compute for compression or prediction. Instead we will introduce a slightly suboptimal coding distribution that makes the counts independent and show that it is nearly optimal for every with not too different from a target . Indeed, we advocate that our simple coding distribution is preferable to use computationally when is large even if the sample size were known in advance.

To produce our desired coding distribution we make use of two basic principles. One is that the multinomial family of distributions on counts matches the conditional distribution of given the sum when unconditionally the counts are independent Poisson. Another is the information theory principle [4][5][6] that the conditional distribution given a sum (or average) of a large number of independent random variables is approximately a product of distributions, each of which is the one closest in relative entropy to the unconditional distribution subject to an expectation constraint. This minimum relative entropy distribution is an exponential tilting of the unconditional distribution.

In the Poisson family with distribution , exponential tilting (multiplying by the factor ) preserves the Poisson family (with the parameter scaled to ). Those distributions continue to correspond to the multinomial distribution (with parameters ) when conditioning on the sum of counts . A particular choice of provides the product of Poisson distributions closest to the multinomial in regret. Here for universal coding, we find the tilting of individual maximized likelihood that makes the product of such closest to the Shtarkov’s NML distribution. This greatly simplifies the task of approximate optimal universal compression and the analysis of its regret.

Indeed, applying the maximum likelihood step to a Poisson count produces a maximized likelihood value of . We call this maximized likelihood the Stirling ratio, as it is the quantity that Stirling’s approximation shows near for not small. We find that this plays a distinguished role in universal large alphabet compression, even for sequences with small counts k. This measure has a product extension to counts ,

Although has an infinite sum by itself, it is normalizable when tilted for every positive . The tilted Stirling ratio distribution is

(1)

with the normalizer .

The coding distribution we propose and analyze is simply the product of those tilted one-dimensional maximized Poisson likelihood distributions for a value of we will specify later

By allowing description of all possible counts , , our codelength will be greater for some strings than codelengths designed for the case of a given sum . Nevertheless, with distributed , the probability of the outcome is approximately . So the allowance of description of (not just given ) adds which is approximately bits to the description length beyond that which would have been ideal if were known. This ideal codelength constructed from the tilted maximized Poisson, when conditioning on , matches the Shtarkov’s normalized maximum likelihood based on the multinomial.

For small alphabet with , the minimax regret is about bits per free parameter (a total of + constant); and for large alphabet when and , the minimax regret is about and respectively [2][3][7][8]. The additional bits is a small price to pay for the sake of gaining the coding simplification and additional flexibility.

If it is known that the total count is , then the regret is a simple function of and the normalizer . The choice of the tilting parameter given by the moment condition minimizes the regret over all positive . This arises by differentiation because is equal to the given moment. Moreover, depends only on the ratio between the size of the alphabet and the total count . Fig. 1 displays as a function of solved numerically. Given an alphabet with symbols and a string generated from it of length , one can look at the plot and find the desired according to the given, and then use the to code the data.

If, however, the total count is not given, then the regret depends on . We use a mixture of to account for the lack of knowledge in advance, and details are discussed in section III-D.

Fig. 1: Relationship between and .

When is small, the tilting of the maximized Poisson likelihood distributions does not have much effect except in the tail of the distribution. Over most of the range of count values it follows the approximate power-law as we have indicated. Power-laws have been studied for count distributions and are shown to be related to Zipf’s law for the sorted counts [9]. Our use of a distribution close to a power-law is not because a power-law is assumed to govern the data, but rather because of its near optimum regret properties within suitable sets of counts, demonstrated here for the class of all Poisson count distributions, from which we obtain also its near optimality for the class of all Multinomial distributions on counts.

Shtarkov studied the universal data compression problem and identified the exact pointwise minimax strategy [2]. He showed the asymptotic minimax lower bound for the regret is , in which the parameter set is the dimensional simplex of all probability vectors on an alphabet of size . However, this strategy cannot be easily implemented for prediction or compression [2], because of the computational inconvenience of computing the normalizing constant, and because of the difficulty in computing the successive conditionals required for implementation (by arithmetic coding). Let be the number of different symbols that appear in a sequence. Shtarkov[10] also pointed out that when is large, it typical that is much less than , and the regret depends mainly on rather than . Xie and Barron[3][11] gave an asymptotic minimax strategy for coding under both the expected and pointwise regret for fixed size alphabet, which is formulated by a modification of the mixture density using Jeffery’s prior. The asymptotic value of both the redundancy and the regret are of the form , where is a constant depending on . Orlitsky and Santhanam[12] considered the problem in a large alphabet setting in which the number of symbols is much larger than the sequence length or even infinite. They found the main terms in the minimax regret for , and cases take the forms , and respectively. Szpankowski and Weinberger[8] provided more precise asymptotics in these settings. They also calculated the minimax regret of a source model in which some symbol probabilities are fixed. Boucheron, Garivier and Gassiat[13] focused on countably infinite alphabets with an envelope condition; they used an adapted strategy and gave upper and lower bounds for pointwise minimax regret. Later on Bontemps and Gassiat[14] worked on exponentially decreasing envelope class and provided a minimax strategy and the corresponding regret.

In this paper, we introduce a straightforward and easy to implement method for large alphabet coding. The purpose is three-fold: first, by allowing the sample size to be variable, we are considering a larger class of distributions. This is a more realistic and less restrictive assumption than presuming a particular length. But the method can also be used for fixed sample size coding and prediction.

Second, it unveils an information geometry of three key distributions/measures in the problem: the unnormalized maximum Poisson likelihood measure of the counts, the conditional distribution of given the total count equals , which matches Shtarkov’s normalized maximum multinomial likelihood distribution, and a tilted distribution , with the tilting parameter chosen to make the expected total count equal to . This tilted distribution minimizes the relative entropy from the original measure within the class of distributions with the moment condition . Hence, is the information projection of onto . Moreover, since is also in , the Pythagorean-like equality holds [15][4], i.e.

The case of a tilted distribution (the information projection) as an approximating conditional distribution is investigated in [6] and [5]. A difference here is that our unconditional measure is not normalizable.

Thirdly, the strategy designed through an independent Poisson model and tilting is much easier to analyze and compute as compared to the strategies based on multinomials. The convenience is gained through independence. To actually apply this two pass code, one could first describe the independent counts , for instance by arithmetic coding using , and then describe given the count vector, by arithmetic coding using the sequence of conditional distributions for given both and all the counts (which is the sampling without replacement distribution, proportional to the counts of what remains after step ).

This paper is organized in the following way. Section II introduces the model. Section III provides general results and outlines the proof, and Section IV gives simulated and real data examples. Details of proof are left in the appendix.

Ii The Poisson Model

A Poisson model fits well into this problem. We have for each ,

independently, and also has a Poisson distribution

where . Write , we have

We know that the the MLE for each is , and the first term is a uniform distribution which does not depend on . So

where , (as given in the introduction) is the unnormalized maximized likelihood .

If we use a distribution to code the counts, then the regret is

And the redundancy is

This method can also be applied to fixed total count scenario, which corresponds to the multinomial coding and prediction problem. Suppose is given, the Poisson model, when conditioned on , indeed reduces to the i.i.d sampling model

The right hand side is a discrete memoryless source distribution (i.i.d. ) with probability specified by , for . Note that a sequence with counts of total satisfies

The question left is still how to model the counts. The maximized likelihood (the same target as used by Shtarkov) is thus expressible as

Now again if we use to code the counts, then the regret is

(2)

Here , hence the term is Stirling’s approximation of . The arises because here includes description of the total while the more restrictive target regards it as given.

Iii Results

Iii-a Regret

We start by looking at the performance of using independent tilted Stirling ratio distributions as a coding strategy, by examining the resulting regret.

Let be any set of counts, then the maximized regret of using as a coding strategy given a class of distributions when the vector of counts is restricted to is

Theorem 1.

Let be the distribution specified in equation (1) (Poisson maximized likelihood, tilted and normalized). The regret of using a product of tilted distributions for a given vector of counts is

Let be the set of count vectors with total count be defined as before, then

(3)

Let be the choice of satisfying the following moment condition

(4)

Then is the minimizer of the regret in expression (3). Write .

When , the is near in the following sense.

(5)

where .

When , the is near in the following sense.

(6)

where , and .

When , the , where the constant , and is such that .

Proof:

The expression of the regret is from the definition. The fact that is the minimizer can be seen by taking partial derivative with respect to of expression (3). The upper bounds are derived by applying Lemma 1 in the appendix. Pick and use the first inequality, we get the upper bound for case; pick and use the second inequality, we have the upper bound for . Here is the logarithm base . The rest of the proof is left in Appendix B. ∎

Remark 1: The regret depends only on the number of parameters , the total counts and the tilting parameter . The optimal tilting parameter is given by a simple moment condition in equation (4).

Remark 2: The regret is close to the minimax level in all three cases listed in Theorem 1. The main terms in the and cases are the same as the minimax regret given in [8] except the multiplier for here is instead of for the small scenario. For the case, the is close to the minimax regret in [8] numerically.

Remark 3: In fact, the regret provides an upper bound for the regret. Recall that

(7)

Theorem 4 in Appendix C gives more detailed expression of the redundancy for using . While there is a reduction of bits as compared to the pointwise case, the error depends on the ’s. Nevertheless, expression (7) still provides an uniform upper bound for the redundancy for all possible Poisson means with a given sum.

Corollary 1.

Let be a family of multinomial distributions with total count . Then the maximized regret has an upper bound within above the upper bound in Theorem 1.

Proof:

This can be easily seen by equation (2). ∎

Iii-B Subset of sequences with partitioned counts

One advantage of using the tilted Stirling ratio distributions is the flexibility of choosing tilting parameters. As mentioned in the introduction, the ratio uniquely determines the optimal tilting parameter. In fact, different tilting parameters can be used for symbols to adjust for their relative importance in the alphabet. Here we consider a situation in which the empirical distribution has most probability captured by a small portion of the symbols. This happens when the sorted probability list is quite skewed.

The following theorem holds for strings with constraints on the sum of tail counts . Small remainder occurs in the following regret bound when and are both small.

Theorem 2.

Let be a subset of count vectors with the tail sum controlled by a value , that is, Here is a number between and . The regret of using the tilted Stirling ratio distributions for count vectors in given each is mainly

(8)

The remainder is bounded below by and above by , where

and

Here is and is and .

Proof:

Consider the product distribution,

where if , and is defined as if . It is in fact using an dimensional product distribution on the first symbols, and an dimensional product distribution on the rest.

The regret is the same for any given and . That is,

Here denotes the class of independent Poisson distributions and is the set of independent Poisson counts with sum equal to . In the above case, or , and or .

The choice of providing minimization of is given by the following conditions

This result can be derived by applying Theorem 1 to and respectively. ∎

Remark 4: The problem here is treated as two separate coding tasks, one for a small alphabet with symbols having a total count , and the other for a large alphabet with symbols with total count . The two main terms in expression (8) represent regret from coding the two subsets of symbols, with one set containing symbols having relatively large counts, and each symbol induces bits of regret, and the other containing the rest symbols with small counts and together cost extra bits.

Remark 5: One can arrange more flexibility in what the code can achieve by adding small additional pieces to the code. One is to adapt the choice of between and , including more bits for the description of . Next one can either work with the counts in the given order, or use an additional bits to describe the subset that has the largest counts. Then one uses bits to describe the counts. Rather than fixing , one works with the empirical tail fraction , where is the sum of the counts for the remaining symbols. Finally one has to adapt the choices of and . A suggested method of doing so is described in Section III-D, in which the above is replaced by a mixture over a range of choices of and .

Iii-C Envelope class

Besides a subset of strings, we can also consider subclass of distributions. Here we follow the definition of envelope class in [13]. Suppose is a class of distributions on with the symbol probability bounded above by an envelope function , i.e.

Given the string length , we know the count of each symbol follows a Poisson distribution with mean , . This transfers an envelope condition from the multinomial distribution to a Poisson distribution, the mean for which is restricted to the following set

Theorem 3.

The minimax regret of the Poisson class with envelope function has the following upper bound

where , and

Proof:

A tilted distribution with will give the result. Details are left in Appendix D. ∎

Remark 6: Here in order for to be small, the tail sum of the envelope function needs to be small, although the upper bound holds for general envelope function and . This result is of the same order as the upper bound given in [13]. The first main term in the bound given in Theorem 3 also matches the minimax regret given in [3] for an alphabet with symbols and data points by Stirling’s approximation, i.e.,

The extra is because the tilted distribution allows free parameters instead of .

Remark 7: The best choice of tilting parameters for envelope class only depends on the envelope function and the number of symbols constituting the ‘frequent’ subset. Unlike the subset of strings case discussed before, neither the order of the counts nor which symbols are those with largest counts matters, all we need is an envelope function decaying fast enough when the symbol probabilities are arranged in decreasing order so that and are both small.

Iii-D Regret with unknown total count

We know that depends on the value of the total count. However, when the total count is not known, we can use a mixture of tilted distributions .

Here the upper end of the integrated area is due to inequality (16). We have