Fundamental Limits of Universal Variable-to-Fixed Length Coding of Parametric Sources This research was funded in part by the NSF under grant No. CCF-1422358.

Fundamental Limits of Universal Variable-to-Fixed Length Coding of Parametric Sources thanks: This research was funded in part by the NSF under grant No. CCF-1422358.

Nematollah Iri niri1@asu.edu Oliver Kosut okosut@asu.edu
Abstract

Universal variable-to-fixed (V-F) length coding of -dimensional exponential family of distributions is considered. We propose an achievable scheme consisting of a dictionary, used to parse the source output stream, making use of the previously-introduced notion of quantized types. The quantized type class of a sequence is based on partitioning the space of minimal sufficient statistics into cuboids. Our proposed dictionary consists of sequences in the boundaries of transition from low to high quantized type class size. We derive the asymptotics of the -coding rate of our coding scheme for large enough dictionaries. In particular, we show that the third-order coding rate of our scheme is , where is the entropy of the source and is the dictionary size. We further provide a converse, showing that this rate is optimal up to the third-order term.

1 Introduction

A variable-to-fixed (V-F) length code consists of a dictionary of pre-specified size. Elements of the dictionary (segments) are used to parse the infinite sequence emitted from the source. Segments may have variable length, however they are encoded to the fixed-length binary representation of their indices within the dictionary. In order to be able to uniquely parse any infinite length sequence into the segments, we assume the dictionary to be complete (i.e. every infinite length sequence has a prefix within the dictionary) and proper (i.e. no segment is a prefix of another segment). The underlying source model induces a distribution on the segment lengths. The segment length distribution reflects the quality of the dictionary for the compression task.

For a given memoryless source, Tunstall [1] provided an average-case optimal algorithm to maximize average segment length. A central limit theorem for the Tunstall algorithm’s code length has been derived in [2]. In most applications, however, statistics of the source are unknown or arduous to estimate, especially at short blocklengths, where there are limited samples for the inference task. In universal source coding, the underlying distribution in force is unknown, yet belongs to a known collection of distributions. Universal V-F length codes are studied in e.g. [3, 4, 5, 6]. Upper and lower bounds on the redundancy of a universal code for the class of all memoryless sources is derived in [3]. Universal V-F length coding of the class of all binary memoryless sources is then considered in [4, 5], where [5] provides an asymptotically average sense optimal111Throughout, “optimality” of an algorithm is considered only up to the model cost term (i.e. the term reflecting the price of universality) in the coding rate. The model cost term is the second-order term in the average case analysis, while it is the third-order term in the probabilistic analysis. algorithm. Later, optimal redundancy for V-F length compression of the class of Markov sources is derived in [6]. Performance of V-F length codes and fixed-to-variable (F-V) length codes for compression of the class of Markov sources is compared in [7] and a dictionary construction that asymptotically achieves the optimal error exponent is proposed.

All previous works consider model classes that include all distributions within a simplex. However, universal V-F length coding for more structured model classes has not been considered in the literature. Apart from extending the topological complexities, we further adopt more general metrics of performance. Delay-sensitive modern applications reflect new requirements on the performance of compression schemes. Therefore it is vital to characterize the overhead associated with operation in the non-asymptotic regime. Over the course of probing the non-asymptotics, incurring “errors” are inevitable. Therefore, we depart from classical average-case (redundancy) and worst case (regret) analysis to the modern probabilistic analysis, where the figure of merit in our setup is the -coding rate — the minimum rate such that the corresponding overflow probability is less than . Our goal is to analyze asymptotics of the -coding rate as the size of the dictionary increases. We provide an achievable scheme for compressing -dimensional exponential family of distributions as the parametric model class. Moreover, we provide a converse result, showing that our proposed scheme is optimal up to the third-order -coding rate.

In previous universal V-F length codes, one can define a notion of complexity for sequences. In [3, 4, 5, 6], a sequence with high complexity has low probability under a certain composite or mixture source. While in [7], high complexity sequences have high scaled (by sequence length) empirical entropy. The dictionary of such algorithms then consists of sequences in the boundaries of transition from low complexity to high complexity. We follow a similar complexity theme to design the dictionary. The sequence complexity in our proposed algorithm is characterized based on the sequence’s type class size, hence we name our scheme the Type Complexity (TC) code. Scaled empirical entropy [7] is ignorant of the underlying structure of the parametric class. Therefore, in order to fully exploit the inherited structure of the model class, we characterize type classes based on quantized types, which we introduced in [8, 9] in studying F-V length compression. We partition the space of minimal sufficient statistics into cuboids, and define two sequences to be in the same quantized type class if and only if their minimal sufficient statistic falls within the same cuboid.

The type class approach has been taken before for the compression problem in [10]. The Type Size code (TS code) is introduced in [10] for F-V length compression of the class of all stationary memoryless sources, in which sequences are encoded in increasing order of type class sizes. The exquisite aspect of this approach is the freedom in defining types. In fact, for F-V length coding, any universal one-to-one compression algorithm can be considered as a TS code with a proper characterization of types [11]. In [8], we considered universal F-V length source coding of parametric sources. We have shown [8] that the TS code using quantized types achieves optimal coding rate for F-V length compression of the exponential family of distributions.

In this work, we provide a performance guarantee for V-F length compression of the exponential family using our proposed Type Complexity code. We upper bound the -coding rate of the quantized type implementation of the Type Complexity code by

(1)

where are the entropy and the varentropy of the underlying source, respectively, is the pre-specified dictionary size, is the tail of the standard normal distribution, and is the dimension of the model class. We then provide a converse result showing that this rate is optimal up to the third-order term. Our converse proof relies on the construction of a F-V length code from a V-F length code presented in [7], along with a converse result for F-V length prefix codes [12].

Comparing the third-order term in (1) with Rissanen’s [13] redundancy for F-V length codes, where denotes the fixed number of codewords in the F-V length code and plays the role of (fixed number of segments in the V-F length code), we observe that for binary memoryless sources, the optimal V-F length code provides better convergence for the model cost term than the F-V length codes, while for sources with , the optimal F-V length code trumps the V-F length codes from the perspective of model cost effects. On the other hand, comparing the dispersion term in (1) with the dispersion of the optimal F-V length code [8], which is , we observe that the optimal V-F length code provides better dispersion for binary memoryless sources, while for sources with , optimal F-V length code provides better dispersion effects.

The rest of the paper is organized as follows: In Sec. 2, we introduce the exponential family, V-F length coding and related definitions. In Sec. 3, we reproduce the characterization of quantized types from [8]. Type Complexity code is presented in Sec. 4. Main result of the paper is stated in Sec. 5. We present preliminary results in Sec. 6. The Achievability and the converse results are proved in Sec.’s 7 and 8, respectively. We conclude in Sec. 9.

2 Problem Statement

Let be a compact subset of . Probability distributions in an exponential family can be expressed in the form

(2)

where is the -dimensional parameter vector, is the vector of sufficient statistics and is the normalizing factor. Let the model class , be the exponential family of distributions over the finite alphabet , parameterized by , where is the degrees of freedom in the minimal description of , in the sense that no smaller dimensional family can capture the same model class. The degrees of freedom turns out to characterize the richness of the model class in our context. Compactness of implies existence of uniform bounds on the probabilities, i.e.

(3)

Let be the infinite length sequence drawn from the (unknown) true model . From (2), the probability of a sequence drawn from a model in the exponential family takes the form [14]

(4)

where

(5)

is a minimal sufficient statistic [14]. Note that and are distinguished based upon their arguments. We denote , and as the probability, expectation and variance with respect to , respectively. We denote the set of all finite length sequences over as . We denote the generic source sequence of unspecified length as . Let be the concatenation of and . All logarithms are in base 2. For a set , denotes its size. Instead of introducing different indices for every new constant , the same letter may be used to denote different constants whose precise values are irrelevant.

A V-F length code consists of a parsing dictionary of a pre-specified size , which is used to parse the source sequence. Elements of the dictionary (segments), which we denote by , may have different lengths. Once a segment is identified as a parsed sequence, it is then encoded to its lexicographical index within using bits. As it does not hurt our analysis, we ignore rounding to its closest integer.

We assume is complete, i.e. any infinite length sequence over has a prefix in . In addition, we assume is proper, i.e. there are no two segments where one is a prefix of the other. Completeness along with properness of implies that any long enough sequence has a unique prefix in the dictionary. Every complete and proper dictionary can be represented with a rooted complete -ary tree in which every internal node has child nodes. Let us label each of the edges branching out of an internal node with different letters from . Each node corresponds to the sequence of edge-labels from the root to the node. One can then correspond internal nodes of the tree to the prefixes of the segments, while leaf nodes correspond to the segments.

Let be the dictionary of a V-F length code . Let be the random first parsed segment of the source output , using the dictionary . Let be the length of . We adopt a one-shot setting and denote

(6)

We gauge the performance of V-F length code with a dictionary of size , through the -coding rate given by

(7)

Our goal is to analyze the behavior of for large enough dictionary size .

Remark 1.

Optimizing the -coding rate provides more refined results than optimizing . The latter is done in e.g. [5].

3 Quantized Types

We have previously introduced quantized types [8, 9], the optimal222Optimality is in the sense that the quantized type class implementation of the TS code achieves the minimum third-order coding rate. characterization of type classes for the universal F-V length compression of the exponential family. In this section, we briefly review this characterization. In order to define the quantized type class of a sequence , we cover the convex hull of the set of minimal sufficient statistics , into -dimensional cubic grids — cuboids — of side length , where is a constant. The union of such disjoint cuboids should cover . The position of these cuboids is arbitrary, however once we cover the space, the covering is fixed throughout. We represent each -dimensional cuboid by its geometrical center. Denote as the cuboid with center . More precisely

(8)

where is the -th component of the -dimensional vector . Let be the center of the cuboid that contains .

We then define the quantized type class of as

(9)

the set of all sequences with minimal sufficient statistic belonging to the very same cuboid containing the minimal sufficient statistic of (See Figure 1). We denote as the set of all quantized type classes for sequences of length .

Figure 1: Quantized Types

4 Type Complexity Code

In this section, we propose the Type Complexity (TC) code. Our designed dictionary , consists of sequences in the boundaries of transition from low quantized type class size to high quantized type class size. More precisely, let be chosen as the largest positive constant such that the resulting dictionary has at most segments; we characterize this precisely in Section 7.1. The sequence is a segment in the dictionary of the TC code if and only if

(10)

where is the quantized type class of as defined in (9) and is obtained from by deleting the last letter.

From construction, it is clear that is proper, and furthermore monotonicity of in implies completeness of . Intuitively, sequences with large type class sizes contain more information, implying that the TC code compresses more information into a fixed budget of output bits, which is the promise of the optimal V-F length code.

We note that there is a freedom in defining type classes in (10). We show that the quantized type is the relevant characterization of type classes for the optimal performance.

5 Main Result

Let and be the entropy and the varentropy of , repectively. The following theorem exactly characterizes achievable -rates up to third-order term, as well as asserting that this rate is achievable by the TC code using quantized types.

Theorem 1.

For any stationary memoryless exponential family of distributions parameterized by ,

(11)

where the infimum is achieved by the TC code using quantized types.

Example 1.

For the class of all binary memoryless sources , and the third-order term in (11) matches with the optimal redundancy in [5].

6 Preliminary Results

Define

(12)

Note that since the Hessian matrix of , is positive definite, the log-likelihood function is strictly concave and hence the maximum likelihood is unique.

The following lemma, which is a direct consequence of [8, Lemmas 1 and 3] provides tight upper and lower bounds on the quantized type class size.

Lemma 1.

Size of the quantized type class of is bounded as

(13)

where are constants independent of .

The type class size bounds in the previous lemma are springboards to the following upper bound on the lengths of the dictionary segments.

Corollary 1 (Segment Length).

There exists a positive constant , such that for any , we have

(14)
Proof.

For any , (10,13) yield

Since for all , , we have

The corollary then follows. ∎

The following lemma shows that one single observation does not provide much information.

Lemma 2.

Let . There exists a constant such that

(15)
Proof.

We have

(16)
(17)
(18)

where (16) is from the definition (12), (17) exploits the fact that for any two functions

and finally (18) follows from for some constant along with the fact that is a continuous function over a compact domain and hence is bounded. ∎

We appeal to the following normal approximation result from [15, 16], in order to bound the percentiles of the type class size in the achievability proof.

Lemma 3 (Asymptotic Normality of Information).

[15, 16] Fix a positive constant . For a stationary memoryless source, there exists a finite positive constant , such that for all and with ,

(19)

where and , are the entropy and the varentropy of the true model , respectively.

7 Achievability

7.1 Threshold Design

Setting high threshold values of in (10), results in compressing more information into a fixed budget of output bits. On the other hand, in order to keep the dictionary size below the pre-specified size , cannot be set too high. In this subsection, we characterize the largest value of for which the resulting dictionary size is below .

Let be the number of dictionary segments with length . For any , it must certainly hold that and . Let

(20)

Motivated by [7, Eq. 3.12], we upper bound as follows:

(21)

We show in Appendix A that . Hence

(22)

We then upper bound the dictionary size as follows:

(23)
(24)
(25)

where (23) is from (14), (24) follows from (22), and (25) is a consequence of upper bounding the summation with an integral, where is a generic constant whose precise value is irrelevant. Finally, to ensure that the dictionary of the quantized Type Complexity code (10) does not contain more than segments, it suffices to set such that

(26)

One can show that, there exists a positive constant , such that the following choice of , satisfies (26) and moreover the leading two terms are the largest possible:

(27)

7.2 Coding Rate Analysis

In this subsection, we derive an upper bound for the -coding rate of the quantized type implementation of the TC code. To this end, we upper bound the overflow probability as follows:

(28)
(29)
(30)
(31)
(32)

where (28) is from the condition for segment to be in the dictionary in (10), (29) holds since for a prefix of , and furthermore we assume that is an integer, (30) is from the quantized type class size bound in Lemma 1, (31) is from , and finally (32) is an application of Lemma 3. In Appendix B, we show that for the rate specified below, (32) and subsequently the overflow probability falls below :

(33)

Due to the definition of -coding rate, . This completes the achievability proof.

8 Converse

We first introduce notations relevant to the F-V length codes. Recall that any F-V length prefix code is a mapping from a set of words , the set of all sequences of fixed input length over the alphabet , to variable length binary sequences. For an infinite length sequence emitted from the source, we adopt a one-shot setting and let

(34)

where is the prefix of within the set of words. For simplicity of notation, we denote .

Let be an arbitrary V-F length code with dictionary segments and length function defined as in (6). Let be any achievable -coding rate for . We show that

(35)

Assume and are integers. This assumption does not hurt generality of our result. It is shown in [7] that for any V-F length code with dictionary segments and length function , one can construct a F-V length prefix code with codewords (i.e. fixed input length of ) and length function , such that the event for is equivalent to the event for . Their construction goes as follows:

  • Step 1: Consider the complete -ary tree with leaves corresponding to the complete and proper V-F length code. All the dictionary segments of length greater than , are shortened to letters, by pruning all subtrees with roots at depth . Therefore, all the leaves (i.e. segments) of the modified tree have length at most , and moreover the probability of the modified tree is equal to that of the original tree.

  • Step 2: Every segment of the modified tree with length is extended to by all possible suffixes, and accordingly, the -bit codeword for this segment is also extended by all possible -bit suffixes. This results in a F-V length code with fixed input-length and length function satisfying the required properties.

Therefore we have

(36)

Since is -achievable for , therefore and hence (36) implies

(37)

Define the -coding rate of the F-V length code as [12, Eq. 9]

Note that the fixed input length of is . Therefore, (37) implies

(38)

The converse for fixed-to-variable length prefix codes [12, Theorem 15], in turn implies333The result in [12] is stated for the class of all memoryless sources. However, adapting their proof for the exponential family is straightforward.

(39)

Combining (38,39) yields

(40)

where is a constant. Through a similar iterative approach as in Appendix B, one can show that (40) leads to (35).

9 Conclusion

We derived the fundamental limits of universal variable-to-fixed length coding of -dimensional exponential families of distributions in the fine asymptotic regime, where the law of large numbers may not hold. We proposed the Type Complexity code and further showed that the quantized type implementation of the Type Complexity code achieves the optimal third-order coding rate. Studying the behavior of the non-proper codes is an interesting future direction.

References

  • [1] B. P. Tunstall, Synthesis of noiseless compression codes.   Ph.D. dissert., Georgia Inst. of Technol., Atlanta, GA, 1967.
  • [2] M. Drmota, Y. A. Reznik, and W. Szpankowski, “Tunstall code, khodak variations, and random walks,” Information Theory, IEEE Transactions on, vol. 56, no. 6, pp. 2928–2937, June 2010.
  • [3] R. Krichevsky and V. Trofimov, “The performance of universal encoding,” Information Theory, IEEE Transactions on, vol. 27, pp. 199–207, 1981.
  • [4] J. Lawrence, “A new universal coding scheme for the binary memoryless source,” Information Theory, IEEE Transactions on, vol. 23, pp. 466–472, 1977.
  • [5] T. Tjalkens and F. Willems, “A universal variable-to-fixed length source code based on lawrence’s algorithm,” Information Theory, IEEE Transactions on, vol. 38, pp. 247–253, 1992.
  • [6] K. Visweswariah, S. R. Kulkarni, and S. Verdu, “Universal variable-to-fixed length source codes,” Information Theory, IEEE Transactions on, vol. 47, pp. 1461–1472, 2001.
  • [7] N. Merhav and D. L. Neuhoff, “Variable-to-fixed length codes provide better large deviations performance than fixed-to-variable length codes,” IEEE Transactions on Information Theory, vol. 38, no. 1, pp. 135–140, 1992.
  • [8] N. Iri and O. Kosut, “Fine asymptotics for universal one-to-one compression of parametric sources,” arXiv preprint arXiv:1612.06448, 2016.
  • [9] ——, “A new type size code for universal one-to-one compression of parametric sources,” in 2016 IEEE International Symposium on Information Theory (ISIT), 2016, pp. 1227–1231.
  • [10] O. Kosut and L. Sankar, “Universal fixed-to-variable source coding in the finite blocklength regime,” in Information Theory Proceedings (ISIT), 2013 IEEE International Symposium on, 2013, pp. 649–653.
  • [11] N. Iri and O. Kosut, “Universal coding with point type classes,” in 51st Annual Conference on Information Sciences and Systems (CISS), March 2017.
  • [12] O. Kosut and L. Sankar, “Asymptotics and non-asymptotics for universal fixed-to-variable source coding,” IEEE Transactions on Information Theory, vol. 63, no. 6, pp. 3757–3772, June 2017.
  • [13] J. Rissanen, “Universal coding, information, prediction, and estimation,” Information Theory, IEEE Transactions on, vol. 30, no. 4, pp. 629–636, Jul 1984.
  • [14] N. Merhav and M. Weinberger, “On universal simulation of information sources using training data,” Information Theory, IEEE Transactions on, vol. 50, no. 1, pp. 5–20, Jan 2004.
  • [15] I. Kontoyiannis and S. Verdú, “Optimal lossless data compression: Non-asymptotics and asymptotics,” Information Theory, IEEE Transactions on, vol. 60, no. 2, pp. 777–795, Feb 2014.
  • [16] S. Saito, N. Miya, and T. Matsushima, “Evaluation of the minimum overflow threshold of Bayes codes for a Markov source,” in Information Theory and its Applications (ISITA), 2014 International Symposium on.   IEEE, 2014, pp. 211–215.

Appendix A Proof of

The type class size bounds in Lemma 1 implies the following subset relationships

(41)

and

(42)

Hence Lemma 2, along with (41,42) and the definition of , imply

On the other hand it is shown in [8, Eq. 32] that

This completes the proof.

Appendix B Achievable -coding Rate

In order for (32) to be less than or equal to , it must hold that

Recalling the designed value for in (27) along with the Taylor expansion of around yield

(43)

for some constant . Define as the largest satisfying (43). We then solve iteratively for . For large enough , one can show that , where . Substituting in (43) and cancelling from the left and right side of (43), one can show that , where and the Taylor expansion of around is employed. Finally, substituting in (43) and cancelling the and terms from the left and right side of (43), one can show that , where .

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
30499
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description