Fundamental Limits of Universal Variable-to-Fixed Length Coding of Parametric Sources ††thanks: This research was funded in part by the NSF under grant No. CCF-1422358.
Universal variable-to-fixed (V-F) length coding of -dimensional exponential family of distributions is considered. We propose an achievable scheme consisting of a dictionary, used to parse the source output stream, making use of the previously-introduced notion of quantized types. The quantized type class of a sequence is based on partitioning the space of minimal sufficient statistics into cuboids. Our proposed dictionary consists of sequences in the boundaries of transition from low to high quantized type class size. We derive the asymptotics of the -coding rate of our coding scheme for large enough dictionaries. In particular, we show that the third-order coding rate of our scheme is , where is the entropy of the source and is the dictionary size. We further provide a converse, showing that this rate is optimal up to the third-order term.
A variable-to-fixed (V-F) length code consists of a dictionary of pre-specified size. Elements of the dictionary (segments) are used to parse the infinite sequence emitted from the source. Segments may have variable length, however they are encoded to the fixed-length binary representation of their indices within the dictionary. In order to be able to uniquely parse any infinite length sequence into the segments, we assume the dictionary to be complete (i.e. every infinite length sequence has a prefix within the dictionary) and proper (i.e. no segment is a prefix of another segment). The underlying source model induces a distribution on the segment lengths. The segment length distribution reflects the quality of the dictionary for the compression task.
For a given memoryless source, Tunstall  provided an average-case optimal algorithm to maximize average segment length. A central limit theorem for the Tunstall algorithm’s code length has been derived in . In most applications, however, statistics of the source are unknown or arduous to estimate, especially at short blocklengths, where there are limited samples for the inference task. In universal source coding, the underlying distribution in force is unknown, yet belongs to a known collection of distributions. Universal V-F length codes are studied in e.g. [3, 4, 5, 6]. Upper and lower bounds on the redundancy of a universal code for the class of all memoryless sources is derived in . Universal V-F length coding of the class of all binary memoryless sources is then considered in [4, 5], where  provides an asymptotically average sense optimal111Throughout, “optimality” of an algorithm is considered only up to the model cost term (i.e. the term reflecting the price of universality) in the coding rate. The model cost term is the second-order term in the average case analysis, while it is the third-order term in the probabilistic analysis. algorithm. Later, optimal redundancy for V-F length compression of the class of Markov sources is derived in . Performance of V-F length codes and fixed-to-variable (F-V) length codes for compression of the class of Markov sources is compared in  and a dictionary construction that asymptotically achieves the optimal error exponent is proposed.
All previous works consider model classes that include all distributions within a simplex. However, universal V-F length coding for more structured model classes has not been considered in the literature. Apart from extending the topological complexities, we further adopt more general metrics of performance. Delay-sensitive modern applications reflect new requirements on the performance of compression schemes. Therefore it is vital to characterize the overhead associated with operation in the non-asymptotic regime. Over the course of probing the non-asymptotics, incurring “errors” are inevitable. Therefore, we depart from classical average-case (redundancy) and worst case (regret) analysis to the modern probabilistic analysis, where the figure of merit in our setup is the -coding rate — the minimum rate such that the corresponding overflow probability is less than . Our goal is to analyze asymptotics of the -coding rate as the size of the dictionary increases. We provide an achievable scheme for compressing -dimensional exponential family of distributions as the parametric model class. Moreover, we provide a converse result, showing that our proposed scheme is optimal up to the third-order -coding rate.
In previous universal V-F length codes, one can define a notion of complexity for sequences. In [3, 4, 5, 6], a sequence with high complexity has low probability under a certain composite or mixture source. While in , high complexity sequences have high scaled (by sequence length) empirical entropy. The dictionary of such algorithms then consists of sequences in the boundaries of transition from low complexity to high complexity. We follow a similar complexity theme to design the dictionary. The sequence complexity in our proposed algorithm is characterized based on the sequence’s type class size, hence we name our scheme the Type Complexity (TC) code. Scaled empirical entropy  is ignorant of the underlying structure of the parametric class. Therefore, in order to fully exploit the inherited structure of the model class, we characterize type classes based on quantized types, which we introduced in [8, 9] in studying F-V length compression. We partition the space of minimal sufficient statistics into cuboids, and define two sequences to be in the same quantized type class if and only if their minimal sufficient statistic falls within the same cuboid.
The type class approach has been taken before for the compression problem in . The Type Size code (TS code) is introduced in  for F-V length compression of the class of all stationary memoryless sources, in which sequences are encoded in increasing order of type class sizes. The exquisite aspect of this approach is the freedom in defining types. In fact, for F-V length coding, any universal one-to-one compression algorithm can be considered as a TS code with a proper characterization of types . In , we considered universal F-V length source coding of parametric sources. We have shown  that the TS code using quantized types achieves optimal coding rate for F-V length compression of the exponential family of distributions.
In this work, we provide a performance guarantee for V-F length compression of the exponential family using our proposed Type Complexity code. We upper bound the -coding rate of the quantized type implementation of the Type Complexity code by
where are the entropy and the varentropy of the underlying source, respectively, is the pre-specified dictionary size, is the tail of the standard normal distribution, and is the dimension of the model class. We then provide a converse result showing that this rate is optimal up to the third-order term. Our converse proof relies on the construction of a F-V length code from a V-F length code presented in , along with a converse result for F-V length prefix codes .
Comparing the third-order term in (1) with Rissanen’s  redundancy for F-V length codes, where denotes the fixed number of codewords in the F-V length code and plays the role of (fixed number of segments in the V-F length code), we observe that for binary memoryless sources, the optimal V-F length code provides better convergence for the model cost term than the F-V length codes, while for sources with , the optimal F-V length code trumps the V-F length codes from the perspective of model cost effects. On the other hand, comparing the dispersion term in (1) with the dispersion of the optimal F-V length code , which is , we observe that the optimal V-F length code provides better dispersion for binary memoryless sources, while for sources with , optimal F-V length code provides better dispersion effects.
The rest of the paper is organized as follows: In Sec. 2, we introduce the exponential family, V-F length coding and related definitions. In Sec. 3, we reproduce the characterization of quantized types from . Type Complexity code is presented in Sec. 4. Main result of the paper is stated in Sec. 5. We present preliminary results in Sec. 6. The Achievability and the converse results are proved in Sec.’s 7 and 8, respectively. We conclude in Sec. 9.
2 Problem Statement
Let be a compact subset of . Probability distributions in an exponential family can be expressed in the form
where is the -dimensional parameter vector, is the vector of sufficient statistics and is the normalizing factor. Let the model class , be the exponential family of distributions over the finite alphabet , parameterized by , where is the degrees of freedom in the minimal description of , in the sense that no smaller dimensional family can capture the same model class. The degrees of freedom turns out to characterize the richness of the model class in our context. Compactness of implies existence of uniform bounds on the probabilities, i.e.
is a minimal sufficient statistic . Note that and are distinguished based upon their arguments. We denote , and as the probability, expectation and variance with respect to , respectively. We denote the set of all finite length sequences over as . We denote the generic source sequence of unspecified length as . Let be the concatenation of and . All logarithms are in base 2. For a set , denotes its size. Instead of introducing different indices for every new constant , the same letter may be used to denote different constants whose precise values are irrelevant.
A V-F length code consists of a parsing dictionary of a pre-specified size , which is used to parse the source sequence. Elements of the dictionary (segments), which we denote by , may have different lengths. Once a segment is identified as a parsed sequence, it is then encoded to its lexicographical index within using bits. As it does not hurt our analysis, we ignore rounding to its closest integer.
We assume is complete, i.e. any infinite length sequence over has a prefix in . In addition, we assume is proper, i.e. there are no two segments where one is a prefix of the other. Completeness along with properness of implies that any long enough sequence has a unique prefix in the dictionary. Every complete and proper dictionary can be represented with a rooted complete -ary tree in which every internal node has child nodes. Let us label each of the edges branching out of an internal node with different letters from . Each node corresponds to the sequence of edge-labels from the root to the node. One can then correspond internal nodes of the tree to the prefixes of the segments, while leaf nodes correspond to the segments.
Let be the dictionary of a V-F length code . Let be the random first parsed segment of the source output , using the dictionary . Let be the length of . We adopt a one-shot setting and denote
We gauge the performance of V-F length code with a dictionary of size , through the -coding rate given by
Our goal is to analyze the behavior of for large enough dictionary size .
Optimizing the -coding rate provides more refined results than optimizing . The latter is done in e.g. .
3 Quantized Types
We have previously introduced quantized types [8, 9], the optimal222Optimality is in the sense that the quantized type class implementation of the TS code achieves the minimum third-order coding rate. characterization of type classes for the universal F-V length compression of the exponential family. In this section, we briefly review this characterization. In order to define the quantized type class of a sequence , we cover the convex hull of the set of minimal sufficient statistics , into -dimensional cubic grids — cuboids — of side length , where is a constant. The union of such disjoint cuboids should cover . The position of these cuboids is arbitrary, however once we cover the space, the covering is fixed throughout. We represent each -dimensional cuboid by its geometrical center. Denote as the cuboid with center . More precisely
where is the -th component of the -dimensional vector . Let be the center of the cuboid that contains .
We then define the quantized type class of as
the set of all sequences with minimal sufficient statistic belonging to the very same cuboid containing the minimal sufficient statistic of (See Figure 1). We denote as the set of all quantized type classes for sequences of length .
4 Type Complexity Code
In this section, we propose the Type Complexity (TC) code. Our designed dictionary , consists of sequences in the boundaries of transition from low quantized type class size to high quantized type class size. More precisely, let be chosen as the largest positive constant such that the resulting dictionary has at most segments; we characterize this precisely in Section 7.1. The sequence is a segment in the dictionary of the TC code if and only if
where is the quantized type class of as defined in (9) and is obtained from by deleting the last letter.
From construction, it is clear that is proper, and furthermore monotonicity of in implies completeness of . Intuitively, sequences with large type class sizes contain more information, implying that the TC code compresses more information into a fixed budget of output bits, which is the promise of the optimal V-F length code.
We note that there is a freedom in defining type classes in (10). We show that the quantized type is the relevant characterization of type classes for the optimal performance.
5 Main Result
Let and be the entropy and the varentropy of , repectively. The following theorem exactly characterizes achievable -rates up to third-order term, as well as asserting that this rate is achievable by the TC code using quantized types.
For any stationary memoryless exponential family of distributions parameterized by ,
where the infimum is achieved by the TC code using quantized types.
6 Preliminary Results
Note that since the Hessian matrix of , is positive definite, the log-likelihood function is strictly concave and hence the maximum likelihood is unique.
The following lemma, which is a direct consequence of [8, Lemmas 1 and 3] provides tight upper and lower bounds on the quantized type class size.
Size of the quantized type class of is bounded as
where are constants independent of .
The type class size bounds in the previous lemma are springboards to the following upper bound on the lengths of the dictionary segments.
Corollary 1 (Segment Length).
There exists a positive constant , such that for any , we have
The following lemma shows that one single observation does not provide much information.
Let . There exists a constant such that
7.1 Threshold Design
Setting high threshold values of in (10), results in compressing more information into a fixed budget of output bits. On the other hand, in order to keep the dictionary size below the pre-specified size , cannot be set too high. In this subsection, we characterize the largest value of for which the resulting dictionary size is below .
Let be the number of dictionary segments with length . For any , it must certainly hold that and . Let
Motivated by [7, Eq. 3.12], we upper bound as follows:
We show in Appendix A that . Hence
We then upper bound the dictionary size as follows:
where (23) is from (14), (24) follows from (22), and (25) is a consequence of upper bounding the summation with an integral, where is a generic constant whose precise value is irrelevant. Finally, to ensure that the dictionary of the quantized Type Complexity code (10) does not contain more than segments, it suffices to set such that
One can show that, there exists a positive constant , such that the following choice of , satisfies (26) and moreover the leading two terms are the largest possible:
7.2 Coding Rate Analysis
In this subsection, we derive an upper bound for the -coding rate of the quantized type implementation of the TC code. To this end, we upper bound the overflow probability as follows:
where (28) is from the condition for segment to be in the dictionary in (10), (29) holds since for a prefix of , and furthermore we assume that is an integer, (30) is from the quantized type class size bound in Lemma 1, (31) is from , and finally (32) is an application of Lemma 3. In Appendix B, we show that for the rate specified below, (32) and subsequently the overflow probability falls below :
Due to the definition of -coding rate, . This completes the achievability proof.
We first introduce notations relevant to the F-V length codes. Recall that any F-V length prefix code is a mapping from a set of words , the set of all sequences of fixed input length over the alphabet , to variable length binary sequences. For an infinite length sequence emitted from the source, we adopt a one-shot setting and let
where is the prefix of within the set of words. For simplicity of notation, we denote .
Let be an arbitrary V-F length code with dictionary segments and length function defined as in (6). Let be any achievable -coding rate for . We show that
Assume and are integers. This assumption does not hurt generality of our result. It is shown in  that for any V-F length code with dictionary segments and length function , one can construct a F-V length prefix code with codewords (i.e. fixed input length of ) and length function , such that the event for is equivalent to the event for . Their construction goes as follows:
Step 1: Consider the complete -ary tree with leaves corresponding to the complete and proper V-F length code. All the dictionary segments of length greater than , are shortened to letters, by pruning all subtrees with roots at depth . Therefore, all the leaves (i.e. segments) of the modified tree have length at most , and moreover the probability of the modified tree is equal to that of the original tree.
Step 2: Every segment of the modified tree with length is extended to by all possible suffixes, and accordingly, the -bit codeword for this segment is also extended by all possible -bit suffixes. This results in a F-V length code with fixed input-length and length function satisfying the required properties.
Therefore we have
Since is -achievable for , therefore and hence (36) implies
Define the -coding rate of the F-V length code as [12, Eq. 9]
Note that the fixed input length of is . Therefore, (37) implies
The converse for fixed-to-variable length prefix codes [12, Theorem 15], in turn implies333The result in  is stated for the class of all memoryless sources. However, adapting their proof for the exponential family is straightforward.
We derived the fundamental limits of universal variable-to-fixed length coding of -dimensional exponential families of distributions in the fine asymptotic regime, where the law of large numbers may not hold. We proposed the Type Complexity code and further showed that the quantized type implementation of the Type Complexity code achieves the optimal third-order coding rate. Studying the behavior of the non-proper codes is an interesting future direction.
-  B. P. Tunstall, Synthesis of noiseless compression codes. Ph.D. dissert., Georgia Inst. of Technol., Atlanta, GA, 1967.
-  M. Drmota, Y. A. Reznik, and W. Szpankowski, “Tunstall code, khodak variations, and random walks,” Information Theory, IEEE Transactions on, vol. 56, no. 6, pp. 2928–2937, June 2010.
-  R. Krichevsky and V. Trofimov, “The performance of universal encoding,” Information Theory, IEEE Transactions on, vol. 27, pp. 199–207, 1981.
-  J. Lawrence, “A new universal coding scheme for the binary memoryless source,” Information Theory, IEEE Transactions on, vol. 23, pp. 466–472, 1977.
-  T. Tjalkens and F. Willems, “A universal variable-to-fixed length source code based on lawrence’s algorithm,” Information Theory, IEEE Transactions on, vol. 38, pp. 247–253, 1992.
-  K. Visweswariah, S. R. Kulkarni, and S. Verdu, “Universal variable-to-fixed length source codes,” Information Theory, IEEE Transactions on, vol. 47, pp. 1461–1472, 2001.
-  N. Merhav and D. L. Neuhoff, “Variable-to-fixed length codes provide better large deviations performance than fixed-to-variable length codes,” IEEE Transactions on Information Theory, vol. 38, no. 1, pp. 135–140, 1992.
-  N. Iri and O. Kosut, “Fine asymptotics for universal one-to-one compression of parametric sources,” arXiv preprint arXiv:1612.06448, 2016.
-  ——, “A new type size code for universal one-to-one compression of parametric sources,” in 2016 IEEE International Symposium on Information Theory (ISIT), 2016, pp. 1227–1231.
-  O. Kosut and L. Sankar, “Universal fixed-to-variable source coding in the finite blocklength regime,” in Information Theory Proceedings (ISIT), 2013 IEEE International Symposium on, 2013, pp. 649–653.
-  N. Iri and O. Kosut, “Universal coding with point type classes,” in 51st Annual Conference on Information Sciences and Systems (CISS), March 2017.
-  O. Kosut and L. Sankar, “Asymptotics and non-asymptotics for universal fixed-to-variable source coding,” IEEE Transactions on Information Theory, vol. 63, no. 6, pp. 3757–3772, June 2017.
-  J. Rissanen, “Universal coding, information, prediction, and estimation,” Information Theory, IEEE Transactions on, vol. 30, no. 4, pp. 629–636, Jul 1984.
-  N. Merhav and M. Weinberger, “On universal simulation of information sources using training data,” Information Theory, IEEE Transactions on, vol. 50, no. 1, pp. 5–20, Jan 2004.
-  I. Kontoyiannis and S. Verdú, “Optimal lossless data compression: Non-asymptotics and asymptotics,” Information Theory, IEEE Transactions on, vol. 60, no. 2, pp. 777–795, Feb 2014.
-  S. Saito, N. Miya, and T. Matsushima, “Evaluation of the minimum overflow threshold of Bayes codes for a Markov source,” in Information Theory and its Applications (ISITA), 2014 International Symposium on. IEEE, 2014, pp. 211–215.
Appendix A Proof of
The type class size bounds in Lemma 1 implies the following subset relationships
Appendix B Achievable -coding Rate
In order for (32) to be less than or equal to , it must hold that
Recalling the designed value for in (27) along with the Taylor expansion of around yield
for some constant . Define as the largest satisfying (43). We then solve iteratively for . For large enough , one can show that , where . Substituting in (43) and cancelling from the left and right side of (43), one can show that , where and the Taylor expansion of around is employed. Finally, substituting in (43) and cancelling the and terms from the left and right side of (43), one can show that , where .