Fundamental Limits of Universal VariabletoFixed Length Coding of Parametric Sources ^{†}^{†}thanks: This research was funded in part by the NSF under grant No. CCF1422358.
Abstract
Universal variabletofixed (VF) length coding of dimensional exponential family of distributions is considered. We propose an achievable scheme consisting of a dictionary, used to parse the source output stream, making use of the previouslyintroduced notion of quantized types. The quantized type class of a sequence is based on partitioning the space of minimal sufficient statistics into cuboids. Our proposed dictionary consists of sequences in the boundaries of transition from low to high quantized type class size. We derive the asymptotics of the coding rate of our coding scheme for large enough dictionaries. In particular, we show that the thirdorder coding rate of our scheme is , where is the entropy of the source and is the dictionary size. We further provide a converse, showing that this rate is optimal up to the thirdorder term.
1 Introduction
A variabletofixed (VF) length code consists of a dictionary of prespecified size. Elements of the dictionary (segments) are used to parse the infinite sequence emitted from the source. Segments may have variable length, however they are encoded to the fixedlength binary representation of their indices within the dictionary. In order to be able to uniquely parse any infinite length sequence into the segments, we assume the dictionary to be complete (i.e. every infinite length sequence has a prefix within the dictionary) and proper (i.e. no segment is a prefix of another segment). The underlying source model induces a distribution on the segment lengths. The segment length distribution reflects the quality of the dictionary for the compression task.
For a given memoryless source, Tunstall [1] provided an averagecase optimal algorithm to maximize average segment length. A central limit theorem for the Tunstall algorithm’s code length has been derived in [2]. In most applications, however, statistics of the source are unknown or arduous to estimate, especially at short blocklengths, where there are limited samples for the inference task. In universal source coding, the underlying distribution in force is unknown, yet belongs to a known collection of distributions. Universal VF length codes are studied in e.g. [3, 4, 5, 6]. Upper and lower bounds on the redundancy of a universal code for the class of all memoryless sources is derived in [3]. Universal VF length coding of the class of all binary memoryless sources is then considered in [4, 5], where [5] provides an asymptotically average sense optimal^{1}^{1}1Throughout, “optimality” of an algorithm is considered only up to the model cost term (i.e. the term reflecting the price of universality) in the coding rate. The model cost term is the secondorder term in the average case analysis, while it is the thirdorder term in the probabilistic analysis. algorithm. Later, optimal redundancy for VF length compression of the class of Markov sources is derived in [6]. Performance of VF length codes and fixedtovariable (FV) length codes for compression of the class of Markov sources is compared in [7] and a dictionary construction that asymptotically achieves the optimal error exponent is proposed.
All previous works consider model classes that include all distributions within a simplex. However, universal VF length coding for more structured model classes has not been considered in the literature. Apart from extending the topological complexities, we further adopt more general metrics of performance. Delaysensitive modern applications reflect new requirements on the performance of compression schemes. Therefore it is vital to characterize the overhead associated with operation in the nonasymptotic regime. Over the course of probing the nonasymptotics, incurring “errors” are inevitable. Therefore, we depart from classical averagecase (redundancy) and worst case (regret) analysis to the modern probabilistic analysis, where the figure of merit in our setup is the coding rate — the minimum rate such that the corresponding overflow probability is less than . Our goal is to analyze asymptotics of the coding rate as the size of the dictionary increases. We provide an achievable scheme for compressing dimensional exponential family of distributions as the parametric model class. Moreover, we provide a converse result, showing that our proposed scheme is optimal up to the thirdorder coding rate.
In previous universal VF length codes, one can define a notion of complexity for sequences. In [3, 4, 5, 6], a sequence with high complexity has low probability under a certain composite or mixture source. While in [7], high complexity sequences have high scaled (by sequence length) empirical entropy. The dictionary of such algorithms then consists of sequences in the boundaries of transition from low complexity to high complexity. We follow a similar complexity theme to design the dictionary. The sequence complexity in our proposed algorithm is characterized based on the sequence’s type class size, hence we name our scheme the Type Complexity (TC) code. Scaled empirical entropy [7] is ignorant of the underlying structure of the parametric class. Therefore, in order to fully exploit the inherited structure of the model class, we characterize type classes based on quantized types, which we introduced in [8, 9] in studying FV length compression. We partition the space of minimal sufficient statistics into cuboids, and define two sequences to be in the same quantized type class if and only if their minimal sufficient statistic falls within the same cuboid.
The type class approach has been taken before for the compression problem in [10]. The Type Size code (TS code) is introduced in [10] for FV length compression of the class of all stationary memoryless sources, in which sequences are encoded in increasing order of type class sizes. The exquisite aspect of this approach is the freedom in defining types. In fact, for FV length coding, any universal onetoone compression algorithm can be considered as a TS code with a proper characterization of types [11]. In [8], we considered universal FV length source coding of parametric sources. We have shown [8] that the TS code using quantized types achieves optimal coding rate for FV length compression of the exponential family of distributions.
In this work, we provide a performance guarantee for VF length compression of the exponential family using our proposed Type Complexity code. We upper bound the coding rate of the quantized type implementation of the Type Complexity code by
(1) 
where are the entropy and the varentropy of the underlying source, respectively, is the prespecified dictionary size, is the tail of the standard normal distribution, and is the dimension of the model class. We then provide a converse result showing that this rate is optimal up to the thirdorder term. Our converse proof relies on the construction of a FV length code from a VF length code presented in [7], along with a converse result for FV length prefix codes [12].
Comparing the thirdorder term in (1) with Rissanen’s [13] redundancy for FV length codes, where denotes the fixed number of codewords in the FV length code and plays the role of (fixed number of segments in the VF length code), we observe that for binary memoryless sources, the optimal VF length code provides better convergence for the model cost term than the FV length codes, while for sources with , the optimal FV length code trumps the VF length codes from the perspective of model cost effects. On the other hand, comparing the dispersion term in (1) with the dispersion of the optimal FV length code [8], which is , we observe that the optimal VF length code provides better dispersion for binary memoryless sources, while for sources with , optimal FV length code provides better dispersion effects.
The rest of the paper is organized as follows: In Sec. 2, we introduce the exponential family, VF length coding and related definitions. In Sec. 3, we reproduce the characterization of quantized types from [8]. Type Complexity code is presented in Sec. 4. Main result of the paper is stated in Sec. 5. We present preliminary results in Sec. 6. The Achievability and the converse results are proved in Sec.’s 7 and 8, respectively. We conclude in Sec. 9.
2 Problem Statement
Let be a compact subset of . Probability distributions in an exponential family can be expressed in the form
(2) 
where is the dimensional parameter vector, is the vector of sufficient statistics and is the normalizing factor. Let the model class , be the exponential family of distributions over the finite alphabet , parameterized by , where is the degrees of freedom in the minimal description of , in the sense that no smaller dimensional family can capture the same model class. The degrees of freedom turns out to characterize the richness of the model class in our context. Compactness of implies existence of uniform bounds on the probabilities, i.e.
(3) 
Let be the infinite length sequence drawn from the (unknown) true model . From (2), the probability of a sequence drawn from a model in the exponential family takes the form [14]
(4) 
where
(5) 
is a minimal sufficient statistic [14]. Note that and are distinguished based upon their arguments. We denote , and as the probability, expectation and variance with respect to , respectively. We denote the set of all finite length sequences over as . We denote the generic source sequence of unspecified length as . Let be the concatenation of and . All logarithms are in base 2. For a set , denotes its size. Instead of introducing different indices for every new constant , the same letter may be used to denote different constants whose precise values are irrelevant.
A VF length code consists of a parsing dictionary of a prespecified size , which is used to parse the source sequence. Elements of the dictionary (segments), which we denote by , may have different lengths. Once a segment is identified as a parsed sequence, it is then encoded to its lexicographical index within using bits. As it does not hurt our analysis, we ignore rounding to its closest integer.
We assume is complete, i.e. any infinite length sequence over has a prefix in . In addition, we assume is proper, i.e. there are no two segments where one is a prefix of the other. Completeness along with properness of implies that any long enough sequence has a unique prefix in the dictionary. Every complete and proper dictionary can be represented with a rooted complete ary tree in which every internal node has child nodes. Let us label each of the edges branching out of an internal node with different letters from . Each node corresponds to the sequence of edgelabels from the root to the node. One can then correspond internal nodes of the tree to the prefixes of the segments, while leaf nodes correspond to the segments.
Let be the dictionary of a VF length code . Let be the random first parsed segment of the source output , using the dictionary . Let be the length of . We adopt a oneshot setting and denote
(6) 
We gauge the performance of VF length code with a dictionary of size , through the coding rate given by
(7) 
Our goal is to analyze the behavior of for large enough dictionary size .
Remark 1.
Optimizing the coding rate provides more refined results than optimizing . The latter is done in e.g. [5].
3 Quantized Types
We have previously introduced quantized types [8, 9], the optimal^{2}^{2}2Optimality is in the sense that the quantized type class implementation of the TS code achieves the minimum thirdorder coding rate. characterization of type classes for the universal FV length compression of the exponential family. In this section, we briefly review this characterization. In order to define the quantized type class of a sequence , we cover the convex hull of the set of minimal sufficient statistics , into dimensional cubic grids — cuboids — of side length , where is a constant. The union of such disjoint cuboids should cover . The position of these cuboids is arbitrary, however once we cover the space, the covering is fixed throughout. We represent each dimensional cuboid by its geometrical center. Denote as the cuboid with center . More precisely
(8) 
where is the th component of the dimensional vector . Let be the center of the cuboid that contains .
We then define the quantized type class of as
(9) 
the set of all sequences with minimal sufficient statistic belonging to the very same cuboid containing the minimal sufficient statistic of (See Figure 1). We denote as the set of all quantized type classes for sequences of length .
4 Type Complexity Code
In this section, we propose the Type Complexity (TC) code. Our designed dictionary , consists of sequences in the boundaries of transition from low quantized type class size to high quantized type class size. More precisely, let be chosen as the largest positive constant such that the resulting dictionary has at most segments; we characterize this precisely in Section 7.1. The sequence is a segment in the dictionary of the TC code if and only if
(10) 
where is the quantized type class of as defined in (9) and is obtained from by deleting the last letter.
From construction, it is clear that is proper, and furthermore monotonicity of in implies completeness of . Intuitively, sequences with large type class sizes contain more information, implying that the TC code compresses more information into a fixed budget of output bits, which is the promise of the optimal VF length code.
We note that there is a freedom in defining type classes in (10). We show that the quantized type is the relevant characterization of type classes for the optimal performance.
5 Main Result
Let and be the entropy and the varentropy of , repectively. The following theorem exactly characterizes achievable rates up to thirdorder term, as well as asserting that this rate is achievable by the TC code using quantized types.
Theorem 1.
For any stationary memoryless exponential family of distributions parameterized by ,
(11) 
where the infimum is achieved by the TC code using quantized types.
6 Preliminary Results
Define
(12) 
Note that since the Hessian matrix of , is positive definite, the loglikelihood function is strictly concave and hence the maximum likelihood is unique.
The following lemma, which is a direct consequence of [8, Lemmas 1 and 3] provides tight upper and lower bounds on the quantized type class size.
Lemma 1.
Size of the quantized type class of is bounded as
(13) 
where are constants independent of .
The type class size bounds in the previous lemma are springboards to the following upper bound on the lengths of the dictionary segments.
Corollary 1 (Segment Length).
There exists a positive constant , such that for any , we have
(14) 
The following lemma shows that one single observation does not provide much information.
Lemma 2.
Let . There exists a constant such that
(15) 
Proof.
7 Achievability
7.1 Threshold Design
Setting high threshold values of in (10), results in compressing more information into a fixed budget of output bits. On the other hand, in order to keep the dictionary size below the prespecified size , cannot be set too high. In this subsection, we characterize the largest value of for which the resulting dictionary size is below .
Let be the number of dictionary segments with length . For any , it must certainly hold that and . Let
(20) 
Motivated by [7, Eq. 3.12], we upper bound as follows:
(21) 
We show in Appendix A that . Hence
(22) 
We then upper bound the dictionary size as follows:
(23)  
(24)  
(25) 
where (23) is from (14), (24) follows from (22), and (25) is a consequence of upper bounding the summation with an integral, where is a generic constant whose precise value is irrelevant. Finally, to ensure that the dictionary of the quantized Type Complexity code (10) does not contain more than segments, it suffices to set such that
(26) 
One can show that, there exists a positive constant , such that the following choice of , satisfies (26) and moreover the leading two terms are the largest possible:
(27) 
7.2 Coding Rate Analysis
In this subsection, we derive an upper bound for the coding rate of the quantized type implementation of the TC code. To this end, we upper bound the overflow probability as follows:
(28)  
(29)  
(30)  
(31)  
(32) 
where (28) is from the condition for segment to be in the dictionary in (10), (29) holds since for a prefix of , and furthermore we assume that is an integer, (30) is from the quantized type class size bound in Lemma 1, (31) is from , and finally (32) is an application of Lemma 3. In Appendix B, we show that for the rate specified below, (32) and subsequently the overflow probability falls below :
(33) 
Due to the definition of coding rate, . This completes the achievability proof.
8 Converse
We first introduce notations relevant to the FV length codes. Recall that any FV length prefix code is a mapping from a set of words , the set of all sequences of fixed input length over the alphabet , to variable length binary sequences. For an infinite length sequence emitted from the source, we adopt a oneshot setting and let
(34) 
where is the prefix of within the set of words. For simplicity of notation, we denote .
Let be an arbitrary VF length code with dictionary segments and length function defined as in (6). Let be any achievable coding rate for . We show that
(35) 
Assume and are integers. This assumption does not hurt generality of our result. It is shown in [7] that for any VF length code with dictionary segments and length function , one can construct a FV length prefix code with codewords (i.e. fixed input length of ) and length function , such that the event for is equivalent to the event for . Their construction goes as follows:

Step 1: Consider the complete ary tree with leaves corresponding to the complete and proper VF length code. All the dictionary segments of length greater than , are shortened to letters, by pruning all subtrees with roots at depth . Therefore, all the leaves (i.e. segments) of the modified tree have length at most , and moreover the probability of the modified tree is equal to that of the original tree.

Step 2: Every segment of the modified tree with length is extended to by all possible suffixes, and accordingly, the bit codeword for this segment is also extended by all possible bit suffixes. This results in a FV length code with fixed inputlength and length function satisfying the required properties.
Therefore we have
(36) 
Since is achievable for , therefore and hence (36) implies
(37) 
Define the coding rate of the FV length code as [12, Eq. 9]
Note that the fixed input length of is . Therefore, (37) implies
(38) 
The converse for fixedtovariable length prefix codes [12, Theorem 15], in turn implies^{3}^{3}3The result in [12] is stated for the class of all memoryless sources. However, adapting their proof for the exponential family is straightforward.
(39) 
(40) 
where is a constant. Through a similar iterative approach as in Appendix B, one can show that (40) leads to (35).
9 Conclusion
We derived the fundamental limits of universal variabletofixed length coding of dimensional exponential families of distributions in the fine asymptotic regime, where the law of large numbers may not hold. We proposed the Type Complexity code and further showed that the quantized type implementation of the Type Complexity code achieves the optimal thirdorder coding rate. Studying the behavior of the nonproper codes is an interesting future direction.
References
 [1] B. P. Tunstall, Synthesis of noiseless compression codes. Ph.D. dissert., Georgia Inst. of Technol., Atlanta, GA, 1967.
 [2] M. Drmota, Y. A. Reznik, and W. Szpankowski, “Tunstall code, khodak variations, and random walks,” Information Theory, IEEE Transactions on, vol. 56, no. 6, pp. 2928–2937, June 2010.
 [3] R. Krichevsky and V. Trofimov, “The performance of universal encoding,” Information Theory, IEEE Transactions on, vol. 27, pp. 199–207, 1981.
 [4] J. Lawrence, “A new universal coding scheme for the binary memoryless source,” Information Theory, IEEE Transactions on, vol. 23, pp. 466–472, 1977.
 [5] T. Tjalkens and F. Willems, “A universal variabletofixed length source code based on lawrence’s algorithm,” Information Theory, IEEE Transactions on, vol. 38, pp. 247–253, 1992.
 [6] K. Visweswariah, S. R. Kulkarni, and S. Verdu, “Universal variabletofixed length source codes,” Information Theory, IEEE Transactions on, vol. 47, pp. 1461–1472, 2001.
 [7] N. Merhav and D. L. Neuhoff, “Variabletofixed length codes provide better large deviations performance than fixedtovariable length codes,” IEEE Transactions on Information Theory, vol. 38, no. 1, pp. 135–140, 1992.
 [8] N. Iri and O. Kosut, “Fine asymptotics for universal onetoone compression of parametric sources,” arXiv preprint arXiv:1612.06448, 2016.
 [9] ——, “A new type size code for universal onetoone compression of parametric sources,” in 2016 IEEE International Symposium on Information Theory (ISIT), 2016, pp. 1227–1231.
 [10] O. Kosut and L. Sankar, “Universal fixedtovariable source coding in the finite blocklength regime,” in Information Theory Proceedings (ISIT), 2013 IEEE International Symposium on, 2013, pp. 649–653.
 [11] N. Iri and O. Kosut, “Universal coding with point type classes,” in 51st Annual Conference on Information Sciences and Systems (CISS), March 2017.
 [12] O. Kosut and L. Sankar, “Asymptotics and nonasymptotics for universal fixedtovariable source coding,” IEEE Transactions on Information Theory, vol. 63, no. 6, pp. 3757–3772, June 2017.
 [13] J. Rissanen, “Universal coding, information, prediction, and estimation,” Information Theory, IEEE Transactions on, vol. 30, no. 4, pp. 629–636, Jul 1984.
 [14] N. Merhav and M. Weinberger, “On universal simulation of information sources using training data,” Information Theory, IEEE Transactions on, vol. 50, no. 1, pp. 5–20, Jan 2004.
 [15] I. Kontoyiannis and S. Verdú, “Optimal lossless data compression: Nonasymptotics and asymptotics,” Information Theory, IEEE Transactions on, vol. 60, no. 2, pp. 777–795, Feb 2014.
 [16] S. Saito, N. Miya, and T. Matsushima, “Evaluation of the minimum overflow threshold of Bayes codes for a Markov source,” in Information Theory and its Applications (ISITA), 2014 International Symposium on. IEEE, 2014, pp. 211–215.
Appendix A Proof of
Appendix B Achievable coding Rate
In order for (32) to be less than or equal to , it must hold that
Recalling the designed value for in (27) along with the Taylor expansion of around yield
(43) 
for some constant . Define as the largest satisfying (43). We then solve iteratively for . For large enough , one can show that , where . Substituting in (43) and cancelling from the left and right side of (43), one can show that , where and the Taylor expansion of around is employed. Finally, substituting in (43) and cancelling the and terms from the left and right side of (43), one can show that , where .