Universal Weak VariableLength Source Coding
on Countable Infinite Alphabets
Abstract
Motivated from the fact that universal source coding on countably infinite alphabets is not feasible, this work introduces the notion of “almost lossless source coding”. Analog to the weak variablelength source coding problem studied by Han [3], almost lossless source coding aims at relaxing the lossless blockwise assumption to allow an average perletter distortion that vanishes asymptotically as the blocklength goes to infinity. In this setup, we show on one hand that Shannon entropy characterizes the minimum achievable rate (similarly to the case of discrete sources) while on the other that almost lossless universal source coding becomes feasible for the family of finiteentropy stationary memoryless sources with countably infinite alphabets. Furthermore, we study a stronger notion of almost lossless universality that demands uniform convergence of the average perletter distortion to zero, where we establish a necessary and sufficient condition for the socalled family of “envelope distributions” to achieve it. Remarkably, this condition is the same necessary and sufficient condition needed for the existence of a strongly minimax (lossless) universal source code for the family of envelope distributions. Finally, we show that an almost lossless coding scheme offers faster rate of convergence for the (minimax) redundancy compared to the wellknown information radius developed for the lossless case at the expense of tolerating a nonzero distortion that vanishes to zero as the blocklength grows. This shows that even when lossless universality is feasible, an almost lossless scheme can offer different regimes on the rates of convergence of the (worst case) redundancy versus the (worst case) distortion.
Universal source coding, countably infinite alphabets, weak source coding, envelope distributions, information radius, metric entropy analysis.
I Introduction
The problem of Universal Source Coding (USC) has a long history on information theory [4, 5, 6, 7, 8]. This topic started with the seminal work of Davisson [7] that formalizes the variablelength lossless coding and introduces relevant information quantities (mutual information and channel capacity [5]). In lossless variablelength source coding, it is wellknown that if we know the statistics of a source (memoryless or stationary and ergodic) the Shannon entropy (or Shannon entropy rate) provides the minimum achievable rate [5]. However, when the statistics of the source is not known but it belongs to family of distributions , then the problem reduces to characterize the worstcase expected overhead (or minimax redundancy) that a pair of encoder and decoder experiences due to the lack of knowledge about true distribution governing the source samples to be encoded [4].
A seminal informationtheoretic result states that the least worstcase overhead (or minimax redundancy of ) is fully characterized by the information radius of [4]. The information radius has been richly studied in a series of numerous contributions [9, 10, 11, 12, 13], including in universal prediction of individual sequences [14]. In particular, it is wellknown that the information radius growths sublinearly for the family of finite alphabet stationary and memoryless sources [4], which implies the existence of an universal source code that achieves Shannon entropy for every distribution in this family provided that the block length goes to infinity. What is intriguing in this positive result obtained for finite alphabet memoryless sources is that it does not longer extend to the case of memoryless sources on countable infinite alphabets, as was clearly shown in [8, 6, 9]. From an information complexity perspective, this infeasibility result implies that the information radius of this family is unbounded for any finite blocklength and, consequently, lossless universal source coding for infinite alphabet memoryless sources is an intractable problem. In this regard, the proof presented by Györfi et al. [6, Theorem 1] is constructed over a connection between variablelength prefixfree codes and distribution estimators, and the fact that the redundancy of a given code upper bounds the expected divergence between the true distribution and the induced (through the code) estimate of the distribution. Then, the existence of an universal source code implies the existence of an universal estimator in the sense of expected direct information divergence [15]. The impossibility of achieving this learning objective for the family of finite entropy memoryless sources [6, Theorem 2] motives the main question addressed in this work that is, the study of a “weak notion” of universal variablelength source coding.
In this framework, we propose to address the problem of universal source coding for infinite alphabet memoryless sources by studying a weaker (lossy) notion of coding instead of the classical lossless definition [4, 5]. This notion borrows ideas from the seminal work by Han [3] that allows reconstruction errors but assuming known statistic. In this paper, we investigate the idea of relaxing the lossless blockwise assumption with the goal that the corresponding weak universal source coding formulation will be reduced to a learning criterion that becomes feasible for the whole family finite entropy memoryless sources on countable infinite alphabets. In particular, we move from lossless coding to an asymptotic vanishing distortion fidelity criterion based on the Hamming distance as a fidelity metric.
Ia Contributions
Assuming that the distribution of the source is known, we first introduce the problem of “almost lossless source coding” for memoryless sources defined on countably infinite alphabets. Theorem 1 shows that Shannon entropy characterizes the minimum achievable rate for this problem. The proof of this theorem adopts a result from Ho et al. [16] that provides a closedform expression for the ratedistortion function on countable alphabets. From this characterization, we show that which is essential to prove this result^{1}^{1}1This result is wellknown for finite alphabets, however the extension on countably infinite alphabets is not straightforward due to the discontinuity of the entropy [17, 18]..
Then, we address the problem of almost lossless universal source coding. The main difficulty arises in finding a lossy coding scheme that achieves asymptotically zero distortion, i.e., pointwise over the family, while guaranteeing that the worstcase average redundancy –w.r.t. the minimum achievable rate– vanishes with the blocklength [4]. The proof of existence of an universal code with the desired property relies on a twostage coding scheme that first quantizes (symbolbysymbol) the countable alphabet and then applies a lossless variablelength code over the resulting quantized symbols. Our main result, stated in Theorem 2, shows that almost lossless universal source coding is feasible for the family of finite entropy stationary and memoryless sources.
We further study the possibility of obtaining rates of convergence for the worstcase distortion and the worst case redundancy. To this end, we restrict our analysis to the family of memoryless distributions with densities dominated by an envelope function , which was previously studied in [9, 10, 19]. Theorem 3 presents a necessary and sufficient condition on to achieve an uniform convergence (over the family) of the distortion to zero and, simultaneously, a vanishing worstcase average redundancy. Remarkably, this condition ( being a summable function) is the same necessary and sufficient condition needed for the existence of a strongly minimax (lossless) universal source code [9, Theorems 3 and 4].
Finally, we provide an analysis of the potential benefit of an almost lossless twostage coding scheme by exploring the family of envelope distributions that admits strong minimax universality in lossless source coding [9]. In this context, Theorem 4 shows that we can have an almost lossless approach that offers a nontrivial reduction to the rate of convergence of the worstcase redundancy, with respect to the wellknown information radius developed for the lossless case, at the expense of tolerating a nonzero distortion that vanishes with the blocklength. This result provides evidence that even in the case where lossless universality is feasible, an almost lossless scheme can reduce the rate of the worstcase redundancy and consequently, it offers ways of achieving different regimes for the rate of convergence of the redundancy versus the distortion. The proof of this result uses advanced tools by HausslerOpper [11] to relate the minimax redundancy of a family of distributions with its metric entropy with respect to the Hellinger distance. Indeed, this metric entropy approach has shown to be instrumental to derive tight bounds on the information radius for summable envelope distributions in [10]. We extended this metric entropy approach to our almost lossless coding setting with a twostage coding scheme to characterize the precise regime in which we can achieve gains in the rate of convergence of the redundancy.
The rest of the paper is organized as follows. Section II introduces the basic elements of weak source coding problem and shows that Shannon entropy is the minimum achievable rate provided that the statistics of the source is known. Section III presents the problem of almost lossless universal source coding and proves its feasibility for the family of finite entropy memoryless distributions on infinite alphabets. Section IIIB elaborates a result for a stronger notion of almost lossless universality. Finally, Section IV studies the gains in the rate of convergence of the redundancy that can be obtained with an almost lossless scheme for families of distributions that admit lossless USC. Section V presents the summary of the work and auxiliary results are presented in the Appendix section.
Ii Almost Lossless Source Coding
We begin by introducing some useful concepts and notations that will be needed across the paper. Let be a memoryless process (source) with values in a countable infinite alphabet equipped with a probability measure defined on the measurable space ^{2}^{2}2 denotes the power set of .. Let denote a finite block of length of the process following the product measure on ^{3}^{3}3The product measure satisfies the memoryless condition for all then .. Let us denote by the family of probability measures in , where for every , we understand to be a shorthand for its probability mass function (pmf). Let and let denote the collection of finite Shannon entropy probabilities [20] where
(1) 
with function on base 2.
Iia Almost Lossless Source Coding with Known Statistics
We now introduce the notion of a variablelength lossy coding of source samples, which consists of a pair where is a prefix free variablelength code (encoder) [5] and is the inverse mapping from bits to source symbols (decoder). Inspired by the weak coding setting introduced by Han [3], the possibility that is allowed. In order to quantify the loss induced by encoding process, a per letter distortion measure characterization is considered, where for the distortion is given by
(2) 
Given an information source , the average distortion induced by the pair is
(3) 
We will focus on . Then, is the normalized Hamming distance between the sequences . On the other hand, the rate of the pair (in bits per sample) is
(4) 
where indicates the functional that returns the length of binary sequences in . At this stage, we can introduce the almostlossless source coding problem and with this, the standard notion of minimum achievable rate.
Definition 1 (Achievability)
Given an information source , we say that a rate is achievable for almostlossless encoding, i.e., with zero asymptotic distortion, if there exists a sequence of encoder and decoder mappings satisfying:
(5)  
(6) 
The minimum achievable rate is then defined as:
(7) 
Let denote the minimum achievable rate of a memoryless source driven by . The next theorem characterizes provided that the source statistics is known.
Theorem 1 (Known statistics)
Given a memoryless source on a countable infinite alphabet driven by the probability measure , it follows that .
As it was expected, Shannon entropy characterizes the minimum achievable rate for the almost lossless source coding problem formulated in Definition 1. In order to prove this result, we adopt a result from Ho et al. [16] that provides a closedform expression for the ratedistortion function of on countable alphabets, through a tight upper bound on the conditional entropy for a given minimal error probability [16, Theorem 1]. From this characterization, we show that , which is essential to show the desired result ^{4}^{4}4This last result is wellknown for finite alphabets, however its extension to countably infinite alphabets is not straightforward due to the discontinuity of the entropy. The interested reader may be refer to [18, 17] for further details..
We consider the nontrivial case where has infinite support over , i.e., , otherwise we reduce to the finite alphabet case where this result is known [5, 4]. Let us assume that we have a lossy scheme such that
(8) 
If we denote by the reconstruction, from lossless variable length source coding it is wellknown that [4]:
(9)  
(10)  
(11) 
For the inequalities in (10), we use that is a memoryless source, the nonnegativity of the conditional mutual information [5], and the convexity of the ratedistortion function of [21, 5] given by
(12) 
For countable infinite alphabets, Ho and Verdú [16, Theorem 1] have shown the following remarkable result: there exists such that
(13) 
with for all , , and solution of the condition . For the rest we assume that is organized in decreasing order in the sense that . We note that for the charcaterization of as this assumption implies no loss of generality. Let us define
(14) 
then
(15) 
From (13) and (8), we focus on exploring when . First, it is simple to verify that implies that by definition. Then, for a fix with infinite support, the problem reduces to chacaterize . Note that converges pointwise to as vanishes ^{5}^{5}5In the countable alphabet case the pointwise convergence of the measure is equivalent to the weak convergence and the convergence in total variations [22].. However, by the entropy discontinuity [18, 23, 17], the convergence of the measure to is not sufficient to guarantee that .
First, it is simple to note that , as considering that for all , , and the dominated convergence theorem [24]. Then, , which is the limit of the first term in the RHS of (15). For the rest, we define the selfinformation function , for all , and . By definition pointwise in , noting that and . Furthermore, there is such that for all , for all ^{6}^{6}6This follows from the fact that the function is monotonically increasing in the range of for some . , where from the assumption that , and the fact that , then . Again by the dominated convergence theorem [24], and consequently, from (15).
Then returning to (11), it follows that for all ,
(16)  
(17) 
Finally, as ,
(18)  
(19)  
(20) 
which implies that .
The achievability part follows from the proof of Proposition 1.
IiB A TwoStage Source Coding Scheme
In this section, we consider a twostage source coding scheme that first applies a lossy (symbolwise) reduction of the alphabet, and second a variablelength lossless source code over the restricted alphabet. Let us define the finite set . We say that a twostage lossy code of blocklength and size is the composition of: a lossy mapping of the alphabet, represented by a pair of functions , where and , and a fixed to variablelength prefixfree pair of lossless encoderdecoder , where and .
Given a source and an lossy source encode ^{7}^{7}7For brevity, the decoding function will be considered implicit in the rest of the exposition., the lossy encoding of induced by is a twostage process where first a quantization of size over is made (letterbyletter) to generate a finite alphabet random sequence and then, a variablelength coding is applied to produce . Associated to the pair , there is an induced partition of given by:
(21) 
and a collection of prototypes^{8}^{8}8Without loss of generality, we assume that . . The resulting distortion incurred by this code is given by
(22) 
where is a shorthand to denote . On the other hand, the coding rate is:
(23) 
with denoting .
At this point, it is worth mentioning some basic properties on the partitions induced by on .
Definition 2
A sequence of partitions of is said to be asymptotically sufficient with respect to , if
(24) 
where denotes the cell in that contains ^{9}^{9}9More precisely, (24) implies that for all , .
Consider now almost lossless coding for which we can state the following.
Lemma 1
Let be a memoryless information source driven by . A necessary and sufficient condition for to have that is the partition sequence in (21) to be asymptotically sufficient for .
The proof is relegated to Appendix AA.
Studying the minimum achievable rate for zerodistortion coding requires the following definition.
Definition 3
For and a partition of , the entropy of restricted to the sigmafield induced by , which is denoted by , is given by
(25) 
A basic inequality [4, 5] shows that if , then for every . In particular, , where it is simple to show that with representing the collection of finite partitions of . Furthermore, it is possible to state the following result.
Lemma 2
If a sequence of partitions of is asymptotically sufficient with respect to , then
(26) 
The proof is relegated to Appendix AB. This implies that if a twostage scheme achieves zero distortion, then
(27) 
is asymptotically sufficient for (cf. Lemma 1). From the wellknown result in lossless variablelength source coding [5], we can conclude that:
(28)  
(29) 
and consequently, Lemma 2 implies that . Hence, letting to be the minimum achievable rate w.r.t. the family of twostage lossy schemes in Definition 1, we obtain that .
The next result shows that there is no additional overhead (in terms of bits per sample), if we restrict the problem to the family of twostage lossy schemes.
Proposition 1
For a memoryless source driven by ,
The proof is relegated to Appendix AC.
Iii Universal Almost Lossless Source Coding
Consider a memoryless source on a countable alphabet with unknown distribution but belong to a family . The main question to address here is if there exists a lossy coding scheme whose rate achieves the minimum feasible rate in Theorem 1, for every possible distribution in the family , while the distortion goes to zero as the blocklength tends to infinity as defined below.
Definition 4 (Admissible codes)
A family of distribution is said to admit an almost lossless USC scheme, if there is a lossy source code simultaneously satisfying:
(30) 
and
(31) 
An almost lossless universal code provides a pointwise convergence of the distortion to zero for every while constraining the worstcase expected redundancy to vanish as the block length goes to infinity. It is obvious from Definition 4 that if admits a classical lossless universal source code [8, 7], i.e., the worstcase average redundancy vanishes with zero distortion for every finite , then it admits an almost lossless USC. We next study if there is a richer family of distributions that admits an almost lossless USC scheme.
Iiia Feasibility Result
We need to introduce some preliminaries. Regarding the distortion, we need the following definition.
Definition 5
A sequence of partitions of is asymptotically sufficient for , if it is asymptotically sufficient for every measure (cf. Definition 2).
Concerning the analysis of the worstcase average redundancy in a lossy context, it is instrumental to introduce the divergence restricted to a subsigma field [25]. Let be a partition of and its induces sigmafield. Then, for every , the divergence of with respect to restricted to is [25]:
(32) 
With this notation, we can introduce the information radius of restricted to a subsigma field as follows:
(33) 
for all , where is the product partition of , is the set of probability measures in , and denotes the collection of all i.i.d (product) probabilities measures in induced by .
Lemma 3
Let us consider . If there is a sequence of partitions of asymptotically sufficient for , and is , then the family of memoryless sources with marginal distribution in admits an almost lossless source coding scheme.
The argument is presented in Appendix AD.
Theorem 2 (Feasibility)
The family admits an almost lossless source code.
Let us consider a collection of finite size partitions with for all . We note that if
(34) 
then, this partition scheme is asymptotically sufficient for . Concerning the information radius, we have that , which reduces the analysis to the finite alphabet case. In this context, it is wellknown that [4, Theorem 7.5]:
(35) 
for some universal constants and . Then, provided that is it follows that is . There are numerous finite partition sequences that satisfy the conditions stated in (34) and being . For example, the tail partition family given by , where and , considering that is and is . Finally, for all we have by definition that , which proves the result by applying Lemma 3.
Theorem 2 shows that a weak notion of universality allows to code the complete collection of finite entropy memoryless sources defined on countable alphabets. Since the same result for lossless source coding is not possible [6], an interpretation of Theorem 2 is that a nonzero distortion (for any finite blocklength) is the price to make the average redundancy of an universal coding scheme vanishing with the blocklength. To obtain this result, the twostage approach presented in Section IIB was considered.
If we restrict the family of twostage schemes to have an exhaustive firststage mapping, i.e., for all and , then we reduce the approach to the lossless source coding (i.e., zero distortion for every finite blocklength). In this case, the condition in Lemma 3 reduces to verify that
(36) 
is . This is the equivalent information radius condition needed for a family of distributions to have a nontrivial minimax redundancy rate [7, 8, 6, 9, 4].
IiiB Uniform Convergence of the Distortion
We further focus on a stronger notion of universal weak source coding, here we study whether is possible to achieve an uniform convergence of the distortion to zero (over the entire family ), instead of the pointwise convergence defined in expression (30). Tho this end, we restrict the analysis to the rich family of envelope distributions studied in [9] for the lossless coding problem. Given a nonnegative function , the envelope family indexed by is given by^{10}^{10}10It is worth mentioning the work of Boucheron et al. [9] that has studied the information radius of this family in the lossless case and they stablished a necessary and sufficient condition to have strong minimax universality [9, Theorems 3 and 4].:
(37) 
We can state the following dichotomy:
Theorem 3 (Uniform convergence)
Let us consider the family of envelope distributions .

If , then there is a twostage coding scheme with (finite size) such that
(38) (39) 
Otherwise, i.e., , for any twostage code of length with it follows that
(40) while if , then
(41) More generally, for a lossy code of length , provided that
(42) then
In summary, if the envelope function is summable, there is a twostage coding scheme of finite size that offers a uniform convergence of the distortion to zero, while ensuring that the worstcase average redundancy vanishes with the blocklength. On the negative side, for all memoryless family of sources indexed by a nonsummable envelope function, it is not possible to achieve an uniform convergence of the distortion to zero with the blocklength bounded on a finite size twostage coding rule. An infinite size rule is indeed needed, i.e., , eventually with , that on the downside it has an infinite information radius from Lemma 4 and 5. Remarkably, this impossibility result remains when enriching the analysis with the adoption of general lossy coding rules (details in Sec. IIIC2).
Finally, it worths noting that the family with has a finite regret and redundancy in the context of lossless source coding [9, Ths. 3 and 4]. Furthermore, summability is the necessary and sufficient condition on that makes this collection strongly minimax universal in lossless source coding [9]. Then, based on this strong almost lossless source coding principle (with a uniform convergence to zero of the distortion and the redundancy) we match the necessary and sufficient condition for lossless universal source coding and it is not possible to code (universally) a richer family of distributions when restricting the analysis to envelope families.
IiiC Proof of Theorem 3
If , the fact that has a uniform bound on the tails of the distributions suggests that a family of tail truncating partitions should be considered to achieve the claim (i). Let us define
(43) 
which resolves the elements of and consequently, there is a pair associated with such that ,
(44) 
In fact, , and being is a sufficient condition to satisfy the uniform convergence of the distortion to zero. Furthermore, from the proof of Lemma 3 (Eq.(122)) and (35), there is a lossless coding scheme such that . Therefore, we can consider being with to conclude the achievability part.
IiiC1 Converse for TwoStage Lossy Coding Schemes
^{11}^{11}11We first present this preliminary converse result, as it provides the ground to explore redundancy gains in Section IV.For the converse part, let us first consider an arbitrary twostage lossy rule with a finite partition . If we denote its prototypes by , it is clear that there exists such that , and consequently, for all . Therefore, for any finite size partition rule it follows that for all and hence, when no uniform convergence on the distortion can be achieved with a finite size lossy rule.
On the other hand, for the family of infinite size rules, we focus our analysis on in (33). Let us fix a blocklength and a rule of infinite size. For sake of clarity, we consider that , where is a countable infinite alphabet. For any , denotes the induced measure in by the mapping trough the standard construction^{12}^{12}12There is no question about the measurability of as we consider that is the power set.: for all . In addition, it is simple to verify by definition that for any pair
(45)  
(46) 
where and denotes the pmf of on . Then,
(47)  
(48) 
where denotes de product probability on with marginal and is the collection of probability measures on . In other words, the information radius of restricted to the product subsigma field is equivalent to the information radius of in the full measurable space. At this point, we can stay the following:
Lemma 4
For any nonnegative envelope function , there is given by
(49) 
such that .
The proof is relegated to Appendix AE.
In other words, envelope families on map to envelope families on through the mapping . Based on this observation, we can apply the following result by Boucheron et al. [9]:
Lemma 5
[9, Theorems 3 and 4] For a family of envelope distribution over a countable alphabet, if is summable, then its information radius for all , and it scales as with the blocklength. Otherwise, for all .
Then from (48), the problem reduces to characterize the information radius of on , which from Lemma 4 is an envelope family with envelope function given by (49). Given that implies that , from Lemma 5 we have that . Finally, since the information radius tightly bound the least worst expected redundancy for the second lossless coding stage, e.g. see (121) and (122), this implies that:
(50) 
which concludes the argument.
IiiC2 Converse for general variablelength lossy codes
Let us consider a lossy code of length . Without loss of generality we can decouple as the composition of a vector quantizer , where is an index set, and prefix free losses mapping , where for all . From this, we characterize the vector quantization induced by as follows:
(51) 
Using this twostage (vector quantizationcoding) view, we have the following result:
Proposition 2
Let us consider a lossy code and a family of distributions . If we denote by the probability in induced by (the fold distributions with marginal in ) and , by , then
(52)  
(53)  
(54)  
(55) 
The proof is relegated to Appendix AF.
In our context, this result shows that the worstcase overhead is lower bounded by the information radius of the fold family projected into the subsigma field induced by , i.e., a quantization of .
Considering that , we follow the construction presented in [9] that shows that there is a infinite collection of distributions with , where if we denote by
then for each and for any , . In this context, for each is the support of .
At this point, let us use the assumption that:
This implies that . From the fact that is a infinite collection of probabilities with disjoint supports and the definition of the distortion, it is simple to verify that we need to allocate at least one prototype^{13}^{13}13The prototypes of is the set