Universal Weak Variable-Length Source Coding
on Countable Infinite Alphabets
Motivated from the fact that universal source coding on countably infinite alphabets is not feasible, this work introduces the notion of “almost lossless source coding”. Analog to the weak variable-length source coding problem studied by Han , almost lossless source coding aims at relaxing the lossless block-wise assumption to allow an average per-letter distortion that vanishes asymptotically as the block-length goes to infinity. In this setup, we show on one hand that Shannon entropy characterizes the minimum achievable rate (similarly to the case of discrete sources) while on the other that almost lossless universal source coding becomes feasible for the family of finite-entropy stationary memoryless sources with countably infinite alphabets. Furthermore, we study a stronger notion of almost lossless universality that demands uniform convergence of the average per-letter distortion to zero, where we establish a necessary and sufficient condition for the so-called family of “envelope distributions” to achieve it. Remarkably, this condition is the same necessary and sufficient condition needed for the existence of a strongly minimax (lossless) universal source code for the family of envelope distributions. Finally, we show that an almost lossless coding scheme offers faster rate of convergence for the (minimax) redundancy compared to the well-known information radius developed for the lossless case at the expense of tolerating a non-zero distortion that vanishes to zero as the block-length grows. This shows that even when lossless universality is feasible, an almost lossless scheme can offer different regimes on the rates of convergence of the (worst case) redundancy versus the (worst case) distortion.
Universal source coding, countably infinite alphabets, weak source coding, envelope distributions, information radius, metric entropy analysis.
The problem of Universal Source Coding (USC) has a long history on information theory [4, 5, 6, 7, 8]. This topic started with the seminal work of Davisson  that formalizes the variable-length lossless coding and introduces relevant information quantities (mutual information and channel capacity ). In lossless variable-length source coding, it is well-known that if we know the statistics of a source (memoryless or stationary and ergodic) the Shannon entropy (or Shannon entropy rate) provides the minimum achievable rate . However, when the statistics of the source is not known but it belongs to family of distributions , then the problem reduces to characterize the worst-case expected overhead (or minimax redundancy) that a pair of encoder and decoder experiences due to the lack of knowledge about true distribution governing the source samples to be encoded .
A seminal information-theoretic result states that the least worst-case overhead (or minimax redundancy of ) is fully characterized by the information radius of . The information radius has been richly studied in a series of numerous contributions [9, 10, 11, 12, 13], including in universal prediction of individual sequences . In particular, it is well-known that the information radius growths sub-linearly for the family of finite alphabet stationary and memoryless sources , which implies the existence of an universal source code that achieves Shannon entropy for every distribution in this family provided that the block length goes to infinity. What is intriguing in this positive result obtained for finite alphabet memoryless sources is that it does not longer extend to the case of memoryless sources on countable infinite alphabets, as was clearly shown in [8, 6, 9]. From an information complexity perspective, this infeasibility result implies that the information radius of this family is unbounded for any finite block-length and, consequently, lossless universal source coding for infinite alphabet memoryless sources is an intractable problem. In this regard, the proof presented by Györfi et al. [6, Theorem 1] is constructed over a connection between variable-length prefix-free codes and distribution estimators, and the fact that the redundancy of a given code upper bounds the expected divergence between the true distribution and the induced (through the code) estimate of the distribution. Then, the existence of an universal source code implies the existence of an universal estimator in the sense of expected direct information divergence . The impossibility of achieving this learning objective for the family of finite entropy memoryless sources [6, Theorem 2] motives the main question addressed in this work that is, the study of a “weak notion” of universal variable-length source coding.
In this framework, we propose to address the problem of universal source coding for infinite alphabet memoryless sources by studying a weaker (lossy) notion of coding instead of the classical lossless definition [4, 5]. This notion borrows ideas from the seminal work by Han  that allows reconstruction errors but assuming known statistic. In this paper, we investigate the idea of relaxing the lossless block-wise assumption with the goal that the corresponding weak universal source coding formulation will be reduced to a learning criterion that becomes feasible for the whole family finite entropy memoryless sources on countable infinite alphabets. In particular, we move from lossless coding to an asymptotic vanishing distortion fidelity criterion based on the Hamming distance as a fidelity metric.
Assuming that the distribution of the source is known, we first introduce the problem of “almost lossless source coding” for memoryless sources defined on countably infinite alphabets. Theorem 1 shows that Shannon entropy characterizes the minimum achievable rate for this problem. The proof of this theorem adopts a result from Ho et al.  that provides a closed-form expression for the rate-distortion function on countable alphabets. From this characterization, we show that which is essential to prove this result111This result is well-known for finite alphabets, however the extension on countably infinite alphabets is not straightforward due to the discontinuity of the entropy [17, 18]..
Then, we address the problem of almost lossless universal source coding. The main difficulty arises in finding a lossy coding scheme that achieves asymptotically zero distortion, i.e., point-wise over the family, while guaranteeing that the worst-case average redundancy –w.r.t. the minimum achievable rate– vanishes with the block-length . The proof of existence of an universal code with the desired property relies on a two-stage coding scheme that first quantizes (symbol-by-symbol) the countable alphabet and then applies a lossless variable-length code over the resulting quantized symbols. Our main result, stated in Theorem 2, shows that almost lossless universal source coding is feasible for the family of finite entropy stationary and memoryless sources.
We further study the possibility of obtaining rates of convergence for the worst-case distortion and the worst case redundancy. To this end, we restrict our analysis to the family of memoryless distributions with densities dominated by an envelope function , which was previously studied in [9, 10, 19]. Theorem 3 presents a necessary and sufficient condition on to achieve an uniform convergence (over the family) of the distortion to zero and, simultaneously, a vanishing worst-case average redundancy. Remarkably, this condition ( being a summable function) is the same necessary and sufficient condition needed for the existence of a strongly minimax (lossless) universal source code [9, Theorems 3 and 4].
Finally, we provide an analysis of the potential benefit of an almost lossless two-stage coding scheme by exploring the family of envelope distributions that admits strong minimax universality in lossless source coding . In this context, Theorem 4 shows that we can have an almost lossless approach that offers a non-trivial reduction to the rate of convergence of the worst-case redundancy, with respect to the well-known information radius developed for the lossless case, at the expense of tolerating a non-zero distortion that vanishes with the blocklength. This result provides evidence that even in the case where lossless universality is feasible, an almost lossless scheme can reduce the rate of the worst-case redundancy and consequently, it offers ways of achieving different regimes for the rate of convergence of the redundancy versus the distortion. The proof of this result uses advanced tools by Haussler-Opper  to relate the minimax redundancy of a family of distributions with its metric entropy with respect to the Hellinger distance. Indeed, this metric entropy approach has shown to be instrumental to derive tight bounds on the information radius for summable envelope distributions in . We extended this metric entropy approach to our almost lossless coding setting with a two-stage coding scheme to characterize the precise regime in which we can achieve gains in the rate of convergence of the redundancy.
The rest of the paper is organized as follows. Section II introduces the basic elements of weak source coding problem and shows that Shannon entropy is the minimum achievable rate provided that the statistics of the source is known. Section III presents the problem of almost lossless universal source coding and proves its feasibility for the family of finite entropy memoryless distributions on infinite alphabets. Section III-B elaborates a result for a stronger notion of almost lossless universality. Finally, Section IV studies the gains in the rate of convergence of the redundancy that can be obtained with an almost lossless scheme for families of distributions that admit lossless USC. Section V presents the summary of the work and auxiliary results are presented in the Appendix section.
Ii Almost Lossless Source Coding
We begin by introducing some useful concepts and notations that will be needed across the paper. Let be a memoryless process (source) with values in a countable infinite alphabet equipped with a probability measure defined on the measurable space 222 denotes the power set of .. Let denote a finite block of length of the process following the product measure on 333The product measure satisfies the memoryless condition for all then .. Let us denote by the family of probability measures in , where for every , we understand to be a short-hand for its probability mass function (pmf). Let and let denote the collection of finite Shannon entropy probabilities  where
with function on base 2.
Ii-a Almost Lossless Source Coding with Known Statistics
We now introduce the notion of a variable-length lossy coding of source samples, which consists of a pair where is a prefix free variable-length code (encoder)  and is the inverse mapping from bits to source symbols (decoder). Inspired by the weak coding setting introduced by Han , the possibility that is allowed. In order to quantify the loss induced by encoding process, a per letter distortion measure characterization is considered, where for the distortion is given by
Given an information source , the average distortion induced by the pair is
We will focus on . Then, is the normalized Hamming distance between the sequences . On the other hand, the rate of the pair (in bits per sample) is
where indicates the functional that returns the length of binary sequences in . At this stage, we can introduce the almost-lossless source coding problem and with this, the standard notion of minimum achievable rate.
Definition 1 (Achievability)
Given an information source , we say that a rate is achievable for almost-lossless encoding, i.e., with zero asymptotic distortion, if there exists a sequence of encoder and decoder mappings satisfying:
The minimum achievable rate is then defined as:
Let denote the minimum achievable rate of a memoryless source driven by . The next theorem characterizes provided that the source statistics is known.
Theorem 1 (Known statistics)
Given a memoryless source on a countable infinite alphabet driven by the probability measure , it follows that .
As it was expected, Shannon entropy characterizes the minimum achievable rate for the almost lossless source coding problem formulated in Definition 1. In order to prove this result, we adopt a result from Ho et al.  that provides a closed-form expression for the rate-distortion function of on countable alphabets, through a tight upper bound on the conditional entropy for a given minimal error probability [16, Theorem 1]. From this characterization, we show that , which is essential to show the desired result 444This last result is well-known for finite alphabets, however its extension to countably infinite alphabets is not straightforward due to the discontinuity of the entropy. The interested reader may be refer to [18, 17] for further details..
We consider the non-trivial case where has infinite support over , i.e., , otherwise we reduce to the finite alphabet case where this result is known [5, 4]. Let us assume that we have a lossy scheme such that
If we denote by the reconstruction, from lossless variable length source coding it is well-known that :
For countable infinite alphabets, Ho and Verdú [16, Theorem 1] have shown the following remarkable result: there exists such that
with for all , , and solution of the condition . For the rest we assume that is organized in decreasing order in the sense that . We note that for the charcaterization of as this assumption implies no loss of generality. Let us define
From (13) and (8), we focus on exploring when . First, it is simple to verify that implies that by definition. Then, for a fix with infinite support, the problem reduces to chacaterize . Note that converges point-wise to as vanishes 555In the countable alphabet case the point-wise convergence of the measure is equivalent to the weak convergence and the convergence in total variations .. However, by the entropy discontinuity [18, 23, 17], the convergence of the measure to is not sufficient to guarantee that .
First, it is simple to note that , as considering that for all , , and the dominated convergence theorem . Then, , which is the limit of the first term in the RHS of (15). For the rest, we define the self-information function , for all , and . By definition point-wise in , noting that and . Furthermore, there is such that for all , for all 666This follows from the fact that the function is monotonically increasing in the range of for some . , where from the assumption that , and the fact that , then . Again by the dominated convergence theorem , and consequently, from (15).
Then returning to (11), it follows that for all ,
Finally, as ,
which implies that .
The achievability part follows from the proof of Proposition 1.
Ii-B A Two-Stage Source Coding Scheme
In this section, we consider a two-stage source coding scheme that first applies a lossy (symbol-wise) reduction of the alphabet, and second a variable-length lossless source code over the restricted alphabet. Let us define the finite set . We say that a two-stage lossy code of block-length and size is the composition of: a lossy mapping of the alphabet, represented by a pair of functions , where and , and a fixed to variable-length prefix-free pair of lossless encoder-decoder , where and .
Given a source and an -lossy source encode 777For brevity, the decoding function will be considered implicit in the rest of the exposition., the lossy encoding of induced by is a two-stage process where first a quantization of size over is made (letter-by-letter) to generate a finite alphabet random sequence and then, a variable-length coding is applied to produce . Associated to the pair , there is an induced partition of given by:
and a collection of prototypes888Without loss of generality, we assume that . . The resulting distortion incurred by this code is given by
where is a short-hand to denote . On the other hand, the coding rate is:
with denoting .
At this point, it is worth mentioning some basic properties on the partitions induced by on .
A sequence of partitions of is said to be asymptotically sufficient with respect to , if
where denotes the cell in that contains 999More precisely, (24) implies that for all , .
Consider now almost lossless coding for which we can state the following.
Let be a memoryless information source driven by . A necessary and sufficient condition for to have that is the partition sequence in (21) to be asymptotically sufficient for .
The proof is relegated to Appendix A-A.
Studying the minimum achievable rate for zero-distortion coding requires the following definition.
For and a partition of , the entropy of restricted to the sigma-field induced by , which is denoted by , is given by
A basic inequality [4, 5] shows that if , then for every . In particular, , where it is simple to show that with representing the collection of finite partitions of . Furthermore, it is possible to state the following result.
If a sequence of partitions of is asymptotically sufficient with respect to , then
The proof is relegated to Appendix A-B. This implies that if a two-stage scheme achieves zero distortion, then
The next result shows that there is no additional overhead (in terms of bits per sample), if we restrict the problem to the family of two-stage lossy schemes.
For a memoryless source driven by ,
The proof is relegated to Appendix A-C.
Iii Universal Almost Lossless Source Coding
Consider a memoryless source on a countable alphabet with unknown distribution but belong to a family . The main question to address here is if there exists a lossy coding scheme whose rate achieves the minimum feasible rate in Theorem 1, for every possible distribution in the family , while the distortion goes to zero as the block-length tends to infinity as defined below.
Definition 4 (Admissible codes)
A family of distribution is said to admit an almost lossless USC scheme, if there is a lossy source code simultaneously satisfying:
An almost lossless universal code provides a point-wise convergence of the distortion to zero for every while constraining the worst-case expected redundancy to vanish as the block length goes to infinity. It is obvious from Definition 4 that if admits a classical lossless universal source code [8, 7], i.e., the worst-case average redundancy vanishes with zero distortion for every finite , then it admits an almost lossless USC. We next study if there is a richer family of distributions that admits an almost lossless USC scheme.
Iii-a Feasibility Result
We need to introduce some preliminaries. Regarding the distortion, we need the following definition.
A sequence of partitions of is asymptotically sufficient for , if it is asymptotically sufficient for every measure (cf. Definition 2).
Concerning the analysis of the worst-case average redundancy in a lossy context, it is instrumental to introduce the divergence restricted to a sub-sigma field . Let be a partition of and its induces sigma-field. Then, for every , the divergence of with respect to restricted to is :
With this notation, we can introduce the information radius of restricted to a sub-sigma field as follows:
for all , where is the product partition of , is the set of probability measures in , and denotes the collection of all i.i.d (product) probabilities measures in induced by .
Let us consider . If there is a sequence of partitions of asymptotically sufficient for , and is , then the family of memoryless sources with marginal distribution in admits an almost lossless source coding scheme.
The argument is presented in Appendix A-D.
Theorem 2 (Feasibility)
The family admits an almost lossless source code.
Let us consider a collection of finite size partitions with for all . We note that if
then, this partition scheme is asymptotically sufficient for . Concerning the information radius, we have that , which reduces the analysis to the finite alphabet case. In this context, it is well-known that [4, Theorem 7.5]:
for some universal constants and . Then, provided that is it follows that is . There are numerous finite partition sequences that satisfy the conditions stated in (34) and being . For example, the tail partition family given by , where and , considering that is and is . Finally, for all we have by definition that , which proves the result by applying Lemma 3.
Theorem 2 shows that a weak notion of universality allows to code the complete collection of finite entropy memoryless sources defined on countable alphabets. Since the same result for lossless source coding is not possible , an interpretation of Theorem 2 is that a non-zero distortion (for any finite block-length) is the price to make the average redundancy of an universal coding scheme vanishing with the block-length. To obtain this result, the two-stage approach presented in Section II-B was considered.
If we restrict the family of two-stage schemes to have an exhaustive first-stage mapping, i.e., for all and , then we reduce the approach to the lossless source coding (i.e., zero distortion for every finite block-length). In this case, the condition in Lemma 3 reduces to verify that
Iii-B Uniform Convergence of the Distortion
We further focus on a stronger notion of universal weak source coding, here we study whether is possible to achieve an uniform convergence of the distortion to zero (over the entire family ), instead of the point-wise convergence defined in expression (30). Tho this end, we restrict the analysis to the rich family of envelope distributions studied in  for the lossless coding problem. Given a non-negative function , the envelope family indexed by is given by101010It is worth mentioning the work of Boucheron et al.  that has studied the information radius of this family in the lossless case and they stablished a necessary and sufficient condition to have strong minimax universality [9, Theorems 3 and 4].:
We can state the following dichotomy:
Theorem 3 (Uniform convergence)
Let us consider the family of envelope distributions .
If , then there is a two-stage coding scheme with (finite size) such that
Otherwise, i.e., , for any two-stage code of length with it follows that
while if , then
More generally, for a lossy code of length , provided that
In summary, if the envelope function is summable, there is a two-stage coding scheme of finite size that offers a uniform convergence of the distortion to zero, while ensuring that the worst-case average redundancy vanishes with the block-length. On the negative side, for all memoryless family of sources indexed by a non-summable envelope function, it is not possible to achieve an uniform convergence of the distortion to zero with the block-length bounded on a finite size two-stage coding rule. An infinite size rule is indeed needed, i.e., , eventually with , that on the down-side it has an infinite information radius from Lemma 4 and 5. Remarkably, this impossibility result remains when enriching the analysis with the adoption of general lossy coding rules (details in Sec. III-C2).
Finally, it worths noting that the family with has a finite regret and redundancy in the context of lossless source coding [9, Ths. 3 and 4]. Furthermore, summability is the necessary and sufficient condition on that makes this collection strongly minimax universal in lossless source coding . Then, based on this strong almost lossless source coding principle (with a uniform convergence to zero of the distortion and the redundancy) we match the necessary and sufficient condition for lossless universal source coding and it is not possible to code (universally) a richer family of distributions when restricting the analysis to envelope families.
Iii-C Proof of Theorem 3
If , the fact that has a uniform bound on the tails of the distributions suggests that a family of tail truncating partitions should be considered to achieve the claim (i). Let us define
which resolves the elements of and consequently, there is a pair associated with such that ,
In fact, , and being is a sufficient condition to satisfy the uniform convergence of the distortion to zero. Furthermore, from the proof of Lemma 3 (Eq.(122)) and (35), there is a lossless coding scheme such that . Therefore, we can consider being with to conclude the achievability part.
Iii-C1 Converse for Two-Stage Lossy Coding Schemes111111We first present this preliminary converse result, as it provides the ground to explore redundancy gains in Section IV.
For the converse part, let us first consider an arbitrary two-stage lossy rule with a finite partition . If we denote its prototypes by , it is clear that there exists such that , and consequently, for all . Therefore, for any finite size partition rule it follows that for all and hence, when no uniform convergence on the distortion can be achieved with a finite size lossy rule.
On the other hand, for the family of infinite size rules, we focus our analysis on in (33). Let us fix a block-length and a rule of infinite size. For sake of clarity, we consider that , where is a countable infinite alphabet. For any , denotes the induced measure in by the mapping trough the standard construction121212There is no question about the measurability of as we consider that is the power set.: for all . In addition, it is simple to verify by definition that for any pair
where and denotes the pmf of on . Then,
where denotes de product probability on with marginal and is the collection of probability measures on . In other words, the information radius of restricted to the product sub-sigma field is equivalent to the information radius of in the full measurable space. At this point, we can stay the following:
For any non-negative envelope function , there is given by
such that .
The proof is relegated to Appendix A-E.
In other words, envelope families on map to envelope families on through the mapping . Based on this observation, we can apply the following result by Boucheron et al. :
[9, Theorems 3 and 4] For a family of envelope distribution over a countable alphabet, if is summable, then its information radius for all , and it scales as with the block-length. Otherwise, for all .
Then from (48), the problem reduces to characterize the information radius of on , which from Lemma 4 is an envelope family with envelope function given by (49). Given that implies that , from Lemma 5 we have that . Finally, since the information radius tightly bound the least worst expected redundancy for the second lossless coding stage, e.g. see (121) and (122), this implies that:
which concludes the argument.
Iii-C2 Converse for general variable-length lossy codes
Let us consider a lossy code of length . Without loss of generality we can decouple as the composition of a vector quantizer , where is an index set, and prefix free losses mapping , where for all . From this, we characterize the vector quantization induced by as follows:
Using this two-stage (vector quantization-coding) view, we have the following result:
Let us consider a lossy code and a family of distributions . If we denote by the probability in induced by (the -fold distributions with marginal in ) and , by , then
The proof is relegated to Appendix A-F.
In our context, this result shows that the worst-case overhead is lower bounded by the information radius of the -fold family projected into the sub-sigma field induced by , i.e., a quantization of .
Considering that , we follow the construction presented in  that shows that there is a infinite collection of distributions with , where if we denote by
then for each and for any , . In this context, for each is the support of .
At this point, let us use the assumption that:
This implies that . From the fact that is a infinite collection of probabilities with disjoint supports and the definition of the distortion, it is simple to verify that we need to allocate at least one prototype131313The prototypes of is the set