Rateless Lossy Compression via the Extremes

Rateless Lossy Compression via the Extremes

Albert No and Tsachy Weissman
The material in this paper has been presented in part at the 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton). This work was supported by the NSF Center for Science of Information under Grant Agreement CCF-0939370. Department of Electrical Engineering, Stanford University
Email: {albertno, tsachy}@stanford.edu

We begin by presenting a simple lossy compressor operating at near-zero rate: The encoder merely describes the indices of the few maximal source components, while the decoder’s reconstruction is a natural estimate of the source components based on this information. This scheme turns out to be near-optimal for the memoryless Gaussian source in the sense of achieving the zero-rate slope of its distortion-rate function. Motivated by this finding, we then propose a scheme comprised of iterating the above lossy compressor on an appropriately transformed version of the difference between the source and its reconstruction from the previous iteration. The proposed scheme achieves the rate distortion function of the Gaussian memoryless source (under squared error distortion) when employed on any finite-variance ergodic source. It further possesses desirable properties we respectively refer to as infinitesimal successive refinability, ratelessness, and complete separability. Its storage and computation requirements are of order no more than per source symbol for at both the encoder and decoder. Though the details of its derivation, construction, and analysis differ considerably, we discuss similarities between the proposed scheme and the recently introduced Sparse Regression Codes (SPARC) of Venkataramanan et al.

Complete separability, extreme value theory, infinitesimal successive refinability, order statistics, rate distortion code, rateless code, spherical distribution, uniform random orthogonal matrix.

I Introduction

Consider an independent and identically distributed (i.i.d.) standard Gaussian source . It is well known [1] that the maximum value concentrates on , i.e., . This fact suggests a simple lossy source coding scheme for the Gaussian source under quadratic distortion. The encoder sends the index of the maximum value and the decoder reconstructs according to


For the meager nats that it requires, this simple scheme achieves essentially optimum distortion (in a sense made concrete in Section II) and has obviously modest storage and computational requirements. We can generalize this scheme by describing the indices of the largest values, and the scheme still achieves optimum distortion for its operating rate. Note that this scheme can be considered a special case of a permutation code [2], where the encoder sends a rough ordering of the source. It can perform as well as the best entropy-constrained scalar quantizer (ECSQ) but cannot achieve the optimum distortion-rate function at general positive rates [3]. In [2], the authors mentioned the case explicitly as being asymptotically optimum under the expected distortion criterion. Our focus is more on the excess distortion probability than the expected distortion. Furthermore, we establish a more general result where grows sub-linearly in .

We generalize this idea to a scheme we refer to as Coding with Random Orthogonal Matrices (CROM), which achieves the distortion-rate function at all rates. Let be a random by matrix uniformly drawn from the set of all by orthogonal matrices, i.e., for any -dimensional vector , the random vector is uniformly distributed on the sphere with radius . Since a random vector uniformly distributed on a high-dimensional sphere is close in distribution to an i.i.d. Gaussian random vector, we can expect the behavior of to be similar to that of an i.i.d. Gaussian random vector. Therefore, we can apply the above scheme again to describe a lossy version of it, using another nats, and so on. In this paper, we show that this iterative scheme achieves the Gaussian rate distortion function for any finite-variance ergodic source under quadratic distortion, while enjoying additional properties such as a strong notion of successive refinability and polynomial complexity.

One nice property of CROM is ratelessness. Similar to the rateless codes in the channel coding setting, CROM is able to reconstruct a source with partial messages while the optimum distortion for that rate is achieved. More precisely, suppose the decoder received first fraction of the messages for some , then it can reconstruct a source with a distortion . Thanks to the ratelessness, the encoder does not have to determine the rate ahead of encoding. However, unlike in many rateless channel coding settings, CROM requires that the bits observed are the first fraction bits, rather than that number of bits gleaned from any set of locations along the stream.

Much work has been dedicated to reducing the complexity of rate-distortion codes (cf. [4], [5], [6] and references therein). In particular, Venkataramanan et al. proposed the sparse regression code (SPARC) that achieves the Gaussian distortion-rate function with low complexity [7], [8]. SPARC and CROM have similarities, which we discuss in detail.

The paper is organized as follows. In Section II, we present the simple zero-rate scheme and the sense in which it is optimal for Gaussian sources. CROM is described, along with some of its properties and performance guarantees, in Section III. We compare our scheme with SPARC in Section IV. We test CROM via simulation in Section V. We also discuss dual channel coding results in Section VI. Section VII provides proofs of our main results and we conclude the paper in Section VIII.

Notation: Both and denote an -dimensional random vector . We let denote the -th largest element of . We denote an by random orthogonal matrix by , and a non-random orthogonal matrix by . We denote the distortion rate-function of the memoryless standard Gaussian source by . Finally, we use nats instead of bits and denotes logarithm to the natural base unless specified otherwise.

Ii Optimum Zero-Rate Gaussian Source Coding Scheme

In this section, we propose a simple zero-rate lossy compressor which is essentially optimal for the i.i.d. standard Gaussian source under quadratic distortion. Before that, let us be more rigorous regarding our notion of “zero-rate optimum source coding” for a Gaussian source under squared error distortion. Consider a scheme using a number of nats for the lossy description of the source which is sub-linear in the block length , i.e., the rate of the scheme converges to zero. Suppose the scheme achieves a distortion , where the target excess distortion probability is , i.e.,


We further define to be the minimum distortion achievable over all possible strictly zero-rate schemes when the target excess distortion probability is . Following lemma shows that the best reconstruction is the all zero vector for the i.i.d. standard Gaussian source under squared error distortion. {lemma} Let be the i.i.d. standard Gaussian source. Then, for any and , the following inequality holds.


Since has spherically symmetric distribution, namely is also i.i.d. standard Gaussian for any orthogonal matrix , only depends on . Let , then




It is not hard to show that


Finally, we say that a sequence of zero-rate schemes achieves the zero-rate optimum if


for all , where is the slope of the Gaussian distortion-rate function at zero rate. Equivalently,


This definition is reminiscent of the finite block length result in lossy compression [9], [10], where the authors showed the minimum distortion among all possible schemes for given rate , target excess distortion probability , and block length is


Recall that denotes the Gaussian distortion-rate function of memoryless standard Gaussian source.

We are now ready to propose the simple zero-rate optimum source coding scheme. Let be an i.i.d. standard normal random process. The encoder simply sends the index of the maximum value, , and the decoder reconstructs as


where is naturally chosen as . Note that the encoder only describes the index of the maximum entry but not its value. This scheme works because the unsent value of the maximum entry concentrates on the specific value near , i.e., , which is a well-known fact from extreme value theory [1].

The rate of this scheme is nats per symbol, and it is not hard to show that the distortion is reduced by (plus lower order terms), which is twice the rate we are using. Therefore, it is natural to suspect that such a scheme is zero-rate optimum.

We can generalize this scheme to send more than one index: The encoder sends the indices of the largest values of , and the decoder reconstructs as


Here we will choose for some and to be roughly the expected value of the -th largest value of , i.e., .

Clearly this scheme has rate where . The following theorem shows that this scheme is optimal at zero rate. {theorem} For any and , there is an such that the above scheme achieves the zero-rate optimum. More precisely, for any , the scheme achieves




Since , we can say that the above scheme is zero-rate optimum. The proof is given in Section VII-B. We note that the encoding and decoding can be done in almost linear time. Moreover, we do not need to store an entire codebook, but only the single real number needs to be stored.


Note that Verdù [11] also considered the slope of the rate-distortion function at as a counterpart to the capacity per unit cost. However, our requirements for zero-rate optimum scheme is more stronger since we incorporates the second order (or dispersion) term .


The above scheme only describes the index of the largest element. However, the encoder can send indices of both the maximum and the minimum, which is also the zero-rate optimum. Note that the minimum value will be close to , and therefore we can expect the similar behavior.

Iii Coding with Random Orthogonal Matrices

Iii-a Preliminaries

Before presenting the scheme, we briefly review some key ingredients: random orthogonal matrices and spherical distributions.

Let be the set of all by orthogonal matrices. We write to denote that is a random by orthogonal matrix uniformly drawn from . This uniform distribution is with respect to Haar measure, cf. [12]. More precisely, the random matrix is uniformly distributed on if and only if has the same distribution with for any orthogonal matrix . QR decomposition of random matrix with i.i.d. Gaussian entries provides a uniformly distributed random orthogonal matrix. There is a more efficient methods called subgroup algorithm to generate such matrices [13], [14]. Now, let us recall the definition of a radially symmetric random vector and its relation with uniform random orthogonal matrices. {definition} An -dimensional random vector has a spherical distribution if and only if and has the same distribution for all orthogonal matrices .

One nice property of a spherically distributed random vector is that its characteristic function is radially symmetric [15], i.e., for some . Therefore, it is enough to consider the norm for a spherically distributed random vector . It is clear that an i.i.d. Gaussian random vector has a spherical distribution. The following lemma shows how to symmetrize a vector with a uniform random orthogonal matrix.


Suppose is a uniform random orthogonal matrix on . For any random vector , the random vector has a spherical distribution. The lemma is direct consequence of the respective definitions of a uniform random orthogonal matrix and a spherical distribution.

Iii-B Coding with Random Orthogonal Matrices

For notational convenience, define to be the function that finds the largest values of the input. If there is an ambiguity, the function picks the smallest index first. Specifically, if , then if and only if is one of the largest entries of and otherwise. Let be orthogonal matrices, be scalars, and assume that is a positive integer smaller than . We are now ready to describe the iterative scheme.

  Set .
  for  to  do
     Let .
     Let where
     Let .
  end for
  Send .
Algorithm 1 CROM

The unit vector indicates the largest values of , and ’s are scaling factors which depend on the norm of and will be specified later. Since , the inverse of the recursion is for all . This implies


Therefore, when the decoder receives for some , it outputs the reconstruction


The decoder can sequentially generate reconstructions using the relation . Note that the decoder can compute efficiently according to


Since we need nats to store (send) , rate corresponds to number of iterations. We are ready to state our main theorem asserting that Algorithm 1 achieves the Gaussian distortion-rate function. {theorem} Suppose is emitted by an ergodic source of marginal second moment . For any , let and suppose the rate is . If we take


and small enough scalar , there exists orthogonal matrices such that Algorithm 1 satisfies


Recall that (22) holds for any small enough for any ergodic . If we have stronger assumptions that is i.i.d. distributed with , then we can find vanishing that satisfies (22). The proof of Theorem III-B is given in Section VII-C with full details regarding the choice of . {remark} Theorem III-B implies that (22) holds for any fixed . In terms of complexity, large is preferred since it implies small number of iteration which results in lower complexity. On the other hand, our result relies on the concentration of largest values of i.i.d. Gaussian random vector. If is too big, then the largest values may deviate too much. We will see the trade-off with simulation results in Section V.

Iii-C Discussion

Iii-C1 Role of Orthogonal Matrices

It is known that an i.i.d. Gaussian random vector has a spherical distribution and the variance of its norm is very small. Therefore, if a random vector has a spherical distribution and the variance of its norm is small enough, can be thought of as an approximately i.i.d. Gaussian random vector. In the proof of CROM, we employ a randomization argument. Specifically, we assume that are drawn i.i.d.  and show that equation (22) holds when the probability is averaged over this ensemble of random matrices. The source at -th iteration has spherical distribution by Lemma III-A, and we can therefore expect to be a near Gaussian source, where we indirectly show that the norm of has small variance. This shows that multiplying by uniformly distributed random matrices can be thought of as a way to not only symmetrize but also Gaussianize the random vector so that we can apply the idea of Theorem II iteratively.

Note that the conditional distribution of is no longer similar to Gaussian when the matrix is known to both the encoder and the decoder. However, in the proof, we implicitly showed that the maximum element of is very close to with high probability as if it is i.i.d. Gaussian random vector.

A similar idea can be found in the work of Asnani et al. [16]. The authors showed that any coding scheme for a Gaussian network source coding problem can be adapted to perform well for other network source coding problems that are not necessarily Gaussian but have the same covariances. The key idea of the paper is applying an orthogonal transformation to the sources which basically “Gaussianizes” them so that the coding scheme for Gaussian sources are applicable in the transform domain.

Iii-C2 Storage and Computational Complexity

Unlike the zero-rate scheme of Section II, this scheme requires the storage of matrices (and scalars). Since , both the encoder and decoder must keep real values to store matrices . In terms of computation, the encoder finds the largest entries of an dimensional vector and performs a matrix-vector multiplication for each iteration. The dominant cost is , the cost of matrix-vector multiplication. Therefore, the overall computational complexity is of order .

Instead of storing , it is also possible to store random seeds at both encoder and decoder to generate them. In this case, the CROM requires storage space. However, generating a uniform random orthogonal matrix takes [13], and therefore the overall computational complexity will be of order .

Iii-C3 Infinitesimal Successive Refinability

Suppose the decoder gets only the first messages . Note it needs to have seen only the first nats for that. With this partial message set, the decoder is able to reconstruct which achieves a distortion


where the theorem guarantees is arbitrarily negligible for large enough . In other words, the decoder essentially achieves a distortion , which is the Gaussian distortion-rate function at rate . Evidently, CROM can be viewed as a successive refinement coding scheme with stages. Since we have a growing number of stages (in ), the rate increment at each stage is negligible (i.e., sub-linear number of additional nats per stage) and this is a key difference from classical successive refinement problems where the number of stages is fixed. Note that Theorem III-B implies that the probability of excess distortion beyond the relevant point on the distortion-rate curve at any of the successive refinement stages is negligible. Therefore, if the source is i.i.d. Gaussian, our coding scheme simultaneously achieves every point on the optimum distortion-rate curve. This infinitesimal successive refinability can be considered a strengthened version of successive refinement. In other words, to implement and operate CROM, the value of the rate need not be known or set in advance, a point we will expound in Section III-C4.

In [17], the similar property called “incremental refinements” was discussed. The paper discovered a new limiting behavior of additive rate-distortion function at zero-rate, and proposed a refinement idea. However, additive rate-distortion function is a mutual information between the input and the output of the Gaussian test channel, where it is not clear how to achieve it. On the other hand, we proposed a concrete scheme that achieves rate-distortion function.

Iii-C4 (Near) Ratelessness

In the channel coding setting, it is well-known that rateless coding schemes, including Raptor codes, achieve the capacity of erasure channels. In this setting, the rate does not have to be specified in advance, and the receiver is able to decode a message upon observing sufficiently many packets (or bits), regardless of their order. As we have discussed above, CROM has a similar property in that a rate does not need to be specified in advance of the code design. This is because is a function of only, and therefore ’s are independent to . Furthemore, we will see in the proof that depends only on . If the source is i.i.d. , the decoder can achieve a distortion upon observing fraction of the message bits. This is similar to a rateless code in channel coding because the decoder can achieve the optimum as soon as it collects sufficiently many of the message bits. However, the CROM decoder needs its observed bits to be a contiguous sequence at the beginning of the message bit stream while it is enough to have any combination of channel output observations in the rateless channel coding setting.

Note that our scheme can be considered as a progressive coder where “progressive” refers to the refinability. However, it is often the case that the refinement layer of progressive code is often useless without the base layer, where refinement layers of CROM are useful by themselves. More precisely, the decoder can have the following reconstruction based only on ,


where with the reconstruction would be


Iii-C5 Complete Separability

In the classical separation scheme, the source encoder must know the channel capacity in order to design the source coding scheme with rate where the source encoder often does not have this prior knowledge. However, if the source is Gaussian, the proposed scheme achieves the optimum distortion without channel information. Let be a sufficiently large constant and say the encoder uses the proposed scheme with rate . When the decoder receives the first fraction of message bits and performs the reconstruction, we achieve the distortion that satisfies due to the infinitesimal successive refinability. Since we can achieve the optimum performance using a simple scheme while the source encoder is blind to the capacity of the link, we can call this property complete separability.




Fig. 1: Relay Network

Another interesting example is a relay network without a direct link, as described in Figure 1, where the source is i.i.d. Gaussian. Both the links from the encoder to the relay node and the relay node to the decoder are noiseless with capacity and respectively, when we assume that . If the encoder knows the capacity of both links, then the problem is equivalent to the successive refinement problem. However, consider the case where the encoder only knows . If the encoder is optimized only for the first link, the relay node has to decode the whole message and compress it again with rate . However, if we use CROM, the relay node can simply send the first fraction of messages to the decoder and the decoder will be able to have optimal reconstruction with respect to its own link capacity.

Iii-C6 Convergence Rate

After the -th iteration, the decoder can achieve a distortion


Recall that the Gaussian distortion-rate function at rate is , and therefore the gap between the achieved distortion and is uniformly bounded by at all stages. Note that if the source is i.i.d. with bounded , we can choose vanishing such that the probability of error decays on the order of .

Iv Comparison to SPARC

Recall that CROM can be viewed as a nonzero-rate generalization of the zero-rate scheme introduced in Section II. On the other hand, SPARC implements the idea of describing a codeword with a linear combination of sub-codewords. Though the derivations of these two schemes were based on different ideas, they share several similarities. In this section, we outline the similarities and differences.

Iv-a Sparse Linear Regression Codes

Let us briefly review SPARC. Let be the first components of an ergodic source with mean 0 and variance 1. Define sub-codebooks , where each sub-codebook has sub-codewords. Sub-codewords are generated independently according to the standard normal distribution. Parameters and are chosen to be , where is the rate of the scheme, and define constants appropriately. Then, the following algorithm exhibits the main structure of the sparse linear regression code (SPARC), which was presented in [7] and shown to achieve the Gaussian distortion-rate function for any ergodic source (under appropriate choice of parameters).

  Set .
  for  to  do
     Let and be the index of .
     Let .
  end for
  Send .
Algorithm 2 SPARC

Note that there is another version of SPARC [8] where encoding is not done sequentially but is done by exhaustive search. Since we are focusing on efficient lossy compressors, we only consider the SPARC described in Algorithm 2 throughout the paper.

Iv-B Main Differences

In SPARC, the codebook consists of sub-codebooks where each sub-codebook has codewords. Our proposed iterative scheme is similar to SPARC with and ; finding the sub-codeword that achieves the maximum inner product can be viewed as finding the maximum entries after multiplying the matrix in our iterative scheme.

There are, however, two main differences. The first is that our scheme finds the largest values at each iteration. This implies that one iteration of our proposed encoding scheme is equivalent to iterations of SPARC’s encoding. In Section III-C2, we have seen that CROM requires operations per symbol, for an arbitrarily chosen . The gap between the distortion and is . In SPARC, the gap between the distortion and is . In order to calibrate with CROM, we can set . However, operation per symbol is required for SPARC encoding where , and therefore the number of operations for SPARC is . Thus, SPRAC requires times more operations. The same relation holds when we consider the storage complexity. CROM requires to store real numbers, where the SPARC encoder and decoder have to store real numbers.

The second difference is the structure of the sub-codebook. The columns of orthogonal matrix are orthogonal to each other, and this implies that CROM is similar to SPARC with structured sub-codewords. For example, if , all sub-codewords of CROM are orthogonal to each other, where SPARC draws sub-codewords according to i.i.d. Gaussian.

Iv-C Key Lemma

As we discussed in Section IV-B, sub-codewords in CROM is drawn from the surface of the sphere while sub-codewords in SPARC are drawn according to the i.i.d. Gaussian distribution. Under this difference, we would like to introduce some dualities. For example, consider the following lemma used in the proof of SPARC. {lemma}[7, Lemma 1] Let be independent random vectors with i.i.d. standard Gaussian elements. Then for any random vector supported on the dimensional unit sphere and independent of the ’s, the inner products are i.i.d. standard Gaussian random variables that are independent of .

On the other hand, recall Lemma III-A, which asserts that any random vector multiplied by uniform random orthogonal matrix has a spherical distribution.

Iv-D Successive Refinability

That SPARC possesses the successive refinability property was briefly mentioned by the authors, however, the main theorem in [7] only guarantees that the probability of error at the end of the process will vanish. On the other hand, we have seen that CROM has uniform convergence rates, uniformly and simultaneously on all points on the rate distortion curve, in Section III-C6.

V Simulation Results

In this section, we test CROM via simulations on sources with . We choose


Note that parameters are not optimized for the expected distortion, so there might be a better choice of . All results are averaged over 100 random trials.

First, We compare the performance of CROM and SPARC in Figure 1(a). We choose i.i.d. standard Gaussian source where . We simulated for for SPARC. Note that the complexity of SPARC is higher when is large. We let for CROM which corresponds to case of SPARC. Note that the performance of CROM is similar to the performance of SPARC with .

As we discussed in Remark III-B, the complexity of CROM decreases when is large, however, the performance will be worse when is large. Figure 1(b) shows trade-off between the small and the large .

In order to simulate CROM with higher , we use structured orthogonal matrices to reduce the storage and computational complexity. Note that any orthogonal matrix is a product of Givens rotations which are matrices of the form


This suggests to construct sparse orthogonal matrices using Givens rotations as a building block. Suppose be the power of , i.e., . We recursively define the sparse orthogonal matrices for .


where is a diagonal matrix with entries . The following matrices (31), (32), (33) show three types of sparse orthogonal matrices when .


Each matrix is a product of Givens rotations. Therefore, the product of consecutive sparse orthogonal matrices is equivalent to the product of Givens rotations. If we draw angles uniformly randomly, the product is expected to have similar distribution to uniform random orthogonal matrix. Since each row has exactly two non-zero elements, the matrix multiplication requires operations. Also, the storage complexity is .

Another well-known orthogonal matrix is discrete cosine transform matrix of type-II (DCT-II). We can use Fast Fourier Transform (FFT) algorithm to multiply DCT matrix efficiently. Also, DCT matrix requires of storage space.

Instead of original CROM with uniform random orthogonal matrices, we propose two modified version of CROM using the above structured orthogonal matrices. First, at -th iteration, we choose where (mod ), and are uniformly sampled from . The second approach is using where denotes the DCT-II matrix. Figure 1(c) shows performances of two modified algorithms when and . Note that the performance of sparse orthogonal matrices is worse than uniformly generated orthogonal matrices, on the other hand, the performance of sparse orthogonal matrices with DCT-II matrix is comparable to those of uniform orthogonal matrices.

Since modified CROM has lower complexity, we can test CROM with larger . Figure 1(d) shows the distortion-rate curve of the second approach with sparse orthogonal matrices and the DCT-II matrix where and . Compare to the simulation result of with uniform random orthogonal matrices, its distortion-rate curve shows better performance.

(a) Distortion-rate curves of CROM and SPARC where .
(b) Distortion-rate curves for where .
(c) Distortion-rate curves for different matrix constructions where .
(d) Distortion-rate curves for where .
Fig. 2: Distortion-rate curves of CROM and SPARC. -axis shows the rate in nats, and the -axis represents the average distortion.

Vi Channel Coding Dual

In [18], we can find a dual result in the Gaussian channel coding problem. In this section, we briefly review the idea of [18] (with slightly changed notation). Consider the AWGN channel where is an i.i.d. standard normal random vector. Suppose the number of messages is , i.e., the rate of the scheme is nats per channel use. Based on message , the encoder simply sends where and if . Then, the decoder finds the index of the maximum value of and recovers the message, i.e., . The average power that the encoder uses is . We will specify such that .

Before considering the probability of error , let us introduce the following useful lemma. {lemma} Let be an i.i.d. standard normal random vector, then


where is a standard normal cumulative distribution function and . We used the fact that where is a probability density function of standard normal random variable. ∎

Now we are ready to bound . Without loss of generality, we can assume that .


If we choose such that , then goes to infinity as grows. Therefore,


Since converges to zero as grows, we can approximate the capacity by . It is clear that converges to one as grows, i.e.,


This is reminiscent of the definition of a zero-rate optimal scheme in the source coding problem. We can say that this scheme is zero-rate optimal in the channel coding setting. We further note that the encoding and decoding can be done in almost linear time, and essentially no extra information needs to be stored.

However, unlike CROM, we could not find an iterative scheme building on this zero-rate one that achieves reliable communication at a positive rate. The main challenge is that the tail behavior on the left side is very different from the right side. In the source coding problem, a small maximum value (which corresponds to the left tail) yields an error, while it is a large maximum value (which corresponds to the right tail) that yields an error in the channel coding problem. More precisely, the cumulative distribution function of the maximum of Gaussian random variables converges to with normalizing constants. This function decays double-exponentially as decreases, which allows a small cumulative error for our iterative scheme CROM. However, converges to one only exponentially as grows. Therefore, in the similar channel coding scheme, the cumulative error does not remain negligible when we employ the scheme iteratively. We believe that for similar reasons a channel coding analog of SPARC with efficient encoding would not work.

Note that Erez et al. discussed rateless coding for Gaussian channels [19]. The goal of the paper “Rateless Coding for Gaussian Channels seems design a channel code where the transmitter can be blind to the channel gain and the variance of the noise. Note that the proposed rateless code requires the base code that achieves the capacity. On the other hand, we would like to design a concrete coding scheme that achieves the channel capacity when the channel information is known.

Vii Proofs

Vii-a Extreme Value of Gaussian Random Variables

Before providing proofs, consider the following lemma which shows the probabilistic bound of when is an i.i.d. standard normal random vector. {lemma} Let . If positive integers and satisfy , then


where is a standard normal cumulative distribution function.


Since are i.i.d. uniform random variables, can be considered as the -th largest value of an dimensional i.i.d. uniform random vector. The probability density function of is . Therefore,


This concludes the proof. ∎

Vii-B Proof of Theorem Ii

In the proof, we use for simplicity. By the definition of , we have


Let and be positive real numbers where we specify their values later. Then,


Consider the first term of (56). Let , then we have


In (57), we used Berry-Esseen theorem [20]: