Rate-Dependent Analysis of the Asymptotic Behavior of Channel Polarization

# Rate-Dependent Analysis of the Asymptotic Behavior of Channel Polarization

S. Hamed Hassani,  Ryuhei Mori,  Toshiyuki Tanaka,  and Rüdiger Urbanke The material in this paper was presented in part in [6], [7], [8] and [11].The work of S. H. Hassani was supported by grant no 200021-121903 of the Swiss National Science Foundation, the work of R. Mori by the Grant-in-Aid for Scientific Research for JSPS Fellows (225936), MEXT, Japan, and the work of T. Tanaka by the Grant-in-Aid for Scientific Research (C) (22560375), JSPS, Japan. S. H. Hassani and R. Urbanke are with the School of Computer and Communication Science, EPFL, CH-1015 Lausanne, Switzerland (e-mail: {seyehamed.hassani, rudiger.urbanke}@epfl.ch).R. Mori and T. Tanaka are with the Department of Systems Science, Graduate School of Informatics, Kyoto University, Yoshida Hon-machi, Sakyo-ku, Kyoto-shi, Kyoto, 606-8501 Japan (e-mail: rmori@sys.i.kyoto-u.ac.jp, tt@i.kyoto-u.ac.jp).
###### Abstract

For a binary-input memoryless symmetric channel , we consider the asymptotic behavior of the polarization process in the large block-length regime when transmission takes place over . In particular, we study the asymptotics of the cumulative distribution , where is the Bhattacharyya process defined from , and its dependence on the rate of transmission. On the basis of this result, we characterize the asymptotic behavior, as well as its dependence on the rate, of the block error probability of polar codes using the successive cancellation decoder. This refines the original bounds by Arıkan and Telatar. Our results apply to general polar codes based on kernel matrices.

We also provide lower bounds on the block error probability of polar codes using the MAP decoder. The MAP lower bound and the successive cancellation upper bound coincide when , but there is a gap for .

## I Introduction

### I-a Polar Codes

Polar codes, introduced by Arıkan [1], are a family of codes that provably achieve the capacity of binary-input memoryless symmetric (BMS) channels using low-complexity encoding and decoding algorithms. Since their invention, there has been a large body of work that has analyzed (see e.g., [2][11]) and extended (see e.g., [12][20]) these codes.

The construction of polar codes is based on an matrix , with entries in , called the kernel matrix. Besides being invertible, the matrix should have the property that none of its column permutations is upper triangular [13]. We call a matrix with such properties a polarizing matrix and in the following, whenever we speak of a kernel matrix , we assume that is polarizing.

The rows of the generator matrix of a polar code with block-length are chosen from the rows of the matrix

 G⊗n≜nG⊗G⊗⋯⊗G,

where denotes the Kronecker product. For the case and the choice , Reed-Muller (RM) codes also fall into this category. However, the crucial difference between polar codes and RM codes lies in the choice of the rows. For RM codes, the rows of the largest weights are chosen, whereas for polar codes the choice is dependent on the channel and is made using a method called channel polarization. We briefly review this method and explain how polar codes are constructed from it. We also refer the reader to [1], [5] and [13] for a detailed discussion.

### I-B Channel Polarization

Let be a BMS channel, and let denote its input alphabet, the output alphabet, and the transition probabilities. Let denote the mutual information between the input and output of with uniform distribution on the input. The capacity of a BMS channel is equal to . Also, the Bhattacharyya parameter of , denoted by , is defined as

 Z(W)=∑y∈Y√W(y|0)W(y|1).

It provides upper and lower bounds of the error probability in estimating the channel input on the basis of the channel output via the maximum-likelihood (ML) decoding of as follows [22, Chapter 4], [5].

 12(1−√1−Z(W)2)≤Pe(W)≤12Z(W). (1)

It is also related to the capacity via

 Z(W)+I(W)≥1, [Z(W)]2+[I(W)]2≤1,

both proved in [1].

The method of channel polarization is defined as follows. Take copies of a BMS channel . Combine them by using the kernel matrix to make a new set of channels . The construction of these channels is done by recursively applying a transform called channel splitting. Channel splitting is a transform which takes a BMS channel as input and outputs BMS channels , . The channels are constructed according to the following rule: Consider a random row vector that is uniformly distributed over . Let , where the arithmetic is in . Also, let be the output of uses of over the input . We define the channel between and by the transition probabilities

 Wℓ(yℓ−10|uℓ−10)≜ℓ−1∏i=0W(yi|xi)=ℓ−1∏i=0W(yi|(uℓ−10G)i). (2)

The channel is defined as the BMS channel with input , output and transition probabilities

 Wj(yℓ−10,uj−10|uj)=12ℓ−1∑uℓ−1j+1Wℓ(yℓ−10|uℓ−10). (3)

Here and hereafter, denotes the subvector .

The construction of the channels can be visualized in the following way [1]. Consider an infinite -ary tree with the root node placed at the top. To each vertex of the tree, we assign a channel in a way that the collection of all the channels that correspond to the vertices at depth equals . We do this by a recursive procedure. Assign to the root node the channel itself. From left to right, assign to to the children of the root node. In general, if is the channel that is assigned to vertex , we assign to , from left to right respectively, to the children of the node . There are vertices at level in this -ary tree. Assume that we label these vertices from left to right from to . Let the channel assigned to the th vertex, , be . Also, let the -ary representation of be , where is the most significant digit. Then we have

 W(i)ℓn=(((Wb1)b2)⋯)bn.

As an example, assuming , and we have .

The channels have the property that, as grows large, a fraction close to of the channels have capacity close to (or Bhattacharyya parameter close to ); and a fraction close to of the channels have capacity close to (or Bhattacharyya parameter close to ). The basic idea behind polar codes is to use those channels with capacity close to for information transmission. Accordingly, given the rate and block-length , the rows of the generator matrix of a polar code of block-length correspond to a subset of the rows of the matrix whose indices are chosen with the following rule: Choose a subset of size of the channels with the least values for the Bhattacharyya parameter and choose the rows with the indices corresponding to those of the channels. For example, if the channel is chosen, then the th row of is selected, where the -ary representation of is the digit-reversed version of that of . We decode using a successive cancellation (SC) decoder. This algorithm decodes the bits one-by-one in a pre-chosen order that is closely related to how the row indices of are chosen.

### I-C Problem Formulation and Relevant Work

Let be the set of indices of the channels in the set with the least values for the Bhattacharyya parameter. Let and denote the average block error probability of the SC and the maximum a-posteriori (MAP) decoders, respectively, with block-length and rate . For the SC decoder we have [1, 5],

 maxi∈I12(1−√1−Z(W(i)ℓn)2)≤PSCe(N,R)≤∑i∈IZ(W(i)ℓn). (4)

This relation evidently shows that the distribution of the Bhattacharyya parameters of the channels plays a fundamental role in the analysis of polar codes. More precisely, for and , we are interested in analyzing the behavior of

 F(n,z)=#{i:Z(W(i)ℓn)≤z}ℓn, (5)

where denotes the number of elements of the set . There is an entirely equivalent probabilistic description of (5): Define the “polarization” process [2] of the channel as a channel-valued stochastic process with and

 Wn+1=WBnn, (6)

where is a sequence of independent and identically-distributed (i.i.d.) random variables with distribution for . In other words, the process begins at the root node of the infinite -ary tree introduced above, and in each step it chooses one of the children of the current node with uniform probability. So at time , the process outputs one of the channels at level of the tree uniformly at random. The Bhattacharyya process of the channel is defined from the polarization process as . In this setting we have

 P(Zn≤z)=F(n,z). (7)

It was shown in [2] and [5] that the Bhattacharyya process converges almost surely to a -valued random variable with . Our objective is to investigate the asymptotic behavior of . The analysis of the process around the point is of particular interest, as this indicates how the “good” channels (i.e., the channels that have mutual information close to ) behave. The asymptotic analysis of the process is closely related to the “partial distances” of the kernel matrix :

###### Definition 1 (Partial Distances)

We define the partial distances , , of an matrix (’s are row vectors) as

 Di(G) ≜dH({gi},⟨gi+1,…,gℓ−1⟩),i=0,…,ℓ−2, Dℓ−1(G) ≜dH({gℓ−1},{0}),

where denotes the Hamming distance between two sets of binary sequences, and where denotes the linear space spanned by . The exponent of is then defined as

 E(G)=1ℓℓ−1∑i=0logℓDi(G),

and the second exponent of is defined as

 V(G)=1ℓℓ−1∑i=0(logℓDi(G)−E(G))2.

In other words, the exponent and the second exponent are the mean and the variance of the random variable , where is a random variable taking a value in with uniform probability. It should be noted that the invertibility of implies the partial distances to be strictly positive, making the exponent finite. Note also that the condition for a matrix to be polarizing, that none of column permutations of is upper triangular, implies to be strictly greater than 1, yielding to be strictly positive.

The following theorem partially characterizes the behavior of the process around .

###### Theorem 2 ([2] and [5])

Let be a BMS channel and assume that we are using as the kernel matrix an matrix with exponent . For any fixed with ,

 limn→∞P(Zn≤2−ℓnβ)=I(W).

Conversely, if , then for any fixed ,

 limn→∞P(Zn≥2−ℓnβ)=1.

An important consequence of Theorem 2 is that, as the behavior of when using polar codes with the kernel matrix , of block-length and rate under SC decoding is asymptotically the same as that of from (4), the probability of error behaves as as tends to infinity. A noteworthy point about this result is that the asymptotic analysis of the probability of error is rate-independent, provided that the rate is less than the capacity . In this paper, we provide a refined estimate for . Specifically, we derive the asymptotic relation between and the rate of transmission . From this we derive the asymptotic behavior of and its dependence on the rate of transmission. We further derive lower bounds on the error probability when we perform MAP decoding instead of SC decoding.

An important point to mention here is that the results of this paper are obtained in the asymptotic limit of the block-length for any fixed rate value . Considering the regime where also varies with the block-length is a problem of different interest, for which we refer the reader to [21].

The outline of the paper is as follows. In Section II we state the main results of the paper. In Section III we first define several auxiliary processes and provide bounds on their asymptotic behavior. Using these bounds, we then prove the main results. We discuss the implications of the proofs in selecting the set of channel indices in Section IV. It should be noted that in the following the logarithms are in base 2 unless explicitly stated otherwise.

## Ii Main Results

###### Theorem 3

Consider an polarizing kernel matrix . For a BMS channel , let be the Bhattacharyya process of . Let be the error function and be its inverse function.

1. For ,

 limn→∞P⎛⎝Zn≤2−ℓnE(G)+√nV(G)Q−1(RI(W))+f(n)⎞⎠=R.
2. Let ( denotes the transpose) and assume that for . Then, for we have,

 limn→∞P⎛⎝Zn≥1−2−ℓnE(H)+√nV(H)Q−1(R′1−I(W))+f(n)⎞⎠=R′.

Here, is any function satisfying . ∎

Discussion: Theorem 3 characterizes the asymptotic behavior of and refines Theorem 2 in the following way. According to Theorem 2, if we transmit at rate below the channel capacity, then the quantity scales like . The first part of Theorem 3 gives one further term by stating that is in fact . The second part of Theorem 3, on the other hand, characterizes the asymptotic behavior of near , which is important in applications of polar codes for source coding [12]. Put together, Theorem 3 characterizes the scaling of the error probability of polar codes with the SC decoder. Similar results hold for the case of the MAP decoder.

###### Theorem 4

Let be a BMS channel and let be the rate of transmission. Consider an kernel matrix with the Hamming weights of its rows and define

 Ew(G)=1ℓℓ−1∑i=0logℓwi(G),Vw(G)=1ℓℓ−1∑i=0(logℓwi(G)−Ew(G))2. (8)

If we use polar codes of length and rate for transmission, then the probability of error under MAP decoding, , satisfies

 logℓ(−log(PMAPe(N,R)))≤nEw(G)+√nVw(G)Q−1(RI(W))+o(√n). (9)

Discussion: Let be according to Arıkan’s original construction [1], i.e., , which is the only polarizing matrix for the case . For this , we have for and . Hence, the block error probability for the SC decoder and the MAP block error probability share the same asymptotic behavior according to Theorems 3 and 4. For a general matrix , however, one may have strict inequality , in which case one still has an asymptotic gap between the error probability with SC decoding and the lower bound of MAP error probability. Whether or not this gap can be filled or made narrower is an open problem.

## Iii Proof of the Main Result

### Iii-a Preliminaries

Let be a sequence of i.i.d. random variables that take their values in with uniform probability, i.e., for . Let denote the probability space generated by the sequence and let be the probability space generated by . We now couple the polarization process with the sequence via (6). Consequently, the Bhattacharyya process is coupled with the sequence . By using the bounds given in [5, Chapter 5] we have the following relationship between the Bhattacharyya parameters of and that of : Recall that are the partial distances of the matrix . We have [5]

 Z(W)Di(G)≤Z(Wi)≤2ℓ−iZ(W)Di(G). (10)

Also let . Assuming ,

 (1−Z(W))Di(H)≤1−Z(Wi)≤22i+1(1−Z(W))Di(H). (11)

### Iii-B Proof of Theorem 3

We first provide an intuitive picture behind the result of Theorem 3. For simplicity, assume and let the channel be a binary erasure channel (BEC) with erasure probability . The capacity of this channel is . For such a channel, the Bhattacharyya process has a simple closed form [1] as and

 Zn+1={Z2n,Bn=0,2Zn−Z2n,Bn=1. (12)

We know from Section I-C that as grows large, tends almost surely to a -valued random variable with . The asymptotic behavior of can be explained roughly by considering the behavior of . In particular, it is clear from (12) that at time , is either doubled (when ), or decreased by at most (when ). Also, observe that once becomes sufficiently large, subtracting from it has negligible effect compared with the doubling operation. Now assume that is a sufficiently large number. Conditioned on the event that is a very large value (or equivalently, the value of is very close to : this happens with probability very close to ), for the process evolves each time by being doubled if or remaining roughly the same if . We can then use the central limit theorem to characterize the asymptotic behavior of for .

The proof of Theorem 3 is done by making the above intuitive steps rigorous for a BMS channel and a polarizing kernel matrix . In a slightly more general setting, we study the asymptotic properties of for any generic process satisfying the conditions (c1)–(c4) defined as follows.

###### Definition 5

Let be a random variable taking values in . Assume that the expectation and the variance of exist and are denoted by and , respectively. Assume that are i.i.d. samples of . Let be a random process satisfying the following conditions:

• There exists a random variable such that holds almost surely.

• .

• There exists a constant such that holds.

• is independent of for .

The random processes and satisfy the above four conditions by letting and , respectively. The fact that these processes satisfy the condition (c1) has been proved in [5, Lemma 5.4], and the result reads that if is polarizing, then takes only 0 and 1, with probabilities and , respectively. Conditions (c2) and (c3) also hold because of (10) and (11).

Our objective now is to prove that for such a process , we have

 limn→∞P(Xn≤2−2nE[logS]+t√nV[logS]+f(n))=P(X∞=0)Q(t), (13)

where is any function such that holds. The results of Theorem 3 then follow by noting that and hold, and by substituting and , respectively, into (13).

We prove (13) by showing the two inequalities obtained by replacing the equality in (13) by inequality in both directions. As the first step we have:

###### Lemma 6

Let be a random process satisfying (c1), (c3) and (c4). For any ,

 liminfn→∞P(Xn≤2−2nE[logS]+t√nV[logS]+f(n))≥P(X∞=0)Q(t).
###### Proof:

Without loss of generality, we can assume that in condition (c3) satisfies . Define the process as . From (c3), we have

 Ln ≤logc+Sn−1Ln−1,

and by applying the above relation recursively, for we obtain

 Ln ≤(n−1∑j=mn−1∏i=j+1Si)logc+(n−1∏i=mSi)Lm ≤(n−1∏i=mSi)((n−m)logc+Lm). (14)

Fix and let

 m≜(logn+loglogc)/β. (15)

Conditioned on the event , by using (III-B) we obtain

 Ln≤−(n−1∏i=mSi)mlogc.

Let the event be defined as

 Hn−1m(t)≜{ n−1∑i=mlogSi≥(n−m)E[logS]+t√(n−m)V[logS]+f(n−m)},

where is any function such that holds. Conditioned on and , we have

 log(−Ln)≥logm+loglogc+(n−m)E[logS]+t√(n−m)V[logS]+f(n−m).

Hence,

 P(log(−Ln)≥logm+loglogc+(n−m)E[logS]+t√(n−m)V[logS]+f(n−m))≥P(Dm(β)∩Hn−1m(t))=P(Dm(β))P(Hn−1m(t)).

The last equality follows from the independence condition (c4).

Note that taking the limit also implies and via (15). From Theorem 10 (in Appendix), we have . We also have due to the central limit theorem for . We consequently have

 liminfn→∞P(log(−logXn)≥nE[logS]+t√nV[logS]+f(n))≥P(X∞=0)Q(t)

for any .

The second step of the proof of (13) is to prove the other direction of the inequality. We have:

###### Lemma 7

Let be a random process satisfying (c1), (c2) and (c4). For any ,

 limsupn→∞P(Xn≤2−2nE[logS]+t√nV[logS]+f(n))≤P(X∞=0)Q(t).
###### Proof:

Let . From (c2), for we have

 Ln ≥Sn−1Ln−1 ≥(n−1∏i=mSi)Lm,

and thus

 log(−Ln)≤n−1∑i=mlogSi+log(−Lm). (16)

Hence, for any fixed and any ,

 limsupn→∞P(log(−Ln)>nE[logS]+t√nV[logS]+f(n)) ≤limsupn→∞P(log(−Ln)>nE[logS]+t√nV[logS]+f(n),Xm≤δ) (17)

The first term in the right-hand side of (17) is upper bounded as

 limsupn→∞P(log(−Ln)>nE[logS]+t√nV[logS]+f(n),Xm≤δ) \lx@stackrel(a)≤limsupn→∞P(n−1∑i=mlogSi+log(−Lm)>nE[logS]+t√nV[logS]+f(n),Xm≤δ) \lx@stackrel(b)=Q(t)P(Xm≤δ),

where (a) follows from (16), and where (b) follows from (c4) and the central limit theorem. The second term in the right-hand side of (17) is upper bounded as

 limsupn→∞P(log(−Ln)>nE[logS]+t√nV[logS]+f(n),Xm>δ) ≤limsupn→∞P(Xn≤δ2,Xm>δ) \lx@stackrel(a)≤P(X∞≤δ2,Xm>δ),

where (a) follows from (c1). Applying these bounds to (17), for any , we have

 limsupn→∞P(log(−Ln)>nE[logS]+t√nV[logS]+f(n)) ≤limsupm→∞{Q(t)P(Xm≤δ)+P(X∞≤δ2,Xm>δ)} ≤Q(t)P(X∞≤δ)+P(X∞≤δ2,X∞≥δ) =Q(t)P(X∞≤δ).

By letting , we obtain the result.

### Iii-C Proof of Theorem 4

###### Lemma 8

The MAP error probability of a linear code over a BMS channel is lower bounded by where is the minimum distance of .

###### Proof:

Within this proof, the notation should be understood as generically denoting the probability of an event . Since the MAP error probability of a linear code over a BMS channel does not depend on transmitted codeword, we can assume without loss of generality that transmitted codeword is the all-zero codeword, which is denoted by . Let be the random variable corresponding to a received sequence when is transmitted and let be the likelihood of a codeword given a received sequence . Since MAP and ML are equivalent for equiprobable codewords, the MAP error probability is lower bounded as

 P(∪c′∈C∖{0}{P(Y|c′)≥P(Y|0)})≥P(P(Y|c)≥P(Y|0)) =Pe(W⊗w(c)) \lx@stackrel(a)≥12(1−√1−Z(W⊗w(c))2) =12(1−√1−Z(W)2w(c)) ≥14Z(W)2w(c).

Here, is an arbitrary codeword in the set and denotes its Hamming weight. Also denotes the -parallel channel of which has the following rule

 W⊗m(ym1|x)≜m∏i=1W(yi|x). (18)

Step (a) follows from (1).

It should be noted that the lower bound in the proof of Lemma 8 is not asymptotically tight in terms of the conventional exponents. It is possible to obtain tighter lower bounds via more elaborate arguments as in [22, Chapter 4]. However, since we are only interested in behavior of double exponents, the above bound turns out to be sufficient for the purpose of proving Theorem 4.

In order to prove Theorem 4, from Lemma 8 it is sufficient to prove that given any there exists an integer such that for ,

 logℓ(d(n,R))≤nEw(G)+√nVw(G)(Q−1(RI(W))+ϵ),

where is the minimum distance of a polar code using the kernel matrix , with block-length and rate . Since a row weight of the generator matrix is an upper bound of the minimum distance for a linear code, and since the weight of the th row of is equal to , where is the th digit of the -ary representation of , it is therefore sufficient to prove that given any , there exists an integer such that for a polar code of block-length and rate and set of chosen indices , there exists for which the inequality

 n∑j=1logℓwij(G)≤nEw(G)+√nVw(G)(Q−1(RI(W))+ϵ) (19)

holds. In the proof of Theorem 3, one can observe that the key idea is to apply central limit theorem for . In the same sense, in order to prove Theorem 4 we consider the random process in addition to . Note that these processes are in general correlated since they are both coupled to the same process . These processes are equal with probability one in the special case where holds for all . In the same manner as the proof of Theorem 3, we move on to a more abstract setting, by introducing a random variable taking values in , for which we assume that the expectation and the variance of exist and are denoted by and , respectively, and by letting be i.i.d. drawings of , where is defined as in Definition 5. Let be a random process such that satisfies the conditions (c1) to (c4) together with the additional condition (c5) for .

• is independent of for .

It is easy to see that the stochastic process of the triplets satisfies (c1) to (c5). We first note from the proof of Theorem 3 that for any generic process satisfying (c1) to (c5), the relation (13) holds for any function . We also claim that for real numbers such that and for any function we have

 limsupn→∞P(Xn≤2−2nE[logS]+t√nV[logS]+f(n),n−1∑i=0logUi>nE[logU]+v√nV[logU]+g(n))

Using the relations (13) and (20) it is easy to see that for generator matrices of polar codes with rate , the number of rows satisfying (19) is asymptotically proportional to the block-length, and hence there exists at least a row satisfying (19). We now turn to the proof of (20).

###### Lemma 9

Let be a random process satisfying (c1) to (c5). For any and ,

 limn→∞P(Xn≤2−2nE[logS]+t√nV[logS]+f(n),n−1∑i=0logUi>nE[logU]+v√nV[logU]+g(n))=P(X∞=0)P(AS≥t,AU≥v),

where are Gaussian random variables of mean zero whose covariance matrix is equal to that of

 (logS−E[logS]√V[logS],logU−E[logU]√V[logU]).

The proof of this Lemma is the same as the proofs of Lemma 6 and Lemma 7. The difference is that the central limit theorem is replaced by the two-dimensional central limit theorem. From , the relation (20) is obtained for . This completes the proof of Theorem 4.

Remark: Let . For this choice of , we have