Asymmetric Error Correctionand Flash-Memory Rewriting using Polar Codes

# Asymmetric Error Correction and Flash-Memory Rewriting using Polar Codes

Eyal En Gad, Yue Li, Joerg Kliewer, Michael Langberg,
Anxiao (Andrew) Jiang and Jehoshua Bruck
The material in this paper was presented in part at the IEEE Int. Symp. on Inform. Theory (ISIT), Honolulu, HI, USA, July 2014 [8]. This work was supported in part by Intellectual Ventures, NSF grants CIF-1218005, CCF-1439465, CCF-1440001 and CCF-1320785, NSF CAREER Award CCF-0747415 and the US-Israel Binational Science Foundation (BSF) under Grant No. 2010075. Eyal En Gad, Yue Li and Jehoshua Bruck are with the California Institute of Technology, Pasadena, CA 91125, {eengad, yli, bruck}@caltech.edu. Joerg Kliewer is with New Jersey Institute of Technology, Newark, NJ 07102, jkliewer@njit.edu. Michael Langberg is with SUNY at Buffalo, Buffalo, NY 14260 and the Open University of Israel, Raanana 43107, Israel, mikel@buffalo.edu. Anxiao (Andrew) Jiang is with Texas A&M University, College Station, TX 77840, ajiang@cse.tamu.edu.
###### Abstract

We propose efficient coding schemes for two communication settings: 1. asymmetric channels, and 2. channels with an informed encoder. These settings are important in non-volatile memories, as well as optical and broadcast communication. The schemes are based on non-linear polar codes, and they build on and improve recent work on these settings. In asymmetric channels, we tackle the exponential storage requirement of previously known schemes, that resulted from the use of large Boolean functions. We propose an improved scheme, that achieves the capacity of asymmetric channels with polynomial computational complexity and storage requirement.

The proposed non-linear scheme is then generalized to the setting of channel coding with an informed encoder, using a multicoding technique. We consider specific instances of the scheme for flash memories, that incorporate error-correction capabilities together with rewriting. Since the considered codes are non-linear, they eliminate the requirement of previously known schemes (called polar write-once-memory codes) for shared randomness between the encoder and the decoder. Finally, we mention that the multicoding scheme is also useful for broadcast communication in Marton’s region, improving upon previous schemes for this setting.

## I Introduction

In this paper we make several contributions to the design and analysis of error-correcting codes in two important communication settings: 1. asymmetric channel coding, and 2. channel coding with an informed encoder. Asymmetric channel coding is important for applications such as non-volatile memories, in which the electrical mechanisms are dominantly asymmetric [5]. Another important application is in optical communication, where photons may fail to be detected () but a false detection when no photon was sent () is much less likely [15, Section IX]. Channel coding with an informed encoder is also important for non-volatile memories, since the memory state in these devices affects the fate of writing attempts. Channel coding with an informed encoder is also useful in broadcast communication, where it is used in Marton’s coding scheme to achieves high communication rates (see [7, p. 210]).

The focus of this paper is on polar coding techniques, as they are both highly efficient in terms of communication rate and computational complexity, and are relatively easy to analyze and understand. Polar codes were introduced by Arikan in [1], achieving the symmetric capacity of binary-input memoryless channels. The first task that we consider in this paper is that of point-to-point communication over asymmetric channels. Several polar coding schemes for asymmetric channels were proposed recently, including a pre-mapping using Gallager’s scheme [11, p. 208] and a concatenation of two polar codes [30]. A more direct approach was proposed in [17], which we consider in this paper. A similar approach is also considered in [22]. The scheme in [17] achieves the capacity of asymmetric channels using non-linear polar codes, but it uses large Boolean functions that require storage space that is exponential in the block length. We propose a modification for this scheme, that removes the requirement for the Boolean functions, and thus reduces the storage requirement of the encoding and decoding tasks to a linear function of the block length.

The second contribution of this paper is a generalization of the non-linear polar-coding scheme to the availability of channel side information at the encoder. We call this scheme a polar multicoding scheme, and we prove that it achieves the capacity of channels with informed encoders. The capacity of such channels was characterized by Gelfand and Pinsker in [12]. This scheme is useful for non-volatile memories such as flash memories and phase change memories, and for broadcast channels. We focus mainly on the flash memory application.

A prominent characteristic of flash memories is that the response of the memory cells to a writing attempt is affected by the previous content of the memory. This complicates the design of error correcting schemes, and thus motivates flash systems to “erase” the content of the cells before writing, and by that to eliminate its effect. However, the erase operation in flash memories is expensive, and therefore a simple coding scheme that does not require erasures could improve the performance of solid-state drives significantly. We show two instances of the proposed polar multicoding scheme that aim to achieve this goal.

### I-a Relation to Previous Work

The study of channel coding with an informed encoder was initiated by Kusnetsov and Tsybakov [19], with the channel capacity derived by Gelfand and Pinsker [12]. The informed encoding technique of Gelfand and Pinsker was used earlier by Marton to establish an inner bound for the capacity region of broadcast channels [21]. Low-complexity capacity-achieving codes were first proposed for continuous channels, using lattice codes [32]. In discrete channels, the first low-complexity capacity-achieving scheme was proposed using polar codes, for the symmetric special case of information embedding [18, Section VIII.B]. A modification of this scheme for the application of flash memory rewriting was proposed in [4], considering a model called write-once memory. An additional scheme for the application of flash memory, based on randomness extractors, was also proposed recently [10].

Our work is concerned with a setup that is similar to those considered in [4, 10]. An important contribution of the current paper compared to [4, 10] is that our scheme achieves the capacity of a rewriting model that also includes noise, while the schemes in [4, 10] address only the noiseless case. Indeed, error correction is a crucial capability in flash memory systems. Our low-complexity achievability of the noisy capacity is done using a multicoding technique. Comparing with [10], the current paper allows an input cost constraint, which is important in rewriting models for maximizing the sum of the code rates over multiple rewriting rounds. Comparing with [4], the current paper also improves by removing the requirement for shared randomness between the encoder and the decoder, which limits the practical coding performance. The removal of the shared randomness is done by the use of non-linear polar codes. An additional coding scheme was proposed during the writing of this paper, which also does not require shared randomness [20]. However, the scheme in [20] considers only the noiseless case, and it is in fact a special case of the scheme in the current paper.

Polar coding for channels with informed encoders was implicitly studied recently in the context of broadcast channels, as the Marton coding scheme for broadcast communication contains an informed encoding instance as an ingredient. In fact, a multicoding technique similar to the one presented in this paper was recently presented for broadcast channels, in [13]. While we were unaware of the result of [13] and developed the scheme independently, this paper also has three contributions that were not shown in [13]. First, by using the modified scheme of non-linear polar codes, we reduce the storage requirement from an exponential function in the block length to a linear function. Secondly, we connect the scheme to the application of data storage and flash memory rewriting, that was not considered in the previous work. And thirdly, the analysis in [13] holds only for channels whose capacity-achieving distribution forms a certain degraded structure. In this paper we consider a specific noisy rewriting model, whose capacity-achieving distribution forms the required degraded structure, and by that we show that the scheme achieves the capacity of the considered flash-memory model.

Another recent paper on polar coding for broadcast channels was published recently by Mondelli et. al. [23]. That paper proposed a method, called “chaining”, that allows to bypass the degraded structure requirement. In this paper we connect the chaining method to the flash-memory rewriting application and to our new non-linear polar coding scheme, and apply it to our proposed multicoding scheme. This allows for a linear storage requirement, together with the achievability of the informed encoder capacity and Marton’s inner bound, eliminating the degraded structure requirement. Finally, we show an important instance of the chaining scheme for a specific flash-memory model, and explain the applicability of this instance in flash-memory systems.

The rest of the paper is organized as follows. Section II proposes a new non-linear polar coding scheme for asymmetric channels, which does not require an exponential storage of Boolean functions. Section III proposes a new polar multicoding scheme for channels with informed encoders, including two special cases for the rewriting of flash memories. Finally, Section IV summarizes the paper.

## Ii Asymmetric Point-to-Point Channels

Notation: For positive integers , let denote the set , and let denote the set . Given a subset of , let denote the complement of with respect to , where is clear from the context. Let denote a vector of length , and let denote a vector of length obtained from by deleting the elements with indices in .

Throughout this section we consider only channels with binary input alphabets, since the literature on polar codes with non-binary codeword symbols is relatively immature. However, the results of this section can be extended to non-binary alphabets without much difficulty using the methods described in [24, 25, 26, 28, 29, 27]. The main idea of polar coding is to take advantage of the polarization effect of the Hadamard transform on the entropies of random vectors. Consider a binary-input memoryless channel model with an input random variable (RV) , an output RV and a pair of conditional probability mass functions (pmfs) on . Let be a power of 2 that denotes the number of channel uses, also referred to as the block length. The channel capacity is the tightest upper bound on the code rate in which the probability of decoding error can be made as small as desirable for large enough block length. The channel capacity is given by the mutual information of and .

###### Theorem 1

. (Channel Coding Theorem)[6, Chapter 7] The capacity of a discrete memoryless channel defined by is

 C=maxpXI(X;Y).

The Hadamard transform is a multiplication of the random vector over the field of cardinality 2 with the matrix , where and denotes the Kronecker power. In other words, can be described recursively for by the block matrix

 Gn=(Gn/20Gn/2Gn/2).

The matrix transforms into a random vector , such that the conditional entropy is polarized. That means that for a fraction of close to of the indices , the conditional entropy is close to 1, and for almost all the rest of the indices, is close to 0. This result was shown by Arikan in [1, 2].

###### Theorem 2

. (Polarization Theorem) [2, Theorem 1] Let be defined as above. For any , let

 HX|Y≜{i∈[n]:H(Ui|U[i−1],Y[n])∈(1−δ,1)},

and

 LX|Y≜{i∈[n]:H(Ui|U[i−1],Y[n])∈(0,δ)}.

Then

 limn→∞|HX|Y|/n=H(X|Y)andlimn→∞|LX|Y|/n=1−H(X|Y).

Note that denotes a conditional entropy, while denotes a subset of . It is also shown in [1] that the transformation is invertible with , implying . This polarization effect can be used quite simply for the design of a coding scheme that achieves the capacity of symmetric channels with a running time that is polynomial in the block length. The capacity of symmetric channels is achieved by a uniform distribution on the input alphabet, i.e.  [6, Theorem 7.2.1]. Since the input alphabet in this paper is binary, the capacity-achieving distribution gives , and therefore we have

 limn→∞(1/n)|LX|Y|=1−H(X|Y)=H(X)−H(X|Y)=I(X;Y)=C. (1)

Furthermore, for each index in , the conditional probability must be close to either 0 or 1 (since the conditional entropy is small by the definition of the set ). It follows that the RV can be estimated reliably given and . This fact motivates the capacity-achieving coding scheme that follows. The encoder creates a vector by assigning the subvector with the source message, and the subvector with uniformly distributed random bits that are shared with the decoder. The randomness sharing is useful for the analysis, but is in fact unnecessary for using the scheme (the proof of this fact is described in [1, Section VI]). The set is called the frozen set. Equation (1) implies that this coding rate approaches the channel capacity. The decoding is performed iteratively, from index 1 up to . In each iteration, the decoder estimates the bit using the shared information or using a maximum likelihood estimation, according to the set membership of the iteration. The estimates of are denoted by . The estimates for which is in are always successful, since these bits were known to the decoder in advance. The rest of the bits (those in ) are estimated correctly with high probability (as explained in the beginning of the paragraph), leading to a successful decoding of the entire message with high probability.

However, this reasoning does not translate directly to asymmetric channels. Remember that the capacity-achieving input distribution of asymmetric channels is in general not uniform (see, for example, [14]), i.e. . Since the Hadamard transform is bijective, it follows that the capacity-achieving distribution of the polarized vector is non uniform as well. The problem with this fact is that assigning uniform bits of message or shared randomness changes the distribution of , and consequentially also changes the conditional entropies . To manage this situation, the approach proposed in [17], which we adopt in this work, is to make sure that the change in the distribution of is kept to be minor, and thus its effect on the probability of decoding error is also minor. To do this, consider the conditional entropies , for . Since the polarization happens regardless of the channel model, we can consider a channel for which the output is a deterministic variable, and conclude by Theorem 2 that the entropies also polarize. For this polarization, a fraction of of the indices admit a high . To ensure a minor change in the distribution of , we restrict the assignments of uniform bits of message and shared randomness to the indices with high .

The insight of the last paragraph motivates a modified coding scheme. The locations with high entropy are assigned with uniformly distributed bits, while the rest of the locations are assigned with the pmf . Note that and refer to the capacity-achieving distribution of the channel, which does not equal to the distribution that the encoding process induces. Similar to the notation of Theorem 2, we denote the set of indices with high entropy by . To achieve a reliable decoding, we place the message bits in the indices of that can be decoded reliably, meaning that their entropies are low. So we say that we place the message bits in the intersection . The value of must be known by the decoder in advance for a reliable decoding. Previous work suggested to share random Boolean functions between the encoder and the decoder, drawn according to the pmf , and to assign the value of according to these functions [13, 17]. However, we note that the storage required for those Boolean functions is exponential in , and therefore we propose an alternative method.

To avoid the Boolean function, we divide the complement of into three disjoint sets. First, the indices in the intersection are assigned with uniformly distributed random bits that are shared between the encoder and the decoder. As in the symmetric case, this randomness sharing will in fact not be necessary, and a deterministic frozen vector could be shared instead. The rest of the bits of (those in the set ), are assigned randomly at the encoder to a value with probability (where is calculated according to the pmf , the capacity-achieving distribution of the channel). The indices in could be decoded reliably, but not those in . Fortunately, the set can be shown to be small (as we will show later), and thus we could transmit those locations separately with a vanishing effect on the code rate. The encoding of the vector is illustrated in Figure 1.

We note that an alternative method to avoid the Boolean functions was in implied in [17]. According to this method, a seed of uniformly random bits is shared in advance between the encoder and the decoder. During encoding and decoding, the bits whose indices are in the set are generated as pseudorandom bits from the shared seed, such that each bit is distributed according to . Such scheme could be used in many practical scenarios. However, the use of pseudorandomness might lead to error propagation in some applications. Therefore, we describe the constructions in the rest of the paper according to the approach of the previous paragraph.

To see the reason of why the code rate approaches the channel capacity, notice that the source message is placed in the indices in the intersection . The asymptotic fraction of this intersection can be derived as following.

 |HX∩LX|Y|/n=1−|HcX∪LcX|Y|/n=1−|HcX|/n−|LcX|Y|/n+|HcX∩LcX|Y|/n. (2)

The Polarization Theorem (Theorem 2) implies that and . Since the fraction vanishes for large , we get that the asymptotic rate is , achieving the channel capacity.

For a more precise definition of the scheme, we use the so called Bhattacharyya parameter in the selection of subsets of , instead of the conditional entropy. The Bhattacharyya parameters are polarized in a similar manner as the entropies, and are more useful for bounding the probability of decoding error. For a discrete RV and a Bernoulli RV , the Bhattacharyya parameter is defined by

 Z(X|Y)≜2∑y√pX,Y(0,y)pX,Y(1,y). (3)

Note that most of the polar coding literature is using a slightly different definition of the Bhattacharyya parameter, that coincides with Equation (3) when the RV is distributed uniformly. We use the following relations between the Bhattacharyya parameter and the conditional entropy.

###### Proposition 3

. ([2, Proposition 2])

 (Z(X|Y))2 ≤H(X|Y), (4) H(X|Y) ≤log2(1+Z(X|Y))≤Z(X|Y). (5)

We now define the set of high and low Bhattacharyya parameters, and work with them instead of the sets and . For , define

 HX|Y ≜{i∈[n]:Z(Ui|U[i−1],Y[n])≥1−2−n1/2−δ}, LX|Y ≜{i∈[n]:Z(Ui|U[i−1],Y[n])≤2−n1/2−δ}.

As before, we define the sets and for the parameter by letting be a deterministic vector. Using Proposition 3, it is shown in [17, combining Proposition 2 with Theorem 2] that Theorem 2 holds also if we replace the sets and with the sets and . That is, we have

 limn→∞|HX|Y|/n=H(X|Y)andlimn→∞|LX|Y|/n=1−H(X|Y). (6)

We now define our coding scheme formally. Let be the realization of a uniformly distributed source message, and be a deterministic frozen vector known to both the encoder and the decoder. We discuss how to find a good frozen vector in Appendix C-C. For a subset and an index , we use a function to denote the rank of in an ordered list of the elements of . The probabilities and can be calculated efficiently by a recursive method described in [17, Section III.B].

###### Construction 4

.
Encoding

Input: a message .

Output: a codeword .

1. For from to , successively, set

 ui=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩u∈{0,1}with % probability pUi|U[i−1](u|u[i−1])if i∈HcXmr(i,HX∩LX|Y)if i∈HX∩LX|Yfr(i,HX∩LcX|Y)if i∈HX∩LcX|Y.
2. Transmit the codeword .

3. Transmit the vector separately using a linear, non-capacity-achieving polar code with a uniform input distribution (as in [1]). In practice, other error-correcting codes could be used for this vector as well.

Decoding

Input: a noisy vector .

Output: a message estimation .

1. Estimate the vector by .

2. For from 1 to , set

 ^ui=⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩argmaxu∈{0,1}pUi|U[i−1],Y[n](u|u[i−1],y[n])if i∈LX|Y^ur(i,HcX∩LcX|Y)if i∈HcX∩LcX|Yfr(i,HX∩LcX|Y)if i∈HX∩LcX|Y.
3. Return the estimated message .

We say that a sequence of coding schemes achieves the channel capacity if the probability of decoding error vanishes with the block length for any rate below the capacity.

###### Theorem 5

. Construction 4 achieves the channel capacity (Theorem 1) with a encoding and decoding complexity of and a probability of decoding error of at most for any and large enough .

In the next section we show a generalized construction and prove its capacity-achieving property. Theorem 5 thus will follow as a corollary of the more general Theorem 15. We note here two differences between Construction 4 and the construction in [23, Section III.B]. First, in the encoding of Construction 4, the bits in the set are set randomly, while in [23, Section III.B], those bits are set according to a maximum likelihood rule. And second, the vector is being sent through a side channel in Construction 4, but not in [23, Section III.B]. These two features of Construction 4 allow an alternative analysis and proof that the scheme achieves the channel capacity.

## Iii Channels with Non-Causal Encoder State Information

In this section we generalize Construction 4 to the availability of channel state information at the encoder. We consider mainly the application of rewriting in flash memories, and present two special cases of the channel model for this application. In flash memory, information is stored in a set of memory cells. We mainly focus on a flash memory type that is called Single-Level Cell (SLC), in which each cell stores a single information bit, and its value is denoted by either 0 or 1. We first note that the assumption of a memoryless channel is not exactly accurate in flash memories, due to a mechanism of cell-to-cell interference. However, we keep using this assumption, as it is nonetheless useful for the design of coding schemes with valuable practical performance. The main limitation of flash memories that we consider in this work is the high cost of changing a cell level from 1 to 0 (in SLC memories). To perform such a change, an expensive operation, called “block erasure”, is required. To avoid this block erasure operation, information is rewritten over existing memory in the sense that no cell is changed from value 1 to 0. We thus consider the use of the information about the previous state of the cells in the encoding process. We model the memory cells as a channel with a discrete state, and we also assume that the state is memoryless, meaning that the states of different cells are distributed independently.

We assume that the state of the entire cells is available to the writer prior to the beginning of the writing process. In communication terminology this kind of state availability is refereed to as “non causal”. We note that this setting is also useful in the so called Marton-coding method for communication over broadcast channels. Therefore, the multicoding schemes that will follow serve as a contribution also in this important setting. One special case of the model which we consider is the noiseless write-once memory model. This model also serves as an ingredient for a type of codes called “rank-modulation rewriting codes” [9]. Therefore, the schemes proposed in this section can also be useful for the design of rank-modulation rewriting codes.

We represent the channel state as a Bernoulli random variable with parameter , which equals the probability . A cell of state 1 can only be written with the value 1. Note that, intuitively, when is high, the capacity of the memory is small, since only a few cells are available for modification in the writing process, and thus only a small amount of information could be stored. This also means that the choice of codebook has a crucial effect on the capacity of the memory in future writes. A codebook that contains many codewords of high Hamming weight (number of 1’s in the codeword) would make the parameter of future writes high, and thus the capacity of the future writes would be low. However, forcing the expected Hamming weight of the codebook to be low would reduce the capacity of the current write. To settle this trade-off, previous work suggested to optimize the sum of the code rates over multiple writes. It was shown that in many cases, constraints on the codebook Hamming weight (henceforth just weight) strictly increase the sum rate (see, for example, [16]). Therefore, we consider an input cost constraint in the model.

The most general model that we consider is a discrete memoryless channel (DMC) with a discrete memoryless (DM) state and an input cost constraint, where the state information is available non causally at the encoder. The channel input, state and output are denoted by and , respectively, and their respective finite alphabets are denoted by and . The random variables are denoted by and , and the random vectors by and , where is the block length. The state is distributed according to the pmf , and the conditional pmfs of the channel are denoted by . The input cost function is denoted by , and the input cost constraint is

 n∑i=1E[b(Xi)]≤nB,

where is a real number representing the normalized constraint. The channel capacity with an informed encoder and an input cost constraint is given by an extension of the Gelfand-Pinsker Theorem111The cost constraint is defined slightly differently in this reference, but the capacity is not affected by this change..

###### Theorem 6

. (Gelfand-Pinsker Theorem with Cost Constraint) [7, Equation (7.7) on p. 186] Consider a DMC with a DM state where denotes the channel transition probability and denotes the state probability. Under an input cost constraint , where the state information is available non causally only at the encoder, the capacity of the channel is

 C=maxp(v|s),x(v,s):E(b(X))≤B(I(V;Y)−I(V;S)), (7)

where is an auxiliary random variable with a finite alphabet , .

The main coding scheme that we present in this section achieves the capacity in Theorem 6. The proof of Theorem 6 considers a virtual channel model, in which the RV is the channel input and is the channel output. Similar to the previous section, we limit the treatment to the case in which the RV is binary. In flash memory, this case would correspond to a single-level cell (SLC) type of memory. As mentioned in Section II, an extension of the scheme to a non-binary case is not difficult. The non-binary case is useful for flash memories in which each cell stores 2 or more bits of information. Such memories are called Multi-Level Cell (MLC). We also mention that the limitation to binary random variables does not apply on the channel output . Therefore, the cell voltage in flash memory could be read more accurately at the decoder to increase the coding performance, similarly to the soft decoding method that is used in flash memories with LDPC codes. Another practical remark is that the binary-input model can be used in MLC memories by coding separately on the MSB and the LSB of the cells, as in fact is the coding method in current MLC flash systems.

The scheme that achieves the capacity of Theorem 6 is called Construction 16, and it will be described in Subsection III-C. The capacity achieving result is summarized in the following theorem, which will be proven in Subsection III-C.

###### Theorem 7

. Construction 16 achieves the capacity of the Gelfand-Pinsker Theorem with Cost Constraint (Theorem 6) with a encoding and decoding complexity of and a probability of decoding error of at most for any and large enough .

Note that the setting of Theorem 7 is a generalization of the asymmetric channel-coding setting of Theorem 5, and therefore Construction 16 and Theorem 7 are in fact a generalization of Construction 4 and Theorem 5. We note also that polar codes were constructed for a symmetric case of the Gelfand-Pinsker channel by Korada and Urbanke in [18]. As the key constraint of flash memories is notably asymmetric, the important novelty of this work is in providing the non-trivial generalization that cover the asymmetric case.

Before we describe the code construction, we first show in Subsection III-A two special cases of the Gelfand-Pinsker model that are useful for the rewriting of flash memories. Afterwards, in subsections III-B and III-C, we will show two versions of the construction that correspond to generalizations of the two special cases.

### Iii-a Special Cases

We start with a special case that is quite a natural model for flash memory rewriting.

###### Example 8

. Let the sets and be all equal to , and let the state pmf be . This model corresponds to a single level cell flash memory. We describe the cell behaviour after a bit is attempted to be written. When , the cell behaves as a binary asymmetric channel with input , since the call state does not interfere with the writing attempt. When , the cell behaves as if a value of 1 was attempted to be written, regardless of the actual value attempted. However, an error might still occur, during the writing process or anytime afterwards (for example, due to charge leakage). Thus, we can say that when , the cell behaves as a binary asymmetric channel with input 1. Formally, the channel pmfs are given by

 pY|XS(1|x,s)=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩α0if (x,s)=(0,0)1−α1if (x,s)=(0,1)1−α1if (x,s)=(1,0)1−α1if (x,s)=(1,1) (8)

The error model is also presented in Figure 2. The cost constraint is given by , since it is desirable to limit the amount of cells written to a value of 1.

Our coding-scheme construction for the setting of Theorem 6 is based on a more limited construction, which serves as a building block. We will start by describing the limited construction, and then show how to extend it for the model of Theorem 6. We will prove that the limited construction achieves the capacity of channels whose capacity-achieving distribution forms a certain stochastically degraded structure. We first recall the definition of stochastically degraded channels.

###### Definition 9

. [7, p. 112] A discrete memoryless channel (DMC) is stochastically degraded (or simply degraded) with respect to a DMC , denoted as , if there exists a DMC such that satisfies the equation .

Next, we bring the required property of channels whose capacity is achieved by the limited construction to be proposed.

###### Property 10

. There exist functions and that maximize the Gelfand-Pinsker capacity in Theorem 6 which satisfy the condition .

It is an open problem whether the model of Example 8 satisfies the degradation condition of Property 10. However, we can modify the model such that it will satisfy Property 10. Specifically, we study the following model:

###### Example 11

. Let the sets and be all equal to . The channel and state pmfs are given by and

 pY|XS(1|x,s)=⎧⎪⎨⎪⎩αif (x,s)=(0,0)1−αif (x,s)=(1,0)1if s=1. (9)

In words, if the channel output is always 1, and if , the channel behave as a binary symmetric channel. The cost function is given by . The error model is also presented in Figure 3. This model can represent a writing noise, as a cell of state is not written on and it never suffers errors.

We claim that the model of Example 11 satisfies the degradation condition of Property 10. To show this, we need first to find the functions and that maximize the Gelfand-Pinsker capacity in Theorem 6. Those functions are established in the following theorem of Heegard.

###### Theorem 12

. [16, Theorem 4] The capacity of the channel in Example 11 is

 C=(1−β)[h(ϵ∗α)−h(α)],

where and . The selections , (where is the logical AND operation, and is the logical negation), and

 pV|S(1|0)=ϵ,pV|S(1|1)=ϵ(1−α)ϵ∗α (10)

achieve this capacity.

We provide an alternative proof for Theorem 12 in Appendix A. Intuitively, the upper bound is obtained by assuming that the state information is available also at the decoder, and the lower bound is obtained by setting the functions and according to the statement of the theorem. The proof that the model in Example 11 satisfies the degradation condition of Property 10 is completed by the following lemma.

###### Lemma 13

. The capacity achieving functions of Theorem 12 for the model of Example 11 satisfy the degradation condition of Property 10. That is, the channel is degraded with respect to the channel .

Lemma 13 is proven in Appendix B, and consequently, the capacity of the model in Example 11 can be achieved by our limited construction. In the next subsection we describe the construction for channel models which satisfy Property 10, including the model in Example 11.

### Iii-B Multicoding Construction for Degraded Channels

Notice first that the capacity-achieving distribution of the asymmetric channel in Section II actually satisfies Property 10. In the asymmetric channel-coding case, the state can be thought of as a degenerate random variable (a RV which only takes a single value), and therefore we can choose in Definition 9 to be degenerate as well, and by that satisfy Property 10. We will see that the construction that we present in this subsection is a generalization of Construction 4.

The construction has a similar structure as the achievability proof of the Gelfand-Pinsker Theorem (Theorem 6). The encoder first finds a vector in a similar manner to Construction 4, where the RV in Construction 4 is replaced with , and the RV is replaced with . The vector is now the polarization of the vector , meaning that . The RV is taken according to the pmfs that maximize the rate expression in Equation (7). The selection of the vector is illustrated in Figure 4. After the vector is chosen, each bit in the codeword is calculated by the function that maximizes Equation (7). To use the model of Example 11, one should use the functions and according to Theorem 12. The key to showing that the scheme achieves the channel capacity is that the fraction can be shown to vanish for large if the channel satisfies Property 10. Then, by the same intuition as in Equation (2) and using Equation (6), the replacements imply that the asymptotic rate of the codes is

 |HV|S∩LV|Y|/n =1−|HcV|S|/n−|LcV|Y|/n+|HcV|S∩LcV|Y|/n →1−(1−H(V|S))−H(V|Y)+0 =I(V;Y)−I(V;S),

achieving the Gelfand-Pinsker capacity of Theorem 6. We now describe the coding scheme formally.

###### Construction 14

.
Encoding

Input: a message and a state .

Output: a codeword .

1. For each from to , assign

 ui=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩u∈{0,1}with % probability pUi|U[i−1],S[n](u|u[i−1],s[n])if i∈HcV|Smr(i,HV|S∩LV|Y)if i∈HV|S∩LV|Yfr(i,HV|S∩LcV|Y)if i∈HV|S∩LcV|Y. (11)
2. Calculate and for each , store the value .

3. Store the vector separately using a point-to-point linear non-capacity-achieving polar code with a uniform input distribution. The encoder here does not use the state information in the encoding process, but rather treat it as an unknown part of the channel noise.

Decoding

Input: a noisy vector .

Output: a message estimation .

1. Estimate the vector by .

2. Estimate by as follows: For each from 1 to , assign

 ^ui=⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩argmaxu∈{0,1}pUi|U[i−1],Y[n](u|u[i−1],y[n])if i∈LV|Y^ur(i,HcV|S∩LcV|Y)if i∈HcV|S∩LcV|Yfr(i,HV|S∩LcV|Y)if i∈HV|S∩LcV|Y. (12)
3. Return the estimated message .

The asymptotic performance of Construction 14 is stated in the following theorem.

###### Theorem 15

. If Property 10 holds, then Construction 14 achieves the capacity of Theorem 6 with a encoding and decoding complexity of and a probability of decoding error of at most for any and large enough .

The proof of Theorem 15 is shown in Appendix C. The next subsection describes a method to remove the degradation requirement of Property 10. This allows to achieve also the capacity of the more realistic model of Example 8.

### Iii-C Multicoding Construction without Degradation

A technique called “chaining” was proposed in [23] that allows to achieve the capacity of models that do not exhibit the degradation condition of Property 10. The chaining idea was presented in the context of broadcast communication and point-to-point universal coding. We connect it here to the application of flash memory rewriting through Example 8. We note also that the chaining technique that follows comes with a price of a slower convergence to the channel capacity, and thus a lower non-asymptotic code rate.

The requirement of Construction 14 for degraded channels comes from the fact that the set needs to be communicated to the decoder in a side channel. If the fraction vanishes with , Construction 14 achieves the channel capacity. In this subsection we deal with the case that the fraction does not vanish. In this case we have

 |HV|S∩LV|Y|/n= 1−|HcV|S∪LcV|Y|/n = 1−|HcV|S|/n−|LcV|Y|/n+|HcV|S∩LcV|Y|/n → I(V;Y)−I(V;S)+|HcV|S∩LcV|Y|/n.

The idea is then to store the subvector in a subset of the indices of an additional code block of cells. The additional block is using the same coding technique as the original block. Therefore, it can use about of the cells to store additional message bits, and by that to approach the channel capacity. We denote the by the subset of in which we store the the subvector of the previous block. Note that the additional block also faces the same difficulty as the original block with the set . To solve this, we use the same solution, recursively, sending a total of blocks, each of length . Each block can store a source message of fraction that approaches the channel capacity. The “problematic” bits of block (the last block) will then be stored using yet another block, but this block will be coded without taking the state information into account, and thus will not face the same difficulty. The last block is thus causing a rate loss, but this loss is of a fraction , which vanishes for large . The decoding is performed “backwards”, starting from the last block and ending with the first block. The chaining construction is illustrated in Figure 5. In the following formal description of the construction we denote the index of the -th block of the message by , and similarly for other vectors. The vectors themselves are are also denoted in two dimensions, for example .

###### Construction 16

.
Let be an arbitrary subset of of size .

Encoding

Input: a message and a state .

Output: a codeword .

1. Let be an arbitrary vector. For each from to , and for each from to , assign

 (13)
2. For each from to calculate , and for each , store the value .

3. Store the vector separately using a point-to-point linear non-capacity-achieving polar code with a uniform input distribution. The encoder here does not use the state information in the encoding process, but rather treat it as an unknown part of the channel noise.

Decoding

Input: a noisy vector .

Output: a message estimation .

1. Estimate the vector by , and let .

2. Estimate by as follows: For each down from to , and for each from 1 to , assign

 ^uji=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩argmaxu∈{0,1}pUi|U[i−1],Y[n](u|u[i−1],j,y[n],j)if i∈LV|Y^ur(i,R),j+1if i∈HcV|S∩LcV|Yfr(i,HV|S∩LcV|Y),jif i∈HV|S∩LcV|Y. (14)
3. Return the estimated message .

Constructions 14 and 16 can also be used for communication over broadcast channels in Marton’s region, as described in [13, 23]. Constructions 14 and 16 improve on these previous results since they provably achieve the capacity with linear storage requirement.

Construction 16 achieves the capacity of Theorem 6 with low complexity, without the degradation requirement of Property 10. This result was stated in Theorem 7. The proof of Theorem 7 follows from Theorem 15 and the fact that the rate loss vanishes with large . Construction 16 is useful for the realistic model of flash memory-rewriting of Example 8, using the appropriate capacity-achieving functions and .

## Iv Conclusion

In this paper we proposed three capacity-achieving polar coding schemes, for the settings of asymmetric channel coding and flash memory rewriting. The scheme for asymmetric channels improves on the scheme of [17] by reducing the exponential storage requirement into a linear one. The idea for this reduction is to perform the encoding randomly instead of using Boolean functions, and to transmit a vanishing fraction of information on a side channel.

The second proposed scheme is used for the setting of flash memory rewriting. We propose a model of flash memory rewriting with writing noise, and show that the scheme achieves its capacity. We also describe a more general class of channels whose capacity can be achieved using the scheme. The second scheme is derived from the asymmetric-channel scheme by replacing the Shannon random variables and with the Gelfand-Pinsker random variables and .

The last proposed scheme achieves the capacity of any channel with non-causal state information at the encoder. We bring a model of noisy flash memory rewriting for which the scheme would be useful. The main idea in this scheme is the code chaining proposed in [23]. Another potential application could be in an asymmetric version of information embedding, which can be modeled as another special case of the Gelfand-Pinsker problem (as in [3]).

## Appendix A

In this appendix we provide an alternative proof for Theorem 12, which was originally proven in [16, Theorem 4]. We find the proof in this Appendix to be somewhat more intuitive. Theorem 12 states that the capacity of the channel in Example 11 is

 C=(1−β)[h(ϵ∗α)−h(α)],

where and . An upper bound on the capacity can be obtained by assuming that the state information is available also to the decoder. In this case, the best coding scheme would ignore the cells with (about a fraction of the cells), and the rest of the cells would be coded according to a binary symmetric channel with an input cost constraint. It is optimal to assign a channel input for the cells with state , such that those cells who do not convey information do not contribute to the cost. We now focus on the capacity of the binary symmetric channel with cost constraint. To comply with the expected input cost constraint of the channel of Example 11, the expected cost of the input to the binary symmetric channel (BSC) must be at most . To complete the proof of the upper bound, we show next that the capacity of the BSC with cost constraint is equals to . For this channel, we have

 H(Y|X)=h(α)pX(0)+h(α)pX(1)=h(α).

We are left now with maximizing the entropy over the input pmfs . We have

 pY(1)= pY|X(1|0)pX(0)+pY|X(1|1)pX(1) = α(1−pX(1))+(1−α)pX(1) = α∗