A Two-staged Adaptive Successive Cancellation List Decoding for Polar Codes
Polar codes achieve outstanding error correction performance when using successive cancellation list (SCL) decoding with cyclic redundancy check. A larger list size brings better decoding performance and is essential for practical applications such as 5G communication networks. However, the decoding speed of SCL decreases with increased list size. Adaptive SCL (A-SCL) decoding can greatly enhance the decoding speed, but the decoding latency for each codeword is different so A-SCL is not a good choice for hardware-based applications. In this paper, a hardware-friendly two-staged adaptive SCL (TA-SCL) decoding algorithm is proposed such that a constant input data rate is supported even if the list size for each codeword is different. A mathematical model based on Markov chain is derived to explore the bounds of its decoding performance. Simulation results show that the throughput of TA-SCL is tripled for good channel conditions with negligible performance degradation and hardware overhead.
To improve error correction performance of polar codes , successive cancellation list (SCL) decoding [2, 3] is the most popular decoding choice. (called the list size) successive cancellation (SC) decodings [4, 5] are executed concurrently to decode a polar codeword and candidates of decoded vectors are kept during decoding [2, 3]. Compared with SC decoding, SCL decoding improves the error correction performance as the probability of one of the candidates to be the correct decoded vector is higher, and a larger list size brings a better error correction performance. In , cyclic redundancy check (CRC) codes are concatenated as outer codes with polar codes, and CRC is applied to all the candidates to see whether any candidate is the valid decoding output. From the experimental results presented in [7, 8], the CRC-aided polar codes decoded by SCL with a sufficiently large list size (16) outperform LDPC codes and turbo codes.
Due to the extraordinary error correction performance of CRC-aided SCL decoding, its hardware implementation has attracted much research interest recently. Several different VLSI architectures [9, 10, 11, 12, 13, 8, 14, 15, 16] have been proposed for SCL. The decoding throughputs achieved by the state-of-the-art architectures are shown in Fig. 1. It can be observed that the decoding throughputs of all the architectures are degraded with the list size. This is mainly because the critical path delay of some of the critical functional modules [17, 18, 19] in these architectures increases rapidly when the list size is increased. Although efforts have been made to optimize these modules as well as the overall architecture, the throughput is still reduced due to the high decoding complexity.
To increase the decoding speed so that it can match with that of LDPC or turbo code architectures, adaptive SCL (A-SCL) decoding was proposed in  and a corresponding software decoder was implemented on CPU in . This algorithm first uses a single SC to decode a codeword. If the decoded vector cannot pass CRC, the list size is doubled and the decoding repeats. This process is iterated until a valid vector is obtained or a pre-defined is reached. Experimental results  show that A-SCL significantly reduce the average list size required to achieve an equivalent error correction performance of SCL decoding with . The average throughput of executing A-SCL on hardware can benefit from the reduction on the . However, if the algorithm is directly mapped to hardware, the decoding latency of each codeword is different, which may not support applications that need a constant input transmission data rate. Also, the hardware complexity is high as multiple SCL modules are needed.
The main contributions of this work are outlined as follows:
An analytical model of TA-SCL is developed based on Markov chain to analyse its error correction performance. Its accuracy is verified by simulation, and it can be used for the optimization of the VLSI architecture for TA-SCL.
Ii-a Introduction of Polar Codes and SCL
Polar codes are a family of block codes  characterised by an binary generator matrix , where is the code length. The source word and codeword of an -bit frame are both binary vectors, and the encoding can be expressed as . Among all the bits in a frame, only bits are used to send information and the rest are frozen bits which are set to 0. The last information bits are used to transmit the checksum of the CRC code.
SCL decoding of polar codes decodes a codeword bit-by-bit in a serial order, and the decoding process is similar to a search problem on a binary decoding tree whose depth is . A decoding tree with is shown in Fig. 2. The source bit is mapped to the nodes of the decoding tree at level . A path from the root node to a leaf node represents a candidate of decoded vector. For a parent at level , its left and right children at level correspond to the expansions of the decoding path with and 1, respectively. In the example, the path marked with single crosslines represents a decoding vector . If a bit, such as in Fig. 2, is a frozen bit, the sub-tree rooted at the right child does not contain any valid candidate and hence is pruned. Therefore, the total number of the possible candidates in a decoding tree is , and it is is too large to exhaustively search the decoding tree to obtain the correct decoded vector when a practical code length is used. To limit the computational complexity, each SCL decoding has a pre-defined list size . If the number of paths at a certain level exceeds , a list management operation is used to select and keep the best survival paths and discard the rest ones. The example in Fig. 2 maintains a list with , so another path marked by double crosslines representing the decoded vector 0100 is also kept in the list. At the end of the decoding, the path in the list that passes CRC is selected as the output vector.
Ii-B Adaptive SCL with CRC
Adaptive SCL with CRC was proposed in  and its operation is summarised in Algorithm 1. Each time, a new codeword which contains log-likelihood ratios (LLRs) of the input values is sent for decoding. A-SCL starts from an SCL with , i.e. a single SC. If there is at least one decoded vectors that pass CRC at the end of decoding, the one with the highest reliability is chosen as output. Otherwise, the list size is doubled and the codeword is decoded again by an SCL with the new list size. Usually, a pre-defined is used to limit the computational complexity, that is, after the decoding using an SCL with , the decoding terminates even when there is no valid candidate. According to , the error correction performance of A-SCL is the same as that of an SCL with . At the same time, as most of the valid decoded vectors can be obtained using SCL with smaller list sizes, the average list size of A-SCL is much smaller than and its average decoding speed is much higher than that of SCL with .
Ii-C Problems of Implementing A-SCL on Hardware
If the A-SCL algorithm is implemented on hardware, the throughput will be much higher than that of a traditional SCL. However, direct mapping of the A-SCL algorithm onto a VLSI architecture requires the architecture to support multiple SCL decodings with all . This increases the design effort and also the hardware complexity. Moreover, different codewords may need SCL with different list sizes and SCL with a larger list size has a much higher latency. When a codeword needs longer latency to decode, the input has to be interrupted until the decoding of the current frame is finished. Because of that, a directly-mapped architecture may not be able to support applications that need to have a constant input data rate, such as the channel coding blocks in communication networks.
To improve the decoding speed on hardware, a CPU-based software A-SCL decoder was proposed in  in which A-SCL was simplified by only using a single SC and an SCL with . However, the variable decoding latency issue has not yet been addressed. Moreover, the overall latency is very large as the latency for the data movement between the memory and the computing resources is dominant. Hence, neither the original A-SCL nor the simplified A-SCL in  is a good choice for high-throughput VLSI implementations. To solve these issues and map A-SCL to a high-speed and efficient VLSI architecture, we propose a two-staged adaptive SCL which will be presented in the next section.
Iii Two-staged Adaptive SCL
Iii-a Algorithm of TA-SCL
As mentioned above, the average list size in A-SCL. Actually, in a high signal-to-noise ratio (SNR) operation region , indicating that most codewords are decoded by the single SC correctly. At the same time, the error correction performance follows that of SCL with . Based on these observations, we propose a hardware-friendly two-staged adaptive SCL.The block diagram of TA-SCL and its timing schedule are shown in Fig. 3 and Fig. 4, respectively. Basically, it includes two SCL decodings, which are an SCL decoding with small list size (not necessarily to be 1), denoted as , and an SCL decoding with large list size, denoted as . Each codeword from the channel is first decoded by . Most of the time, the decoded vector can be decoded correctly. If none of the candidates in the list passes CRC after this decoding, e.g. fr.1 in Fig. 4, the current codeword will be decoded again by . This decoding usually takes longer time than decoding using . However, different from A-SCL, will bring in and decode the next codeword from the channel input immediately instead of waiting for to finish decoding the current codeword. The continuous running of permits the data to be transmitted at a constant data rate which is equal to the decoding speed of , while the decoding performance is guaranteed by . Also, the hardware complexity of TA-SCL is effectively reduced as only two SCL decoders are needed.
If the channel is subject to burst errors, it is possible that a new codeword cannot be correctly decoded by and the decoding in has not finished yet. To deal with this, an LLR buffer is needed to store the LLRs of the codeword from temporarily, such as fr.3 and fr.4 shown in Fig. 4. An output buffer is also needed to re-order the decoded vectors as the codeword may be decoded out of order. For example, fr.7fr.9 are stored in the output buffer until the decoding of fr.6 finishes.
Iii-B Error Correction Performance of TA-SCL
To analyse the error correction performance of the TA-SCL decoding, we define its parameters as follows.
: list sizes of / .
: BLERs of / .
: decoding time of each codeword using / .
: speed gain, which is defined as . With out loss of generality, we assume .
: size of the LLR buffer, which equals to the number of codewords that can be stored in the buffer.
We also denote a TA-SCL decoding whose speed gain is and buffer size is as . The TA-SCL decoding in the example shown in Fig. 4 hence can be described as and the corresponding needs to decode a codeword. When a new codeword needs to be stored in the LLR buffer but the buffer is full and decoding in has not finished yet, buffer overflow happens, which will lead to performance degradation for 111To deal with buffer overflow, either the codeword in or the new one should be thrown away. In the following, we just analyse the former case and the latter case can be analysed in the same way. . An example of buffer overflow is marked in Fig. 4. Thus, the BLER of , denoted as , is bounded by
Obviously, it is important to prevent the buffer overflow in order to reduce . A large buffer size certainly helps as more codewords can be stored, and a smaller speed gain indicates have relatively more time to decode the codewords accumulated in the buffer. To obtain the best tradeoff among performance, hardware usage and throughput, an analytical model of will be introduced to derive the relationship between Pr(Overflow) and the parameters of in the next sub-section.
Iii-C Analytical Model of TA-SCL based on Markov Chain
To model the behavior of , we first introduce the states that the decoder can operate at. In particular, these states reflect whether buffer overflow will happen. We define the number of codewords stored in the LLR buffer as and the remaining time required to finish the decoding of (in term of ) as . Each codeword in the LLR buffer needs to decode. Then, the state of TA-SCL indicates the time to clear the buffer and is defined as
which equals to the total time required to clear the buffer. For a , there are totally states. All the states can be divided into two groups.
Hazard states: The states that the LLR buffer is full and the current codeword decoded by cannot be finished within , which means and . Buffer overflow will occur if cannot decode the next codeword correctly.
Safe states: In contrast with the hazard states, these states do not have overflow hazard as the LLR buffer has enough space for a codeword that cannot be correctly decoded by .
We show an example for in Fig. 5(a), where the black and white arrows represent the probabilities of and , respectively. The first three columns show , and , respectively. Typical transitions from hazard and safe states are marked with “H” and “S” in the figure, respectively. Note that the transition from state 0 is a little different as is idle.
Suppose that (BLER of ) follows an identical and independent distribution (IID). Then, the state transitions only depend on current state of and . Hence, decoding with is a Markov process and can be modeled with a Markov chain. The state diagram of a can be easily obtained by finding out all the possible state transitions in Fig. 5(a). Fig. 5(b) shows the state diagram of . For further mathematical analysis, we map the state diagram to a transition matrix whose size is . An element corresponds to the transition probability from state to state , i.e.,
where is the current state and is the next state. The transition matrix of mapped from the state diagram is
With the transition matrix , we can do steady-state analysis for . Suppose that the decoding begins with at state 0, i.e., the state probability . After (), the state probability becomes . Define , then the steady-state distribution of is
Actually, all the lines of are the same, which means the steady-state distribution is irrespective of the initial state of . Buffer overflow happens when is in any hazard state and cannot decode the next codeword correctly, and the probability of buffer overflow is then expressed as
This probability of overflow bounds in (1). It is a function of error correction performance , speed gain and buffer size , i.e., Pr(Overflow)=. If and are fixed, the term and hence Pr(Overflow) is monotonically increasing with respect to . The proof is omitted due to page limitation and will be given in our future work. The monotonicity indicates we can either increase or the SNR to get a better error correction performance.
We will show the accuracy of the proposed model by simulation results in the next section. We will also show that TA-SCL can improve the decoding throughput with a small hardware overhead.
Iv Experimental Results
Iv-a Accuracy of the Proposed Model
To verify the accuracy of the proposed model, we run simulations for a polar code with under AWGN channel conditions. The list sizes of the two component SCL decoders are and , respectively. The simulated BLER results of under different speed gain and buffer sizes are obtained at an SNR of 2dB and are compared with the upper bounds calculated using (1) and (8).
Fig. 6 summarizes the performance loss with respect to the speed gain when different are used. Here, the performance loss is calculated by . The solid lines and the dashed lines show the calculated and simulated results, respectively. It can be seen that these two lines are almost overlapped, indicating is approximately equal to its upper bound derived in (8). The proposed model can thus be used to estimate the error correction performance of an accurately. The results also show that a larger buffer size enables the decoding to run at a higher speed gain with the same constraint of performance loss.
Iv-B Analysis of Hardware Gain
|Scaled to 90 nm technology.|
In this sub-section, we show the improvement of hardware performance achieved by the proposed TA-SCL decoder. We use a polar code with = and =32 as an example. The hardware performance of some VLSI architectures of SCL decoder in the literature [8, 22] is shown in Table I. They are used as the component SCL decoders in the TA-SCL decoder. Fig. 7 shows the error correction performance of with buffer size =6. When the target is 3, there is almost no performance degradation at a high SNR range (1.6dB) comparing with the baseline of =32. The degradation is obvious at a low SNR range. If the target is reduced to 2, the decoder can work in a wider range of SNR down to 1.4dB. All these observations is consistent with the intuitions mentioned in Section IV. It is noted that the throughput of is lower than in both cases, so the speed gain of up to 3x is achievable. The overall throughput of TA-SCL is also higher than that of the SCL decoders with smaller list sizes as shown in Fig. 1 [12, 13].
The area of the proposed architecture is shown in Table I, which is estimated based on the results reported in the literature [8, 22]. It equals to the sum of area of the two SCL modules, the LLR buffer and the output buffer. As the area of the module is dominant, the proposed only has a 18% area overhead. Moreover, due to the throughput improvement, the area efficiency of is also much higher than that of .
In this work, a two-staged adaptive SCL is proposed. This algorithm can support data input at a fixed data rate and has a low hardware complexity. To analyse its error correction performance, an analytical model is also proposed and its accuracy is then verified by simulations. With a good selection of the parameters of TA-SCL using the proposed analytical model, an optimal tradeoff between speed gain, error correction performance loss and hardware overhead can be obtained for designing the VLSI architecture.
-  E. Arıkan, “Channel polarization: A method for constructing capacity-achieving codes for symmetric binary-input memoryless channels,” IEEE Trans. Inf. Theory, vol. 55, no. 7, pp. 3051–3073, June 2009.
-  I. Tal and A. Vardy, “List decoding of polar codes,” IEEE Trans. Inf. Theory, vol. 61, no. 5, pp. 2213–2226, May 2015.
-  K. Chen, K. Niu, and J. R. Lin, “List successive cancellation decoding of polar codes,” IET Electron. Lett., vol. 48, no. 9, pp. 500–501, Apr 2012.
-  C. Leroux, A. J. Raymond, G. Sarkis, and W. J. Gross, “A semi-parallel successive-cancellation decoder for polar codes,” IEEE Trans. Signal Process., vol. 61, no. 2, pp. 289–299, Jan 2013.
-  Y. Fan and C.-Y. Tsui, “An efficient partial-sum network architecture for semi-parallel polar codes decoder implementation,” IEEE Trans. Signal Process., vol. 62, no. 12, pp. 3165–3179, Jun 2014.
-  K. Niu and K. Chen, “CRC-aided decoding of polar codes,” IEEE Commun. Lett., vol. 16, no. 10, pp. 1668–1671, Oct 2012.
-  K. Niu, K. Chen, and J. R. Lin, “Beyond turbo codes: Rate-compatible punctured polar codes,” in Proc. IEEE Int. Conf. Commun.(ICC), 2013, pp. 3423–3427.
-  C. Xia, J. Chen, Y. Fan, C. Tsui, J. Jin, H. Shen, and B. Li, “A high-throughput architecture of list successive cancellation polar codes decoder with large list size,” IEEE Trans. Signal Process., vol. 66, no. 14, pp. 3859 – 3874, Jul 2018.
-  A. Balatsoukas-Stimming, M. Bastani Parizi, and A. Burg, “LLR-based successive cancellation list decoding of polar codes,” IEEE Trans. Signal Process., vol. 63, no. 19, pp. 5165–5179, Oct 2015.
-  B. Yuan and K. K. Parhi, “Low-latency successive-cancellation list decoders for polar codes with multibit decision,” IEEE Trans. VLSI Syst., vol. 23, no. 10, pp. 2268–2280, Oct 2015.
-  C. Xiong, J. Lin, and Z. Yan, “Symbol-decision successive cancellation list decoder for polar codes,” IEEE Trans. Signal Process., vol. 64, no. 3, pp. 675–687, Feb 2016.
-  J. Lin, C. Xiong, and Z. Yan, “A high throughput list decoder architecture for polar codes,” IEEE Trans. VLSI Syst., vol. 24, no. 6, pp. 2378–2391, June 2016.
-  S. A. Hashemi, C. Condo, and W. J. Gross, “Fast and flexible successive-cancellation list decoders for polar codes,” IEEE Trans. Signal Process., vol. 65, no. 21, pp. 5756 – 5769, Nov 2017.
-  P. Giard, G. Sarkis, C. Thibeault, and W. Gross, “A 638 Mbps low-complexity rate 1/2 polar decoder on FPGAs,” in IEEE Workshop Signal Process. Syst. (SiPS), 2015, pp. 1–6.
-  C. Xiong, Y. Zhong, C. Zhang, and Z. Yan, “An FPGA emulation platform for polar codes,” in IEEE Workshop Signal Process. Syst. (SiPS), 2016, pp. 148–153.
-  C. Xia, Y. Fan, J. Chen, C. Tsui, C. Zeng, J. Jin, and B. Li, “An implementation of list successive cancellation decoder with large list size for polar codes,” in Int. Conf. Field Programmable Logic and Appl. (FPL), 2017, pp. 1–4.
-  A. Balatsoukas-Stimming, M. Bastani Parizi, and A. Burg, “On metric sorting for successive cancellation list decoding of polar codes,” in IEEE Int. Symp. Circ. and Syst. (ISCAS), 2015, pp. 1993–1996.
-  C. Xia, Y. Fan, J. Chen, and C. Tsui, “On path memory in list successive cancellation decoder of polar codes,” in IEEE Int. Symp. Circ. and Syst. (ISCAS), 2018, pp. 1–5.
-  M. Mousavi, Y. Fan, C. Tsui, J. Jin, H. Shen, and B. Li, “Efficient partial-sum network architectures for list successive-cancellation decoding of polar codes,” IEEE Trans. Signal Process., vol. 66, no. 14, pp. 3848 – 3858, Jul 2018.
-  B. Li, H. Shen, and D. Tse, “An adaptive successive cancellation list decoder for polar codes with cyclic redundancy check,” IEEE Commun. Lett., vol. 16, no. 12, pp. 2044–2047, Dec 2012.
-  G. Sarkis, P. Giard, A. Vardy, C. Thibeault, and W. J. Gross, “Fast list decoders for polar codes,” IEEE J. Sel. Areas Commun., vol. 34, no. 2, pp. 318–328, Feb. 2016.
-  P. Giard, A. Balatsoukas-Stimming, G. Sarkis, C. Thibeault, and W. J. Gross, “Fast low-complexity decoders for low-rate polar codes,” Journal of Signal Processing Systems, vol. 90, no. 5, pp. 675–685, May 2018.