Implementation of Decoders for LDPC
Block Codes and LDPC Convolutional
Codes Based on GPUs
Abstract
With the use of belief propagation (BP) decoding algorithm, lowdensity paritycheck (LDPC) codes can achieve nearShannon limit performance. LDPC codes can accomplish bit error rates (BERs) as low as even at a small bitenergytonoisepowerspectraldensity ratio (). In order to evaluate the error performance of LDPC codes, simulators running on central processing units (CPUs) are commonly used. However, the time taken to evaluate LDPC codes with very good error performance is excessive. For example, assuming iterations are used in the decoder, our simulation results have shown that it takes a modern CPU more than 7 days to arrive at a BER of for a code with length . In this paper, efficient LDPC blockcode decoders/simulators which run on graphics processing units (GPUs) are proposed. We also implement the decoder for the LDPC convolutional code (LDPCCC). The LDPCCC is derived from a predesigned quasicyclic LDPC block code with good error performance. Compared to the decoder based on the randomly constructed LDPCCC code, the complexity of the proposed LDPCCC decoder is reduced due to the periodicity of the derived LDPCCC and the properties of the quasicyclic structure. In our proposed decoder architecture, ( is a multiple of a warp) codewords are decoded together and hence the messages of codewords are also processed together. Since all the codewords share the same Tanner graph, messages of the distinct codewords corresponding to the same edge can be grouped into one package and stored linearly. By optimizing the data structures of the messages used in the decoding process, both the read and write processes can be performed in a highly parallel manner by the GPUs. In addition, a thread hierarchy minimizing the divergence of the threads is deployed, and it can maximize the efficiency of the parallel execution. With the use of a large number of cores in the GPU to perform the simple computations simultaneously, our GPUbased LDPC decoder can obtain hundreds of times speedup compared with a serial CPUbased simulator and over times speedup compared with an thread CPUbased simulator.
LDPC, LDPC convolutional code, CUDA, graphics processing unit (GPU), OpenMP, parallel computing, LDPC decoder, LDPCCC decoder
I Introduction
Lowdensity paritycheck (LDPC) codes were invented by Robert Gallager [1] but had been ignored for years until Mackay rediscovered them [2]. They have attracted much attention recently because they can achieve excellent error correcting performance based on the belief propagation (BP) decoding algorithm.
However, the BP decoding algorithm requires intensive computations. For applications like optical communication [3, 4] which requires BERs down to , using CPUbased programs to simulate the LDPC decoder is impractical. Fortunately, the decoding algorithm possesses a high dataparallelism feature, i.e., the data used in the decoding process are manipulated in a very similar manner and can be processed separately from one another. Thus, practical decoders with lowlatency and highthroughput can be implemented with dedicated hardware such as field programmable gate arrays (FPGAs) or application specific integrated circuits (ASICs) [5, 6, 7, 8, 9, 10, 11]. However, high performance FPGAs and ASICs are very expensive and are nonaffordable by most researchers. Such hardware solutions also cost a long time to develop. In addition, the hardware control and interconnection frame are always associated with a specific LDPC code. If one parameter of an LDPC code/decoder changes, the corresponding hardware design has to be changed accordingly, rendering the hardwarebased solutions nonflexible and nonscalable.
Recently, graphics processing units (GPUs) used to process graphics only have been applied to support general purpose computations [12]. In fact, GPUs are highly parallel structures with many processing units. They support floating point arithmetics and can hence conduct computations with the same precision as CPUs. GPUs are particularly efficient in carrying out the same operations to a large amount of (different) data. Compared with modern CPUs, GPUs can also provide much higher dataparallelism and bandwidth. Consequently, GPUs can provide a cheap, flexible and efficient solution of simulating an LDPC decoder. Potentially, the simulation time can be reduced from months to weeks or days when GPUs, instead of CPUs, are used. In addition, the GPU programming codes can be reused without much modification should more advanced GPUs be produced by manufacturers.
In [13, 14], a compressed paritycheck matrix has been proposed to store the indices of the passing messages in a cyclic or quasicyclic LDPC code. Further, the matrix is stored in the constant cache memory on the GPU for fast access. The messages are stored in a compressed manner such that the global memory can be accessed in a coalesced way frequently. However, the coalesced memory access occurs only during the dataread process and is not always guaranteed due to a lack of data alignment. In [12, 15, 16], the sumproduct LDPC decoder and the minsum decoder have been implemented with GPUs. Moreover, by combining sixteen fixedpoint bit data to form one bit data, the LDPC decoder in [12] decodes sixteen codewords simultaneously and achieves a high throughput. Although the method in [12] allows coalesced memory access in either the read or write process, coalesced memory access in both the read and write processes is yet to be achieved.
Furthermore, the LDPC convolutional codes (LDPCCCs), first proposed in [17], have been shown to achieve a better error performance than the LDPC block code counterpart of similar decoding complexity. There are many features of LDPCCC that make it suitable for real applications. First, the LDPCCC inherits the structure of the convolutional code, which allows continuous encoding and decoding of variablelength codes. Thus the transmission of codewords with varying code length is possible. Second, the LDPCCC adopts a pipelined decoding architecture — in the iterative decoding procedure, each iteration is processed by a separate processor and the procedure can be performed in parallel. So a highthroughput decoder architecture is possible. In [18, 19], the concepts and realization of highly parallelized decoder architectures have been presented and discussed. To the author’s best knowledge, there is not any GPUbased implementation of the LDPCCC decoder yet. The reason may lie in the complexity structure of the LDPCCC compared to the LDPC block code, particularly the random timevarying LDPCCC.
As will be discussed in this paper, an LDPCCC derived from a well designed QCLDPC code possesses not only the good BER performance, but also the regular structure that results in many advantages in practical implementations. Due to the structure inherited from the QCLDPC code, the LDPCCC decoder enables an efficient and compact memory storage of the messages with a simple address controller.
In this paper, we develop flexible and highly parallel GPUbased decoders for the LDPC codes. We improve the efficiency by making (i) the threads of a warp follow the same execution path (except when deciding whether a bit is a “0” or a “1”) and (ii) the memory accessed by a warp be of a certain size and be aligned. The results show that the decoders based on the GPUs achieve remarkable speedup improvement — more than times faster than the serial CPUbased decoder.
We also develop a GPUbased decoder for the LDPC convolutional codes. We propose a decoder architecture for LDPCCC derived from QCLDPC blockcode. By taking advantage of the homogeneous operations of the pipeline processors, we compress the index information of different processors into one lookup table. Combined with an efficient thread layout, the decoder is optimized in terms of thread execution and memory access. Simulation results show that compared with the serial CPUbased decoder, the GPUbased one can achieve as many as times speedup. The GPUbased decoder, moreover, outperforms a quadcore CPUbased decoder by almost times in terms of simulation time.
The rest of the paper is organized as follows. Section II reviews the structure and decoding algorithm of the LDPC code. The same section also reviews the construction of LDPCCC based on QCLDPC code as well as the decoding process for the LDPCCC. In Section III, the architecture of CUDA GPU and the CUDA programming model is introduced. Section IV describes the implementation of the LDPC decoder and LDPCCC decoder based on GPUs. Section V presents the simulation results of the LDPC decoder and LDPCCC decoder. The decoding times are compared when (i) a GPU is used, (ii) a quadcore CPU is used with a single thread, and (iii) a quadcore CPU is used with up to 8 threads. Finally, Section VI concludes the paper.
Ii Review of LDPC Codes and LDPC Convolutional Codes
Iia Structure of LDPC Codes and QCLDPC Codes
A binary LDPC code is a linear block code specified by a sparse paritycheck matrix , where . The code rate of such an LDPC code is . The equality holds when is full rank.
The matrix contains mostly and relatively a small number of Such a sparsity structure is the key characteristic that guarantees good performance of LDPC codes. A regular LDPC code is a linear block code with containing a constant number of ’s in each column and a constant number of ’s in each row. Moreover, and satisfy the equation . Otherwise the code is defined as an irregular LDPC code.
Example 1.
The paritycheck matrix in (1) shows an example of an irregular LDPC code.
(1) 
A bipartite graph called Tanner graph [20] can be used to represent the codes and to visualize the messagepassing algorithm. Figure 1 is the underlying Tanner graph of the in (1). The upper nodes are called the message nodes or the variable nodes and the nodes in the lower part of Fig. 1 are called the check nodes. An edge in the Tanner graph represents the adjacency of the variable node and the check node . It corresponds to a nonzero th entry in the matrix.
QCLDPC codes form a subclass of LDPC codes with the paritycheck matrix consisting of circulant permutation matrices [21, 22]. The paritycheck matrix of a regular QCLDPC code is represented by
(2) 
where denotes the number of block rows, is the number of block columns, is the identity matrix of size , and (; ) is a circulant matrix formed by shifting the columns of cyclically to the right times with ’s being nonnegative integers less than . The code rate of is lower bounded by . If one or more of the submatrix(matrices) is/are substituted by the zero matrix rendering nonuniform distributions of the checknode degrees or variablenode degrees, the QCLDPC code becomes an irregular code.
IiB Belief Propagation Decoding Algorithm for LPDC Codes
LDPC codes are most commonly decoded using the belief propagation (BP) algorithm [23, 24]. Referring to the Tanner graph shown in Fig. 1, the variable nodes and the check nodes exchange soft messages iteratively based on the connections and according to a twophase schedule.
Given a binary (N, K) LDPC code with a paritycheck matrix , we define as the set of binary codewords that satisfy the equation . At the transmitter side, a binary codeword is mapped into the sequence according to . We assume that is then transmitted over an additive white Gaussian noise (AWGN) channel and the received signal vector is then given by , where consists of independent Gaussian random variables with zero mean and variance .
Let be the initial loglikelihood ratio (LLR) that the variable node is a “” to that it is a “”, i.e.,
(3) 
Initially, is calculated by [25]. Define as the set of variable nodes that participate in check node and as the set of check nodes connected to variable node . At iteration , let be the LLR messages passed from variable node to check node ; be the LLR messages passed from check node to variable node ; and be the a posteriori LLR of variable node . Then the standard BP algorithm can be described in Algorithm 1 [2, 26].
Note that the decoding algorithm consists of 4 main procedures: initialization, horizontal step, vertical step and making hard decisions. For each of these procedures, multiple threads can be used in executing the computations in parallel and all the threads will follow the same instructions with no divergence occurring, except when making hard decisions.
(4) 
(5) 
(6) 
(7) 
IiC Structure of LDPC Convolutional Codes
A (timevarying) semiinfinite LDPC convolutional code can be represented by its parity check matrix in (3).
(3) 
where is referred to as the syndrome former memory of the paritycheck matrix. Besides, the submatrices , are binary matrices given by
If are full rank for all time instant , the matrix in (3) defines a rate convolutional code ignoring the irregularity at the beginning.
Definition 1.
A LDPC convolutional code is called a regular LDPC convolutional code if the paritycheck matrix has exactly ones in each row and ones in each column starting from the th row and th column.
Definition 2.
An LDPC convolutional code is periodic with period if is periodic, i.e., .
A code sequence = is “valid” if it satisfies the equation
(8) 
where = and is the syndromeformer (transposed paritycheck) matrix of .
IiD Deriving LDPC Convolutional codes from QCLDPC block codes
There are several methods to construct LDPC convolutional codes from LDPC block codes. One method is to derive timevarying LDPCCC by unwrapping randomly constructed LDPC block codes [17] and another is by unwrapping the QCLDPC codes [27, 28]. We now consider a construction method by unwrapping a class of QCLDPC block code.
Suppose we have created a QCLDPC block code with blockrows and blockcolumns. The size of its circulant matrices is . We can derive the paritycheck matrix for a LDPC convolutional code using the following steps.

Partition the paritycheck matrix to form a matrix, where is the greatest common divisor of and , i.e.,
where is a matrix, for .

Divide along the diagonal into two portions: the strictly uppertriangular portion and the lowertriangular portion as follows:
and

Unwrap the paritycheck matrix of the block code to obtain the paritycheck matrix of LPDCCC. First paste the strictly uppertriangular portion below the lowertriangular portion. Then repeat the resulting diagonallyshaped matrix infinitely, i.e.,
The resulting timevarying LDPCCC has a period of and the memory equals . The girth of the derived LPDCCC is at least as large as the girth of the QCLDPC code [29]. A convenient feature of this timevarying unwrapping is that a family of LDPC convolutional codes can be derived by choosing different circulant size of the QCLDPC block code.
Example 2.
Consider a QCLDPC code with block rows and block columns, i.e., and . It is first divided into equally sized subblocks^{1}^{1}1Here we use subblock to denote the matrix as to distinguish it with the submatrix within it, i.e., the matrix., i.e., . Then the paritycheck matrix of LDPCCC is derived. The construction process is shown in Fig. 2.
IiE Decoding Algorithm for LDPCCC
In , two different variable nodes connected to the same check node cannot be distant from each other more than time units. This allows a decoding window that operates on a fixed number of nodes at one time. Since any two variable nodes that are at least units apart can be decoded independently, parallel implementation is feasible. The LDPCCC can therefore be decoded with pipelined BP decoding algorithm [17]. Specifically, for a maximum iteration number of , independent processors will be employed working on different variable nodes corresponding to different time. In each processor, the variable nodes and the check nodes exchange soft messages iteratively based on the connections and according to a twophase schedule.
Fig. 3 shows a decoder on the Tanner graph. It is based on the LDPCCC structure shown in Example 1. The code has a rate of and a syndrome former memory of . We refer the incoming variable nodes (bits) as a frame. Note that every bits form a frame and every frames are involved in the same constraints. The processors can operate concurrently. At every iteration, every processor first updates the neighboring check nodes of the variable nodes that just come into this processor. Then every processor will update the variables which are leaving this processor.
The computations of the checknode updating and variablenode updating are based on the standard BP algorithm Suppose = , where is the th transmitted codeword. Then the codeword is mapped into the sequence according to and (). Assuming an AWGN channel, the received signal is further given by where and is an AWGN with zero mean and variance .
Using the same notation as in Sect. IIB, the pipelined BP decoding algorithm applying to LDPCCC is illustrated in Algorithm 2. Same as the LDPC decoding algorithm, the LDPCCC decoding algorithm consists of 4 main procedures: initialization, horizontal step, vertical step and making hard decisions. Moreover, for each of these procedures, multiple threads can be used in executing the computations in parallel and all the threads will follow the same instructions with no divergence occurring, except when making hard decisions.
(10) 
(11) 
(12) 
Iii Graphics Processing Unit and CUDA Programming
A graphics processing unit (GPU) consists of multithreaded, multicore processors. GPUs follow the singleinstruction multipledata (SIMD) paradigm. That is to say, given a set of data (regarded as a stream), the same operation or function is applied to each element in the stream by different processing units in the GPUs simultaneously. Figure 4 shows a simplified architecture of the latest GPU device. It contains a number of multiprocessors called streaming multiprocessors (or SMs). Each SM contains a group of stream processors or cores and several types of memory including registers, onchip memory, L2 cache and the most plentiful dynamic randomaccess memory (DRAM). The L1 cache is dedicated to each multiprocessor and the L2 cache is shared by all multiprocessors. Both caches are used to cache accesses to local or global memory. The onchip memory has a small capacity (tens of KB) but it has a low latency [30].
In our work, the GPU used is a GTX460, which has 7 SMs and 768 MB global memory. Each SM contains 48 cores [31]. Moreover, the 64 KB onchip memory is configured as 48 KB shared memory and 16 KB L1 cache for each SM because the more shared memory is utilized, the better.
CUDA (Compute Unified Device Architecture) is a parallel computing architecture developed by Nvidia. In a CUDA program, computations are performed as a sequence of functions called parallel kernels. Each kernel is typically invoked on a massive number of threads. Threads are first grouped into thread blocks and blocks are further grouped into a grid. A thread block contains a set of concurrently executing threads, and the size of all blocks are the same with an upper limit . In current GPUs with compute capability 2.x, .
In an abstract level, the CUDA devices use different memory spaces, which have different characteristics. These memory spaces includes global memory, local memory, shared memory, constant memory, texture memory, and registers. The global memory and texture memory are the most plentiful but have the largest access latency followed by constant memory, registers, and shared memory.
CUDA’s hierarchy of threads map to a hierarchy of processors on the GPU. A GPU executes one or more kernel grids and a SM executes one or more thread blocks. In current GPUs with compute capability 2.x, the SM creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps. A warp is the execution unit and executes one common instruction at a time. So full efficiency is realized when threads of a warp take the same execution path.
In CUDA programming, the first important consideration is the coalescing global memory accesses. Global memory resides in the device memory and is accessed via 32, 64, or 128byte memory transactions. These memory transactions must be naturally aligned (i.e. the first address is a multiple of their size).
When a warp executes an instruction that accesses the global memory, it coalesces the memory accesses of the threads within the warp into one or more of these memory transactions depending on the size of the word accessed by each thread and the distribution of the memory addresses across the threads.
Iv Implementation of Decoders for LDPC Codes and LDPCCCs
Iva GPUbased LDPC Decoder
We implement our decoders using the standard BP decoding algorithm. According to the CUDA programming model, the granularity of a thread execution and a coalesced memory access is a warp. Full efficiency is realized when all threads in a warp take the same execution path and the coalesced memory access requirement is satisfied. Thus, we propose to decode codewords simultaneously, where is an integer multiple of a warp (i.e., multiple of 32). For each decoding cycle, codewords will be input, decoded, and ouput together and in parallel.
Recall that an LDPC code can be represented by its paritycheck matrix or a Tanner graph. A nonzero element in the paritycheck matrix corresponds to an edge in the Tanner graph.
In the LDPC decoder, messages are bound to the edges in the Tanner graph (or the ’s in the paritycheck matrix ). So we store the messages according to the positions of ’s. Besides, the channel messages corresponding to the variable nodes are required. To reuse the notation, we denote the data structure storing the messages between the variable nodes and the check nodes as while the the data structure storing the channel messages as . The difficulty of the CUDA memory arrangement lies on the fact that for practical LDPC codes with good performance, the positions of the ’s are scattered in the paritycheck matrix.
First, in the BP decoding procedure, although there are two kinds of messages, namely, the variabletocheck messages and the checktovariable messages, at every step of the iteration, only one kind of message is needed to be stored, i.e., after the checknode updating step, only the checktovariable messages ’s are stored in the and after the variablenode updating step, only the variabletocheck messages ’s are stored in the . Second, in our new decoder architecture, ( is a multiple of a warp) codewords are decoded together and hence the messages of codewords are also processed together. We number the distinct codewords as and we use the same notations for the messages as before, i.e., is the message from variable node to check node corresponding to the th codeword and is the message from check node to variable node corresponding to the th codeword. Since all the codewords messages share the same Tanner graph, messages of the distinct codewords corresponding to the same edge can be grouped into one package and stored linearly. Let denote the package corresponding to the edge connecting variable node and check node . Then in package , or are stored contiguously. This is shown in Figure 5. Different packages ’s are aligned linearly according to their corresponding positions in the paritycheck matrix — rowbyrow, and left to right for each row. That implies the messages associated to one check node are stored contiguously.
0 1 2 3 4 5 6 7  
(13) 
Remark.
To be consistent with the use of memory locations in computer programming, all the indices of the data structures in this paper starts from .
The advantage of this arrangement is obvious. Since is a multiple of 32, the memory segment for every package is naturally aligned when the data type belongs to one of the required data types (i.e., with word size of 1, 2, 4, or 8byte). In addition, the structure of the paritycheck matrix is shared by the codewords. As these data elements are processed together, they can be accessed by contiguous threads and hence the global memory is always accessed in a coalesced way. We also ensure that the threads within a warp always follow the same execution path with no divergence occurring (except when making hard decisions on the received bits). Then both the memory access and the thread execution are optimal and efficient.
We also need to store the details of the paritycheck matrix. Two lookup tables denoted by and will be kept. is used in the checknode updating process and is used in the variablenode updating process. The two tables store the indices of the data accessed in the two updating processes and both are twodimensional. The first dimension is to distinguish different check nodes, i.e., is associated with the th check node or the th row. Each records the indices of the messages related to the th check node. The two lookup tables are shared by all codewords.
(14) 
(15) 
Example 3.
Consider the paritycheck matrix in (13). The corresponding data structure begins with the package , which is followed by , , , , , …, , …, . The subscripts of the nonzero entries indicate the sequences (or positions) of the associated data in the entire data structure, starting from . The and are shown in (14) and (15).
It is seen that the size of can be reduced by only storing the first address of the data in each row, namely, only store , only store and so on for all . Particularly, for regular LDPC codes with a unique row weight , the indices in for the th check node are from to . As for the , the indices are normally irregular and random. Hence a fullindexed lookup table is required for .
The and lookup tables are stored in the constant or texture memory in the CUDA device so as to be cached to reduce the access time.
A separate thread is assigned to process each check node or each variable node in the updating kernel. Hence, threads can be assigned to process the data of codewords simultaneously. So, a two dimensional thread hierarchy is launched. The first dimension is for identifying the different codewords while the second dimension is for processing different check nodes or variable nodes. The thread layout is illustrated in Fig. 6. For each thread block, we allocate threads in the threadIdx.x dimension^{2}^{2}2In CUDA, threads are linear in the threadIdx.x dimension., and threads in the threadIdx.y dimension. Each threadblock contains threads, which should be within the threadblock size limit (1024 for the current device). The total number of threadblocks is determined by the number of check nodes or the number of variable nodes . We denote in the checknode updating kernel as and the one in the variablenode updating kernel as . Then the numbers of thread blocks are given by and , respectively. In Fig. 6, the threads marked by the vertical rectangular are processing the same codeword.
In the checknode updating kernel and the variablenode updating kernel, the forwardandbackward calculation is adopted as in [32]. The shared memory is used to cache the involved data so as to avoid reaccessing the global memory. Due to the limited size of the shared memory, the size of the threadblock should not be too large. Consider a QCLDPC code. For each check node, there are data elements to be stored. Denote the shared memory size by and the size of each data by . Consequently in the checknode updating kernel, the threadblock size, denoted by , is limited by
(16) 
In addition, and .
With such a thread layout, the threads access the memory in a straightforward pattern. For example, for the checknode updating kernel, a twodimensional thread hierarchy with a total size of is launched. During the memory access, every threads are onetoone mapped to data in a message package. Hence, coalesced memory access is guaranteed.
IvB GPUbased LDPCCC Decoder
The decoding algorithm and the pipelined LDPCCC decoder architecture have been introduced in Section IIE. The LDPCCCs studied in our work are derived from QCLDPC codes as described in Section IID. So our LDPCCC decoder is confined to the LDPCCCs with the paritycheck matrix of this kind of structure.
IvB1 Data Structure
The LDPC convolutional codes are decoded continuously. We will thus refer to an LDPCC code sequence = as a code stream and , as a code frame or variable frame. A code stream is constrained with the paritycheck matrix by
The paritycheck matrix of the LDPCCC is shown in Figure 7. It is seen that the check nodes are grouped into layers. Each variablenode frame is connected to ( here) check layers in the paritycheck matrix. Let denote the size of , and denote the size of each check layer. Thus the code rate is .
We will use the same notations as in Section IID. The LDPCCC is derived from a QCLDPC base code which has submatrices and the size of each submatrix is . is first divided into subblocks^{3}^{3}3Note that a “subblock” is different from a “submatrix”. ( in Figure 7) and each subblock contains several submatrices. We have and . Referring to Section IID, we denote the unwrapped paritycheck matrix of the QCLDPC code as
The of the derived LDPCCC is a repetition of . Denotingthe number of edges in by , we have .
In designing the LDPCCC decoder, the first thing to consider is the amount of memory required to store the messages. Like the LDPC decoder, we store the messages according to the edges in the paritycheck matrix. Let denote the number of iterations in the LDPCCC decoding. Then processors are required in the pipelined decoder. Although the paritycheck matrix of the LDPCCC is semiinfinite, the decoder only needs to allocate memory for processors. Hence the total size of the memory required for storing the messages passing between the variable nodes and check nodes is units. And the total size of the memory required for storing these channel messages is .
Next, we will describe the hierarchical data structure for the LDPCCC decoder memory space. To reuse the notation, we use to denote the memory space for the messages on the edges and to denote the memory space for the channel messages. The is a multidimensional array with two hierarchies. First, we divide the entire memory space into groups corresponding to the processors and we use the first hierarchy of as the data structure for each group. That is , denote the data structure for the processors, respectively. Second, recall that the paritycheck matrix in Figure 7 is derived from which is divided into nonzero subblocks and each subblock has a size of . Thus in each group, is also divided into subblocks, denoted by the second hierarchy of , namely, , where . Every stores the messages associated with one subblock. On the other hand, the memory for the channel messages is simpler: , will be allocated. Finally, to optimize the thread execution and memory access, LDPC convolutional code streams are decoded simultaneously, where is a multiple of a warp. Thus every data are combined into one package and take up one memory unit.
An LDPCCC decoder uses the BP algorithm to update the check nodes and variable nodes. The BP decoding procedures are based on the paritycheck matrix . With the data structure to store the messages, the decoder also needs the structure information of for understanding the connections between the check nodes and the variable nodes. This information can be used to calculate the index of the data being accessed during the updating. Due to the periodic property of the constructed LDPCCC, the structure of is shared by all the processors. We label the subblocks in with the numbers .
In addition, in the decoder, the checknode layers or variablenode frames being updated simultaneously in the processors are separated by an interval of . Since also has a period of , at any time slot, the processors require the same structure information in updating the check nodes or the variable nodes, as seen in Figure 7. The lookup tables used in checknode updating and variablenode updating are denoted as and , respectively. The two lookup tables will then store the labels of the subblocks in that are involved in the updating process. Besides, another lookup table will be used to store the “shift numbers^{4}^{4}4For a QCLDPC base matrix, the information is the “shift number” of each submatrix ( represents the allzero matrix, represents the identity matrix, represents cyclically rightshifting the identity matrix times).” of the submatrices in each subblock.
Example 4.
IvB2 Decoding Procedures
Based on the discussion in Section IIE, the detailed decoding procedures are as follows.

At time slot , the first code frame enters Processor . This means the corresponding memory space will be filled with the channel messages of . Then the channel messages will be propagated to the corresponding check nodes. Hence, referring to Fig. 7 and (18), , , and will be filled with the same channel messages .
Next, the first check layer of , i.e., , will be updated based on the messages from , namely, the messages stored in (they are the only messages available to ).

At time slot , the second code frame enters Processor . Hence the memory space , , , and will be filled with the messages of . Then, the check layer are updated in a similar way as the check layer . However, both the messages from and , i.e., messages stored in and , are used in the updating of based on the index information in . The procedure at time slot is shown in Figure (a)a.
The procedure goes on. When has been input and check layer has been updated, all the checktovariable messages needed to update the variable layer are available. So will be updated with the channel messages in and the checktovariable messages in , , , and . Now, is at the end of Processor and is about to be shifted to Processor . Instead of copying the memory from one location to another, all we need to do is to specify that the memory “belongs” to Processor .

At the next time slot, i.e., time slot (time slot ), the new code frame comes. The messages will be stored in and , , and . Now there are two check layers to update, and . It is noted that are updated based on all the available messages in , , , and while are updated based on the updated messages only in . This insufficient updating of check nodes only occurs to the first code frames. After the updating of the check nodes, the code frame is at the end of Processor and will be updated. There is no code frame arriving at the end of Processor yet.

At time slot , the entire memory space of and are filled with messages. and its associated messages are at the end of Processor (as being labeled) while is the latest code frame input into Processor . Next, the check nodes in the check layers of the processors will be updated in parallel. After the updating of the check nodes, all the variable nodes which are leaving Processor () are updated. Specifically, the variable nodes , are to be updated. Furthermore, is about to leave the decoder. Hard decision will be made based on the a posteriori LLR of . Then the memory space of , , , , and are cleared for reuse. At the next time slot , the new code frame comes in the decoder and these memory space will be filled with the messages of .
Remark.
In our GPUbased decoder, all the check nodes (variable nodes) needed to be updated in the processors are processed in parallel by multiple threads.

Note that the LDPCCC matrix has a period of ( here). Hence, at time slot , enters the decoder and reuses the memory space of where . Furthermore, we let . Then the check layer in Processor () will be updated followed by the updating of the code frame . Moreover, the “oldest” code frame residing in the decoder — — is about to leave the decoder and hard decisions will be made on it.
So the entire LDPCCC decoder possesses a circulant structure, as shown in Figure 9. The memory is not shifted except for the one associated with the code frame which is leaving the decoder. Instead, the processor are “moving” by changing the processor label of each code frame. Correspondingly, the “entrance” and “exit” are moving along the data structure. This circulant structure reduces the time for memory manipulation and simplifies the decoder.
IvB3 Parallel Thread Hierarchy
As described in Sect. IVB1, the memory associated with each entry in the matrix is a message package containing messages from code streams. So there is a straightforward mapping between the thread hierarchy and the data structure. In the checknodeupdating kernel (or variableupdatingkernel), a two dimensional thread hierarchy of size (or ) is launched, where (or ) is mapped to the total number of check nodes (or variable nodes) being updated in processors. The size of one of the dimensions (i.e., ) is mapped to the number of code streams. Like in LDPC decoder, will be configured as the threadIdx.x dimension and (or ) will be the threadIdx.y dimension in the CUDA thread hierarchy. The threads in the threadIdx.x dimension is contiguous and will access the data in each message package for coalesced access.
Fig. 8: Illustration of the procedures of a LPDCCC decoder. The horizontal line denotes the updating of the row. The vertical line denotes the updating of a column.
IvC CPUbased LDPC and LDPCCC Decoders
We implement both the serial CPUbased LDPC decoder and LDPCCC decoder using the C language. As CPUs with multiple cores are very common nowadays, we further implement a multithread CPUbased LDPCCC decoder using OpenMP. OpenMP [33] is a portable, scalable programming interface for sharedmemory parallel computers. It can be used to explicitly direct multithreaded, shared memory parallelism. A straightforward application of the OpenMP is to parallize the intensive loopbased code with the #pragma omp parallel for directive. Then the executing threads will be automatically allocated to different cores on a multicore CPU.
The horizontal step and the vertical step in Algorithm 2 involve intensive computing. On a singlecore CPU, the updating of the different nodes are processed with a serial for loop. Since the updating of different nodes can be performed independent of one another, it is ideal to parallelize the for loop with the #pragma omp parallel for directive in the OpenMP execution on a multicore CPU. Hence, in our implementation, we issue multiple threads to both the updating of the check nodes (9) and the updating of the variable nodes (10) in the multithread CPUbased LDPCCC decoder.
V Results and Discussion
CPU  GPU  

Platform  Intel Xeon  Nvidia GTX460 
Number of cores  4  7 48 = 336 
Clock rate  2.26 GHz  0.81 GHz 
Memory  8 GB DDR3 RAM  768 MB global memory and 48 KB shared memory 
Maximum number of threads  —  
Maximum threadblock size  —  threads 
Programming language  C/OpenMP  CUDA C 
Code  Number of Edges  

A  
B  
C  
D 
Va The Experimental Environment
The CPU being used is an Intel Xeon containing cores. Moreover, it can handle up to threads at a time. The serial CPUbased decoders are developed using C and the multithreaded CPUbased LDPCCC decoder is developed using OpenMP. Note that for the serial CPUbased decoders, only one of the cores in the CPU will be utilized. The GPU used in this paper is a GTX460 containing cores and the GPUbased decoders are developed using CUDA C. Furthermore, in our simulations, codewords are decoded simultaneously in the GPU decoders, i.e., . Details of the CPU and GPU used in our simulations are presented in Table I.
Table II shows the characteristics of the QCLDPC codes under test. For Code A to code D, and thus giving the same code rate of . These codes are further used to derived regular LDPCCCs. In order to avoid confusion, we denote the derived LDPCCCs as Code A’ to Code D’. It can be readily shown that the LDPCCCs A’ to D’ have the same code rate of .
Remark.
Note that although QCLDPC codes are adopted in the simulation, the new GPUbased LDPC decoder is able to decode other LDPC codes like randomlyconstructed regular or irregular codes.
VB The Decoding Time Comparison
In order to optimize the speed and to minimize the data transfer between the CPU (host) and the GPU (device), we generate and process the data, including the codeword and the AWGN noise, directly on the GPU. After hard decisions have been made on the received bits, the decoded bits are transferred to the CPU which counts the number of error bits. Since the data transfer occurs only at the end of the iterative decoding process, the transfer time (overhead) is almost negligible compared with time spent in the whole decoding process.
In the following, we fix the number of decoding iterations and the simulation terminates after block/frame errors are received. By recording the total number of blocks/frames decoded and the total time taken^{5}^{5}5 In the case of the GPUbased decoders, the total time taken includes the GPU computation time, the time spent in transferring data between the CPU and GPU, etc. However, as explained above, the GPU computation time dominates the total time while the overhead is very small., we can compute the average time taken to decode one block/frame.
Code  (s)  (ms)  (s)  (ms)  Speedup ()  

A  2832  6  2.12  4058  1270  313  148 
B  12768  37  2.9  11664  5350  458  158 
C  21664  74  3.4  20046  10950  546  161 
D  82624  371  4.5  70843  51580  728  162 
VB1 LDPC decoders
The GPUbased decoder and the serial CPUbased decoder are tested with 30 iterations at a of 3.2 dB. Table III shows the number of transmitted codewords and the simulation times for different codes.
We consider the average time for decoding one codeword for the serial CPUbased decoder, i.e., . We observe that increases from Code A to Code D due to an increasing number of edges in the codeword. Further, we consider the average time for decoding one codeword for the GPUbased decoder, i.e., . Similar to the serial CPUbased decoder, increases from Code A to Code D.
Finally, we compare the simulation times of the serial CPUbased decoder and the GPUbased decoders by taking the ratio . The results in Table III indicate that the GPUbased decoder accomplishes speedup improvements from times to times compared with the serial CPUbased decoder.
VB2 LDPCCC decoders
We decode the LDPC convolutional codes A’ to D’ at a of dB with . First, we show the average decoding times for Code A’ and Code C’ when different numbers of threads are used in the CPUbased decoders. The results are shown in Table IV. The serial CPUbased decoder corresponds to the case with a single thread. We observe that the decoding time is approximately inversely proportional to the number of threads used — up to threads. However, the time does not improve much when the number of threads increases to or . The reason is as follows. The CPU being used has cores, which can execute up to tasks in fully parallel. Hence, compared with using a single thread, there is an almost times improvement when threads are used. As the number of threads increases beyond , however, all the threads cannot really be executed at the same time by the cores. Consequently, further time improvement is small when more than threads are used.
Code  Number of threads used  

1  2  4  6  8  
A’  39  20  11  10  9 
C’  73  38  21  19  17 
Next, we compare the decoding times of the LDPCCC decoders when GPUbased and CPUbased decoders are used to decode Code A’ to Code D’. For the CPUbased decoders, we consider the cases where a single thread and threads are used, respectively. Table V shows the results. As explained above, limited by the number of cores ( only) in the CPU, the CPUbased decoder can only improve the speed by about times even when the number of threads increases from to . We also observe that compared with the serial CPUbased decoder, the GPUbased LDPCCC decoder can achieve to times speedup improvement. Compared with the thread CPUbased decoder, the GPUbased LDPCCC decoder can also accomplish to times speedup improvement.
Code  (s)  (ms)  (s)  (ms)  (s)  (ms)  

A  3136  0.73  0.23  2846  112  39  28  9  4.3  170  39 
B  6272  1.95  0.31  5716  345  60  79  14  4.3  194  45 
C  14400  5.4  0.38  13303  976  73  230  17  4.3  192  45 
D  43680  21.0  0.48  37451  3590  96  834  22  4.4  200  46 
Vi Conclusion
In this paper, efficient decoders for LDPC codes and LDPC convolutional codes based on the GPU parallel architecture are implemented. By using efficient data structure and thread layout, the thread divergence is minimized and the memory can be accessed in a coalesced way. All decoders are flexible and scalable. First, they can decode different codes by changing the parameters. Hence, the programs need very little modification. Second, they should be to run on the latest or even future generations of GPUs which possess more hardware resources. For example, if there are more cores/memory in the GPU, we can readily decode more codes, say codes as compared with codes used in this paper, at the same time. These are actually advantages of GPU parallel architecture compared to other parallel solutions including FPGA or VLSI. We will report our results in the future when we have the opportunity to run our proposed mechanism in other GPU families.
Compared with the traditional serial CPUbased decoders, results show that the proposed GPUbased decoders can achieve to speedup. The actual time depends on the particular codes being simulated. When compared with the thread CPUbased decoder, the GPUbased decoder can also accomplish to times speedup improvement. Thus the simulation time can be reduced from months to weeks or days when a GPUbased decoder is used. In summary, our results show that the proposed GPUbased LDPC/LDPCCC decoder has obvious advantages in the decoding time compared with CPUbased decoders.
References
 [1] R. G. Gallager, LowDensity ParityCheck Codes. The MIT Press, Sep. 1963.
 [2] D. MacKay, “Good errorcorrecting codes based on very sparse matrices,” Information Theory, IEEE Transactions on, vol. 45, no. 2, pp. 399–431, 1999.
 [3] I. Djordjevic, M. Cvijetic, L. Xu, and T. Wang, “Using LDPCCoded modulation and coherent detection for ultra highspeed optical transmission,” Lightwave Technology, Journal of, vol. 25, no. 11, pp. 3619–3625, 2007.
 [4] Y. Miyata, K. Sugihara, W. Matsumoto, K. Onohara, T. Sugihara, K. Kubo, H. Yoshida, and T. Mizuochi, “A tripleconcatenated FEC using softdecision decoding for 100 Gb/s optical transmission,” in Optical Fiber Communication (OFC), collocated National Fiber Optic Engineers Conference, 2010 Conference on (OFC/NFOEC), 2010, pp. 1–3.
 [5] Y. Chen and D. Hocevar, “A FPGA and ASIC implementation of rate 1/2, 8088b irregular low density parity check decoder,” in Global Telecommunications Conference, 2003. GLOBECOM ’03. IEEE, vol. 1, 2003, pp. 113–117 Vol.1.
 [6] I. B. Djordjevic, M. Arabaci, and L. L. Minkov, “Next generation FEC for HighCapacity communication in optical transport networks,” Journal of Lightwave Technology, vol. 27, no. 16, pp. 3518–3530, 2009.
 [7] B. Levine, R. R. Taylor, and H. Schmit, “Implementation of near Shannon limit errorcorrecting codes using reconfigurable hardware,” 2000.
 [8] A. Pusane, A. Feltstrom, A. Sridharan, M. Lentmaier, K. Zigangirov, and D. Costello, “Implementation aspects of LDPC convolutional codes,” Communications, IEEE Transactions on, vol. 56, no. 7, pp. 1060–1069, 2008.
 [9] S. Bates, Z. Chen, L. Gunthorpe, A. Pusane, K. Zigangirov, and D. Costello, “A lowcost serial decoder architecture for lowdensity paritycheck convolutional codes,” Circuits and Systems I: Regular Papers, IEEE Transactions on, vol. 55, no. 7, pp. 1967 –1976, Aug. 2008.
 [10] Z. Chen, S. Bates, and W. Krzymien, “High throughput parallel decoder design for LDPC convolutional codes,” in Circuits and Systems for Communications, 2008. ICCSC 2008. 4th IEEE International Conference on, May 2008, pp. 35 –39.
 [11] R. Swamy, S. Bates, and T. Brandon, “Architectures for asic implementations of lowdensity paritycheck convolutional encoders and decoders,” in Proc. IEEE Int. Symp. Circuits and Systems ISCAS 2005, 2005, pp. 4513–4516.
 [12] G. Falcao, V. Silva, and L. Sousa, “How GPUs can outperform ASICs for fast LDPC decoding,” in Proceedings of the 23rd international conference on Supercomputing. Yorktown Heights, NY, USA: ACM, 2009, pp. 390–399.
 [13] H. Ji, J. Cho, and W. Sung, “Massively parallel implementation of cyclic LDPC codes on a general purpose graphics processing unit,” in Signal Processing Systems, 2009. SiPS 2009. IEEE Workshop on. IEEE, 2009, pp. 285–290.
 [14] ——, “Memory access optimized implementation of cyclic and quasicyclic LDPC codes on a GPGPU,” Journal of Signal Processing Systems, pp. 1–11, 2010.
 [15] G. Falcao, L. Sousa, and V. Silva, “Massive parallel LDPC decoding on GPU,” in Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming. Salt Lake City, UT, USA: ACM, 2008, pp. 83–90.
 [16] ——, “Massively LDPC decoding on multicore architectures,” IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 2, pp. 309–322, Feb. 2011.
 [17] A. J. Felstrom and K. Zigangirov, “Timevarying periodic convolutional codes with lowdensity paritycheck matrix,” Information Theory, IEEE Transactions on, vol. 45, no. 6, pp. 2181–2191, 1999.
 [18] M. Tavares, E. Matus, S. Kunze, and G. Fettweis, “A dualcore programmable decoder for LDPC convolutional codes,” in Circuits and Systems, 2008. ISCAS 2008. IEEE International Symposium on, May 2008, pp. 532 –535.
 [19] E. Matus, M. Tavares, M. Bimberg, and G. Fettweis, “Towards a GBit/s programmable decoder for LDPC convolutional codes,” in Circuits and Systems, 2007. ISCAS 2007. IEEE International Symposium on, May 2007, pp. 1657 –1660.
 [20] R. Tanner, “A recursive approach to low complexity codes,” Information Theory, IEEE Transactions on, vol. 27, no. 5, pp. 533–547, 1981.
 [21] M. Fossorier, “Quasicyclic lowdensity paritycheck codes from circulant permutation matrices,” Information Theory, IEEE Transactions on, vol. 50, no. 8, pp. 1788–1793, 2004.
 [22] W. Tam, F. Lau, and C. Tse, “A class of QCLDPC codes with low encoding complexity and good error performance,” Communications Letters, IEEE, vol. 14, no. 2, pp. 169–171, 2010.
 [23] T. Richardson, M. Shokrollahi, and R. Urbanke, “Design of capacityapproaching irregular lowdensity paritycheck codes,” Information Theory, IEEE Transactions on, vol. 47, no. 2, pp. 619–637, 2001.
 [24] T. Richardson and R. Urbanke, “Efficient encoding of lowdensity paritycheck codes,” Information Theory, IEEE Transactions on, vol. 47, no. 2, pp. 638–656, 2001.
 [25] X. Hu, E. Eleftheriou, D. Arnold, and A. Dholakia, “Efficient implementations of the sumproduct algorithm for decoding LDPC codes,” in Global Telecommunications Conference, 2001. GLOBECOM ’01. IEEE, vol. 2, 2001, pp. 1036–1036E vol.2.
 [26] J. Chen, A. Dholakia, E. Eleftheriou, M. Fossorier, and X. Hu, “ReducedComplexity decoding of LDPC codes,” Communications, IEEE Transactions on, vol. 53, no. 8, pp. 1288–1299, 2005.
 [27] R. Tanner, D. Sridhara, A. Sridharan, T. Fuja, and D. Costello, “LDPC block and convolutional codes based on circulant matrices,” Information Theory, IEEE Transactions on, vol. 50, no. 12, pp. 2966–2984, 2004.
 [28] A. E. Pusane, R. Smarandache, P. O. Vontobel, and D. J. Costello, “On deriving good LDPC convolutional codes from QCLDPC block codes,” in Proc. IEEE Int. Symp. Information Theory ISIT 2007, 2007, pp. 1221–1225.
 [29] M. Lentmaier, D. G. M. Mitchell, G. P. Fettweis, and D. J. Costello, “Asymptotically regular LDPC codes with linear distance growth and thresholds close to capacity,” in Proc. Information Theory and Applications Workshop (ITA), 2010, pp. 1–8.
 [30] C. Nvidia, “Compute Unified Device Architecture Programming Guide Version 4.0,” NVIDIA Corporation, Tech. Rep., 2011.
 [31] W. Nvidia, N. Generation, and C. Compute, “Whitepaper nvidia’s next generation cuda compute architecture,” ReVision, pp. 1–22, 2009.
 [32] F. Kschischang, B. Frey, and H. Loeliger, “Factor graphs and the sumproduct algorithm,” Information Theory, IEEE Transactions on, vol. 47, no. 2, pp. 498–519, 2001.
 [33] R. Chandra, Parallel programming in OpenMP. Morgan Kaufmann, 2001.