Modeling and Energy Optimization ofLDPC Decoder Circuits with Timing Violations

Modeling and Energy Optimization of LDPC Decoder Circuits with Timing Violations

Abstract

This paper proposes a “quasi-synchronous” design approach for signal processing circuits, in which timing violations are permitted, but without the need for a hardware compensation mechanism. The case of a low-density parity-check (LDPC) decoder is studied, and a method for accurately modeling the effect of timing violations at a high level of abstraction is presented. The error-correction performance of code ensembles is then evaluated using density evolution while taking into account the effect of timing faults. Following this, several quasi-synchronous LDPC decoder circuits based on the offset min-sum algorithm are optimized, providing a 23%–40% reduction in energy consumption or energy-delay product, while achieving the same performance and occupying the same area as conventional synchronous circuits.

\SetAlFnt\newacronym

iidi.i.d.independent and identically distributed \newacronymMCMCMonte-Carlo \newacronymexitExIT“extrinsic information transfer” \newacronym1D1-Done-dimensional \newacronympmfPMFprobability mass function

1 Introduction

The time required for a signal to propagate through a CMOS circuit varies depending on several factors. Some of the variation results from physical limitations: the delay depends on the initial and final charge state of the circuit. Other variations are due to the difficulty (or impossibility) of controlling the fabrication process and the operating conditions of the circuit [2]. As process technologies approach atomic scales, the magnitude of these variations is increasing, and reducing the supply voltage to save energy increases the variations even further [3].

The variation in propagation delay is a source of energy inefficiency for synchronous circuits since the clock period is determined by the worst delay. One approach to alleviate this problem is to allow timing violations to occur. While this would normally be catastrophic, some applications (in signal processing or in error-correcting decoding, for example) can tolerate a degradation in the operation of the circuit, either because an approximation to the ideal output suffices, or because the algorithm intrinsically rejects noise. This paper proposes an approach to the design of systems that are tolerant to timing violations. In particular we apply this approach to the design of energy-optimized low-density parity-check (LDPC) decoder circuits based on a state-of-the-art soft-input algorithm and architecture.

Other approaches have been previously proposed to build synchronous systems that can tolerate some timing violations. In better than worst-case (BTWC) [4] or voltage over-scaled (VOS) circuits, a mechanism is added to the circuit to compensate or recover from timing faults. One such method introduces special latches that can detect timing violations, and can trigger a restart of the computation when needed [5, 6]. Since the circuit’s latency is increased significantly when a timing violation occurs, this approach is only suitable for tolerating small fault rates (e.g., ) and for applications where the circuit can be easily restarted, such as microprocessors that support speculative execution.

In most signal processing tasks, it is acceptable for the output to be non-deterministic, which creates more possibilities for dealing with timing violations. A seminal contribution in this area was the algorithmic noise tolerance (ANT) approach [7, 8], which is to allow timing violations to occur in the main processing block, while adding a separate reliable processing block with reduced precision that is used to bound the error of the main block, and provide algorithmic performance guarantees. The downside of the ANT approach is that it relies on the assumption that timing violations will first occur in the most significant bits. If that is not the case, the precision of the circuit can degrade to the precision of the auxiliary block, limiting the scheme’s usefulness. For many circuits, including some adder circuits [9], this assumption does not hold. Furthermore, the addition of the reduced precision block and of a comparison circuit increases the area requirement.

We propose a design methodology for digital circuits with a relaxed synchronicity requirement that does not rely on any hardware compensation mechanism. Instead, we provide performance guarantees by re-analyzing the algorithm while taking into account the effect of timing violations. We say that such systems are quasi-synchronous. LDPC decoding algorithms are good candidates for a quasi-synchronous implementation because their throughput and energy consumption are limiting factors in many applications, and like other signal processing algorithms, their performance is assessed in terms of expected values. Furthermore, since the algorithm is iterative, there is a possibility to optimize each iteration separately, and we show that this allows for additional energy savings.

The topic of unreliable LDPC decoders has been discussed in a number of contributions. Varshney studied the Gallager-A and the Sum-Product decoding algorithms when the computations and the message exchanges are “noisy”, and showed that the density evolution analysis still applies [10]. The Gallager-B algorithm was also analyzed under various scenarios [11, 12, 13]. A model for an unreliable quantized Min-Sum decoder was proposed in [14], which provided numerical evaluation of the density evolution equations as well as simulations of a finite-length decoder. Faulty finite-alphabet decoders were studied in [15], where it was proposed to model the decoder messages using conditional distributions that depend on the ideal messages. The quantized Min-Sum decoder was also analyzed in [16] for the case where faults are the result of storing decoder messages in an unreliable memory. The specific case of faults caused by delay variations in synchronous circuits is considered in [17], where a deviation model is proposed for binary-output circuits in which a deviation occurs probabilistically when the output of a circuit changes from one clock cycle to the next, but cannot occur if the output does not change. While none of these contributions explicitly consider the relationship between the reliability of the decoder’s implementation and the energy it consumes, there have been some recent developments in the analysis of the energy consumption of reliable decoders. Lower bounds for the scaling of the energy consumption of error-correction decoders in terms of the code length are derived in [18], and tighter lower bounds that apply to LDPC decoders are derived in [19]. The power required by regular LDPC decoders is also examined in [20], as part of the study of the total power required for transmitting and decoding the codewords.

In this paper, we present a modeling approach that provides an accurate representation of the deviations introduced in the output of an LDPC decoder processing circuit in the presence of occasional timing violations, while simultaneously measuring its energy consumption. We introduce a weak symmetry property for this model, and show that when it is satisfied, the model can be used as part of a density evolution analysis to evaluate the channel threshold and iterative performance of the decoder affected by timing faults. We also present experimental evidence showing that weak symmetry is satisfied for the decoder circuits that we consider. Finally, we show that under mild assumptions, the problem of minimizing the energy consumption of a quasi-synchronous decoder can be simplified to the energy minimization of a small test circuit, and present an approximate optimization method similar to Gear-Shift Decoding [21] that finds sequences of quasi-synchronous decoders that minimize decoding energy subject to performance constraints. We note that subsequent to [1], an energy optimization method for faulty LDPC decoders was presented in [22]. Both methods look for a sequence of operating conditions that minimize decoding energy, but the method in [22] requires that the operating conditions be ordered based on the message error probability that can be achieved, which is not possible in general without knowing the message distribution.

The remainder of the paper is organized as follows. Section 2 reviews LDPC codes and describes the circuit architecture of the decoder that is used to measure timing faults. Section 3 presents the deviation model that represents the effect of timing faults on the algorithm. Section 4 then discusses the use of density evolution and of the deviation model to predict the performance of a decoder affected by timing faults. Finally, Section 5 presents the energy optimization strategy and results, and Section 6 concludes the paper. Additional details on the CAD framework used for circuit measurements can be found in Appendix 7, and Appendix 8 provides some details concerning the simulation of the test circuits.

2 LDPC Decoding Algorithm and Architecture

2.1 Code and Channel

We consider a communication scenario where a sequence of information bits is encoded using a binary LDPC code of length . The LDPC code described by an binary parity-check matrix consists of all length- row vectors satisfying the equation . Equivalently, the code can be described by a bipartite Tanner graph with variable nodes (VN) and check nodes (CN) having an edge between the -th variable node and the -th check node if and only if . We assume that the LDPC code is regular, which means that in the code’s Tanner graph each variable node has a fixed degree and each check node has a fixed degree .

Let us assume that the transmission takes place over the binary-input additive white Gaussian noise (BIAWGN) channel. A codeword is transmitted through the channel, which outputs the received vector , where is a vector of \glsiid zero-mean normal random variables with variance . We use and to refer to the input and output of the channel at time . The BIAWGN channel has the property of being output symmetric, meaning that , and memoryless, meaning that . Throughout the paper, denotes a probability density function. The BIAWGN channel can also be described multiplicatively as , where is a vector of \glsiid normal random variables with mean and variance .

Let the belief output of the channel at time be given by

(1)

with . Note that if then is a log-likelihood ratio. Assuming that was transmitted, then has a normal distribution with mean and variance . Writing , we see that is Gaussian with mean and variance , that is, the distribution of is described by a single parameter . We call this distribution a \gls1D normal distribution. The distribution of can also be specified using other equivalent parameters, such as the probability of error , given by

(2)

where is the complementary error function.

2.2 Decoding Algorithm

The well-known Offset Min-Sum (OMS) algorithm is a simplified version of the Sum-Product algorithm that can usually achieve similar error-correction performance. It has been widely used in implementations of LDPC decoders [23, 24, 25]. To make our decoder implementation more realistic and show the flexibility of our design framework, we present an algorithm and architecture that support a row-layered message-passing schedule. Architectures optimized for this schedule have proven effective for achieving efficient implementations of LDPC decoders [24, 25, 26]. Using a row-layered schedule also allows the decoder to be pipelined to increase the circuit’s utilization. In a row-layered LDPC decoder, the rows of the parity-check matrix are partitioned into sets called layers. To simplify the description of the decoding algorithm, we assume that all the columns in a given layer contain exactly one non-zero element. This implies that . Note that codes with at most one non-zero element per column and per layer can also be supported by the same architecture, simply requiring a modification of the way algorithm variables are indexed.

Let us define a set containing the indices of the rows of that are part of layer , . We denote by a message sent from VN to CN during iteration . and by a message sent from CN to VN . It is also useful to refer to the CN neighbor of a VN that is part of layer . Because of the restriction mentioned above, there is exactly one such CN, and we denote its index by . Finally, we denote the channel information corresponding to the -th codeword bit by , since it also corresponds to the first message sent by a variable node to all its neighboring check nodes.

The Offset Min-Sum algorithm used with a row-layered message-passing schedule is described in Algorithm 2.2. In the algorithm, denotes the set of indices corresponding to VNs that are neighbors of a check node , and represents the current sum of incoming messages at a VN . The function returns the smallest and second smallest values in the set , is the offset parameter, and

{algorithm}

[tb] \SetKwInOutInputinput \SetKwInOutOutputoutput \DontPrintSemicolon\LinesNumbered

\Input

\Output \Begin

\tcp

Initialization

\tcp

Decoding \For \KwTo \For \KwTo \tcpVN to CN messages \tcpCN to VN messages \For \For \lIf \lElse \tcpVN update \tcpVN decision \For \lIf \lElseIf \lElse OMS with a row-layered schedule.

2.3 Architecture

The Tanner graph of the code can also be used to represent the computations that must be performed by the decoder. At each decoding iteration, one message is sent from variable to check nodes on every edge of the graph, and again from check to variable nodes. We call a variable node processor (VNP) a circuit block that is responsible for generating messages sent by a variable node, and similarly a check node processor (CNP) a circuit block generating messages sent by a check node.

In a row-layered architecture in which the column weight of layer subsets is at most 1, there is at most one message to be sent and received for each variable node in a given layer. Therefore VNPs are responsible for sending and receiving one message per clock cycle. CNPs on the other hand receive and send messages per clock cycle. At any given time, every VNP and CNP is mapped respectively to a VN and a CN in the Tanner graph. The routing of messages from VNPs to CNPs and back can be posed as two equivalent problems. One can fix the mapping of VNs to VNPs and of CNs to CNPs, and find a permutation of the message sequence that matches VNP outputs to CNP inputs, and another permutation that matches CNP outputs to VNP inputs. Alternatively, if VNPs process only one message at a time, one can fix the connections between VNPs and CNPs, and choose the assignment of VN to VNPs to achieve correct message routing. We choose the later approach because it allows studying the computation circuit without being concerned by the routing of messages.

The number of CNPs instantiated in the decoder can be adjusted based on throughput requirements from to (the number of rows in a layer). As the number of CNPs is varied, the number of VNPs will vary from to . An architecture diagram showing one VNP and one CNP is shown in Fig. 1. In reality, a CNP is connected to additional VNPs, which are not shown. The memories storing the belief totals and the intrinsic beliefs are also not shown. The part of the VNP responsible for sending a message to the CNP is called VNP front and the part responsible for processing a message received from a CNP is called the VNP back. The VNP front and back do not have to be simultaneously mapped to the same VN. This allows to easily vary the number of pipeline stages in the VNPs and CNPs. Fig. 1 shows the circuit with two pipeline stages.

Messages exchanged in the decoder are fixed-point numbers. The position of the binary point does not have an impact on the algorithm, and therefore the messages sent by VNs in the first iteration can be defined as rounding the result of (1) to the nearest integer, while choosing a suitable . The number of bits in the quantization, the scaling factor , and the OMS offset parameter are chosen based on a density evolution analysis of the algorithm (described in Section 4). We quantize decoder messages to 6 bits, which yields a decoder with approximately the same channel threshold as a floating-point decoder under a standard fault-free implementation.

Figure 1: Block diagram of the layered Offset Min-Sum decoder architecture.

In order to analyze a circuit that is representative of state-of-the-art architectures, we use an optimized architecture for finding the first two minima in each CNP. Our architecture is inspired by the “tree structure” approach presented in [27], but requires fewer comparators. Each pair of CNP inputs is first sorted using the Sort block shown in Fig. a. These sorted pairs are then merged recursively using a tree of Merge blocks, shown in Fig. b. If the number of CNP inputs is odd, the input that cannot be paired is fed directly into a special merge block with 3 inputs, which can be obtained from the 4-input Merge block by removing the input and the bottom multiplexer.

Note that it is possible that changes to the architecture could increase or decrease the robustness of the decoder (see e.g. [28]), but this is outside the scope of this paper.

(a)
(b)
Figure 2: Logic blocks used in the unit.

3 Deviation Model

3.1 Quasi-Synchronous Systems

We consider a synchronous system that permits timing violations without hardware compensation, resulting in what we call a quasi-synchronous system. Optimizing the energy consumption of these systems requires an accurate model of the impact of timing violations, and of the energy consumption. We propose to achieve this by characterizing a test circuit that is representative of the complete circuit implementation.

The term deviation refers to the effect of circuit faults on the result of a computation, and the deviation model is the bridge between the circuit characterization and the analysis of the algorithm. We reserve the term error for describing the algorithm, in the present case to refer to the incorrect detection of a transmitted symbol. A timing violation occurs in the circuit when the propagation delay between the input and output extends beyond a clock period. Modeling the deviations introduced by timing violations is challenging because they not only depend on the current input to the circuit, but also on the state of the circuit before the new input was applied. In general, timing violations also depend on other dynamic factors and on process variations.

In this paper, we focus on the case where the output of the circuit is entirely determined by the current and previous inputs of the circuit, and by the nominal operating condition of the circuit. We denote by the set of possible operating conditions, represented by vectors of parameters, and by a particular operating condition. For example, an operating condition might specify the supply voltage and clock period used in the circuit. We assume that all the parameters specified by are deterministic.

3.2 Computation Model

As described in Section 2.3, we consider a decoder composed of a processing unit (shown in Fig. 1) that computes all messages to and from one check node in each clock cycle. Since the circuit is synchronous, we can represent the computations in terms of a discrete-time system. Let be the input at clock cycle . When timing violations are allowed to occur, the corresponding1 circuit output can be expressed as , where represents the state of the circuit at the beginning of cycle , and is some deterministic function. By definition, the state depends on the previous input , but not on .

(a)
(b)
Figure 3: Computation tree combined with the deviation model.

The computations required to perform one decoding iteration can be represented using a computation tree, which models the generation of a VN-to-CN message in terms of messages sent in the previous iteration. There are check nodes in the tree. Each of these check nodes receives messages from neighboring variable nodes, and generates a message sent to the one VN whose message was excluded from the computation. This VN then generates an extrinsic message based on the channel prior and on the messages received from neighboring check nodes. An example of a computation tree is shown within the dashed box in Fig. a. In this paper, we assume that messages are updated using a flooding schedule, or in other words, that all messages at the left of the tree are identically distributed.

Evaluating the computation tree requires uses of a processor, but the number of processors implemented in the decoder, and the way they are mapped to nodes of the Tanner graph can affect the modeling of deviations. Since a processor performs a parallel check node computation, and are associated with distinct VNs. Let denote the set of VNs associated with the processor during cycle . Let us first assume that a processor is always mapped to distinct VNs in consecutive clock cycles, i.e. for any processor,

(3)

Then, and are independent, and if they belong to the same decoding iteration, they are also identically distributed. As a result, and are also independent. At the output of the processor, and are not independent since they both depend on . However, if (3) holds, messages received at a particular VN are guaranteed to have been generated in non-consecutive clock cycles, and it is therefore reasonable to consider only the marginal distribution of .

We now briefly describe some decoder architectures in which (3) holds. One possible architecture consists in implementing a single processor. Neglecting the circuit’s latency, the processor would therefore be reused for cycles to compute each layer, and for cycles to perform one iteration. If we assume that the parity-check matrix contains at most one non-zero element per column and per layer, (3) clearly holds for cycles that belong to the same layer. Furthermore, since the order in which CNs are processed can be chosen arbitrarily, it can easily be chosen to ensure that (3) also holds when starting a new layer. Another architecture of particular interest is one that instantiates processors to achieve maximum parallelism. In this case, each processor is used once in each layer. This type of architecture is often used with quasi-cyclic parity-check matrices composed of cyclically shifted identity sub-matrices. In this case, (3) holds as long as the shift indices used in two consecutive layers are different.

For convenience, we choose to represent the effect of deviations at the output of the computation tree. The properties that have been established for the processors continue to apply when these processors are used to evaluate a computation tree, since this corresponds to the normal operation of the decoder that was just discussed. Therefore, reusing the previous notation, we can write , where is a VN-to-CN message that will be sent in the next iteration, and now represents the combined state of the processors that are used to evaluate the tree. Defining as a vector containing all the VN-to-CN messages that form the input of the computation tree, becomes the concatenation of with .

3.3 Deviation Model

We have seen above that a decoder message can be expressed as a function of the messages received by the neighboring check nodes in the previous iteration and of the state of the processing circuit. To separate the deviations from the ideal operation of the decoder, it is helpful to decompose a decoding iteration into the ideal computation, followed by a transmission through a deviation channel. This model is shown in Fig. a, where is the message that would be sent from variable node to check node during iteration if no deviations had occurred during iteration . For the first messages sent in the decoder at , the computation circuits are not used and therefore no deviation can occur, and we simply have . Since we neglect correlations in successive circuit outputs, the deviation channel is memoryless.

Unlike typical channel models where the noise is independent from other variables in the system, the deviation is a function of the current circuit input and of the current state . Fig. b illustrates the dependencies between the random variables involved in the computation tree. Assuming a fixed distribution for and , can be determined by marginalizing the joint distribution of , , , , and , which in practice can be done using a Monte-Carlo simulation of a test circuit that implements the computation tree (such a simulation is discussed in more details in Appendix 8). However, directly measuring makes it difficult to extend the model to handle different input distributions. Instead, we propose a deviation model that retains conditional dependencies on the ideal message and on the transmitted bit . We note that a similar model that did not include was also used in [15]. The deviation model is thus expressed as the conditional distribution . Using the ideal message as a model parameter rather than has the advantage of reducing the complexity of the model, but also of limiting the error when the model is used with other input distributions, as discussed in the following subsection.

3.4 Generalized Deviation Model

The deviation model introduced in Section 3.3 requires and to have fixed distributions, but these distributions change at every decoding iteration. Furthermore, because the message distribution depends on the transmitted codeword, the deviation model also depends on the transmitted codeword. Under the assumption that and are \glsiid, is a function of , and it is thus sufficient to consider the evolution of .

Let us first assume that the transmitted codeword is fixed. In this case, the message distribution depends on the channel noise, on the iteration index , and on the operating conditions of the circuit. Since the messages are affected by deviations for , only is known a priori. An obvious way to measure deviations for all decoding iterations is to determine using the known , and to repeat the process for each subsequent decoding iteration. However, the resulting deviation model is of limited interest, since it depends on the specific message distributions in each iteration.

To generate a model that is independent of the iterative progress of the decoder, we first approximate as a \gls1D Normal distribution with error rate parameter chosen such that

(4)

This also provides us with an implicit parametrization of in terms of . Note that while does correspond exactly to a \gls1D Normal distribution, this is not necessarily the case after the first iteration. However, the approximation is only used to characterize deviations, and exact distributions can still be used to characterize messages. Therefore, can be determined exactly, and the impact of the approximation remains small as long as is small. In fact, combining a density evolution based on exact distributions with a deviation model generated using \gls1D Normal distributions leads to very accurate bit error rate predictions in practice [29].

To evaluate , we must also consider that deviations depend on the operating condition . Once is specified, is uniquely determined by the synthesized circuit, and in order to retain the ability to represent arbitrary circuits, we make no assumption on the distribution and simply characterize it as an arbitrary conditional \glspmf. We therefore obtain a model consisting of a family of non-parametric conditional \glsplpmf denoted as

(5)

where are the family parameters. However, we generally omit the superscript to simplify the notation. In practice, (5) is constructed by performing several Monte-Carlo simulations of the circuit implementation of the computation tree in Fig. 3 for various values and for all operating conditions . Interpolation is then used to obtain a continuous model in . While measuring deviations, we also record the switching activity in the circuit, which is then used to construct an energy model that depends on and , denoted as (where stands for “cost”).

To use the model, we first use (4) to determine the error rate parameter corresponding to the arbitrary message distribution at the beginning of the iteration, and we then retrieve the appropriate conditional \glspmf based on and on the operating condition . This conditional \glspmf then informs us of the statistics of deviations that occur at the end of the iteration, that is on messages sent in iteration .

As mentioned above, since depends on the transmitted codeword, this is also the case of and of the deviation distributions. We show in Section 4 that the codeword dependence is entirely contained within the deviation model and does not affect the analysis of the decoding performance, as long as the decoding algorithm and deviation model satisfy certain properties. Nonetheless, we would like to obtain a deviation model that does not depend on the transmitted codeword. This can be done when the objective is to predict the average performance of the decoder, rather than the performance for a particular codeword, since it is then sufficient to model the average behavior of the decoder. For the case where all codewords have an equal probability of being transmitted, we propose to perform the Monte-Carlo deviation measurements by randomly sampling transmitted codewords. This approach is supported by the experimental results presented in [29], which show that a deviation model constructed in this way can indeed accurately predict the average decoding performance.

4 Performance Analysis

4.1 Standard Analysis Methods for LDPC Decoders

Density evolution (DE) is the most common tool used for predicting the error-correction performance of an LDPC decoder. The analysis relies on the assumption that messages passed in the Tanner graph are mutually independent, which holds as the code length goes to infinity [30]. Given the channel output probability distribution and the probability distribution of variable node to check node messages at the start of an iteration, DE computes the updated distribution of variable node to check node messages at the end of the decoding iteration. This computation can be performed iteratively to determine the message distribution after any number of decoding iterations. The validity of the analysis rests on two properties of the LDPC decoder. The first property is the conditional independence of errors, which states that the error-correction performance of the decoder is independent from the particular codeword that was transmitted. The second property states that the error-correction performance of a particular LDPC code concentrates around the performance measured on a cycle-free graph, as the code length goes to infinity.

Both properties were shown to hold in the context of reliable implementations [30]. It was also shown that the conditional independence of errors always holds when the channel is output symmetric and the decoder has a symmetry property. We can define a sufficient symmetry property of the decoder in terms of a message-update function that represents one complete iteration of the (ideal) decoding algorithm. Given a vector of all the messages sent from variable nodes to check nodes at the start of iteration and the channel information associated with variable node , returns the next ideal message to be sent from a variable node to a check node : .

Definition 1.

A message-update function is said to be symmetric with respect to a code if

for any , any , and any codeword .

In other words, a decoder’s message-update function is symmetric if multiplying all the VN-to-CN belief messages sent at iteration and the belief priors by a valid codeword is equivalent to multiplying the next messages sent at iteration by that same codeword. Note that the symmetry condition in Definition 1 is implied by the check node and variable node symmetry conditions in [30, Def. 1].

4.2 Applicability of Density Evolution

In order to use density evolution to predict the performance of long finite-length codes, the decoder must satisfy the two properties stated in Section 4.1, namely the conditional independence of errors and the convergence to the cycle-free case. We first present some properties of the decoding algorithm and of the deviation model that are sufficient to ensure the conditional independence of errors.

Using the multiplicative description of the BIAWGN channel, the vector received by the decoder is given by when a codeword is transmitted, or by when the all-one codeword is transmitted. In a reliable decoder, messages are completely determined by the received vector, but in a faulty decoder, there is additional randomness that results from the deviations. Therefore, we represent messages in terms of conditional probability distributions given . Since we are concerned with a fixed-point circuit implementation of the decoder, we can assume that messages are integers from the set , where is the largest message magnitude that can be represented.

Definition 2.

We say that a message distribution is symmetric if

If a message has a symmetric distribution, its error probability as defined in (4) is the same whether or is received. Similarly to the results presented in [15], we can show that the symmetry of message distributions is preserved when the message-update function is symmetric.

Lemma 1.

If is a symmetric message-update function and if and have symmetric distributions for all , the next ideal messages also have symmetric distributions.

Proof.

We can express the distribution of the next ideal message from VN to CN as

(6)

where .

Assuming that the elements of the VN-to-CN message vector are independent and that each has a symmetric distribution,

and since the channel output also has a symmetric distribution,

Therefore, we can rewrite (6) as

(7)

Finally, letting and , (7) becomes

where . Since is symmetric, we can also express as

and therefore,

indicating that the next ideal messages have symmetric distributions. ∎

To establish the conditional independence of errors under the proposed deviation model, we first define some properties of the deviation.

Definition 3.

We say that the deviation model is symmetric if

Definition 4.

We say that the deviation model is weakly symmetric (WS) if

Note that if the model satisfies the symmetry condition, it also satisfies the weak symmetry condition, since . We then have the following Lemma.

Lemma 2.

If a decoder having a symmetric message-update function and taking its inputs from an output-symmetric communication channel is affected by weakly symmetric deviations, its message error probability at any iteration is independent of the transmitted codeword.

Proof.

Similarly to the approach used in [31, Lemma 4.90] and [10], we want to show that the probability that messages are in error is the same whether or is received. This is the case if the faulty messages have a symmetric distribution for all and all .

Since the communication channel is output symmetric and since no deviations can occur before the first iteration, messages have a symmetric distribution. We proceed by induction to establish the symmetry of the messages for . We start by assuming that

(8)

also holds for .

Using Definition 4 and (8), we can write the faulty message distribution as

where the third equality is obtained using the substitution . We conclude that the faulty messages have a symmetric distribution. Finally, since the decoder’s message-update function is symmetric, Lemma 1 confirms the induction hypothesis in (8). ∎

The last remaining step in establishing whether density evolution can be used with a decoder affected by WS deviations is to determine whether the error-correction performance of a code concentrates around the cycle-free case. The property has been shown to hold in [10] (Theorems 2, 3 and 4) for an LDPC decoder affected by “wire noise” and “computation noise”. The wire noise model is similar to our deviation model, in the sense that the messages are passed through an additive noise channel, and that the noise applied to one message is independent of the noise applied to other messages. The proof presented in [10] only relies on the fact that the wire noise applied to a given message can only affect messages that are included in the directed neighborhood of the edge where it is applied, where the graph direction refers to the direction of message propagation. This clearly also holds in the case of our deviation model, and therefore the proof is the same.

Since the message error probability is independent of the transmitted codeword, and furthermore concentrates around the cycle-free case, density evolution can be used to determine the error-correction performance of a decoder perturbed by our deviation model, as long as the deviations are weakly symmetric. It is important to note that as discussed in Section 3.4, the deviation model itself still depends on the transmitted codeword. However, given a weakly symmetric deviation model, density evolution can be used to determine the decoder’s performance. The hope is that in practice, only a single (or a few) deviation models are required to represent the deviations for all codewords, and indeed one model is sufficient to obtain accurate predictions in the experiment of [29].

4.3 Deviation Examples

As described in Section 3.4, we collect deviation measurements from the test circuits by inputting test vectors representing random codewords, and distributed according to several values. We then generate estimates of the conditional \glsplpmf in (5). It is interesting to visualize the distributions using an aggregate measure such as the probability of observing a non-zero deviation

(9)

These conditional probabilities are shown for a circuit in Fig. 4. When , positive belief values indicate a correct decision, whereas when , negative belief values indicate a correct decision. We can see that in this example, deviations are more likely when the belief is incorrect than when it is correct, and therefore a symmetric deviation model is not consistent with these measurements. On the other hand, there is a sign symmetry between the “correct” part of the curves, and between the “incorrect” parts, that is , and for this reason a weakly symmetric model is consistent with the measurements. Note that the slight jaggedness observed for incorrect belief values of large magnitude in the curves is due to the fact that these values occur only rarely. For the largest incorrect values, only about 100 deviation events are observed for each point, despite the large number of \glsMC trials.

Figure 4: Non-zero deviation probability given and at two values, measured on a circuit operated at and . decoding iteration trials were performed for each value. The total number of non-zero deviation events observed is 4,115,229 at , and 10,071,810 at .

Figure 5 shows a similar plot for a circuit. In this case, , and a symmetric deviation model could be appropriate. Of course, since it is more general, a WS model is also appropriate.

Figure 5: Non-zero deviation probability given and at two values, measured on a circuit operated at and . decoding iteration trials were performed for each value. The total number of non-zero deviation events observed is 2,524,601 at , and 1,020,867 at .

Under the assumption that deviations are weakly symmetric, we have

Therefore, we can combine the and data generated by the \glsMC simulation to improve the accuracy of the estimated \glsplpmf. To determine the validity of the WS assumption in a systematic way, we can generate an error metric by applying the WS assumption to one half of the simulation data to predict the other half. For all the circuits and operating conditions considered, the mean squared error of the predicted \glsplpmf remains below .

Let and be respectively the smallest and largest values for which the deviations have been characterized. We can generate a conditional \glspmf for any by interpolating from the nearest \glsplpmf that have been measured. We choose to make sure that the first iteration’s deviation is within the characterized range. Because messages in the decoder are saturated once they reach the largest magnitude that can be represented, and since messages are represented in the CNP in sign & magnitude format, the circuit’s switching activity decreases when the message error probability becomes very small. Since timing faults cannot occur when the circuit does not switch, we can expect deviations to be equally or less likely at values below . Therefore, to define the deviation model for , we make the pessimistic assumption that the deviation \glspmf remains the same as for .

4.4 DE and Energy Curves

We evaluate the progress of the decoder affected by timing violations using quantized density evolution [32]. For the Offset Min-Sum algorithm, a DE iteration can be split into the following steps: 1-a) evaluating the distribution of the CN minimum, 1-b) evaluating the distribution of the CN output, after subtracting the offset, 2) evaluating the distribution of the ideal VN-to-CN message, and 3) evaluating the distribution of the faulty VN-to-CN messages. Step 1-a is given in [16], while the others are straightforward. In the context of DE, we write the message distribution as , and the channel output distribution as . We write a DE iteration as .

As mentioned in Section 3.4, the energy consumption is modeled in terms of the message error probability and of the operating condition, and denoted . As for the deviation model, we use interpolation to define for , and assume that for . To display and on the same plot, we project onto the message error probability space.

Figure 6: Examples of projected DE curves (solid lines) and energy curves (dashed lines) for rate ensembles with , and .
Figure 7: Examples of projected DE curves (solid lines) and energy curves (dashed lines) for the and ensembles (rate ), with .

Several regular code ensembles were evaluated, with rates and . Fig. 6 shows examples of projected DE curves and energy curves for rate- code ensembles with and various operating conditions. The energy is measured as described in Appendix 7 and corresponds to one use of the test circuit (shown in Fig. 8). The nominal operating condition is , and therefore these curves correspond to a reliable implementation. With a reliable implementation, these ensembles have a channel threshold of for the ensemble, for , and for . We use for all the curves shown in Fig. 6 to allow comparing the ensembles. As can be expected, a larger variable node degree results in faster convergence towards zero error rate, and it is natural to ask whether this property might provide greater fault tolerance and ultimately better energy efficiency. This is discussed in Section 5.4.

Fig. 7 is a similar plot for the and ensembles. The channel threshold of both ensembles is approximately . For these curves, the nominal operating condition is and . As we can see, the energy consumption per iteration of the decoder is roughly double that of the decoder. We note that in the case of the ensemble, the reliable decoder stops making progress at an error probability of approximately . This floor is the result of the message saturation limit chosen for the circuit.

5 Energy Optimization

5.1 Design Parameters

As in a standard LDPC code-decoder design, the first parameter to be optimized is the choice of code ensemble. In this paper we restrict the discussion to regular codes, and therefore we need only to choose a degree pair , where is the design rate of the code. For a fixed , we can observe that both the energy consumption and the circuit area of the decoding circuit grow rapidly with , and therefore it is only necessary to consider a few of the lowest values.

Besides the choice of ensemble, we are interested in finding the optimal choice of operating parameters for the quasi-synchronous circuit. We consider here the supply voltage () and the clock period (). Generally speaking, the supply voltage affects the energy consumption, while the clock period affects the decoding time, or latency. The energy and latency are also affected by the choice of code ensemble, since the number of operations to be performed depends on the node degrees. The operating parameters of a decoder are denoted as a vector .

The decoding of LDPC codes proceeds in an iterative fashion, and it is therefore possible to adjust the operating parameters on an iteration-by-iteration basis. In practice, this could be implemented in various ways, for example by using a pipelined sequence of decoder circuits, where each decoder is responsible for only a portion of the decoding iterations. It is also possible to rapidly vary the clock frequency of a given circuit by using a digital clock divider circuit [33]. We denote by the sequence of parameters used at each iteration throughout the decoding, and we use to denote a specific sequence in which the parameter vector is used for the first iterations, followed by for the next iterations, and so on.

5.2 Objective

The performance of the LDPC code and of its decoder can be described by specifying a vector , where is the output error rate of the communication channel, the residual error rate of VN-to-CN messages when the decoder terminates, and the expected decoding latency.

The decoder’s performance and energy consumption are controlled by . The energy minimization problem can be stated as follows. Given a performance constraint , we wish to find the value of that minimizes , subject to , , . As in the standard DE method, we propose to use the code’s computation tree as a proxy for the entire decoder, and furthermore to use the energy consumption of the test circuit described in Appendix 8 as the optimization objective. To be able to replace the energy minimization of the complete decoder with the energy minimization of the test circuit, we make the following assumptions:

  1. The ordering of the energy consumption is the same for the test circuit and for the complete decoder, that is, for any and , implies , where and are respectively the energy consumption of the test circuit and of the complete decoder when using parameter .

  2. The average message error rate in the test circuit and in the complete decoder is the same for all decoding iterations.

  3. The latency of the complete decoder is proportional to the latency of the test circuit, that is, if is the latency measured using the test circuit with parameter , the latency of the complete decoder is given by , where does not depend on .

Assumption 1 is reasonable because the test circuit is very similar to a computation unit used in the complete decoder. The difference between the two is that the test circuit only instantiates one full VNP, the remaining VNPs being reduced to only their “front” part (as seen in Fig. 8), whereas the complete decoder has full VNPs for every CNP. Assumption 2 is the standard DE assumption, which is reasonable for sufficiently long codes. Finally, it is possible for the clock period to be slower in the complete decoder, because the increased area could result in longer interconnections between circuit blocks. Even if this is the case, the interconnect length only depends on the area of the complete decoder, which is not affected by the parameters we are optimizing, and hence does not depend on .

Clearly, if Assumption 1 holds and the performance of the test circuit is the same as the performance of the complete decoder, then the solution of the energy minimization is also the same. The performance is composed of the three components . The channel error rate does not depend on the decoder and is clearly the same in both cases. Because of Assumption 2, the complete decoder can achieve the same residual error rate as the test circuit when is the same. The latencies measured on the test circuit and on the complete decoder are not necessarily the same, but if Assumption 3 holds, and if we assume that the constant is known, then we can find the solution to the energy minimization of the complete decoder subject to constraints by instead minimizing the energy of the test circuit with constraints .

We also consider another interesting optimization problem. It is well known that for a fixed degree of parallelism, energy consumption is proportional to processing speed (represented here by ), which is observed both in the physical energy limit stemming from Heisenberg’s uncertainty principle [34], as well as in practical CMOS circuits [35]. In situations where both throughput normalized to area and low energy consumption are desired, optimizing the product of energy and latency or energy-delay product (EDP) for a fixed circuit area can be a better objective. In that case the performance constraint is stated in terms of , and the optimization problem becomes the following: given a performance constraint , minimize subject to , , and a fixed circuit area.

Standard Quasi-synchronous
Code Nom. Norm. Latency Energy EDP Best energy Best EDP
family area [] [] [] [] []
(3,6) (-23%) (-23%)
(-34%) (-35%)
(4,8) (-24%) (-25%)
(3,30) (-31%) (-35%)
(-28%) (-27%)
(-36%) (-38%)
(-34%) (-34%)
(4,40) (-38%) (-40%)
Cell area divided by the minimal area of the smallest decoder having the same code rate. Approx. threshold.
Table 1: Energy and EDP optimization results.

5.3 Dynamic Programming

To solve the iteration-by-iteration energy and EDP minimization problems stated above, we adapt the “Gear-Shift” dynamic programming approach proposed in [21]. The original method relies on the fact that the message distribution has a \gls1D characterization, which is chosen to be the error probability. By quantizing the error probability space, a trellis graph can be constructed in which each node is associated with a pair . Quantized quantities are marked with tildes. A particular choice of corresponds to a path through the graph, and the optimization is transformed into finding the least expensive path that starts from the initial state and reaches any state such that and the latency constraint is satisfied, if there is one. Note that to ensure that the solutions remain achievable in the original continuous space, the message error rates are quantized by rounding up. To maintain a good resolution at low error rates, we use a logarithmic quantization, with 1000 points per decade.

In the case of a faulty decoder, we want to evaluate the decoder’s progress by tracking a complete message distribution using DE, rather than simply tracking the message error probability. In this case, the Gear-Shift method can be used as an approximate solver by projecting the message distribution onto the error probability space. We refer to this method as DE-Gear-Shift. Any path through the graph is evaluated by performing DE on the entire path using exact distributions, but different paths are compared in the projection space. As a result, the solutions that are found are not guaranteed to be optimal, but they are guaranteed to accurately represent the progress of the decoder.

In the DE-Gear-Shift method, a path is a sequence of states . As in the original Gear-Shift method, any sequence of decoder parameters corresponds to a path. We denote the projection of a state onto the error probability space as . To each path , we associate an energy cost and a latency cost . A path ending at a state can be extended with one additional decoding iteration using parameter by evaluating one DE iteration to obtain . Performing this additional iteration adds an energy cost and a latency cost to the path’s cost. When optimizing EDP, we define the overall cost of a path as . When optimizing energy under a latency constraint, we define the path cost as a two-dimensional vector .

We use the following rules to discard paths that are suboptimal in the error probability space. Rule 1: Paths for which the message error rate is not monotonically decreasing are discarded. Rule 2: A path with cost is said to dominate another path with cost if all the following conditions hold: 1) an ordering exists between and , 2) , 3) , where denotes the last state reached by path . The search for the least expensive path is performed breadth-first. After each traversal of the graph, any path that is dominated by another is discarded.

When the path cost is one-dimensional, the optimization requires evaluating DE iterations, where is the number of operating points being considered and the number of quantization levels used for . This can be seen from the fact that with a 1-D cost, Rule 2 implies that at most one path can reach a given state