Energy Consumption of VLSI DecodersSubmitted for publication on November 7th, 2013, revised November 28th, 2014. Presented in part at the 2013 Canadian Workshop on Information Theory, June 18–21, Toronto, Canada, 2013.

# Energy Consumption of VLSI Decoders††thanks: Submitted for publication on November 7th, 2013, revised November 28th, 2014. Presented in part at the 2013 Canadian Workshop on Information Theory, June 18–21, Toronto, Canada, 2013.

Christopher Blake and Frank R. Kschischang
Department of Electrical & Computer Engineering
University of Toronto
christopher.blake@mail.utoronto.ca frank@comm.utoronto.ca
###### Abstract

Thompson’s model of VLSI computation relates the energy of a computation to the product of the circuit area and the number of clock cycles needed to carry out the computation. It is shown that for any family of circuits implemented according to this model, using any algorithm that performs decoding of a codeword passed through a binary erasure channel, as the block length approaches infinity either (a) the probability of block error is asymptotically lower bounded by or (b) the energy of the computation scales at least as , and so the energy of successful decoding, per decoded bit, must scale at least as . This implies that the average energy per decoded bit must approach infinity for any sequence of codes that approaches capacity. The analysis techniques used are then extended to the case of serial computation, showing that if a circuit is restricted to serial computation, then as block length approaches infinity, either the block error probability is lower bounded by or the energy scales at least as fast as . In a very general case that allows for the number of output pins to vary with block length, it is shown that the average energy per decoded bit must scale as . A simple example is provided of a class of circuits performing low-density parity-check decoding whose energy complexity scales as .

## I Introduction

Since the work of Shannon [1], information theory has sought to determine how much information can be communicated over a noisy channel; modern coding theory has sought ways to achieve this capacity using error control codes. A standard channel model is the additive white Gaussian noise (AWGN) channel, for which the maximum rate of information that can be reliably communicated (known as the capacity) is known and depends on the transmission power. This model does not, however, consider the energy it takes to encode and decode; a full understanding of energy use in a communication system requires taking into account these encoding and decoding energies, along with the transmission energy. Currently there has been very little work done in seeking a fundamental understanding of the energy required to execute an error control coding algorithm.

Early work in relating the area of circuits and number of clock cycles in circuits that perform decoding algorithms was presented by El Gamal et al. in [2]. More recent work in trying to find fundamental limits on the energy of decoding can be attributed to Grover et al. in [3]. In this work, the authors consider decoding schemes that are implemented using a VLSI model attributed to Thompson [4] (which we will describe later), and they are able to show that for any code, using any decoding algorithm, as the required block error probability approaches , the sum of the transmission, encoding, and decoding energy, per bit, must approach infinity at a rate of , where is the block error probability of the code. This result is useful to the extent that it suggests how to judge the energy complexity of low error probability codes; however, it does not suggest how the energy complexity of decoding scales as capacity is approached.

The result of this paper uses a similar approach to Grover et al., but we generalize the computation model to both parallel and serial computation, and show how the energy of low block error rate decoders must scale with block length . We believe that this approach can guide the development of codes and decoding circuits that are optimal from an energy standpoint.

In this paper, in Section II we will describe the VLSI model that will be used to derive our bounds on decoding energy. Our results apply to a decoder for a standard binary erasure channel, which will be formally defined in Section III. In Section IV we will describe some terminology and some key lemmas used in our paper. The main contribution of this paper will be given in Section V where we describe a scaling rule for codes with long block length that have asymptotic error probability less than . The approach used in this section is extended in Section VI to find a scaling rule for serial computation. Then, in Section VII we extend the approaches of the previous sections to derive a non-trivial super-linear lower bound on circuit energy complexity for a series of decoders in which the number of output pins can vary with increasing block length. These results are applied to find a scaling rule for the energy of capacity approaching decoders as a function of fraction of capacity in Section VIII. We then give a simple example in Section IX showing how an LDPC decoder can be implemented with at most energy, providing an upper bound to complement our fundamental lower bound.

Notation: To aid our discussion of scaling laws, we use standard Big-Oh and Big-Omega asymptotic notation, which is well discussed in [5]. We say that a function if and only if there is some and some such that for all , . Similarly, we say that if and only if . Intuitively, this means that the function grows at least as fast (in an order sense) as and hence we use it with lower bounds. In the following, a sequence of values is denoted . Random variables are denoted with upper case letters; values in their sample spaces are denoted with lower-case letters.

## Ii VLSI Model

### Ii-a Description of Model

The VLSI model that we will use is based on the work of Thompson [6], and was used by El Gamal in [2] and Grover et al. in [3]. The model consists of a basic set of axioms that describe what is meant by a VLSI circuit and a computation. The model then relates two parameters of the circuit and computation, namely the number of clock cycles used for the computation and the area of the circuit, to the energy used in the computation. Thompson used this model to compute fundamental bounds on the energy required to compute the discrete Fourier transform, as well as other standard computational problems, including sorting. The results in this paper apply to any circuit implemented in a way that is consistent with these axioms, listed as follows:

• A circuit consists of two types of components: wires and nodes. In such a circuit, wires carry binary messages and nodes can perform simple computations (e.g., and, xor), all laid out on a grid of squares. Wires carry information from one node to another node. Wires are assumed to be bi-directional (at least for the purpose of lower bounds). In each clock cycle, each node sends one bit to each of the nodes it is connected to over the wire. We in general allow a node to perform a random function on its inputs. In a deterministic function, the output of the function is determined fully by its inputs. By a random function we mean that the outputs of a particular node, conditioned on the inputs being some element from the set of possible inputs, is a distribution where is a random variable representing the possible outputs of the node and a random variable representing the inputs of a node. In the particular case of a node with input wires (and thus output wires because of our bidirectional assumption on the wires) the random variables and can take on values from .

• A VLSI circuit is a set of computational nodes connected to each other using finite-width bi-directional wires. At each clock cycle, nodes communicate with the other nodes to which they are connected. Computation terminates at a pre-determined number, , of clock cycles.

• Wires run along edges and cross at grid points. There is only one wire per edge. Each grid point contains a logic element, a wire crossing, a wire connection, or an input/output pin. Grid points can be empty. A computational node is a grid point that is either a logic element, a wire-connection, or an input/output pin.

• The circuit is planar, and each node has at most wires leading from it.

• Inputs of computation are stored in source nodes and outputs are stored in output nodes. The same node can be a source node and an output node. Each input can enter the circuit only at the corresponding source node.

• Each wire has width . Any two wires are separated by at least the wire-width. Any grid points are separated by distance at least . Each node is assumed to require at least wire area (length and width at least ).

• Processing is done in “batches,” in which a set of inputs is processed and outputs released before the next set of inputs arrive.

• Energy consumed in a computation is proportional to where is the area occupied by the wires and nodes of the circuit and is the number of clock cycles required to execute the computation. Precisely, the energy is assumed to be where , where is the capacitance per unit wired area of the circuit, and is the voltage used to drive the circuit. The quantity is denoted by , which is the “energy parameter” of the circuit. Processing energy for computation, , is thus given by . Since energy of a computation in our model and the area time complexity are essentially the same, in this paper we use the terms “energy complexity” and “Area-Time” complexity interchangeably.

### Ii-B Discussion of Model

The circuit model described above allows us to consider a circuit as a graph in which the computational nodes correspond to the nodes of a graph and the wires correspond to edges.

The last assumption of our model, which relates the area and number of clock cycles to energy consumed in a circuit, assumes that a VLSI circuit is fully charged and then discharged to ground during each clock cycle. Since the wires in a circuit are made of conducting material and are laid out essentially flat, the circuit will have a capacitance proportional to the area of the wires. Assuming that all the wires will need to be charged at each clock cycle, there must be energy supplied to the circuit. For now, we do not consider what will happen if at each clock cycle the state of some of the wires does not change. In the literature (see [7] and [8]) this model is often used to understand power consumption in a digital circuit so we do not attempt to alter these assumptions for the purposes of this paper. Sometimes leakage current of the circuit is factored into such models; we also neglect this as we assume the frequency of computation is high enough so that the power used in computation dominates.

There has been some work to understand the tradeoff between computational complexity and code performance. One such example is [9], in which the complexity of a Gallager Decoding Algorithm B was optimized subject to some coding parameters. This however does not correspond to the energy of such algorithms.

In [6] it was proven that the Area-Time complexity of any circuit implemented according to this VLSI model that computes a Discrete Fourier Transform must scale as . However, there exist algorithms that compute in operations (for example, see [10]); Thompson’s results thus imply that, for at least some algorithms, energy consumption is not merely proportional to the computational complexity of an algorithm.

In the field of coding theory, Grover et al. in [3] provided an example of two algorithms with the same computational complexity but different computational energies. The authors looked at the girth of the Tanner graph of an LDPC code. The girth is defined as the minimum length cycle in the Tanner graph that represents the code. They showed, using a concrete example, that for (3, 4)-regular LDPC codes of girth 6 and 8 decoded using the Gallager-A decoding algorithm, the decoders for girth 8 codes can consume up to 36% more energy than those for girth 6 codes. The girth of a code does not necessarily make the decoding algorithm require more computations (i.e., it does not increase computational complexity), but, for this example, it does increase the energy complexity. This is because codes with greater girth require the interconnection between nodes to be more complex, even though the same number of computational nodes and clock cycles may be required. This drives up the area required to make these interconnections, and thus drives up the energy requirements. Also in the field of coding theory, the work of Thorpe [11] has shown that a measure of wiring complexity of an LDPC decoder can be traded off with decoding performance.

Thus, current research suggests that in decoding algorithms there appears to be a fundamental trade-off between code performance and interconnect complexity. This paper attempts to find an analytical characterization of this trade-off.

Our paper considers a generic model of computation, but of course it does not completely reflect all methods of implementing a circuit. We discuss some circuit design techniques that our model does not directly consider below.

#### Ii-B1 Multiple Layer VLSI Circuits

Modern VLSI circuits differ from the Thompson model in that the number of VLSI layers is not one (or two if one counts a wire crossing as another layer). Modern VLSI circuits allow multiple layers. Fortunately, it is known that if layers are allowed, then this can decrease the total area by at most a factor of (see, for example, [4] or [12]). For the purposes of our lower bounds, if the number of layers remains constant as increases, we can modify our energy lower bound results by dividing the lower bounds by . If, however, the number of layers can grow with our results may no longer hold. Note also that this only holds for the purpose of lower bound. It may not be possible to implement a circuit with an area that decreases by a factor of , and so the upper bounds of Section IX cannot be similarly modified.

The model used in this paper assumes that after every clock cycle the circuit is entirely discharged and the energy used to charge the circuit is lost. There exists extensive research into circuit designs in which this is not the case (for an overview of this type of computing, called adiabatic computing, see [13]). Our results do not apply to such circuit designs.

#### Ii-B3 Using Memory Elements in Circuit Computation

The Thompson model does not allow for the use of special memory nodes in computation that can hold information and compute the special function of loading and unloading from memory. Such a circuit can be created using the Thompson model, but it may be that a strategic use of a lower energy memory element can decrease the total energy of a computation. Intuitively, however, the use of a memory element to communicate information within a circuit should still be proportional to the distance that information is communicated. Grover in [14] proposed a “bit-meters” model of energy computation and derives scaling rules similar to our results, suggesting that, at least in an order sense, the circuit model we use is general enough to understand the scaling of high block length codes even if lower energy memory is used. Understanding precisely what kind of gain the use of a memory element can provide in energy complexity is beyond the scope of this paper.

## Iii Channel Model

We will consider a noisy channel model that is similar to the model used by Grover et al. in [3]. An information sequence of independent fair coin flips is encoded into binary-alphabet codewords ; and thus this code has rate bits/channel use. The sequence is passed through a binary erasure channel with erasure probability , resulting in a received vector , where the symbol corresponds to the erasure event.

The decoder estimates the input sequence by computing a function of the received vector . The outputs of the noisy channel thus become inputs into the decoder circuit. In our most general model of computation, it is required that these channel outputs are eventually input into an input node of the circuit. In a parallel implementation model used for Theorem 1, each of these decoder input symbols are input, at the beginning of the computation, into the input nodes of the decoder. In a more general computational model used in Theorems 2 and 3, it is assumed that each of these symbols are input into the input nodes of the decoder during some clock cycle in the computation. Note that we allow each of the symbols to be inserted into the circuit at any input node location during any clock cycle, but we also require, according to our model, that each input is injected only once into the circuit. Thus, our model does not subsume circuit implementations that, at no cost, allow the same input symbol to be inserted into the circuit in multiple places.

The probability of block error is defined as

 Pblke=Pr(^bk1≠bk1).

The lower bounds used in our result are valid for a binary erasure channel, but also for any channel that can result from a degraded version of a binary erasure channel. Hence, if we let then our results apply to lower bounds on decoders for binary symmetric channels with crossover probability .

## Iv Definitions and Main Lemmas

The main theorems in this paper rely on the evaluation of a particular limit, which we present as a lemma below.

###### Lemma 1.

For any constant , , and any constant :

 limn→∞(1−exp(−clogn))c′nlogn=0. (1)
###### Proof:

This result follows simply from taking the logarithm of the expression in (1) and using L’Hôpital’s rule to show that the logarithm approaches . ∎

### Iv-a Relation Between Energy and Bits Communicated

Grover et al. in [3] used a nested bisection technique, involving subdividing circuits, to derive the two key lemmas used in this paper. A circuit created according to the Thompson model can be considered a graph in which the computational nodes correspond to graph vertices and the wires are graph edges. To understand these lemmas we must first understand what is meant by (a) a minimum bisection of a circuit and (b) a nested bisection of a circuit. Informally, a bisection of a graph is a set of edges whose removal results in at least two graphs of essentially the same size that are unconnected to each other. A bisection can also be defined in terms of separating a particular subset of vertices. A formal definition is given below.

###### Definition 1.

Let be a graph. Let be a subset of the vertices, and be a subset of the edges. Then bisects in if deletion of the edges in from the graph cuts into disconnected sets and , with , and into sets and such that . A bisection of in is called a bisection of .

A minimal bisection is a bisection of a graph whose size is minimal over all bisections. The minimum bisection width is the size of a minimal bisection. For a general graph, finding a minimum bisection is a difficult problem (in fact, it is NP-Complete [15]), but all that is required for the results we will use is that a minimum bisection exists for every graph. Fig. 2 shows minimal bisections of a few simple graphs.

Note that the definition of a minimum bisection also applies to subsets of the vertices of a graph. A circuit has both input nodes and output nodes. The input (resp. output) nodes of the graph corresponding to a circuit are those nodes of the graph that correspond to the input (resp. output) nodes of the circuit. For the purposes of the results in this paper, we will consider bisections of the graph that bisect the output nodes.

We will also be using some other terms throughout the paper which we define below.

If one performs a minimum bisection on the output nodes of the graph corresponding to the interconnection graph of a circuit, this results in disconnected subcircuits. The output nodes of these two (now disconnected) subcircuits can thus each be minimally-bisected again, resulting in subcircuits.

###### Definition 2.

This process, when repeated times, is said to perform an -stage nested minimum bisection. Note that this divides the graph into disconnected components, which we will refer to as subcircuits.

When a circuit is viewed as a graph, a subcircuit can be viewed as a subgraph induced by a nested minimum bisection. When viewed as a circuit according to the Thompson model, it is a collection of computational nodes joined by wires laid out on a grid pattern. An example of a mesh-like circuit with 16 nodes undergoing two stages of nested minimum bisections is shown in Fig. 3.

After a circuit undergoes -stages of nested minimum bisections, label each of the subcircuits with a unique integer in which . Consider a particular subcircuit . During the -stages of nested minimum bisections, a number of edges are removed that are incident on the graph corresponding to subcircuit (we can think of this as the “fan-out” of this subcircuit).

###### Definition 3.

During the course of the computation, the number of bits communicated to subcircuit is , where is the number of clock cycles, and we refer to as the bits communicated to subcircuit during a computation.

The quantity is associated with a particular subcircuit induced by a particular -stage nested minimum bisection. When discussed, this quantity’s association with a particular -stage nested minimum bisection is implicit.

###### Definition 4.

The quantity denotes the number of bits communicated across all edges deleted in an -stage nested minimum bisection. Note that this is a quantity associated with a particular -stage nested minimum bisection.

The quantity will be important in the proofs of the theorems in the paper. Specifically, it can be shown that if a decoding circuit (which we will define below) has a large for a particular -stage nested minimum bisection, then the energy expended during a computation by this circuit must be high. As well, it can be shown that if this quantity is low, then the probability that the circuit makes an error is high.

The above definitions are general and can apply to the computation of any function. However, for our bounds we will be finding lower bounds on the energy complexity of decoding circuits.

###### Definition 5.

An parallel decoding circuit is a circuit that has input nodes (accepting symbols in ) and output nodes (producing symbols in ). The input nodes are to receive the outputs of a noisy channel (which for the purposes of lower bound we assume to be a binary erasure channel with erasure probability ). At the end of the computation the decoder is to output the estimate of the original codeword. Note that this circuit decodes a rate code.

Note that in the Thompson model it is assumed that all inputs are binary. For the purposes of lower bound, we allow for the inputs into the computation to be either , or , where is the erasure symbol. At every clock cycle, we allow input nodes to perform a function on their input symbol, as well as the bits input into the node at the clock cycle. These nodes may then output any function of these inputs along the wires leading from the node.

This definition will be generalized to serial computation models in the discussions preceding Theorem 2. Note also that our model of a decoding circuit allows for an input to be an erasure symbol, which is a slight relaxation of the Thompson circuit model. However, in our theorems, the key point will be that, if in a particular subcircuit all the input nodes of that subcircuit are erased, then, conditioned on this event, the distribution of the possible original inputs to the channel of the bits that a subcircuit is to estimate remains uniform. This is a result of the symmetric nature of a binary erasure channel, and this will allow us to directly apply Lemma 2 to form lower bounds on probability of error.

After a decoding circuit undergoes -stages of nested minimum bisections, each subcircuit will have roughly an equal number of output nodes, but the number of input nodes may vary (the actual number of input nodes in each subcircuit in general will be a function of the particular graph structure of the circuit, and the particular -level nested minimum bisection performed).

###### Definition 6.

We refer to this quantity as the number of input nodes in subcircuit and denote it .

Note that this quantity is determined by the particular -stage nested minimum bisection, but for notational convenience we will consider the relation of this quantity to the particular structure of the -stage nested minimum bisection to be implicit.

###### Definition 7.

A particular th subcircuit formed by an -stage nested minimum bisection will have a certain number of output nodes, which we denote . This quantity is referred to as the number of output nodes in subcircuit .

In a fully parallel computation model (which we will employ in Theorem 1), at the end of the computation, these output nodes are to hold a vector , where the vector to be estimated is a vector which is produced by a series of fair coin flips as described in Section III. Since at the end of the computation these output nodes are to hold an estimate of a vector of length it is said that in this case is the number of bits responsible for decoding by subcircuit . The probability of error for a subcircuit is precisely the probability that, after the end of the computation, .

###### Lemma 2.

Suppose that , , and are random variables that form a Markov chain . Suppose furthermore that takes on values from a finite alphabet with a uniform distribution (i.e., for any ) and takes on values from an alphabet . Suppose furthermore that takes on values from a set such that . Then,

 P(^X=X)≤|Y||X|.
###### Remark 1.

This general lemma is meant to make rigorous a fundamental notion that will be used in this paper. As applied to our decoding problem, the random variable can be thought of as the input to a binary erasure channel, and can be any inputs into a subcircuit of a computation, and can be thought of as a subcircuit’s estimate of . This lemma makes rigorous the notion that if a subcircuit has fewer bits input into it than it is responsible for decoding, then the decoder must guess at least bit, and makes an error with probability at least . This scenario is actually a special case of this lemma in which and for integers and , where .

###### Proof:

Clearly, by the law of total probability,

 P(X=^X) =∑x∈X∑y∈YPX(x)PY|X(y|x)P^X|Y(x|y)

where we simply expand the term in the summation according to the definition of a Markov chain. Using we get:

 P(X=^X)=1|X|∑x∈X∑y∈YPY|X(y|x)P^X|Y(x|y)

and using because it is a probability, and changing the order of summation gives us:

 P(X=^X)≤1|X|∑y∈Y∑x∈XP^X|Y(x|y).

Since (as we are summing over a subset of values that can take on), we get:

 P(X=^X)≤1|X|∑y∈Y1=|Y||X|.

In the proofs of the theorems in this paper, we will be dividing a circuit up into pieces and then we will let grow larger. Technically, a circuit can only be divided into an integer fraction of pieces. However, in most cases this does not matter. To make this notion rigorous, we will need to use the following lemma:

###### Lemma 3.

Let be a function such that for sufficiently large and some positive constant . If there are functions , and is continuous for sufficiently large , and if for some constant , and if then .

###### Proof:

Suppose

 limx→∞f(g(x))=c.

To show that we need to construct, given some , a particular such that for all . Since grows unbounded, and is continuous for sufficiently large , then there must be a particular value of (call it ) such that takes on all values greater than for some . As well, for any there exists some such that for all , . In particular this is true for some . Thus, choose to be the least number greater than in which (this must exist because takes on all values greater than ). Thus, for only takes on values greater than (because ). Since for all , thus for all , since can only take on values that takes on for . ∎

###### Corollary 1.

This result applies when is the floor function, denoted, since .

We will need to make one observation that will be used in the three main theorems of this paper, which we present in the lemma below.

###### Lemma 4.

If and are positive integers subject to the restriction that then:

 m∏i=1(1−ϵni)≤(1−ϵnm)m
###### Proof:

The proof follows from a simple convex optimization argument. ∎

Grover et al. in [3] uses a nested bisection technique to prove a relation between energy consumed in a circuit computation and bits communicated across the -stages of nested bisections which we present as a series of two lemmas, the second which we will use directly in our results.

###### Lemma 5.

For a circuit undergoing -stages of nested bisections, in which the total number of bits communicated across all -stages of nested bisections is , then

 Aτ2≥(√2−1)216B2r2r+1λ2

where is the area of the circuit and is the number of clock cycles during the computation.

###### Proof:

See [3] for a detailed proof. Here we provide a sketch. To accomplish this proof, -stages of nested minimum bisections on a circuit are performed and then a principle due to Thompson [6] is applied that states that the area of a circuit is at least proportional to the square of the minimum bisection width of the circuit. Also, the number of bits communicated to a subcircuit cannot exceed the number of wires entering that subcircuit multiplied by the number of clock cycles. The area of the circuit (related to the size of the minimum bisections performed) and the number of clock cycles (more clock cycles allow more bits communicated across cuts) are then related to the number of bits communicated across all the edges deleted during the -stages of nested bisections. ∎

###### Lemma 6.

If a circuit as described in Lemma 5 in addition has at least nodes, then the complexity of such a computation is lower bounded by:

 Aτ≥√2−14√2√β2rBrλ2w
###### Proof:

Following the same arguments of Grover et al. in [3] (which we reproduce to get the more general result we will need), note that if there are at least computational nodes, then

 A≥βλ2w

which, when combined with Lemma 5 results in:

 A2τ2≥(√2−1)216B2r2r+1βλ4w

which yields the statement of the Lemma upon taking the square root. ∎

###### Remark 2.

In terms of our energy notation, the result of Lemma 6 implies that for such a circuit with at least computational nodes, the energy complexity is lower bounded by:

 Eproc≥ξtechλ2w√2−14√2√β2rBr=Ktech√β2rBr

where .

### Iv-B Bound on Block Error Probability

The key lemma that will be used in the first theorem of this paper is due to Grover et al. [3]. We modify the lemma slightly.

###### Lemma 7.

For any code implemented using the VLSI model for an erasure channel with erasure probability , for any ,

 either Pblke≥12−(1−ϵni2r−1)2r−1 or Br≥k2.

The proof uses the same approach as Grover et al. in [3] but we modify it slightly to ease the use of our lemma for our theorem and to conveniently deal with the possibility that a decoder can guess an output of a computation.

Let be the number of input bits erased in the th subcircuit after -stages of nested bisections. Furthermore, recall from Definition 3 that is the number of bits injected into the th subcircuit during the computation. Also, recall from Definition 6 that is the number of input nodes located within the th subcircuit. We use the principle that if

 k2r

for any subcircuit then the probability of block error is at least . This is a very intuitive idea; if the number of bits that are not erased, plus the number of bits injected into a circuit is less than the number of bits the circuit is responsible for decoding, the circuit must at least guess bit. This argument will be made formal in the proof that follows.

###### Proof:

(of Lemma 7) Suppose that all the input bits injected into the th subcircuit are the erasure symbol. Then, conditioned on this event, the distribution of the bits that this subcircuit is to estimate is uniform (owing to the symmetric nature of the binary erasure channel). Furthermore, if then the number of bits injected into the subcircuit is less than the number of bits the subcircuit is responsible for decoding. Combining these two facts allows us to apply Lemma 2 directly to conclude that, in the event all the inputs bits of a subcircuit are erased, and the number of bits injected into the subcircuit is less than , then the subcircuit makes an error with probability at least . Denote the event that all inputs bits in subcircuit are erased as . The probability of this event is given by

 P(Wri)=ϵni.

Suppose that (where we recall from Definition 4 that is the total number of bits communicated across all edges cut in -stages of nested minimum bisections). Let be the set of indices in which (the bits communicated to the th subcircuit) is smaller than . We first claim that . To prove this claim, let and note that , from which it follows that . Since , the claim follows.

Hence, in the case that , because of the law of total probability:

 P(correct) ≤∏i∈S(1−ϵni)+12 (2)

where the event is the event that each of the subcircuits indexed in , after -stages of nested bisections, do not have all their input bits erased. We then note that, in this case, the probability of the circuit being decoded correctly is at most . For the second term, we note that conditioned on the event that at least one of the subcircuits indexed in has all their input bits erased, since the circuit must at least guess bit, the probability of the circuit decoding successfully is at most , by Lemma 2.

Since , subject to this restriction, Lemma 4 shows the expression in (2) is maximized when for each subcircuit in . Hence,

 P(correct)≤(1−ϵn|S|)|S|+12.

Thus, either or which implies

 P(correct)≤(1−ϵn2r−1)2r−1+12

and so

 Pblke =1−P(correct) ≥12−(1−ϵn2r−1)2r−1.

## V Key Result: A Fundamental Scaling Rule for Energy of Low Block Error Probability Decoders

We define a coding scheme as a sequence of codes of fixed rate together with decoding circuits, in which the block length of each successive code increases. We define as the block error probability for the decoder of block length in this scheme. An example of a coding scheme would be a series of regular LDPC codes together with LDPC decoding circuits in which the block length doubles for each decoder in the sequence. Our results are general and would apply to any particular coding scheme using any circuit implementation and any decoding algorithm. The key result of this paper is given in the following theorem:

###### Theorem 1.

For every coding scheme in which , there exists some such that for all , for any circuit implemented according to the VLSI model with parameters and ,

 Edec>Ktech  ⎷(log2n)log2(1ϵ)Rn2 (3)

where is the energy used in the decoding and is a constant that depends on circuit technology parameters that we defined before.

###### Remark 3.

The requirement that for our bound in (3) though reasonable, is not necessary for a good design. Typical block error probability requirements may be on the order of or , although if the block error probabilities are lower bounded by for a series of decoding schemes, this is not necessarily a bad design. The individual bit error probabilities (the probability that a randomly selected output bit of the decoder is decoded correctly) may indeed be acceptably low. However, our results do not consider such schemes. It is also not necessary for a decoding scheme to have a block length that gets arbitrarily large. However, a capacity-approaching code must have block length that approaches infinity and our result can be used to understand how the energy complexity of such decoding algorithms approach infinity.

###### Proof:

The theorem follows from an appropriate choice for , the number of nested bisections we perform. We can choose any nonnegative integer so that . Note that is the number of bits the decoder is responsible for decoding. As gets large, we can thus choose any so that . Thus, we choose an so that, approximately, , for a value of which we will choose later. In particular, we will choose .

This is valid so long as is sufficiently large, for some . Note that since . Since , this is a valid choice for so long as

 2log(1ϵ)nKlogn

which must occur as the left side of the inequality approaches as gets large. We can plug this value for into Lemma 7, but we will simplify the expression by neglecting the floor function, as application of Lemma 2 will show that this does not alter the evaluation of the limit that we will compute, as we can see our choice for grows unbounded with . Thus, either

 Pblk,ne≥12−(1−exp(−Klogn))1Klog(1ϵ)nlogn (4)

or, applying Lemma 6 by recognizing that there are at least nodes,

 Edec>Ktech  ⎷(Klogn)log(1ϵ)Rn2. (5)

By a direct application of Lemma 1, so long as , the bound in (4) approaches which we can see as follows:

 limn→∞Pblk,ne ≥12−limn→∞(1−exp(−K(logn)))Kn(logn) =12.

This implies that, in the limit of large block sizes, the probability of block error must be lower bounded by , unless . But then by (5), it must be that

 Edec>Ktech  ⎷(logn)log(1ϵ)Rn2 (6)

which is the result we are seeking to prove. ∎

The following corollary is immediate.

###### Corollary 2.

If a sequence of decoding schemes in which in the limit of large , the average decoding energy, per decoded bit (which we denote ) is bounded as:

 Edec,avg>Ktech  ⎷(log2n)log2(1ϵ). (7)
###### Proof:

The proof follows simply by dividing (6) by , the number of bits such a code is responsible for decoding. ∎

## Vi Serial Computation

Our result in Section V applies to decoders implemented entirely in parallel; however, this does not necessarily reflect the state of modern decoder implementations. Below we provide a modified version of the Thompson VLSI model that allows for the source of the computation to be input serially and the outputs to be computed serially.

In this modified model, we assume that the circuit computes a function of inputs and outputs. However, instead of having input nodes and output nodes, the circuit has input nodes and output nodes. The computation terminates after a set number of clock cycles, and during the clock cycles, the inputs to the computation may be input into the input nodes (where bits at most can be input during a single clock cycle), and the outputs of the computation must appear in the output nodes during specified clock cycles of the computation.

###### Remark 4.

The number of clock cycles must at least be enough to output all the bits. If there are output nodes and outputs to the function being computed, then there must be at least clock cycles. If all the inputs into the computation are being used, then there must also be at least clock cycles, though it is technically possible for some functions to have inputs that “don’t matter” so this is not a strict bound for all functions.

Hence, a lower bound on the energy complexity for this computation is:

 Eproc≥ξtechAckj

where is the area of the circuit.

###### Theorem 2.

Suppose there is a sequence codes together with decoding schemes with rate and block length approaching infinity. We label the block error probability of the length decoder as . Also suppose that the number of output pins remains a constant . Then either (a) or (b) there exists some such that for all greater than

 Edec≥ξtechλ2wR2njlog(1ϵ)(logn−j)=Ω(nlogn).

To prove this theorem, instead of dividing the circuit into subcircuits, we will divide the computation conceptually in time, by dividing the computation into epochs. More precisely, consider dividing the computation outputs into chunks of size (with the exception of possibly one chunk if does not evenly divide ), meaning that there are such chunks. Hence, the outputs, which can be labeled can be divided into groups, or a collection of subvectors in which , and so on, until .

###### Definition 8.

The set of clock cycles in the computation in which the bits in are output is considered to be the th epoch.

In our analysis, we are interested in analyzing the decoding problem for chunks of the output as defined above for an that we will choose later for the convenience of our theorem. We are also interested in another set of quantities: the input bits injected into the circuit between the time when the last of the bits in are output and the first of the bits in bits are output. Label the collection of these bits as . Label the size of each of these of these subvectors as , so that the number of bits injected before all of the bits in are computed is , and the number of those injected after the first bits are injected and until the clock cycle when the last of the bits in are output is , and so on. Let be the number of erasures that are injected into the circuit during the th epoch. Note that by Lemma 2 an error occurs when

 m≤ni+¯A−si

where is the maximum number of bits that can be stored in the circuit, remembering that according to the computation model the maximum number of bits that can be stored in a circuit must be proportional to the area of the circuit, as each wire in the circuit at any given time in the computation can hold only the value or .

###### Proof:

(of Theorem 2) Suppose we divide the circuit into chunks each of size , more than the normalized circuit area. Then, if all the bits are erased, the probability that at least one of the bits of is not decoded must at least be , because there are simply not enough non-erased inputs for the circuit to infer the bits it is responsible for decoding in that window of time. Note that we choose so that an error event occurs with probability at least when all the bits are erased, because it is technically possible that in a clock cycle that outputs the last of the bits of , bits of are output. Then, the number of bits required to be computed for the next chunk of outputs is at least . Let the size of each (except possibly ) be . Similar to what we did for in Section IV-B, denote the event that all input bits in are erased as . Thus:

 P(correct) =P(∩⌈km⌉i=1¯Wi)P(correct|∩⌈km⌉i=1¯Wi)+P(∪⌈km⌉i=1Wi)P(correct|∪⌈km⌉i=1Wi) ≤⌊km⌋∏i=1(1−ϵni)+12.

The first term is simplified by recognizing the independence of erasure events in the channel and the second term is simplified by the fact that, conditioned on the event that at least one subcircuit has input nodes being all erasure symbols, Lemma 2 applies and at least one subcircuit must make an error with probability at least . Thus:

 Pblk,ne =1−P(correct) ≥1−⌊km⌋∏i=1(1−ϵni). (8)

It must be that , and thus , where again is the total number of inputs.

We can apply Lemma 4 to show that the product term in (8) is maximized when each is equal to . Thus, we show that:

For the sake of the convenience of calculation, we replace with , which will not alter the evaluation of the limit by Lemma 3, giving us:

 Pblk,ne≥1−(1−ϵmnk)km (9)

Since we have assumed , suppose that , and recognizing that , and that , substituting into (9) and simplifying gives us:

 Pblke≥12−(1−exp(−clogn))log(1ϵ)Rnlogn

Thus, if and applying Lemma 1:

 limn→∞Pblke≥12−limn→∞(1−exp(−clogn))log(1ϵ)Rnlogn=12

Hence, either in the limit block error probability is at least , or and thus

 Edec≥ξtechλ2w¯Akj≥ξtechλ2wR2njlog(1ϵ)(logn−j)=Ω(nlogn),

where we have used the fact that the number of clock cycles is at least as well as our bound on . ∎

## Vii A General Case: Allowing the Number of Output Pins to Vary with Increasing Block Length

The results in Sections V and VI show that in the case of fully parallel implementations, the Area-Time complexity of decoders that asymptotically have a low block error probability must asymptotically have a super-linear lower bound. Technically, however, it may be possible to make a series of circuits with increasing block length, and have the number of output pins increase with increasing block length. We can show that, in this case, a super-linear lower bound exists as well where we require only weak assumptions on the circuit layout. This proof applies the main principle of this paper: namely that if a subcircuit has all its inputs erased, then that subcircuit must somehow have communicated to it other bits from outside this circuit, or it must, with high probability, make an error. In Theorem 1, we recognize that in a fully parallel computation these bits must be injected to it from another part of the circuit, resulting in some energy cost. In Theorem 2, we divide the circuit into epochs, and recognize that if all the input bits injected into the circuit during that epoch are erased then the circuit must have bits injected to it from before that epoch. But the number of bits that can be carried forward after each epoch is limited by the area of the circuit. In the general case in which the number of output pins can vary with block length, we divide the circuit into subcircuits and epochs (in essence, dividing the circuit in both time and space) and apply these two fundamental ideas.

To accomplish this lower bound, we will need this simplifying assumption: for any decoder with output pins, each output pin is responsible for, before the end of the computation, outputting between and bits. Furthermore, we assume that each output bit produces an output at the same time. We call this assumption an output regularity assumption. This assumption allows us to divide the circuit into subcircuits and then epochs, and thus with this assumption each subcircuit can be divided into subcircuit epochs. The main structure of the proof will be this: if the energy of a computation is not high, then there will be many subcircuit epochs that do not have enough bits injected into them to overcome the case of one of them having all of their input bits erased. The task is thus to choose a correct number of divisions of the circuit into subcircuits and epochs, so that the probability of this event (that a subcircuit epoch makes an error) is high unless the area-time complexity of the computation is high.

###### Theorem 3.

For a sequence of codes and circuit implementations of decoding algorithms in which block length gets large, and where the number of output pins can vary with block length , and the computation performed by the decoders is consistent with the output regularity assumption, then, in the limit as approaches infinity, either (a) or (b) for a sufficiently large , .

###### Proof:

The proof is given in Appendix A. ∎

## Viii Consequences

A direct consequence of our work is that as code rates approach capacity, the average energy of decoding, per bit, must approach infinity. It is well known from [16] and [17] and further studied in [18] that as a function of fraction of capacity , the minimum block length scales approximately as

 n≈c(1−η)2

for a constant that depends on the channel and target probabilities of error. We are not concerned about the value of this constant, but rather the dependence of this approximation on . Plugging this result into (7) implies:

 Edec,avg≳Ktech    ⎷(log2c(1−η)2)log2(1ϵ)=Ω(√log(11−η)).

This result implies that not only must the total energy of a decoding algorithm approach infinity as capacity is approached (this is a trivial consequence of the fact that block length must approach infinity as capacity is approached), but also the energy per bit must approach infinity as capacity is approached. Thus, if the total energy per bit is to be optimized, a rate strictly less than capacity must be used. We cannot get arbitrarily close to optimal energy per bit by getting arbitrarily close to capacity, which would be the case if there were linear energy complexity algorithms with block error probability that stay less than .

The result of Theorem 2 can also be extended to find a fundamental lower bound on the average energy per bit of serially decoded, capacity-approaching codes. For the same reason as in the fully parallel case, we can see that as a function of gap to capacity, the average energy per bit for a decoder must scale as

 Edec,avg≥2λ2wRjlog(1ϵ)(log(11−η)−j)=Ω(log(11−η)).

Finally, it can be shown from Theorem 3 that in circuits in which the output pins can grown arbitrarily, and the regular output rate condition is satisfied, the average energy per bit as a function of gap to capacity must scale as

 Edec,avg≥Ω⎛⎝log(11−η)15⎞⎠.

## Ix Upper Bound on Energy of Regular LDPC Code

We have shown that for any code and decoding circuit with block error probability that is below , the Area-Time complexity must scale at least as fast as . We provide here an example of a particular circuit layout that achieves complexity. Low density parity check (LDPC) codes are standard codes first described by Gallager in [19]. There have been a number of papers that have sought to find very energy-efficient implementations of LDPC decoders; for example [20]. The reference [21] gives an overview of various techniques used to create actual VLSI implementations of LDPC decoders. However, these papers have not sought to view how the energy per bit of these decoders scales with block length; they show a method to optimize an LDPC decoder of a particular block length and show that their implementation method improves over a previous implementation. Our goal is to provide an understanding of how a particular implementation of LDPC codes should scale with block length .

We provide a simple circuit placement algorithm that results in a circuit whose area scales no faster than where is the number of edges in the circuit. For a regular LDPC code with constant node degrees, this implies that the area scales as .

The placement algorithm proposed involves actually instantiating the Tanner Graph of the LDPC code with wires, where each edge of the Tanner graph corresponds to a wire connected to subcircuits that perform variable node computations and the subcircuits that perform check node computations. Our concern is not about the implementation of the variable and check nodes in this circuit. In the diagram, we treat these as merely a “black box” whose area is no greater than proportional to the square of the degree of the node. Of course, the actual area of these nodes is implementation specific, but the important point is that the area of each node should only depend on the particular node degree and not on the block length of the entire code. Our concern is actually regarding how the area of the interconnecting wires scales. The wires leading out of each of these check and variable node subcircuits correspond to edges that leave the corresponding check or variable node of the Tanner graph. The challenge is then to connect the variable nodes with the check nodes with wires as they are connected in the Tanner graph in a way consistent with our circuit axioms. We lay out all the variable nodes on the left side of the circuit, and all the subcircuits corresponding to a check node on the right side of the circuit, and place the outputs of each of these subcircuits in a unique row of the circuit grid (see Fig. 4). Note that the number of outputs for each variable and check node subcircuit will be equal to the degree of that corresponding node in the Tanner graph of the code. The height of this alignment of nodes will be , twice the number of edges in the corresponding Tanner graph (as there must be a unique row for each of the edges of the variable nodes and also for the edges leading from the check nodes.

The distance between these columns of check and variable nodes is . Each output of the variable nodes is assigned a unique grid column that will not be occupied by any other wire (except in the case of a crossing, which according to our model is allowed). A horizontal wire is drawn until this column is reached, and then the wire is drawn up or down along this column until it reaches the row corresponding to the variable node to which it is to be attached. A diagram of the procedure to draw such a circuit for a case of edges is shown in Fig. 4. Since each output of the variable and check node “black boxes” takes up a unique row, and each wire has a unique column, no two wires in drawing this circuit can ever run along the same edge; they can only cross, which is permitted in our model.

The total area of this circuit is thus bounded by: , where is the area of the nodes and is the area of the wires. Now it is sufficient that there is a grid row for each output of the variable nodes and the check nodes, and that there is a column for each edge. Hence

 Aw≤2|E|⋅|E|=2|E|2.

We assume that the area of the subcircuits that perform the computational node operations can complete their operation in one clock cycle and take up area proportional to the square of their degree. Hence we suppose that , where is the degree of the variable nodes and the degree of the check nodes. We then conclude that:

 Ac≤2|E|2+d2vn+d2c(n−k).

The total energy for the computation will depend on the number of iterations performed. Since each iteration requires sending information for the variables nodes to the check nodes and back again, this can be performed in clock cycles. Hence, , where is the number of iterations performed, and of course is the number of clock cycles in the computation.

Thus, the total energy of this implementation of an LDPC code is upper bounded by

 Edec≤2N(2|E|2+d2vn+d2c(n−k)).

The work of Lentmaier et al. [22] has shown that for an LDPC decoder, for asymptotically low block error probability, iterations are sufficient if the node degrees are high enough. This then results in an upper bound on the energy of

 Edec≤2N(2(ndv)2+d2vn+d2c(n−k))=O(n2loglogn).

## X Conclusion

This work expands on previous work in [3] by providing a standard to which decoding algorithms together with circuit implementations can be compared. Earlier work on the energy used in decoding (for example, [20]) have involved trying to optimize a circuit that implements a particular code; they have not sought to understand how the energy scales with the length of the code.

Some work has provided an analysis of the energy requirements for specific types of codes. The work in [23] has provided a way to analyze the energy requirements for an LDPC decoder. The result of our paper is more general: it applies to any decoding algorithm. Further investigations should compare the results in this paper to existing results relating energy per bit with parameters like block error probability.

Finally, this paper should also be used to guide the development of new codes that attempt to approach this fundamental lower bound. There may be some modifications of some types of codes, for example LDPC codes, whose Area-Time complexity is chosen to be (for example, by choosing the neighbors of the nodes in the Tanner graph representation of such a code to limit the area of the code, or by limiting the number of iterations). Most analyses of LDPC codes assume a random Tanner graph. If the interconnections of the LDPC code are restricted to limit the area of the implementation (and thus violating the assumptions of most LDPC code analyses) will the decoder still have a good performance? The work of [11] suggests there is some kind of trade-off.

### A Proof Of Theorem 3

To accomplish this super-linear lower bound, we divide how the number of output pins scales with , the block length, into cases. We suppose that . If not, using our result from Theorem 2, for codes with asymptotic block error probability less than ,

 Edec≥λ2wR2nlog(1ϵ)(lognj−1),

if then we can show that

 Aτ≥λ2wR2nlog(1ϵ)(√logk−1)=Ω(n√logn)≥Ω(n(logn)15)

and we are done.

###### Remark 5.

Technically, the statement that either or does not fully specify all possible sequences of output pins. However, for any sequence, we can divide the sequences into separate subsequences, specifically the sequences of codes in which and in which . For each of those subsequences we can prove our lower bound.

Let be the normalized circuit area. Suppose also that . Otherwise, if and from our simple bound in Remark 4, we can see that

 ¯Aτ≥¯Akj>klog0.9n≥Ω(n(logn)15)

and we are done.

Note again that, just as in Remark 5, if the area alternates between with increasing block length, we can simply divide the sequence of decoders into two subsequences and prove that the necessary scaling law holds for each subsequence.

Hence, we consider the case that we have a sequence of serial decoding algorithms in which the area of the circuit grows with the block length and the number of output nodes on the circuit grows with . We consider the case in which

 ¯Aj≤log0.9n (10)

and

 j≥√logk (11)

We will now choose a way to divide the computation into epochs and subcircuits.

For each of the epochs we want the number of bits responsible on average for each decoder to decode to be four times the area. This will mean that, even if we optimistically assume that before the beginning of each epoch a circuit had already computed the future outputs, a typical subcircuit can only store a fraction of the bits it is responsible for decoding in the next epoch. Note that the number of bits that a subcircuit is responsible for in total over the entire computation must be and hence, if the computation is to be divided into epochs, during each epoch, an average subcircuit must be responsible for decoding bits. We seek to choose an such that

 kMN≥4¯Asubckt,avg

where is the average normalized area of a subcircuit. This will be true if or equivalently if so we choose

 M=k4¯A

We also want for a constant which we will choose later, so we choose

 N=cn4¯Aklogn=c4¯ARlogn.

We need to show that this is a valid choice for . The restriction on the choice of is that (we can’t subdivide the circuit into more subcircuits than there are output pins). By applying the assumption on the scaling of the area of the circuit in (10) we can see that

 4¯ARlogn≤4cjlog0.9nRlogn=j(4cR)log0.9nlogn

is asymptotically less than , and hence this choice of is valid. Our choice of is . The restriction on the choice of is that (there must be at least one output per pin per epoch). Thus , which will be true when . But since the output pins form part of the area of the circuit, this must always be satisfied.

On a minor technical note, we can only choose integer values of . Hence, we can decide to choose the floor of . But, as argued in Lemma 3 if the function for choosing grows with then the evaluation of a limit where we neglect this floor function is the same. So, our other requirement, that means that our choice for grows as increases. We consider the case when area of the computation remains proportional to at the end of this section.

Let the number of bits injected into the th subcircuit during the th epoch from other parts of the circuit be . Now, the average subcircuit has area , so we consider the set of subcircuits that have area less than . This number must be at least the subcircuits, otherwise the total area of all these subcircuits would exceed the total circuit area. Denote the set of indices of subcircuits with area less than as . Note that .

Consider a specific th epoch. Suppose that the total number of bits injected into the subcircuits indexed by after -stages of nested bisections during this epoch is less than . If this is true, then there must be at least half of the subcircuits in that have fewer than bits injected into them. Otherwise, the total number of bits injected into these subcircuits is at least , which we assumed is not the case.

Thus, with our assumptions, at the th epoch, either the total number of bits injected across -stages of nested minimum bisections is at least , or there are at least subcircuits with area less than that have less than bits injected into them. Denote the set of indices of these low area subcircuits with a low number of bits injected into them during the th epoch as . The size of