“Information-Friction” and its implications on minimum energy required for communication1footnote 11footnote 1This paper was presented in part at the IEEE International Symposium on Information Theory (ISIT) 2013, Istanbul, Turkey.

“Information-Friction” and its implications on minimum energy required for communication111This paper was presented in part at the IEEE International Symposium on Information Theory (ISIT) 2013, Istanbul, Turkey.

Pulkit Grover, ECE, Carnegie Mellon University
Email: pulkit@cmu.edu
Abstract

Just as there are frictional losses associated with moving masses on a surface, what if there were frictional losses associated with moving information on a substrate? Indeed, many modes of communication suffer from such frictional losses. We propose to model these losses as proportional to “,” i.e., the product of mass of information (i.e., the number of bits) and the distance of information transport. We use this “information-friction” model to understand fundamental energy requirements on encoding and decoding in communication circuitry. First, for communication across a binary input AWGN channel, we arrive at fundamental limits on  (and thus energy consumption) for decoding implementations that have a predetermined input-independent length of messages. For encoding, we relax the fixed-length assumption and derive bounds for flexible-message-length implementations. Using these lower bounds we show that the total (transmit + encoding + decoding) energy-per-bit must diverge to infinity as the target error probability is lowered to zero. Further, the closer the communication rate is maintained to the channel capacity (as the target error-probability is lowered to zero), the faster the required decoding energy diverges to infinity.

I Introduction

Fig. 1: A Newtonian inspiration for the information-friction model. The units of measuring energy are “,” which is the product of number of bits of information, and the Euclidean distance to which that information travels, in the computation.

Just as there are frictional losses associated with moving masses on a surface, there can be frictional losses associated with moving information between gates (see Fig. 1) on a computational substrate. Within the context of communication, these frictional losses can be a significant part of the energy consumed in computations at the transmitter and the receiver (e.g., encoding and decoding an error-correcting code), which in turn can be a significant fraction of total energy for short-distance communication [1].

What computational models allow us to account for these frictional losses? Communication complexity, introduced by Andrew Yao in [2], accounts for information-movement on a computational substrate by counting the number of bits that need to be moved. However, for many implementations [3] (as discussed in Section IV), energy of computation depends not only on the number of bits, but also on the distance (Euclidean, i.e., , or “Manhattan” [4], i.e., ) to which those bits are moved. Are there models that account for these distances as well?

The VLSI model, introduced by Thompson and others in [5, 6, 7, 8, 9, 10] (and explored further in [11, 12, 13, 14, 15]), accounts for these distances by measuring the total wiring infrastructure required to compute a function. The product of the total wiring length and the number of clock-cycles needed, suitably scaled, is used as an approximation for energy consumed in computing. The required wiring infrastructure, as well as energy, are explored through upper and lower bounds (e.g. [6, Ch. 3 and Ch. 4]).

The focus on wires also limits the VLSI model in many ways. First, modern technology is exploring and using alternative interconnects (e.g., optical, carbon nanotubes, or even wireless [16]), and our nervous system uses axons and dendrites, none of which are made of metal wires, and can even evolve (if slowly) as the computation proceeds (e.g. synapses in the brain and wireless interconnects) [17]. Second, modeling computational nodes as ones having small degree of connectivity, as is the case in the VLSI model [6], can be too limiting. Third, even for metal-interconnects, the VLSI model focuses more on the wiring infrastructure needed to move information than on the amount and the distance of information actually moved in the computation. This can overestimate the energy requirements: for instance, not all wires need to be charged and discharged in each clock-cycle, but the model estimates energy consumption based on this assumption222Thompson does acknowledge this shortcoming in his thesis [6].. Finally, the lengths of messages passed on wires can be different in response to the input of computation, and thus energy-costs can be input dependent. This energy-difference is not accounted for in Thompson’s model.

In Section II-B, we introduce the “information-friction” model of computation and energy consumption (see Fig. 1) that partially addresses these limitations of the VLSI-inspired models. Besides overcoming the limitations addressed above, the model is also appealing because of its conceptual simplicity and fewer assumptions in comparison with the VLSI model. The information-friction model accounts for the cost of computing by counting the “”: the product of the number of bits, and the distance to which these bits are moved (summed over all computation links). A similar “” metric was used as a measure of “transport capacity” supported by a communication network in the work of Gupta and Kumar [18]. Here, we are interested in the opposite question: how many  are needed to support a computation?

When is “” an appropriate metric for circuit communication energy? The issue is discussed in depth in Section IV, where we argue that for many realistic models of computation (including computation on VLSI circuits), the energy consumption in links in the computational network is well approximated as (or is lowered bounded by) where is a constant called the coefficient of information-friction, and are the  required for the computation. Despite its intuitive appeal and applications, the metric has its shortcomings and limitations, which are also discussed in Section IV.

In Section III, we use the implementation model and an AWGN-based hard-decision channel model to derive the  cost for decoding an error-correcting code. Intellectually, our work builds on work of El Gamal, Greene, and Pang [19] that uses the VLSI model, to estimate complexity (but not energy) of encoding and decoding an error correcting code. This work also builds on our own work [3] where we derive tradeoffs between wiring area and number of clock-cycles within Thompson’s VLSI-model. In this paper, we show that the required  for decoding can be no smaller than , where is the block-error probability, and is the transmit power (for a binary-input AWGN channel where the receiver makes a hard decision on the channel output before decoding, see Section II-A). We show a similar result for encoding under a stronger model of circuit implementation: where the scheduling of messages along the communication links is not predetermined, but can adapt itself to the input of computation. Taking a step further, we also establish that if the communication rate is maintained close to the channel capacity even as the (block) error-probability is driven to zero, the required per-bit energy goes up at least as fast as . Here, is the blocklength of the code, and is the cross-over probability of the Binary-Symmetric Channel (BSC) over which the signal is being communicated. As is well known, for some constant (that depends on ), and thus diverges to infinity faster as the rate and channel capacity are brought close to each other.

What are the implications of these results on total (transmit + computation) energy consumption in communication? Under the information-friction model, optimizing over , we show that the total (transmit + decoding) energy per bit is at least . This means that for any implementation that experiences information-frictional losses, the total energy per bit must diverge to infinity as the error probability is driven to zero. Further, operating with bounded transmit power (e.g., by operating close to the Shannon limit) appears333In absence of good upper bounds (that are a work in progress), we are left with comparing the lower bounds on energy consumed by the two strategies, which can only offer suggestions on which strategy is more energy-efficient. to incur larger costs: the total energy per-bit is at least .

Our results on information-frictional energy for encoding and decoding, and total energy for communication, attempt to begin to fill a void in our understanding of energy required for communication. In a paper that is little-known within the information-theory community [20], Landauer argues that one can communicate with arbitrarily small energy, paralleling his results on zero-energy reversible computation [21]. In order to do so, however, Landauer observes that one needs to lower friction and noise in the communication medium to effectively zero444Of course, from an engineering viewpoint, it makes little sense to think about energy of computing assuming friction and noise are (or can be made) negligible. However, Landauer’s main goal was not to provide practically relevant limits to energy of computing (as he himself acknowledges in [20]), but instead to understand and resolve the paradox of Maxwell’s demon [22]. This fictional demon is able to lower the thermodynamic entropy of a system seemingly without expending any energy, a violation of the Second Law of Thermodynamics, which would mean (among other “calamitous” conclusions) that perpetual motion machines can exist. A fundamental limit on energy required for communication with arbitrarily small friction and noise would resolve the paradox (because measurement can be viewed as communication of information from the source to the measuring device). Landauer’s contention in [20] is that no such limit can exist and thus the paradox cannot be resolved by alluding to energy costs of communication. Instead it is losses in erasing information that (according to Landauer) resolve the paradox. We refer the interested reader to [23, 24, 25, 26, 27, 28, 29] for contemporary work on energy of communication and computing within the context of theoretical physics, and discussions on whether Landauer’s principle indeed resolves the paradox., which however requires lowering the speed of computing (asymptotically) to zero to keep the system in thermodynamic equilibrium. From this perspective, information-theoretic works of Golay [30] and Verdú [31] derive capacity per-unit energy for various communication media (i.e., channels) that do have friction and noise, but implicitly assume that computation at the transmitter and receiver is frictionless and noiseless (and hence is free). In this paper, we take a step forward by allowing frictional losses in both communication and computation media and derive lower bounds on energy, whilst still ignoring noise in computation for simplicity.

Ii System model and notation

Ii-a Channel model

We consider a point-to-point communication link. An information sequence of fair coin flips is encoded into binary-alphabet codewords . The rate of the code is therefore bits/channel use, which is assumed to be fixed. The codeword is modulated using BPSK modulation and sent through an Additive White Gaussian Noise (AWGN) channel of bandwidth , with channel uses per second. The decoder estimates the input sequence by first performing a hard-decision on the received channel symbols before using these (binary) hard-decisions to decode the input sequence. The overall channel is therefore a Binary Symmetric Channel (BSC) with raw bit-error probability , where , is the path-loss associated with the channel, is the transmit power of the BPSK-modulated signal, and is the variance of the Gaussian noise in the hard-decision estimation. The encoder-channel-decoder system operates at an average block-error probability given by .

Definition 1 (Channel Model ())

Channel Model () denotes (as described above) a BSC() channel that is a result of hard-decision at the receiver across an AWGN channel of average transmit power , path loss and noise variance .

Ii-B Implementation, computation, and energy models

The computation is performed using a “circuit” on a “substrate.” This section formally defines these terms allowing for decoding analysis in Section III.

Definition 2 (Substrate)

A Substrate is a square of side in with vertices at , , , and .

Definition 3 ()

A is the collection of points for all .

Definition 4 ()

is the intersection of with the substrate , that is, it is the set of the lattice-points of the square lattice that lie in the substrate.

The parameter determines how close computational nodes in the circuit can be brought to each other, and depends on the technology of implementation. For large circuits, .

Definition 5 (Circuit, computational nodes)

The substrate together with a collection of points (called computational nodes, or simply nodes) inside , is called a Circuit, and is denoted by .

For instance, along with the set constitutes a Circuit.

Nodes can be input nodes, output nodes, or helper nodes. Physically, the nodes help perform the computation by computing functions of received messages. Each node is accompanied with a finite storage memory. Input nodes store the input of computation (one bit each; at the beginning of computation), output nodes store the output (one bit each; at the end of computation), and helper nodes help perform the computation.

Definition 6 (Subcircuit)

A subcircuit of a circuit is constituted by an open and convex subset of and by the subset of computational nodes .

That is, all the computational nodes within the sub-substrate must lie in the subcircuit .

Definition 7 (Link)

A (unidirectional) link connects two nodes in that it allows for noiseless communication between nodes in one direction. The messages are binary-strings. Each message is a function of all the messages (and the possible inputs) received at the transmitting node until the start of the message-transmission.

In a circuit with nodes, there are unidirectional links, which can be used more than once during a computation.

Definition 8 (Communication on a circuit)

Computational nodes use messages received thus far in computation, and stored memory values, to generate messages that can be communicated to other nodes over links.

We now introduce two models of computation: those with fixed and flexible-length messages. For both, the order of messages passed between computational nodes is pre-determined, but for a flexible-message-length computation, the length of a message can depend on the computation input.

Definition 9 (Fixed-message-length computation (on a circuit))

The computation starts with the arrival of the input of computation at the input nodes. Each input node stores one bit of the input. The computation then proceeds with communication of messages of predetermined size, i.e., the messages’ size does not depend on the input of computation. Each message is a function of the messages that the transmitting computational node has received thus far in the computation (including one bit of the input if the transmitting node is an input node). At the end of the computation, the output is available in the memories of the output nodes.

Definition 10 (Flexible-message-length computation (on a circuit))

The computation is said to be flexible-message-length computation if the number of bits in a message on a link in the computation can depend on the input of computation. Nevertheless, the minimum message-length is assumed to be at least one bit.

A computation may use some or all of the communication links in the circuit. Each link can be used as many times as needed, and at each use, the message can be of any chosen size with the associated costs as described in the following definitions.

Definition 11 ( cost of a link and of a circuit)

The  cost of a link in a computation on a circuit is the product of the total number of bits carried by the messages on the link and the Euclidean distance between the nodes at the ends of the link. The  for the entire circuit is the sum of  for all the links in .

Fixing the order of messages (but not necessarily the length), along with making the minimum message-size one bit, makes sure that there’s no free-of-cost “silence” [32] that can be used for communicating messages between nodes. Since each message on a link contains at least one bit, and the link is at least in length, the message costs at least .

When a flexible-message-length computation is executed, the  expended can depend on the input of computation. In such cases, we will often be interested in average  for a link or a computation, where the average is taken over the possible input realizations (with a specified distribution).

Definition 12 ( for a link within a subcircuit)

For a link that connects two nodes within a subcircuit in a computation , the  for that link within the subcircuit is the same as the  for the link in the original circuit. However, if only one of the nodes lies within the subcircuit, then  for this link within the subcircuit is the product of the number of bits of the message passed along this link and the length of link from the node inside the subcircuit to the boundary of the subcircuit.

Definition 13 ( for a subcircuit)

The  for a subcircuit in computation is the sum of  for all the links within the subcircuit (wholly or partially, as defined in Definition 12), and is denoted by .

The definition also holds for  for the entire circuit.

Definition 14 (Coefficient of information-friction ())

The coefficient of information-friction, denoted by , characterizes the energy required for computation in our model. This energy is given by , where is the number of  expended in executing the given computation on a circuit.

Definition 15 (Implementation Model ())

Implementation Model () denotes the implementation model as described in this section with being the minimum distance between computational nodes, and being the coefficient of information-friction.

The same implementation model can be used to execute a fixed or flexible-message-length computation.

Iii Lower bounds on  and information-friction energy of encoding and decoding

To obtain lower bounds on  for encoding and decoding, similar to analysis in [19, 3, 33], we need to cut the circuit under consideration into many disjoint subcircuits. The following definitions and lemmas set up the technical background needed for circuit-cutting and ensuing analysis.

Definition 16 (Disjoint subcircuits)

Two subcircuits and of a circuit are said to be disjoint subcircuits if , the null set. Similarly, are said to be mutually disjoint subcircuits if for every .

It follows that any two disjoint subcircuits cannot share computational nodes or communication links that connect two nodes within one of the subcircuits. In fact, two disjoint subcircuits do not share  of computation:

Lemma 1

Let , where , be a set of mutually disjoint subcircuits of the circuit . Then for any computation ,

(1)
Proof:

The lemma follows from the observation that in Definition 11, no  are double-counted in disjoint subcircuits. We note that there are potential situations when for which (1) is not satisfied with equality. This happens when there is a long link in a circuit which has a part that does not lie within either of the subcircuits that contain the two nodes at the ends of the link. \qed

The decoder circuit is partitioned into multiple subcircuits via a “Stencil555We use the term “Stencil” in analogy with the classic stencil instrument used to produce letters or designs on an underlying surface. A stencil can be slid on the surface to produce the design at any location on the surface, effectively shifting the origin-point of the design. In this case, a pattern of inner and outer squares is produced on the computational substrate. ” that can be “moved” over the circuit by changing its origin.

Definition 17 (Stencil)

A Stencil in , for , is a pattern of equally spaced “inner” squares that are concentric with “outer” squares which form a grid (as shown in Fig. 2). The length of a side of each outer square is , and the origin lies in the center of an “inner” square. The side of each inner square is of length .

A node in a circuit is said to be covered by a Stencil that is overlaid on the circuit substrate if it lies inside an inner-square of the Stencil. For the decoder, the input nodes store the channel observations, and the output nodes, also called “bit-nodes,” store the decoded message bits. At the encoder, the information-bits that are the input of computation are assumed to be stored in bit-nodes. Inside the -th subcircuit, let denote the number of bit-nodes that lie inside the inner square, and denote the number of input nodes that lie inside the outer square (i.e., anywhere inside the -th subcircuit).

Definition 18 (Stencil-partition)

The outer squares of Stencil induce a partition (see Fig. 2) of a circuit into subcircuits, each occupying substrate area at most . If any computational node lies on the boundary of an outer square, then it is arbitrarily included in one of the subcircuits.

Fig. 2: A Stencil overlaid on the Substrate. Also shown are the computational nodes of the Circuit on the Substrate. A zoomed-in version shows the dimensions of the Stencil. As an example, for the square in the zoomed-in version, .

The next lemma shows that by moving the Stencil over the substrate, we can find at least one position of the Stencil so that the average number of nodes (over random locations of the Stencil) are covered.

Lemma 2

For any circuit implemented in Implementation Model (), for any , there exists an origin of Stencil such that the number of bit-nodes covered by the Stencil is lower bounded by

(2)
Proof:

The proof uses the probabilistic method [34]. Let , that is, uniformly distributed in the square formed by . Now, the average number of bit-nodes covered by the Stencil (averaged over ) is:

(3)
(4)

where the key step follows from the observation that for any point, as we move the origin around uniformly, the probability measure of the set of origins for which the point is covered by the Stencil is the fraction of area covered by the Stencil, which is . Thus there exists at least one value of the origin such that the number of nodes covered is no smaller than the average.\qed

Consider the Stencil shown in Fig. 2. The distance between the inner and the outer squares is . bits are said to be communicated from the “transmitting” part of the circuit to the “receiving” part if the values stored in the receiving part are independent of the bits prior to communication, and the bits can be recovered (in an error-free manner) from the messages received at the receiving part during the process of communication. Notice that this definition is looser than the traditional understanding of communication: we do not stipulate that the stored values at the receiving part post-communication be able to recover the bits.

If bits are communicated from outside an outer square to inside an inner square in a subcircuit, then, intuitively, the  associated with the subcircuit should be at least . The following lemma shows this rigorously:

Lemma 3 ( and average  in computations)

Consider a circuit implemented in Implementation Model (), and any subcircuit obtained using the Stencil-partition defined in Definition 18. For communicating bits of information from outside an outer-square to inside the corresponding inner-square, for fixed-length messages. Further, even allowing for a flexible-message-length, the average . Similarly, for communicating bits from inside an inner-square to outside the corresponding outer-square, the average .

Proof:
Fig. 3: Square cuts are made in order to use the cut-set bounding technique. The directed edges show the links along which information flows in the computation. However, the links do not indicate the relative order of information flow during the computation, or the amount of information they carry.

Fixed-length messages: Consider the concentric square-shaped cuts on the sub-circuit-network, starting with the outer square as a cut, with distance separating these cuts, as shown in Fig. 3. The cuts end when distance from the inner square is smaller than . This remaining distance is denoted by for some . The inner square is now included as the final -th cut. Except for the inner square, across each cut, each link has to cross at least distance.

Further, if the number of bits across any cut, which is the summation of bits passed over all links across the cut, is smaller than , then bits cannot be delivered to the inner square. Thus across each cut, the total number of should be at least . If is the number of cuts, the total distance for which at least bits need to travel is at least which is exactly the distance between the inner square and the outer square. Thus, for fixed-message-length computation, .

Flexible-message-length: Flexible-message-length allows for use of variable-length messages on circuit links that can depend on the input of computation. Nevertheless, to code bits of information using variable-length coding still requires666We remind the reader that “silence” can not be used for communication because each message has at least one bit (see Definition 10). at least bits on average [35, Pg. 110]. \qed

Iii-a Decoding lower bounds: fixed-length messages

Lemma 4

If at most bits of information is available to obtain an estimate of a variable that is distributed uniformly on the set , being a positive integer, then .

Proof:

Applying Fano’s inequality [35, Pg. 39] to reconstruction of message , given the available information of at most bits, the error probability is lower bounded by

(5)

where on the LHS is the binary entropy function. We now consider two cases:

Case 1: : In this case, and , and thus from (5),

(6)

Since for (see, e.g. [36]), . From (6), for ,

(7)

Case 2: : In this case, , and thus using a looser form of (5),

\qed

We can now connect information-flow in decoding subcircuits to error probability. The following lemma provides a lower bound on the error probability when the number of  in a subcircuit of the decoder implementation is sufficiently small.

Lemma 5

For any decoder subcircuit obtained via Stencil-partitioning of Implementation Model , with , if , then .

Proof:

From Lemma 3, since the number of  for the subcircuit is smaller than , and the distance between the outer square and the inner square is meters, at most bits of information can be communicated from outside the outer square to inside the inner square.

We first observe that a BSC() is a stochastically degraded version of a BEC(). That is, a decoder that receives channel outputs that pass through BEC() can simulate a BSC() channel by randomly assigning the value or to an erased bit, i.e. without any increase in . Supplying the decoder with outputs of the erasure channel, we examine the event when all the channel outputs inside the outer square are erased. This event has probability .

Conditioning on the erasure event , let the (block) probability of not recovering all of the bits inside the -th inner square, denoted by , be . From Fano’s inequality [35, Pg. 39] applied to reconstructing the message bits , , given the communicated information of entropy at most bits,

(8)

Thus, for any , the (unconditional) error probability for recovering the bits correctly is lower bounded by . Since the block-error probability for the entire code is larger than the block-error probability in recovering the bits in -th subcircuit, we obtain the lemma. \qed

Lemma 6

For the Implementation Model (), for Stencil-partition with outer-squares of side-length , the maximum number of computational nodes (input, output, or helper) in a subcircuit is upper bounded by

(9)

Further, if ,

(10)
Proof:

The number of nodes in a Stencil cell is approximately . The actual number could however be larger because of boundary effects. On each axis, allowing for one extra node to be included from either side of the square, the number of nodes is (loosely) upper bounded by . Also note that

which is positive (in fact, greater than ) when , or . \qed

Theorem 1

For an error correcting code transmitted over a channel with Channel Model () and decoded in a decoder circuit implemented in Implementation Model () with fixed-message-length implementation that achieves a block-error probability , the decoder  are lower bounded as:

(11)
as long as . (12)

Remark: When condition (12) is violated in the asmyptopia of , i.e., when

(13)

the transmit power needs to scale at least as fast as . To see this, we use a known bound [37] on the -function, namely, :

(14)

Thus,

(15)
(16)

where follows from the observation that for (a fact that can be verified by simply plotting the two sides of the inequality). Further, if condition is not satisfied, then is bounded, and so is , which means that (12) is not violated in the limit . From above and (13), under condition . This lower bound, which is derived for the case when condition (12) is not satisfied, is larger than our lower bounds on total power when condition (12) is satisfied (Section III-C).

Proof:

The outer squares of the Stencil partition the circuit into subcircuits. Let the -th subcircuit have channel output nodes available within the outer square and bit-nodes inside the inner square. Using Lemma 2, we choose the origin of the Stencil so that at least fraction of the bit-nodes are covered by the inner squares, i.e.,

(17)

From Lemma 6 choosing Stencil parameter to be , under condition (12),

Thus . Using Lemma 6, .

From Lemma 5, if  for any subcircuit are smaller than , then the error probability is lower bounded as

(18)

which is a contradiction. Thus, for each decoding subcircuit obtained via the Stencil-partition,

From Lemma 2, , therefore, using Lemma 1,

Choosing yields the theorem. \qed

Iii-B Encoding lower bounds: fixed and flexible-message-length

Theorem 2

For an error correcting code encoded in a circuit that is implemented in Implementation Model () and transmitted over a channel with Channel Model () and with block-error probability , the encoder average  (denoted by ) are lower bounded as:

(19)
as long as , (20)

for both fixed and flexible-message-length encoding.

Proof:

We directly show the result for flexible-message-length implementations, which subsume fixed-message-length implementations. At the encoder, input information bits are mapped to codeword output bits.

Fig. 4: The figure illustrates the definitions of random variables corresponding to bit-nodes and output (codeword) nodes at the encoder. The values are the counterparts of viewed through the channel. It is important to note that they are not based on circuit partitioning at the decoder. Indeed, for deriving bounds for the encoder , we assume no implementation constraint on the decoder, so it is not even necessary that the decoder is implemented within the Implementation Model of Section II-B.

We again choose the Stencil parameters and . Focusing on the -th encoder subcircuit, let denote the number of codeword symbols inside the -th encoder subcircuit, and let denote the input nodes (that store uncoded information) inside the inner square of the subcircuit. Further, for the -th subcircuit, (dropping subscript for simplicity) let the information stored in the input nodes inside the inner square be denoted by , and in those outside the outer square be . There are more input nodes in the “annulus” between the inner square and the outer square, denote them by (see Fig. 4). Similarly, define codeword symbols and the corresponding channel outputs (see Section II-A).

Now, at the decoder, declare the values of for free. Further, assume that the decoder is not required to recover the values of , . Thus the job of the decoder is to only recover (this relaxation on requirements from the decoder will only further reduce the error probability). For recovering , it has the channel outputs , and the freely declared vector . Using the erasure-channel argument used in decoding lower bounds (Theorem 1), we assume that , the outputs of an erasure channel, are available at the decoder as well (which, as far as this theorem is concerned, is free to run the optimal Maximum Likelihood decoding without the constraints of implementation imposed on the encoder). This will only reduce the error probability for the same number of encoding . Further, observing that are available to the decoder, we are interested in minimizing the entropy , which is the uncertainty at the decoder in the information bits (that are still undeclared, namely the information bits in the -th encoder subcircuit) given the information available at the decoder to decode these bits. Examining this uncertainty,

(21)
(22)

where and follow from the Markov chains and respectively.

Similarly,

(23)

That is, the equality (21) also holds for specific values of the random variables and .

Our next step, which is key to this proof, is a simple equality. Consider the event that all of the symbols are erased, denoted by . Then,

(24)

This is because the event does not alter the joint distribution of even when encoding is a flexible-message-length computation. The encoder has no knowledge of this erasure-event777In absence of feedback from the receiver, the encoder only knows the channel statistics, not the realization. While feedback from the receiver to the transmitter is absent here, in presence of noiseless feedback, our bound on encoding  could be beaten. But the question is more interesting and relevant with realistic models of noisy feedback, where benefits are severely curtailed (see, e.g. [38]). Further, it is also important to note that for flexible-message-length implementations, the key equality (24) holds only when we are investigating circuits at the encoder. At the decoder, the knowledge that all inputs in the subcircuit are erased can be used by a subcircuit to ask for more information from the rest of the decoding circuit. At this point, it is unclear to us if this means that flexible-message-length decoding can beat our bound in Theorem 1., and thus cannot alter the joint distribution in response to the event. Further, under this erasure-event, because are completely erased, they provide no help in decoding .

Thus, if (as in (5)), then the conditional probability of error in recovering these bits, , is at least (from Lemma 4), and thus the (unconditional) block-error probability is lower bounded by

(25)

which leads to a contradiction (following the exact sequence of steps in (18) from proof of Theorem 1).

Thus for all . This means that

(26)

Thus, at least bits of information are communicated from inside the inner square to outside the outer square for each subcircuit at the encoder. From Lemma 3, the required  (average or deterministic) for the computation is at least (since ) for each subcircuit during encoding, and thus the total average  for encoding circuitry is at least , yielding the lemma. \qed

We emphasize that while our lower bounds for fixed and flexible-message-length encoding are the same, this does not imply that flexible-message-length cannot reduce the required energy consumption because our bounds could be loose. As we discuss in Section V, this necessitates a comparison with upper bounds, which is a work in progress.

Iii-C Lower bounds on total energy consumption

This section uses the bounds on  derived above to yield bounds on total (transmit and information-friction) energy consumed in communications. Strictly speaking, our bounds are for total energy-per-bit. However these bounds can be translated to total power consumption simply by dividing both transmission and circuit energy by the available time (under the assumption that encoding/decoding can take only as much time as transmission in order to not have buffer-overflows). The results in this section can be viewed as those that account for frictional losses in both the communication channel and the transmitter and receiver circuitry. However, our emphasis is on observing qualitative differences between bounds on total energy and the traditional understanding on transmit energy. Thus we fix the distance (and hence also the path-loss) between the transmitter and the receiver, focusing on the contribution of circuit energy bounds to the total energy.

Corollary 1 (Unavoidable limits on total energy-per-bit)

For communication over a channel with Channel Model () with the encoder and the decoder implemented in Implementation Model () with fixed-message-length computing, the total energy per bit for communication at error probability is lower bounded as:

(27)
Proof:

The lower bound considers only the energy at the transmitting end: the transmit and the encoding energy, ignoring the decoding energy. This makes no difference to the order-sense result since the bounds in Theorem 1 and Theorem 2 are the same.

Because the channel is used times per second, the per-bit transmit energy used is . The total (transmit + encoding) energy-per-bit under condition (12) can therefore be lower bounded as (using Theorem 2, and denoting total transmit energy by , and encoding energy by ):

In our hard-decision channel model, as increases, the term scales proportionally to the received power (see, e.g. [3]). Thus

for some . By simple differentiation, the choice of that minimizes the RHS is