# Universal Decoding for Gaussian Intersymbol Interference Channels

###### Abstract

A universal decoding procedure is proposed for the intersymbol interference (ISI) Gaussian channels. The universality of the proposed decoder is in the sense of being independent of the various channel parameters, and at the same time, attaining the same random coding error exponent as the optimal maximum-likelihood (ML) decoder, which utilizes full knowledge of these unknown parameters. The proposed decoding rule can be regarded as a frequency domain version of the universal maximum mutual information (MMI) decoder. Contrary to previously suggested universal decoders for ISI channels, our proposed decoding metric can easily be evaluated.

## I Introduction

In many practical situations encountered in coded communication systems, the specific channel over which transmission is to be carried out is unknown to the receiver. The receiver only knows that the channel belongs to a given family of channels. In such a case, the implementation of the optimum maximum likelihood (ML) decoder is precluded, and thus, universal decoders, independent of the unknown channel, are sought. In designing such a decoder, there are two desirable properties that should be taken into account: The first is that the universal decoder performs asymptotically as well as the ML decoder had the channel law been known, and secondly, that the constructed decoding metric will be reasonably easy to calculate. This paper addresses the problem of universal decoding for intersymbol interference (ISI) Gaussian channels.

The topic of universal coding and decoding under channel uncertainty has received very much attention in the last four decades, see, for example, [NeriUni, Goppa, CsisKro, Csis2, ZivUni, LapZiv, FerderLapidoth, merFeder, Lomnitz, Lomnitz2, Misra, Shayevitz, Shayevitz2, Shayevitz3, UniNeri2]. In the realm of memoryless channels, Goppa [Goppa] explored the maximum mutual information (MMI) decoder, which chooses the codeword having the maximum empirical mutual information (MMI) with the channel output sequence. It was shown that this decoder achieves the capacity in the case of discrete memoryless channels (DMC). In [CsisKro], the problem of universal decoding for DMC’s with finite input and output alphabets was studied. It was shown that the MMI decoder universally achieves the optimal random coding error exponent under the uniform random coding distribution over a certain type class. In [NeriUni], an analogous result was derived for a certain parametric class of memoryless Gaussian channels with an unknown deterministic interference signal. In the same paper, a conjecture was proposed concerning a universal decoder for ISI channels.

For channels with memory, there are several quite general results, each proposing a different universal decoder. In [ZivUni], the case of unknown finite-state channels with finite input and output alphabets for which the next channel state is a deterministic unknown function of the channel current state and current inputs and outputs, was considered. For uniform random codes over a given set, a universal decoder (that achieves the optimal random coding error exponent) which is based on the Lempel-Ziv algorithm was proposed. Later, in [LapZiv], it was shown that this decoder continues to be universally asymptotically optimum also for the class of finite-state channels with stochastic, rather than deterministic, next-state functions. In [FerderLapidoth], sufficient conditions and a universal decoder (called the merging decoder) were proposed, for families of channels with memory. The idea was to employ many decoding lists in parallel, each one corresponding to one point in a dense grid (whose size grows with the input block length) in the index set. Accordingly, with regard to our work, it was shown that the proposed decoder universally achieves the optimal error exponent under the ISI channel. Unfortunately, as was mentioned before, this deocder is very hard to implement in practice due to its implicit structure and the fact that it requires to form a dense grid in the parameter space. In [merFeder], a competitive minimax criterion was proposed. According to this approach, an optimum decoder is sought in the quest for minimizing (over all decision rules) the maximum (over all channels in the family) ratio between the error probability associated with a given channel and a given decision rule, and the error probability of the ML decoder for that channel, possibly raised some power less than unity. This decoder is, again, very hard to implement for the ISI channel due its complicated decoding metric.

In this paper, we propose a universal decoder that asymptotically achieves the optimal error exponent, and contrary to previous proposed decoders, our proposed decoding metric can easily be calculated. The technique used in this paper is in line with the techniques which were established in [NeriUni, Wasim]. Specifically, similarly to [NeriUni], the main idea is to define an auxiliary “backward channel”, which is a mathematical tool for assessing log-volumes of conditional typical sets of sequences with continuous-valued components. These log-volume terms play a pivotal role in the universal decoding metric. The backward channel is defined in a way that guarantees two properties: first, a measure concentration property, that is, assignment of high probability to a given conditional type by an appropriate choice of certain parameters, and secondly, the conditional density of the input given the output, associated with this backward channel should depend on the input and the output only via the sufficient statistics that define the conditional type class. Contrary to the problem considered in [NeriUni], the difficulty, in the ISI channel, stems from the fact that the choice of the backward channel is a non-trivial issue. It turns out that in this case, the passage to the frequency domain resolves this difficulty. The proposed decoding rule can be regarded as a frequency domain version of the universal maximum mutual information (MMI) decoder.

The remaining part of this paper is organized as follows. In Section II, we first present the model and formulate the problem. Then, the main results are provided and discussed. In Section III, we provide a proof outline where we discuss the techniques and methodologies that are utilized in order to prove the main result. Finally, in Section IV, the main results are proved.

## Ii Model Formulation and Main Result

Consider a discrete time, Gaussian channel characterized by

(1) |

where are the channel inputs, is the unknown channel impulse response, is zero-mean Gaussian white noise with an unknown variance , and are the channel outputs. It will be assumed that the noise is statistically independent of the input . We allow to grow with in the order of . In such a case, we further assume that the impulse response sequence is absolutely summable^{1}^{1}1This assumption can be relaxed to square summability of ..

The input is a codeword that is randomly and uniformly drawn over a codebook of messages , , where is the coding rate in bits per channel use. In the following, the probability of error associated with the ML decoder, that knows the unknown parameters , will be denoted by . We shall adopt the random coding approach, where each codeword is randomly chosen with respect to a probability measure denoted by . For a given power constraint, a reasonable choice of is the truncated Gaussian density restricted to the shell of an -dimensional hypersphere whose radius is about . To wit,

(2) |

where is the indicator function of the set

(3) |

where , and normalizes the above measure such that it would integrate to unity. Note that is invariant to unitary transformations of . It is well-known [Gallager, Chap. 7] that attains a higher error exponent than that of the respective Gaussian density with the same variance, at least for small rates, where the non-typical events (or, the large deviations events) are the dominant^{2}^{2}2Intuitively speaking, this is true because of the fact that it does not allow low energy codewords. The analysis in this paper can also be carried for the case where the codewords are drawn independently and uniformly over a set that is endowed with a -algebra (e.g., an -dimensional hypercube), and satisfy an average power constraint, as was considered in [FerderLapidoth, Theorem 4]. Let , where the expectation is taken over the ensemble of randomly selected codebooks under . Finally, we define the random coding error exponent as .

As was mentioned previously, we wish to find a decoding procedure which is universal in the sense of being independent of the unknown parameters, and at the same time attaining . Specifically, let designate the error probability associated with the universal rule for a given codebook , and let . Then, we would like to decay exponentially with rate .

We now turn to present the proposed decoding rule. To this end, let and denote the discrete Fourier transforms (DFT) of the sequences and , respectively, i.e., the -th component of is given by

(4) |

where and similarly for . Then, define an auxiliary “backward channel” by the conditional measure

(5) |

where is the parameters vector of the backward channel, in which are complex-valued. It should be emphasized that the above definition of the auxiliary backward channel is completely unrelated to the underlying probabilistic model. In particular, it is not argued that is obtained from and the forward channel (1) by the Bayes rule, or any other relationship. For example, our backward channel allows vectors that are outside the region . Our decoding rule will select a message that maximizes the metric

(6) |

among all codewords. The backward channel is a mathematical tool for assessing log-volumes of typical sets [NeriUni, Wasim, Wasim2], and it should be defined in a way that guarantees two general properties: first, a measure concentration property, that is, assignment of high probability to a given conditional type by an appropriate choice of the parameters of this backward channel, and secondly, the conditional density of given , associated with the backward channel should depend on and only via the sufficient statistics that define the conditional type class. Contrary to the problem considered in [NeriUni], the difficulty in the ISI channel stems from the fact that the choice of the backward channel is a non-trivial issue. Specifically, as will be seen in the sequel, an “appropriate” candidate backward channel must depend on a sufficient statistics vector (associated with ) with dimension that equals to the number of degrees of freedom, which in turn adjust their conditional expectations. It turns out that in this case, the passage to the frequency domain is more “natural” and mathematically convenient due to the well-known asymptotic spectral properties of Toeplitz matrices (see, for example, [Gray]). To wit, it can be seen that the model in (1) can be written in a vector form where is a Toeplitz matrix. Now, by the spectral decomposition theorem [Widom], we know that there exists an orthonormal basis that diagonalizes the matrix . Projecting the observations onto this basis will simply decompose the original channel into a set of independent channels, which are simpler to analyze. While this is true for any matrix , for Toeplitz matrices we can asymptotically characterize their eigenvalues and eigenvectors in terms of the generating sequence , which is a fundamental part in our analysis. We next give the main result of this paper.

###### Theorem 1

Let the codewords of be chosen randomly and independently with respect to the density given in (2). Assume that the channel impulse response coefficients are absolutely summable , and that . Then,

(7) |

where as , and is the average probability of error associated with the universal decoder given in (6).

The intuitive interpretation of (6) is that is an empirical version of the per-letter mutual information between and in the frequency domain. Thus, we select the input that seems empirically “most dependent” upon the given output vector in the frequency domain, which corresponds to the MMI principle. The passage to the frequency domain asymptotically eliminates the strong interactions between the various components of the input vector, and transforms the original model into a set of separable channels which are controlled by degrees of freedom. Note that on the support of , the term is nearly a constant independent of . Thus, the proposed decoding rule is essentially equivalent to one that maximizes , namely, maximum a posteriori (MAP) decoding.

###### Remark 1

In [NeriUni], a universal decoding procedure for memoryless Gaussian channels with a deterministic interference was proposed. Accordingly, we remark that Theorem 1 can be fairly easily extended to the channel model

(8) |

where is an unknown deterministic interference that can be decomposed as a series expansion of orthonormal bounded functions with an absolutely summable coefficient sequence, namely,

(9) |

where and for all and . The coefficients are assumed deterministic and unknown. In this case, an appropriate definition of the auxiliary backward channel is

(10) |

where now is the parameter vector of the backward channel, is the frequency transformed representation of , and is assumed to be a monotonically non-decreasing integer-valued sequence such that and . Accordingly, the decoding rule will select a message that maximizes the metric (6) (where in (6) is replaced with ), among all codewords. For simplicity of the exposition and to facilitate the reading of the proof of Theorem 1, we will assume the original model (1).

## Iii Proof Outline

In this section, before getting deep into the proof of Theorem 1, we discuss the techniques and the main steps which will be used in Section IV. In order to facilitate the explanations, we will need the following definitions: Let and be arbitrary vectors in and define

(11) | |||

(12) |

and

(13) |

where is the conditional pdf associated with the channel. In words, and are simply the sets of prospective incorrect codewords corresponding to the ML decoder, and the proposed universal decoder, respectively, assuming that is the transmitted codewords and that is the received vector. The set is just a -perturbed version of which will be used for technical reasons. Finally, we let , , and be the average error probabilities associated with the ML decoder, the proposed decoder, and the -perturbed decoder (see, (18)-(21)).

Generally speaking, the root of our analysis is Lemma 1, which was asserted and proved in [NeriUni, Lemma 1], and can be thought as a continuous extension of [ZivUni, Corollary 1]. This result relates between and as follows

(14) |

where is a sequence of sets of pairs such that

(15) |

Whence, we see that in order to show that and are exponentially the same, we just need to define a sequence such that the ratio in (14)

(16) |

is uniformly overbounded by a subexponential function of , i.e., where as uniformly for all . Once this accomplished, the proof of the theorem will be complete. The main question is now how to define the sequence properly? To answer this question, let us interpret its role. The set simply divides the space of pairs into two parts, where in the first part, the supremum in (14) is uniformly bounded by a subexponential function of , and the second part possesses a probability smaller than the desired exponential function and hence negligible (see, (15)). Obviously, given these requirements one can propose several candidates for , namely, the choice is not unique. However, another important property that should account for is that the function will be uniformly continuous w.r.t. small perturbations of the sufficient statistics (this idea will be emphasized in the analysis). To summarize, the first part in the forthcoming analysis is to define the sequence such that (15) holds true, and that hopefully (14) will hold too. The proposed is given in Lemma 2, and the main tool that is used in the proof is large deviations theory.

Following the first part, in the second part, we will eventually show that the chosen fulfills the desired subexponential behavior of (16). Accordingly, we will overbound (16) within as follows: we will derive an upper bound on the numerator of (16) and a lower bound on its denominator, and show that these are exponentially equivalent. To this end, we will need to define a conditional typical set of our continuous-valued input-output sequences, establish some of its properties, and particularly to calculate its volume (Lebesgue measure). This typical set of some sequence given will contain all the vectors which, within , have the same sufficient statistics as induced by our backward channel (see (61) for a precise definition of this set). Then, we will provide upper and lower bounds (which are exponentially of the same order) on the volume of this typical set. To accomplish this, we will use methods that were previously used in [NeriUni, Wasim, Wasim2], which are based on large deviations theory and methods that are customary to statistical physics. After that, we will show that for any two vectors and that belong to this typical set, the conditional pdf’s and are exponentially equivalent, that is, for sufficiently large ,

(17) |

for any . Thus, given this property, we can easily provide a lower bound on the denominator of (16). Indeed, since , then in view of the last result, there exists a sufficiently small such that the predefined typical set is essentially a subset of . Therefore, the integral over , in the denominator, can be underestimated as an integral over the typical set, and since we know its volume (or, more precisely, a lower bound on it which is exponentially tight), it is not difficult to provide a lower bound on this integral (see (105) for more details). Providing an upper bound on the numerator is a little more involved. The underlying idea is to partition the set into a subexponential number of conditional types, where for each conditional type, the integral over the respective conditional type is overestimated using the upper bound on the volume. Finally, it will be shown that these two bounds are exponentially equivalent, which implies that (16) is subexponential function of , as required.

## Iv Proof of Theorem 1

For completeness, in this section, we will provide again some definitions that were already presented in short in the previous section. Let and be arbitrary vectors in and define and as in eqs. (11) and (12), respectively. The average error probabilities associated with the ML decoder and the proposed decoder are given by (see, for example, [NeriUni])

(18) |

and

(19) |

respectively, where the expectations are taken with respect to (w.r.t.) the joint distribution , and we use the usual conventions where random vectors are denoted by capital letters in bold face font, and their sample values are denoted by the respective lower case letters. Similar convention will apply to scalar random variables (RVs), which will be denoted with same symbols without the bold face font. Finally, for we define the set

(20) |

and accordingly

(21) |

Finally, with a slight abuse of notation, we also use the notation which is defined as follows: Let and be the Fourier transforms of and , respectively. Then, where is the DFT matrix, namely, .

As was discussed earlier, our goal is to compare the exponential behavior of to that of . To this end, we will instead compare the exponential behavior of to that of for small . In the final step of the proof, this will be justified by showing that

(22) |

where as and . In the analysis, we will use the following lemma [NeriUni, Lemma 1 pp. 1263].

###### Lemma 1

Let be a sequence of sets of pairs of -dimensional vectors such that

(23) |

Then, for all large ,

(24) |

Thus, by using Lemma 1, we see that in order to show that and are exponentially the same, we just need to find a sequence such that the ratio

(25) |

is uniformly overbounded by a subexponential function of , i.e., where as uniformly for all . For a given pair , let us define to be

(26) |

The set will be parametrized by a parameter and defined as follows

(27) |

We have the following result.

###### Lemma 2

There exists a sufficiently large such that satisfies (23).

###### Proof 1 (Proof of Lemma 2)

By the union bound we have that

(28) |

Thus, it should be shown that if is sufficiently large, both probabilities on the right-hand side of (28) decays faster than . Regarding the first term, note that

(29) | ||||

(30) | ||||

(31) |

where denotes the spectral norm, and in the second inequality we have used the fact that for any and nonnegative definite matrix . Due to the fact that (essentially, is suffice here) it can be shown that [Gray] the spectral norm is uniformly bounded, that is for all matrix dimension we have that where . Therefore, we obtain that

(32) |

which can be made less than by selecting a sufficiently large , as can be shown by a simple application of the Chernoff bound. As for the remaining terms: by taking the gradient of w.r.t. , we obtain that the components of are given by the solutions of the following set of equations

(33) |

and

(34) |

Note that

(35) |

where . As before, the exponential decay rate of the last two terms on the right-hand side of (35) can be made arbitrarily large by selecting a sufficiently large and sufficiently small . As for the first term, we first note that by using (33), we have

(36) | ||||

(37) |

Thus, using the last result we obtain

(38) | ||||

(39) | ||||

(40) |

which in turn must be nonnegative, and hence

(41) |

Thus, given that , by using (40) we obtain that

(42) | ||||

(43) | ||||

(44) |

Therefore, invoking (41), we finally obtain that

(45) |

Now, recall that minimizes the quadratic norm

over all vectors in . Also, due to (45), the minimizing vector must lie in the -dimensional hypersphere . Now, fix and define the grid , and let designate the th Cartesian power of . From the uniform continuity of the above quadratic form within the set of all energy limited vectors , one can find a sufficiently small value of (depending on ) such that there exists a vector where , i.e., the nearest neighbor of the minimizer, satisfying (given of course the event that )

(46) |

where is a sufficiently small value (depending on ). For brevity, in the following, we will omit this negligible additive term. Whence

(47) | |||

(48) | |||

(49) | |||