Quantization of BinaryInput Discrete Memoryless Channels
Abstract
The quantization of the output of a binaryinput discrete memoryless channel to a smaller number of levels is considered. An algorithm which finds an optimal quantizer, in the sense of maximizing mutual information between the channel input and the quantizer output is given. This result holds for arbitrary channels, in contrast to previous results for restricted channels or a restricted number of quantizer outputs. In the worst case, the algorithm complexity is cubic in the number of channel outputs . Optimality is proved using the theorem of Burshtein, Della Pietra, Kanevsky, and Nádas for mappings which minimize average impurity for classification and regression trees.
discrete memoryless channel, channel quantization, mutual information maximization, classification and regression
1 Introduction
Consider a discrete memoryless channel (DMC) with a quantized output, as shown in Fig. 1. Let the channel input take values from , with input distribution ,
Let the channel output take values from , with channel transition probabilities ,
The channel output is quantized to , which takes values from , by a possibly stochastic quantizer ,
(1) 
so that the conditional probability distribution on the quantizer output is ,
(2) 
Here, , and are finite sets , and .
The mutual information between and is:
(3) 
Mutual information is convex (lower convex) in , for fixed . Similarly, it is concave (upper convex) in for fixed [1, Theorem 2.7.4]. Logarithms are base 2.
The main contribution of this paper is a Quantizer Design Algorithm which finds a mutual informationmaximizing quantizer for binaryinput DMCs. The Quantizer Design Algorithm, an instance of dynamic programming, searches over all quantizers that satisfy a condition on quantizer optimality; this condition will be given as Lemma 3. Note that is of interest, since implies no loss in mutual information due to quantization. The main result is stated concisely as follows.
Theorem. The set of all quantizers, including stochastic quantizers, is denoted by . For an arbitrary binaryinput DMC, the Quantizer Design Algorithm finds a quantizer which maximizes the mutual information between and :
(4) 
and this algorithm has complexity proportional to , in the worst case.
This maximization (4) is a concave optimization problem. General concave optimization is an NPhard problem [3]. For the specific problem of channel quantization, a naive approach of searching over all deterministic quantizers has exponential complexity (it will be shown that there exists an optimal quantizer which is deterministic). The significance of the Theorem is that an optimal quantizer may be found with complexity cubic in
Discrete channel quantization is shown to be related to the design of classification and regression trees. Burshtein, Della Pietra, Kanevsky, and Nádas gave conditions under which an optimal classification forms a convex set [2]. In this paper, this result is applied to find conditions on optimal channel quantization, which are in turn used to prove the optimality of the Quantizer Design Algorithm.
The flow of the remainder of this paper is as follows. Sec. 2 gives the motivation, outlines the connections with statistical learning theory, and describes prior work on channel quantization with informationtheoretic measures. Sec. 3 sketches the main argument, giving a condition on an optimal quantizer, while proofs are in the Appendix. Sec. 4 gives the Quantizer Design Algorithm which exploits this condition; complexity is also discussed. A numerical example of quantizing an AWGN channel is given. Finally, we engage in some discussion in Sec. 5.
2 Relationship with Prior Work
2.1 Motivation
The problem of finding good channel quantizers is of importance since most communications receivers convert physicalworld analog values to discrete values. It is these discrete values that are used by subsequent filtering, detection and decoding algorithms. Since the complexity of circuits which implement such algorithms increases with the number of quantization levels, it is desirable to use as few levels as possible, for some specified error performance.
Channel capacity is the maximization of mutual information, so a reasonable metric for designing channel quantizers is to similarly maximize mutual information between the channel input and the quantizer output. For a memoryless channel with a fixed input distribution, the quantizer which maximizes mutual information will give the highest achievable communications rate.
In addition to the problem of quantizing a physical channel, this paper’s result has other applications. The original motivation was the implementation of finitemessagealphabet LDPC decoders [28]. Also, we have recently discovered that discrete channel quantization may have application in the construction of polar codes [4].
2.2 Learning Theory Connections
A connection between discrete channel quantization and the area of statistical learning theory can be established. Classification, which is a type of supervised learning problem, deals with a variable of interest which is stochastically linked to an observation
A common tree construction procedure is to successively split nodes, beginning with the root node. A node should be split in such a way to give “purer” descendent nodes, in terms of the classifier’s ability to form good estimates [7]. Impurity is a way of measuring the goodness of a split. In this paper, a tree is not being constructed, but the are analogous to the decisions made by the tree, and we are interested in the impurity . Various impurity functions have been suggested [6, Appendix A]. The impurity function of interest here is:
(5) 
where . This is conditional entropy given , that is . Classification conditioned on is good if it represents with as few bits as possible; if is zero, then no bits are required, that is, is known exactly.
The average impurity of a mapping is :
(6)  
(7) 
Minimization of is often of interest in learning theory. On the other hand, mutual information, the objective function in (4), can be written as:
(8) 
Thus, minimization of average impurity over is equivalent to maximization of mutual information over , since is fixed. This is the connection between learning theory and discrete channel quantization.
Burshtein et al. gave an elegant and general result for a broad class of impurity functions, showing that the preimage which minimizes average impurity forms a convex set in a suitable Euclidean space of observed values [2]. Since conditional entropy is a member of this class of impurity functions, we can use this result to prove the Theorem. Note that Coppersmith, Hong and Hosking [8], considering a restricted setting, arrived at a similar conclusion, stating that the preimage of two distinct quantizer outputs are separated by a hyperplane, in a similar Euclidean space. In the present paper we follow Burshtein et al.
Algorithms for classification can be used for channel quantization. If optimality is not required, then Chou’s iterative algorithm based on the Kmeans algorithm can find good quantizers [6]. An optimal classifier can be found with complexity [2], which is lower than for the naive approach, for many cases of interest.
2.3 Superficially Related Problems
Superficially, the channel quantization problem (4) resembles various informationtheoretic optimization problems, particularly the computation of the ratedistortion function,
(9) 
but it is distinct. Mutual information is convex in , and for the computation of , mutual information is minimized, so this is a convex optimization problem [1, Sec. 13.7], where the distortion constraint is omitted for clarity. On the other hand, the channel quantization problem is to maximize mutual information in , leading to a considerably different concave optimization problem. Similarly, computation of the DMC capacity is a convex optimization problem, since mutual information, concave in , is maximized. The relationship among the ratedistortion problem, the DMC capacity, and the problem treated in this paper is illustrated in Fig. 2. Arimoto [9] and Blahut [10] gave the wellknown algorithm to compute the channel capacity and the ratedistortion function.
Another information extremum problem is the information bottleneck method, from the field of machine learning [11]. The problem setup is identical, using the same Markov chain . However, this is an information minimization problem, using a Lagrange multiplier to sweep a kind of ratedistortion curve. Moreover, it is a convex optimization method, using alternating minimization.
2.4 InformationTheoretic Quantizer Design Criteria
Informationtheoretic measures for channel quantization have been of interest since the 1960s, when Wozencraft and Kennedy suggested using the channel cutoff rate as an optimization criterion [12] [13, Sec. 6.2]. Massey gave a quantizer design algorithm for the binaryinput AWGN channel [14] and Lee extended these results to continuous channels with nonbinary inputs [15].
However, since the channel capacity, which is above the channel cutoff rate, can now be practically approached with turbo codes and LDPC codes, mutual information is a more appropriate measure. The earliest work we are aware of is the 2002 conference paper of Ma, Zhang, Yu and Kavcic, which considered quantization of the binaryinput AWGN channel [16]. For the special case of three quantizer outputs, it is straightforward to select a single parameter which maximizes mutual information. However, for a larger number of outputs, local optimization is feasible and this has higher mutual information than uniform quantization [17].
Singh, Dabeer and Madow considered the problem of jointly finding capacityachieving input distributions and AWGN channel quantizers [18]. Again, for an AWGN channel quantized to three levels, optimization over a single parameter was done, but for a larger number of outputs, a local optimization algorithm was used. In fact, this is a concaveconvex problem, and global optimization appears difficult.
Maximization of mutual information has been considered in various applicationcentric quantization problems. Channels with memory can be quantized using the information bottleneck method, and quantizers with memory have a higher information rate than memoryless quantizers [19]. For flash memories, maximizing mutual information for pulseamplitude modulation used in flash memories can improve the performance of LDPC codes [20].
2.5 Contributions of This Paper
Previous work, discretizing continuousoutput channels, either showed optimality in special cases, or considered locally optimal algorithms. The special cases are restricted to two and threelevel quantization of symmetrical AWGN channels. The locally optimal quantization algorithms appear to be effective, but they have not been proven optimal.
By considering discreteoutput channels rather than continuousoutput channels, further progress can be made on this problem. In contrast to previous work, the results in this paper hold for arbitrary channels, and for an arbitrary number of quantization levels. Thus, we believe that this is the first result on globally optimal quantization of general binaryinput channels. Of course, a continuous output channel can be approximated with arbitrarily small discrepancy by a finely quantized channel, which may be used as the original channel in this paper.
While the algorithm assuming uniformlydistributed inputs appeared in conference proceedings previously [29], contributions of this paper include the proof of its optimality and an extension to nonuniform input distributions, as well as showing the usefulness of statistical learning theory for channel quantization.
3 The Structure of Optimal Quantizers
This section develops the condition on an optimal quantizer for binaryinput channels. First, channel quantization is stated as a concave optimization problem, and it is shown that there exists an optimal quantizer that is deterministic. Then, a separating hyperplane condition for general number of inputs is given. Finally, this condition is specialized to the binaryinput discrete memoryless channel case, stated as Lemma 3. Proofs are in the Appendix.
3.1 Restriction to Deterministic Quantizers
Concave optimization, also known as concave programming or concave minimization and related to global optimization, is a class of mathematical programming problems which has the general form:
(10) 
where is a feasible region and is a concave function [21] [22].
Mutual information (3) is convex in as well as , as shown in the Appendix, part .1. Thus, the optimization problem given by (4), is maximization of a convex function, expressed as a concave programming problem over the variables :
(11)  
The constraint enforces that is a conditional probability distribution. Using a wellknown result from concave optimization, the maximum of mutual information is achieved by a deterministic quantizer:
Lemma 1.
For any DMC and any , a deterministic quantizer maximizes mutual information. That is, , for all and .
The proof is in the Appendix, part .1.
As an example to illustrate Lemma 1, consider the binary, symmetric errors and erasure channel, with the transition matrix:
(12) 
for and . Suppose the three outputs, called 0, erasure and 1, are to be quantized to two levels. One might expect that symmetry should be maintained by mapping the erasure symbol to the two output symbols with probability 0.5 each. However, as Lemma 1 shows, there exists a deterministic quantizer (which maps the erasure to either 0 or 1) which maximizes mutual information, and this quantizer lacks symmetry between the channel input and quantizer output. For this particular channel, a stochastic quantizer has strictly lower mutual information. Note that for the lowSNR AWGN channel quantized to two levels, asymmetric quantizers are optimal as well [23].
Previous work on the continuous output channel has shown that dithered quantization has no advantage over fixed, deterministic quantizers [24]. Lemma 1 extends this idea to DMCs, showing that purely stochastic quantizers, that is, nondeterministic quantizers, never have better performance than deterministic quantizers.
3.2 Separating Hyperplane Condition for Optimality
Due to Lemma 1, attention may now be restricted to deterministic quantizers :
(13) 
Let denote the preimage of (in some contexts, the preimage may also be written as ). The sets and are disjoint for , and the union of all the sets is .
Let be the conditional probability distribution on the channel input:
(14) 
which depends on the input distribution as well as the DMC . Each channel output corresponds to a vector :
(15) 
with , and for a given channel, the set of all vectors is .
Define an equivalent quantizer on the quantization domain :
(16) 
The two quantizers are equivalent in the sense that . The advantage of the new quantizer is that the points exist in the Euclidean space , and in that Euclidean space, the following lemma holds.
Lemma 2.
There exists an optimal quanitzer for which any two distinct preimages and are separated by a hyperplane in .
Burshtein et al. [2, Theorem 1] showed that there must be at least one optimal quantizer for which all preimages are convex in . In the context of discrete channel quantization, convexity corresponds to the condition that the convex hulls of distinct preimages do not intersect, or distinct preimages are separated by a hyperplane. The application of [2] is not immediately obvious, and further details and proof are given in the Appendix, part .2 and part .3, respectively.
3.3 Binary Input Case
Specializing Lemma 2 to the case of , the channel outputs are in a onedimensional space. The separating hyperplane is a point, and an optimal preimage is those points on a line segment. In order to simplify finding such points, it is convenient to assume that the outputs are labeled in such a way that implies that , for any , that is:
(17) 
There is no loss of generality because the outputs can be relabeled such that this condition holds. The benefit should be clear: if (17) holds, then for an optimal quantizer , the set will be a contiguous set of integers. That is, attention may be further restricted to:
(18) 
for , with and and . The are quantizer boundaries.
The above reasoning proves the following Lemma 3, which uses a slightly more intuitive sorting using loglikelihood ratios to describe a condition satisfied by an optimal quantizer. Inequalities (19) are easily obtained from (17), and the monotonicity of the function is used.
Lemma 3.
Consider a binaryinput DMC that satisfies,
(19) 
Then, there exists an optimal quantizer , such that each consists of a contiguous set of integers, and so the optimal quantizer boundaries satisfy:
(20) 
Note that the condition (19) does not depend on the input distribution .
The necessity of the contiguousness condition (20) in Lemma 3 for an optimal quantizer to be deterministic cannot be shown using Lemma 1 alone. However, the more restricted classification setting of Coppersmith et. al [8] can be applied here to demonstrate the necessity the condition. The relevance of necessity to the Quantizer Design Algorithm is discussed in the next section.
Strict inequalities are used in (17) and (19) because if
(21) 
then outputs and can be combined to a single output with the likelihood for input to form a new channel with outputs. The likelihood ratio for the combined output,
(22) 
is equal to (21). The new channel is operationally equivalent to the original channel, and moreover the two channels have the same mutual information.
4 Quantizer Design Algorithm
This section describes the Quantizer Design Algorithm which finds a quantizer that maximizes mutual information over quantizer boundaries satisfying . By Lemma 3, provides the maximum mutual information over all quantizers. The Quantizer Design Algorithm, as well as Lemma 3, are restricted to binaryinput DMCs.
4.1 Partial Mutual Information
A partial sum of mutual information is called “partial mutual information.” Partial mutual information is the contribution that one or more quantizer outputs makes to the total mutual information. Consider that the conditional probability distribution on the quantizer output can be written as:
(23) 
The contribution that quantizer output makes to mutual information is called the partial mutual information :
so total mutual information is the sum of all the partial mutual information terms:
(24) 
For one quantizer output with boundaries with , denote the partial mutual information by , which is:
So if , then .
4.2 Quantizer Design Algorithm
The Quantizer Design Algorithm is an instance of dynamic programming. The algorithm has a state value , which is the maximum partial mutual information when channel outputs 1 to are quantized to quantizer outputs 1 to . This can be computed recursively by conditioning on the state value at time index :
(26) 
where the maximization is taken over . Clearly, is the maximum total mutual information. The sequence which gives the maximum of total mutual information corresponds to the optimum quantizer whose boundaries are . The relationship between the state values is illustrated in a trellistype diagram in Fig. 3, for and .
Quantizer Design Algorithm

Inputs:

Binaryinput discrete memoryless channel . If necessary, modify labels to satisfy (19).

The number of quantizer outputs .


Initialize .

Precompute partial mutual information. For each and for each (where ):

compute according to (LABEL:eqn:g).


Recursion. For each , and for each ,

compute according to (26),

store the local decision :
where the maximization is taken over .


Find an optimal quantizer by traceback. Let . For each :

Outputs:

The optimal . Equivalently, output the matrix , where row of has ones in columns to and zeros in all other columns.

The maximum mutual information, .

There may be multiple optimal quantizers. To find these, a tiepreserving implementation should collect all locally optimal decisions and tracebacks to produce multiple optimal quantizers. This was not explicitly indicated, to keep the notation simple.
With arbitrary tiebreaking, Lemma 3 guarantees that the Quantizer Design Algorithm will find one of the optimal quantizers. However, (20) in Lemma 3 can be shown to be a necessary condition for optimality, using [8]. Thus, a tiepreserving implementation will find all optimal deterministic quantizers. This can be further extended to show that probabilistic quantizers are suboptimal; this is done by showing the strict inequalities in (17) lead to strict convexity of mutual information. In summary, this paper has shown in detail that the Quantizer Design Algorithm will find at least one optimal quantizer, and has given a sketch of an argument that the tiepreserving implementation finds all optimal quantizers.
4.3 Complexity
A naive optimization approach for (11) is to search over all candidate solutions, which is searching over all deterministic quantizers. This has complexity exponential in .
Burshtein et al. suggested searching over the quantizers satisfying Lemma 2 [2]. There are such quantizers, formed by selecting the values to from the set of integers. In this case, the complexity grows as . This is polynomial in when is fixed. However, for the important case of fixed, the complexity remains exponential.
On the other hand, the Quantizer Design Algorithm has polynomial complexity in the worst case, and more generally has complexity . The main computational burden is to precompute in step 3. Since is from a set of size and is from a set of size at most , the number of computations is proportional to . Note that in (LABEL:eqn:g), the sum on could be over as many as terms. However, since this sum can be computed recursively, the complexity remains proportional to . Also, for each in step 4, roughly add/compare operations are needed, and there are such steps. This results in a number of operations roughly . For fixed , the expression has its maximum value at . The maximum value is , so the complexity is proportional to .
Dynamic programming principles are used to show the optimality of the algorithm, and details are in the Appendix, part .4. The complexity result, along with the proof of optimality in the Appendix, proves the Theorem.
4.4 Example: Finely Quantized ContinuousOutput Channel
While the Quantizer Design Algorithm can only be applied to discrete channels, it can be used to obtain good coarse quantization of a continuousoutput channel, by first uniformly and finely quantizing to levels. This can be illustrated for the binaryinput AWGN channel with inputs and Gaussian noise variance .
Fig. 4 was created by first uniformly quantizing the AWGN channel between and with or steps; the two input symbols are equally likely. Then, the Quantizer Design Algorithm was applied with quantizer outputs. The figure shows the quantization boundaries of the underlying finely quantized DMC.
Fig. 5 shows similar quantization for an asymmetric channel, where the Gaussian variance is input dependent, with . In this case, the two finely quantized DMCs have and . Since the channel is asymmetric, the the optimum quantizers are also asymmetric. It is of interest to see that the center boundary crosses the value 0 as increases. Note also the nonmonotonic behavior of the rightmost boundary.
5 Discussion
For a binaryinput discrete memoryless channel, the main contribution of this paper is an algorithm that finds an optimal quantizer with cubic complexity in the number of channel outputs . Previous results on optimal channel quantization were restricted to symmetrical channels or a small number of quantizer outputs. The Quantizer Design Algorithm may have various applications beyond the quantization of physical channels; already the design of polar codes and implementation of LDPC decoders have been identified as targets.
A result from statistical learning theory was applied to prove optimality. Since conditional entropy satisfies the conditions of an impurity functions, Burshtein et. al’s theorem on the convexity of the optimal mapping was used [2].
In order to obtain an efficient algorithm, attention was restricted to binary inputs, so that convexity in one dimension corresponded to grouping contiguous integers. However, the theorem of Burshtein et al., and correspondingly Lemma 2, is not restricted to binary inputs channels, and could be used to develop quantizer design algorithms for nonbinary channels. Such an extension to a input channel requires efficient enumeration sets formed from separating hyperplanes in dimensional space. As previously noted [2], generalized counting functions [25] give a bound on complexity. Suboptimal algorithms for tree classification can find good quantizers, if optimality is not required [6]. However, finding an optimal and efficient quantization algorithm remains a sophisticated problem.
A natural question is how to find the jointly optimal input distribution and channel quantizer for a given DMC. We have already considered a simple extension of the quantization algorithm which either finds the jointly optimal input distribution and channel quantizer or declares a failure [31]. However, this is a convexconcave optimization problem, a class of problems which is difficult, and finding the jointly optimal solution remains an open problem. The generalization from mutual information to other informationtheoretic metrics is also of interest. While we have already considered the random error exponent [30] as a generalization of the cutoff rate, the handling of the new metrics is also an open problem.
6 Acknowledgments
The authors extend gratitude to David Burshtein for pointing out the connection between channel quantization and learning theory, and showing the use of [2] to obtain straightforward proofs.
.1 Proof of Lemma 1
Background on convex optimization and the proof of Lemma 1 are given.
First, it is shown that mutual information (3) is convex in as well as , for fixed and fixed . The relationship is an affine transform. If a function is convex, then it is also convex in an affine transform of the original arguments [26, Sec. 3.2], so mutual information is convex in .
Referring to (10), the following is a wellknown result from concave optimization (or, concave minimization):
Lemma 4.
[22, Theorem 1.19] A concave (convex) function attains its global minimum (maximum) over at an extreme point of .
If is a polytope, as in this paper, then the extreme points are its vertices. There may be multiple local maxima when the objective function is convex. Lemma 4 can be visualized in one dimension in Fig. 2(a), where it is clear that the maximum must be at either endpoint 0 or 1, that is, the vertices of the line segment that is the feasible region.
To prove Lemma 1, observe that the constraints on the variables are given by (11). For any fixed , the variables are restricted to be in the polytope which is a dimensional simplex:
(27) 
For the vertices of this polytope, or 1, with . By Lemma 4, there is at least one vertex which obtains an optimal solution, that is, there exists an optimal quantizer which is deterministic. ∎
Note that if mutual information is strictly convex in , then at least one extreme point will have mutual information strictly greater than any stochastic quantizer, that is, all optimal quantizers will be deterministic.
.2 Minimum Average Impurity
Burshtein et al. showed that a classifier which minimizes the average impurity induces a convexity condition [2]. The main result is given here verbatim but with renamed variables, in order to elucidate the notation. The proof of Lemma 2 is given in the next section.
Let be jointlydistributed random variables with values in , and let be the convex hull of the range of . Let be a finite set , and let be concave in its second argument.
Let so that is a random variable on , where denotes expectation. For any measurable partition , define the objective function (to be minimized) by:
(28)  
(29) 
Explicitly,
(30) 
Under these conditions, the following holds:
Lemma 5.
[2, Theorem 1] For any , there exists such that and such that is convex for all .
.3 Proof of Lemma 2
To prove Lemma 2, we show that is the form of (30), and thus minimizing is equivalent to maximizing as in (4). Here,
(31)  
(32) 
where .
Define as
(33) 
where
(34) 
Then, by denoting :
(35) 
Explicitly giving as:
(36) 
where . We can see:
(37) 
where dependence of on is not needed and dropped. Note that is identical to , and so is expressed in the form of (30) via (31).
Consider now which explicitly is:
(38) 
These are the values given earlier in (15). Then, is a random variable on , which takes the value (38) with probability . Here, and are random variables, both on the sample space . A quantizer defined for is equivalent to a quantizer for . The space mapped to is transformed from to . The points are interpreted as points in a dimensional Euclidean space, inside the convex hull . It is on this space that is convex for all .
.4 Proof of Theorem’s Optimality Part
In this section, the optimality part of the Theorem is proved. Recall that by Lemma 3 an optimal quantizer satisfies , where are the quantizer boundaries. In the language of dynamic programming, a problem exhibits optimal substructure if the optimal solution contains optimal solutions to subproblems. If this condition holds, then dynamic programming provides the optimal solution, and moreover, the optimal substructure should be exploited in the optimization [27, Sec. 15.3].
For the Theorem, the subproblem consists of finding the quantizer which maximizes partial mutual information for some partial quantization of the outputs. In detail, recall is the maximum of partial mutual information when channel outputs 1 to are quantized to quantizer outputs 1 to ,
(39) 
where the maximization is over .
For fixed and , assume that is the maximum of partial mutual information, corresponding to an optimal quantization of channel outputs 1 to to the quantizer outputs 1 to . Let be the boundary of quantizer output , that is:
(40) 
so that the preimage of quantizer output is . Then, the quantizer for channel outputs to must also be optimal. This is true because if another quantizer of 1 to produced higher mutual information, then the quantization of 1 to would also have higher partial mutual information, leading to a contradiction of the assumption that 1 to was optimally quantized.
Along with the earlier statement that the complexity is proportional to , the proof of the Theorem is completed. ∎
Footnotes
 Statistical learning literature reverses these, using as the variable of interest, and as the observation, since the input to the classifier is , and the output of the classifier is an estimate of .
References
 T. M. Cover and J. A. Thomas, Elements of Information Theory. Wiley, 1991.
 D. Burshtein, V. Della Pietra, D. Kanevsky, and A. Nádas, “Minimum impurity partitions,” The Annals of Statistics, vol. 20, no. 3, pp. 1637–1646, 1992.
 P. M. Pardalos and J. B. Rosen, “Methods for global concave minimization: A bibliographic survey,” SIAM Review, vol. 28, no. 3, pp. 367–379, 1986.
 I. Tal and A. Vardy, “How to construct polar codes,” IEEE Transactions on Information Theory, vol. 59, no. 10, pp. 6562–6582, October 2013.
 C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
 P. A. Chou, “Optimal partitioning for classification and regression trees,” IEEE Transactions on pattern analysis and machine intelligence, vol. 13, pp. 340–354, April 1991.
 L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees. Belmont, CA, USA: Wadsworth, 1984.
 D. Coppersmith, S. J. Hong, and J. R. M. Hosking, “Partitioning nominal attributes in decision trees,” Data Mining and Knowledge Discovery, vol. 3, no. 3, pp. 197–217, 1999.
 S. Arimoto, “An algorithm for computing the capacity of arbitrary discrete memoryless channels,” IEEE Transactions on Information Theory, vol. 18, pp. 14–20, January 1972.
 R. E. Blahut, “Computation of channel capacity and ratedistortion functions,” IEEE Transactions on Information Theory, vol. 18, pp. 460–473, July 1972.
 N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” in Proceedings 37th Annual Allerton Conference on Communication, Control, and Computing, (Monticello, IL, USA), September 1999.
 J. M. Wozencraft and R. S. Kennedy, “Modulation and demodulation for probabilistic coding,” IEEE Transactions on Information Theory, vol. 12, pp. 291–297, July 1966.
 J. M. Wozencraft and I. M. Jacobs, Principles of Communication Engineering. Prospect Heights, IL, USA: Waveland Press Inc., 1990. Originally Published 1965 by Wiley.
 J. L. Massey, “Coding and modulation in digital communications,” in Proceedings of International Zurich Seminar on Digital Communications, (Zurich, Switzerland), pp. E2(1)–E2(4), 1974.
 L. Lee, “On optimal softdecision demodulation,” IEEE Transactions on Information Theory, vol. 22, pp. 437 – 444, July 1976.
 X. Ma, X. Zhang, H. Yu, and A. Kavcic, “Optimal quantization for softdecision decoding revisited,” in International Symposium on Information Theory and its Applications, (Xian, China), October 2002.
 A. D. Liveris and C. N. Georghiades, “On quantization of lowdensity paritycheck coded channel measurements,” in Proceedings IEEE Global Telecommunications Conference, (San Francisco, CA, USA), pp. 1649–1653, December 2003.
 J. Singh, O. Dabeer, and U. Madhow, “On the limits of communication with lowprecision analogtodigital conversion at the receiver,” IEEE Transactions on Communications, vol. 57, pp. 3629–3639, December 2009.
 G. Zeitler, A. C. Singer, and G. Kramer, “Lowprecision A/D conversion for maximum information rate in channels with memory,” IEEE Transactions on Communications, vol. 60, pp. 2511–2521, September 2012.
 J. Wang, T. Courtade, H. Shankar, and R. D. Wesel, “Soft information for LDPC decoding in flash: Mutualinformation optimized quantization,” in Proceedings IEEE Global Telecommunications Conference, pp. 1–6, IEEE, 2011.
 R. Horst and P. M. Pardalos, eds., Handbook of Global Optimization. Kluwer Academic Publishers, 1995.
 R. Horst, P. M. Pardalos, and N. V. Thoai, Introduction to Global Optimization. Kluwer Academic Publishers, 1995.
 T. Koch and A. Lapidoth, “At low SNR, asymmetric quantizers are better,” IEEE Transactions on Information Theory, vol. 59, no. 9, pp. 5421–5445, 2013.
 J. Singh, O. Dabeer, and U. Madhow, “Communication limits with lowprecision analogtodigital conversion at the receiver,” in International Conference on Communications, Circuits and Systems, IEEE, 2007.
 T. M. Cover, “Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition,” IEEE Transactions on Electronic Computers, no. 3, pp. 326–334, 1965.
 S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004.
 T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, Second Edition. The MIT Press, 2001.
 B. Kurkoski, K. Yamaguchi, and K. Kobayashi, “Noise thresholds for discrete LDPC decoding mappings,” in Proceedings IEEE Global Telecommunications Conference, (New Oreleans, USA), pp. 1–5, November–December 2008.
 B. Kurkoski and H. Yagi, “Concatenation of a discrete memoryless channel and a quantizer,” in Proceedings of the IEEE Information Theory Workshop, (Cairo, Egypt), pp. 160–164, January 2010.
 H. Yagi and B. M. Kurkoski, “Channel quantizers that maximize random coding exponent for binaryinput memoryless channels,” in Proceedings IEEE International Conference on Communications, (Ottawa, Canada), pp. 2256–2260, June 2012.
 B. M. Kurkoski and H. Yagi, “Finding the capacity of a quantized binaryinput DMC,” in Proceedings of IEEE International Symposium on Information Theory, (Cambridge, Massachusetts, USA), pp. 691–695, July 2012.