AutoencoderBased Error Correction Coding for OneBit Quantization
Abstract
This paper proposes a novel deep learningbased error correction coding scheme for AWGN channels under the constraint of onebit quantization in the receivers. Specifically, it is first shown that the optimum error correction code that minimizes the probability of bit error can be obtained by perfectly training a special autoencoder, in which “perfectly” refers to converging the global minima. However, perfect training is not possible in most cases. To approach the performance of a perfectly trained autoencoder with a suboptimum training, we propose utilizing turbo codes as an implicit regularization, i.e., using a concatenation of a turbo code and an autoencoder. It is empirically shown that this design gives nearly the same performance as to the hypothetically perfectly trained autoencoder, and we also provide a theoretical proof of why that is so. The proposed coding method is as bandwidth efficient as the integrated (outer) turbo code, since the autoencoder exploits the excess bandwidth from pulse shaping and packs signals more intelligently thanks to sparsity in neural networks. Our results show that the proposed coding scheme at finite block lengths outperforms conventional turbo codes even for QPSK modulation. Furthermore, the proposed coding method can make onebit quantization operational even for QAM.
I Introduction
Wireless communication systems are trending towards ever higher carrier frequencies, due to the large bandwidths available [1]. These high frequencies are made operational by the use of large cophased antenna arrays to enable directional beamforming. Digital control of these arrays is highly desirable, but requires a very large number of analogtodigital converters (ADCs) at the receiver or digitaltoanalog converters (DACs) at the transmitter, each of which consumes nontrivial power and implementation area [2]. Low resolution quantization is thus inevitable to enable digital beamforming in future systems. However, little is known about optimum communication techniques in a low resolution environment.
In this paper, we focus on error correction codes for the onebit quantized channel, where just the sign of the real and imaginary parts is recorded by the receiver ADC. Conventional coding techniques, which mainly target unquantized additive white Gaussian noise (AWGN) channels or other idealized models, are not wellsuited for this problem. Deep learning is an interesting paradigm for developing channel codes for lowresolution quantization, motivated by its previous success for some other difficult problems, e.g., see [3] for learning transmit constellations, [4] for joint channel estimation and data detection, [5] for onebit OFDM communication, or [6] for several other problems. This paper develops a novel approach which concatenates a conventional turbo code with a deep neural network – specifically, an autoencoder – to approach theoretical benchmarks and achieve compelling error probability performance.
Ia Related Work and Motivations
Employing a neural network for decoding linear block codes was proposed in the late eighties [7]. Similarly, the Viterbi decoder was implemented with a neural network for convolutional codes in the late nineties [8], [9]. A simple classifier is learned in these studies instead of a decoding algorithm. This leads to a training dataset that must include all codewords, which makes them infeasible for most codes due to the exponential complexity. Recently, it was shown that a decoding algorithm could be learned for structured codes [10], however this design still requires a dataset with at least percent of the codebook, which limits its practicality to small block lengths. To learn decoding for large block lengths, [11] trained a recurrent neural network for small block lengths that can generalize well for large block lengths. Furthermore, [12] improves the belief propagation algorithm by assigning trainable weights to the Tanner graph for highdensity parity check (HDPC) codes that can be learned from a single codeword, which prevents the curse of dimensionality.
Most of the prior studies are aimed at learning and/or improving the performance of decoding algorithms through the use of a neural network. There are only a few papers that aim to learn an encoder, which is more difficult than learning a decoder due to the difficulties of training the lower layers in deep networks [13], [14], [15]. We specifically design a channel code for the challenging onebit quantized AWGN channels via an autoencoder to obtain reliable communication at the Shannon rate. The closest study to our paper that we know of is [3], which proposes an autoencoder to learn transmit constellations such as MQAM. However, [3] does not aim to achieve a very small error probability (close to the Shannon bound) and quantization is not considered.
IB Contributions
Our contributions are (i) to show that nearoptimum handcrafted channel codes can be equivalently obtained by perfectly training a special autoencoder, which however is not possible in practice, and (ii) to design a novel and practical autoencoderbased channel coding scheme that is wellsuited for receivers with onebit quantization.
Designing an optimum channel code is equivalent to learning an autoencoder. We first show that the mathematical model of a communication system can be represented by a regularized autoencoder, where the regularization comes from the channel and RF modules. Then, it is formally proven that an optimum channel code can be obtained by perfectly training the parameters of the encoder and decoder – where “perfectly” means finding the global minimum of its loss function – of a specially designed autoencoder architecture. However, autoencoders cannot be perfectly trained, so suboptimum training policies are utilized. This is particularly true for onebit quantization, which further impedes training due to its zero gradient. Hence, we propose a suboptimum training method and justify its efficiency by theoretically finding the minimum required SNR level that yields almost zero detection error, which could be obtained if the autoencoder parameters would be trained perfectly, and prove the existence of a global minimum. This is needed, because we cannot empirically obtain the performance of a perfectly trained autoencoder due to getting stuck in a local minima. In what follows, observing the SNRs due to suboptimum training and comparing it with the case of perfect training allows us to characterize the efficiency.
Designing a practical coding scheme for onebit receivers. Although onebit quantization has been extensively studied, e.g., [16], [17], [18], [19], there is no paper to our knowledge that designs a channel code specifically for onebit quantization. We fill this gap by developing a novel deep learningbased coding scheme that combines turbo codes with an autoencoder. Specifically, we first suboptimally train an autoencoder, and then integrate a turbo code with this autoencoder, which acts as an implicit regularizer. The proposed coding method is as bandwidth efficient as just using the turbo code, because the autoencoder packs the symbols intelligently by exploiting its sparsity stemming from the use of a rectified linear unit (ReLU) activation function and exploits the pulse shaping filter’s excess bandwidth by using the fasterthanNyquist transmission. It is worth emphasizing that conventional channel codes are designed according to the traditional orthogonal pulses with symbol rate sampling and cannot take the advantage of excess bandwidth. The numerical results show that our method can approach the performance of a perfectly trained autoencoder. For example, the proposed coding scheme can compensate for the performance loss of QPSK modulation at finite block lengths due to the onebit ADCs, and significantly improve the error rate in case of QAM, in which case onebit quantization does not usually work even with powerful turbo codes. This success is theoretically explained by showing that the autoencoder produces Gaussian distributed data for turbo decoder even if there are some nonlinearities in the transmitters/receivers that result in nonGaussian noise.
This paper is organized as follows. The mathematical model of a communication system is introduced as a channel autoencoder in Section II. Then, the training imperfections are quantified by finding the minimum required SNR level that achieves almost zero detection error for the onebit quantized channel autoencoder in Section III. The channel code is designed in Section IV, and its performance is given in Section V. The paper concludes in Section VI.
Ii Channel Autoencoders
Autoencoders are a special type of feedforward neural network involving an “encoder” that transforms the input message to a codeword via hidden layers and a “decoder” that approximately reconstructs the input message at the output using the codeword. This does not mean that autoencoders strive to copy the input message to the output. On the contrary, the aim of an autoencoder is to extract lower dimensional features of the inputs by hindering the trivial copying of inputs to outputs. Different types of regularization methods have been proposed for this purpose based on denoising [20], sparsity [21], and contraction [22], which are termed regularized autoencoders. A special type of regularized autoencoder inherently emerges in communication systems, where the physical channel as well as the RF modules of transmitters and receivers behave like a explicit regularizer. We refer to this structure as a channel autoencoder, where channel refers to the type of regularization.
The mathematical model of a communication system is a natural partner to the structure of a regularized autoencoder, since a communication system has the following ingredients:

A message set , in which message is drawn from this set with probability

An encoder that yields length codewords

A channel that takes an input from alphabet and outputs a symbol from alphabet

A decoder that estimates the original message from the received length sequence
In regularized autoencoders, these 4 steps are performed as determining an input message, encoding this message, regularization, and decoding, respectively. To visualize this analogy, the conventional representation of a communication model is portrayed as an autoencoder that performs a classification task in Fig. 1.
The fundamental distinction between a general regularized autoencoder and a communication system is that the former aims to learn useful features to make better classification/regression by sending messages, whereas the latter aims to minimize communication errors by designing handcrafted features (codewords). This analogy is leveraged in this paper to design efficient coding methods by treating a communication system as a channel autoencoder for a challenging communication environment, in which designing a handcrafted code is quite difficult. In this manner, we show that finding the optimum encoderdecoder pair with coding theory in the sense of minimum probability of bit error can give the same encoderdecoder pair that is learned through a regularized autoencoder.
An autoencoder aims to jointly learn a parameterized encoderdecoder pair by minimizing the reconstruction error at the output. That is,
(1) 
where and are the encoder and decoder parameters of and respectively, and
(2) 
where is the input training vector and is the number of training samples. To find the best parameters that minimize the loss function, is defined as the negative log likelihood of . The parameters are then trained through backpropagation and gradient descent using this loss function. The same optimization appears in a slightly different form in conventional communication theory. In this case, encoders and decoders are determined so as to minimize the transmission error probability given by
(3) 
where
(4) 
for a given , and signaltonoiseratio (SNR). Note that (3) can be solved either by human ingenuity or a bruteforce search. For the latter, if all possible combinations of mapping number of information bits to the codewords are observed by employing a maximum likelihood detection, the optimum linear block code can be found in terms of minimum probability of error. However, it is obvious that this is NPhard. Thus, we propose an alternative autoencoder based method to solve (3).
Theorem 1.
Proof.
See Appendix A. ∎
Remark 1.
Theorem 1 states that a special autoencoder that is framed for the mathematical model of a communication system, which was defined in Shannon’s coding theorem, can be used to obtain the optimum channel codes for any block length. This is quite important, because there is not any known tool that gives the optimum code as a result of the mathematical modeling of a communication system. Shannon’s coding theorem only states that there is at least one good code without specifying what it is, and only for infinite block lengths. Hence, autoencoders can in principle be used for any kind of environment to find optimum error correction codes. However, the autoencoder must be perfectly trained, which is challenging or impossible.
Iii Quantifying Training Imperfections in Channel Autoencoders
The channel autoencoder specified in Theorem 1 would negate the need to design sophisticated handcrafted channel codes for challenging communication environments, if it was trained perfectly. However, training an autoencoder is a difficult task, because of the high probability of getting stuck in a local minima. This can stem from many factors such as random initialization of parameters, selection of inappropriate activation functions, and the use of heuristics to adapt the learning rate. Handling these issues is in particular difficult for deep neural networks, which leads to highly suboptimum training and generalization error. Put differently, these are key reasons why deep neural networks were not successfully trained until the seminal work of [23], which proposed a greedy layerwise unsupervised pretraining for initialization. In addition to this, there were other improvements related to better understanding of activation functions, e.g., using a sigmoid activation function hinders the training of lower layers due to saturated units at the top hidden layers [24]. Despite these advances, there is still not any universal training policy that can guarantee to approach the global minimum, and using a suboptimum training, which usually converges to a local minima in optimizing the loss function, is inevitable.
To quantify how well a suboptimum training approach can perform, we need to know the performance of the perfectly trained autoencoder. However, finding this empirically is not possible due to getting stuck in one of the local minimas. Hence, we first find the minimum required SNR to have bit error probability approaching zero (in practice, less than ). Such a low classification error can usually be achieved only if the parameters satisfy the global minima of the loss function, corresponding to perfect training. Then, we quantify the training imperfections in terms of SNR loss with respect to this minimum SNR, which serves us as a benchmark. Since our main goal is to design channel codes for onebit quantized AWGN channels, which is treated as a onebit quantized AWGN channel autoencoder, this method is used to quantify the training performance of this autoencoder. Here, onebit quantization enables us to save hardware complexity and power consumption for communication systems that utilize an everincreasing number of antennas and bandwidth particularly at high carrier frequencies [5]. In the rest of this section, we first determine the minimum required SNR level for the onebit quantized AWGN channel autoencoder in which the autoencoder can achieve zero classification error (or bit error rate) above this SNR, and then formally show there exists a global minimum and at least one set of encoderdecoder pair parameters converges to this global minimum.
Iiia Minimum SNR for Reliable Coding for OneBit Quantized Channel Autoencoders
The encoder and decoder of the onebit quantized AWGN channel autoencoder are parameterized via two separate hidden layers with a sufficient number of neurons (or width)^{1}^{1}1We prefer to use single layer with large number of neurons instead of multiple hidden layers with fewer neurons to make the analysis simpler and clearer without any loss of generality.. To have a tractable analysis, a linear activation function is used at the encoder – whereas there can be any nonlinear activation function in the decoder – and there is a softmax activation function at the output. Since an autoencoder is trained with a global reconstruction error function, nonlinearities in the system can be captured thanks to the decoder even if the encoder portion is linear.
To satisfy Theorem 1, onehot coding is employed for the onebit quantized AWGN channel autoencoder, which yields a multiclass classification. Specifically, the message from the message set is first coded to the bit information sequence s. Then, s is converted into x using onehot coding, and encoded with , which yields an bit codeword. Adding the noise to this encoded signal produces the unquantized received signal, which is given by
(5) 
where z is the additive Gaussian noise with zero mean and variance , and is the encoder parameters. Here, complex signals are expressed as a real signal by concatenating the real and imaginary part. Notice that there is a linear activation function in the encoder.
Onebit quantization, which is applied elementwise, constitutes the quantized received signal
(6) 
The onebit quantized received signal is processed by the decoder via the parameters followed by the softmax activation function, which leads to , where the output vector is such that . The parameters of and are trained by minimizing the cross entropy function between the input and output layer. This can be equivalently considered as minimizing the distance between the empirical and predicted conditional distributions. Following that it is trivial to obtain the estimate of from .
The mutual information between the input and output vector is equal to the channel capacity^{2}^{2}2Note that the encoder and decoder of the autoencoder is considered as a part of the wireless channel, because there is some randomness in the encoder and decoder stemming from the random initialization of parameters, which affects the capacity.
(7) 
Assuming that symbols are independent and identically distributed, can be simplified to
(8) 
where (a) is due to chain rule, (b) is due to independence and (c) comes from the identical distribution assumption. The capacity of the onebit quantized AWGN channel autoencoders can then be readily found as
(9) 
It is not analytically tractable to express in closedform due to the decoder that yields nonGaussian noise. However, (7) can be equivalently expressed by replacing with thanks to the data processing inequality, which qualitatively states that clever manipulations of data cannot enhance the inference, i.e., .
Lemma 1.
The mutual information between s and r in the case of a onebit quantized channel autoencoder is
(10) 
where is the transmit SNR and provided the encoder parameters are initialized with Gaussian random variables
Proof.
See Appendix B. ∎
It is worth emphasizing that the most common weight initialization in deep neural networks is to use Gaussian random variables [24], [25]. The minimum required SNR for the onebit quantized AWGN channel autoencoder can be trivially found through Lemma 1 when the code rate is equal to the capacity. That is, . Specifically, the capacity is numerically evaluated in Fig. 2 using Lemma 1 so as to determine the minimum required SNR to suppress the regularization impact for the onebit quantized AWGN channel autoencoder. To illustrate, for a code rate of , we find dB. This means that if the onebit quantized AWGN channel autoencoder is perfectly trained, it gives almost zero classification error above an SNR of dB.
IiiB Existence of the Global Minimum
To achieve zero classification error above the minimum required SNR, the parameters of the encoder and decoder are trained such that the loss function converges to the global minimum. Next, we prove that there exists a global minima and at least one set of encoderdecoder parameters converges to this global minima.
Theorem 2.
For channel autoencoders, there is a global minima and at least one set of encoderdecoder pair parameters converges to this global minimum above the minimum required SNR.
Proof.
The depth and width of the neural layers in an autoencoder are determined beforehand, and these do not change dynamically. This means that and – and hence the code rate – is fixed. With sufficient SNR, one can ensure that this code rate is below the capacity, in which Shannon’s coding theorem guarantees reliable (almost zero error) communication. To satisfy this for the autoencoder implementation of communication systems, the necessary and sufficient conditions in the proof of Shannon’s channel coding theorem must be hold, which are (i) random code selection; (ii) jointly typical decoding; (iii) no constraint for unboundedly increasing the block length. It is straightforward to see that (i) is satisfied, because the encoder parameters are randomly initialized. Hence, the output of the encoder gives a random codeword. For (ii), Theorem 1 shows that the aforementioned autoencoder results in maximum likelihood detection. Since maximum likelihood detection is a stronger condition than jointly typical decoding to make optimum detection, it covers the condition of jointly typical decoding and so (ii) is satisfied as well. For the last step, there is not any constraint to limit the width of the encoder layer. This means that (iii) is trivially met. Since channel autoencoders satisfy the Shannon’s coding theorem, which states there is at least one good channel code to yield zero error communication, there exists a global minima that corresponds to the zero error communication, which can be achieved with at least one set of encoderdecoder parameters. ∎
It is not easy to converge to encoderdecoder parameters that result in global minimum due to the difficulties in training deep networks as mentioned previously. Additionally, the required onehot coding in the architecture exponentially increases the input dimension, which renders it infeasible for practical communication systems, especially for highdimensional communication signals. Thus, more practical autoencoder architectures are needed to design channel codes for onebit quantization without sacrificing the performance.
Iv Practical Code Design for OneBit Quantization
To design a coding scheme under the constraint of onebit ADCs for AWGN channels, our approach – motivated by Theorem 1 – is to make use of an autoencoder framework. Hence, we transform the code design problem for onebit quantized AWGN channel to the problem of learning encoderdecoder pair for a special regularized autoencoder, in which the regularization comes from the onebit ADCs and Gaussian noise. However, the onehot encoding required by Theorem 1 is not an appropriate method for highdimensional communication signals, because this exponentially increases the input dimension while training neural networks. Another challenge is that onebit quantization stymies gradient based learning for the layers before quantization, since it makes the derivative everywhere except at point , which is not even differentiable. To handle all these challenges, we propose to train a practical but suboptimum autoencoder architecture and stack it with a stateoftheart channel code that is designed for AWGN channels, but not for onebit ADCs. The details of this design are elaborated next. In what follows, we justify the novelty of the proposed model in terms of machine learning principles.
Iva AutoencoderBased Code Design
To design a practical coding scheme for onebit quantized communication, we need a practical (suboptimum) onebit quantized AWGN channel autoencoder architecture. For this purpose, the onebit quantized OFDM architecture proposed in [5] is modified for AWGN channels and implemented with time domain oversampling considering the pulse shape. This architecture is depicted in Fig. 3, where the encoder includes the precoder, channel and equalizer. Note that there is a noise between the and layers that represents the noisy output of the equalizer. The equalized signal is further onebit quantized, which corresponds to hard decision decoding, i.e., the decoder processes the signals composed of . This facilitates training, which will be explained.
In this model, the binary valued input vectors are directly fed into the encoder without doing onehot coding. This means that the input dimension is for bits. The key aspect of this architecture is to increase the input dimension by before quantization. This dimension is further increased by , where while decoding the signal. Although it might seem that there is only one layer for the encoder in Fig. 3, this in fact corresponds to the two neural layers and the RF part as detailed in Fig. 3. The encoded signal is normalized to satisfy the transmission power constraint. There are layers in the decoder with the same dimension, in which the ReLU is used for activation. On the other hand, a linear activation function is used at the output, and the parameters are trained so as to minimize the mean square error between the input and output layer. Additionally, batch normalization is utilized after each layer to avoid vanishing gradients [26].
The twostep training policy is used to train the aforementioned autoencoder as proposed in [5]. Accordingly, in the first step shown in Fig. 3, the decoder parameters are trained, whereas the encoder parameters are only randomly initialized, i.e., they are not trained due to the onebit quantization. In the second step given in Fig. 3, the encoder parameters are trained according to the trained and frozen decoder parameters by using the stored values of and layers in the first step in a supervised learning setup. Here, the precoder in the transmitter is determined by the parameters . Then, the coded bits are transmitted using a pulse shaping filter over an AWGN channel. In particular, these are transmitted with period . In the receiver, the signal is processed with a matched filter , oversampled by , and quantized. This RF part corresponds to fasterthanNyquist transmission, whose main benefit is to exploit the available excess bandwidth in the communication system. Notice that this transmission method is not employed in conventional codes, because it creates intersymbol interference and leads to nonorthogonal transmission that degrades the tractability of the channel codes. The quantized signal is further processed by a neural layer or followed by another onebit quantization so as to obtain the same layer in which the decoder parameters are optimized. The aim of the second onebit quantization is to obtain exactly the same layer that the decoder expects, which would be impossible if the layer became a continuous valued vector. Since the decoder part of the autoencoder processes , the proposed model can be considered as having a hard decision decoder.
The onebit quantized AWGN channel autoencoder architecture apparently violates Theorem 1 that assures the optimum coding, because neither onehot coding nor softmax activation function is used. Additionally, ideal training is not possible due to onebit quantization. Thus, it does not seem possible to achieve almost zero error probability in detection with this suboptimum architecture and suboptimum training even if . To cope with this problem, we propose employing an implicit regularizer that can serve as a priori information. More specifically, turbo coding is combined with the proposed autoencoder without any loss of generality, i.e., other offtheshelf coding methods can also be used.
The proposed coding scheme for AWGN channels under the constraint of onebit ADC is given in Fig. 4, where the outer code is the turbo code and the inner code is the onebit quantized AWGN channel autoencoder. In this concatenated code, the outer code injects strong a priori information for the inner code. Specifically, the bits are first coded with a turbo encoder for a given coding rate and block length. Then, the turbo coded bits in one block are divided into smaller subblocks, each of which is sequentially processed (or coded) by the autoencoder. In this manner, the autoencoder behaves like a convolutional layer by multiplying the subblocks within the entire block with the same parameters. Additionally, dividing the code block into subblocks ensures reasonable dimensions for the neural layers. It is important to emphasize that the autoencoder does not consume further bandwidth. Rather, it exploits the excess bandwidth of the pulse shaping and packs the signal more intelligently by exploiting the sparsity in the autoencoder due to using ReLU, which means that nearly half of the input symbols are set to assuming that input is either or with equal probability. The doublecoded bits (due to turbo encoder and autoencoder) are first decoded by the autoencoder. Then, the output of the autoencoder for all subblocks are aggregated and given to the outer decoder.
A concrete technical rationale for concatenating a turbo code and autoencoder is to provide Gaussian distributed data to the turbo decoder, which is optimized for AWGN and is known to perform very close to theoretical limits for Gaussian distributed data. Below we formally prove that an autoencoder centered on the channel produces conditional Gaussian distributed data for the turbo decoder as in the case of AWGN channel even if there are some significant nonlinearities, such as onebit quantization.
Theorem 3.
The conditional probability distribution of the output of the autoencoder’s decoder – which is the input to the turbo decoder –conditioned on the output of the turbo encoder is a Gaussian process, despite the onebit quantization at the front end of the receiver.
Proof.
See Appendix C. ∎
Remark 2.
Theorem 3 has important consequences, namely that even if there is a nonlinear operation in the channel or RF portion of the system, building an autoencoder around the channel provides a Gaussian distributed input to the decoder, and so standard AWGN decoders can be used without degradation. This brings robustness to the turbo codes against any nonlinearity in the channel: not just quantization but also phase noise, power amplifier nonlinearities, or nonlinear interference.
IvB The Proposed Architecture as of relative to Deep Learning Principles
Choosing some initial weights and moving through the parameter space in a succession of steps does not help to find the optimum solution in highdimensional machine learning problems [27]. Hence, it is very unlikely to achieve reliable communication by randomly initializing the encoder and decoder parameters and training these via gradient descent. This is particularly true if there is a nondifferentiable layer in the middle of a deep neural network as in the case of onebit quantization. Regularization is a remedy for such deep neural networks whose parameters cannot be initialized and trained properly. However, it is not clear what kind of regularizer should be utilized: it is problemspecific and there is not any universal regularizer. Furthermore, it is not easy to localize the impact of regularization from the optimization. To illustrate, in the seminal work of [23] that successfully trains a deep network for the first time by pretraining all the layers and then stacking them together, it is not well understood whether the improvement is due to better optimization or better regularization [28].
Utilizing a novel implicit regularization inspired by coding theory has couple of benefits. First, it is applicable to many neural networks in communication theory: it is not problemspecific. Second, the handcrafted encoder can be treated as features extracted from another (virtual) deep neural network and combined with the target neural network. This means that a machine learning pipeline can be formed by stacking these two trained deep neural networks instead of stacking multiple layers. Although it is not known how to optimally combine the pretrained layers [28], it is much easier to combine two separate deep neural networks. Additionally, our model isolates the impact of optimization due to the onebit quantization. This leads to a better understanding of the influence of regularization.
In deep neural networks, training the lower layers has the key role of determining the generalization capability [27]. In our model, the lower layers can be seen as layers of a virtual deep neural network that can learn the stateoftheart coding method. The middle layers are the encoder part of the autoencoder, which are the most problematic in terms of training ( due to onebit quantization) and the higher layers are the decoder of the autoencoder. We find that even if the middle layers are suboptimally trained, the overall architecture performs well. That is, we claim that as long as the middle layers contribute to hierarchical learning, it is not important to optimally train their parameters. This brings significant complexity savings in training neural networks, but more work is needed to verify this claim more broadly.
V Numerical Results
To determine the efficiency of the twostep training policy in the proposed autoencoder, how well the encoder parameters can be trained according to the decoder parameters is first shown. In what follows, the bit error rate (BER) of the proposed coding scheme is evaluated for QPSK and 16QAM modulation under the constraint of onebit quantization in the receivers for AWGN channels. In the simulations, a root raised cosine (RRC) filter with excess bandwidth is considered as a pulse shape and fold time domain oversampling is utilized. Furthermore, is taken as without an extensive hyperparameter search. We use the turbo code that is utilized in LTE [29], which has a code rate of and a block length of . The codewords formed with this turbo code are processed with a subblock of length with the autoencoder. The proposed coding scheme is directly compared with this conventional turbo code in case of both unquantized (soft decision decoding) and onebit quantized (hard decision decoding) samples. This is to explicitly show why an autoencoder is needed for onebit quantization.
To observe how efficiently the encoder can be trained for the aforementioned training policy, its mean square error (MSE) loss function is plotted with respect to the number of epochs in Fig. 5. As can be seen, the error goes to almost zero after a few hundred epochs.
One of the important observations in training the encoder is the behavior of the neural layer in the transmitter, which is the first layer in Fig. 3. To be more precise, this layer demonstrates that nearly half of its hidden units (or neurons) become zero. This is due to the ReLU activation function and enables us to pack the symbols more intelligently. More precisely, the input of the autoencoder has units, and thus the dimension of the first hidden layer is , but only of them have nonzero terms. Interestingly, the hidden units of this layer, which also correspond to the transmitted symbols, have quite different power levels from each other. To visualize this, when the all ones codeword with length is given to the input of the autoencoder, the output of the first hidden layer for becomes as in Fig. 6. According to that, neurons out of neurons become zero. Our empirical results also show that this is independent of the value of .
In the proposed coding scheme, the symbols are transmitted faster with period , however this does not affect the transmission bandwidth, i.e., the bandwidth remains the same [30]^{3}^{3}3Note that the complexity increase in the receiver due to fasterthanNyquist transmission is not an issue for autoencoders, in which the equalizer is implemented as a neural network independent of transmission rate.. Although the coding rate is in the proposed autoencoder, this does not mean that there is a trivial coding gain increase, because the bandwidth remains the same, and thus the minimum distance (or free distance) does not increase^{4}^{4}4In convolutional codes, the coding gain is smaller than or equal to [31].. The minimum distance can even decrease despite smaller coding rate, because dividing the same subspace into fold more partitions can decrease the distance between neighboring partitions.
To make a fair comparison between conventional turbo codes designed for orthogonal transmission with symbol period and our autoencoderaided coding scheme, our methodology in the simulations is to consider the transmission rate increase as a bandwidth increase. In this manner, we first determine the maximum possible value of that corresponds to the available baseband bandwidth . Then we add an SNR penalty by shifting the BER curve to the right, if the transmission rate increase exceeds the available bandwidth. To illustrate, if is , can be (because the nonorthogonal transmission symbol period becomes due to the precoder behavior) without needing to add an SNR penalty. However, if becomes , the BER curve has to be shifted 3 dB to the right even if there is full excess bandwidth. This in fact explains what exploiting the excess bandwidth and packing the symbols more intelligently correspond to.
The performance of the autoencoderbased concatenated code is given in Fig. 7 for QPSK modulation. Specifically, when the turbo codes are decoded with iterations, the BER becomes as in Fig. 7(a). Here, the proposed coding method can give very close performance to the turbo code that works with unquantized samples despite onebit ADCs for . Although there is some performance loss for , our method can still outperform the turbo code that is optimized for onebit samples. When iterations are employed for the turbo decoding, the gap due to the excess bandwidth increases a little as can be observed in Fig. 7(b). Notice that our empirical results match with the derived expression in Lemma 1, which states the minimum required SNR for reliable communication. This also proves that the proposed suboptimum training policy can approach the performance of an autoencoder that is trained perfectly.
Onebit ADCs can work reasonably well in practice for QPSK modulation. However, this is not the case for higher order modulation, in which it is much more challenging to have a satisfactory performance with onebit ADCs. To specify the benefit of the proposed coding scheme for higher order modulation, the simulation is repeated for QAM as depicted in Fig. 8. As can be observed, the conventional turbo code is not sufficient for onebit ADCs in case of QAM. On the other hand, the proposed coding method can give a similar waterfall slope with a nearly fixed SNR loss with respect to the turbo code that processes ideal unquantized samples. This result can be explained with Theorem 3. More precisely, in case of higher order modulations the nonlinearity stemming from onebit ADCs considerably increases and deviates the AWGN channel to other nonGaussian distributions. However, the inner code or the autoencoder produces a Gaussian process for the turbo decoder even if there is a high nonlinearity.
Vi Conclusions
In this paper, the development of handcrafted channel codes for onebit quantization is transformed into learning the parameters of a specially designed autoencoder. Despite its theoretical appeal, learning or training the parameters of an autoencoder is often very challenging. Hence, suboptimum training methods are needed that can lead to some performance loss. To compensate for this loss, we propose to use a stateoftheart coding technique, which were developed according to the AWGN channel, as an implicit regularizer for autoencoders that are trained suboptimally. This idea is applied to design channel codes for AWGN channels under the constraint of onebit quantization in receivers. Our results show that the proposed coding technique outperforms conventional turbo codes for onebit quantization and can give performance close to unquantized turbo coding by packing the signal intelligently and exploiting the excess bandwidth. The superiority of the proposed coding scheme is more profound for higher order modulation in which onebit ADCs are not previously viable even with powerful turbo codes. As future work, the idea of this hybrid code design can be extended to other challenging environments such as onebit quantization for fading channels and highdimensional MIMO channels. Additionally, it would be interesting to compensate for the performance loss observed in short block lengths for turbo, LDPC and polar codes with deep learning aided methods.
Appendix A Proof of Theorem 1
In communication theory, solving (3) for a given , and SNR leads to the minimum probability of error, which can be achieved through maximum likelihood detection^{5}^{5}5We assume equal transmission probability of each message.. Hence,
(11) 
It is straightforward to express
(12) 
when and . We need to prove that minimizing the loss function in (2) while solving (1) gives these same and , i.e., and .
Since the error probability is calculated messagewise instead of bitwise in (4), the dimensional binary valued input training vector s is first encoded as a dimensional onehot vector x to form the messages, which is to say that . Also, a softmax activation function is used to translate the entries of the output vector into probabilities. With these definitions, the cross entropy function is employed to train the parameters^{6}^{6}6Here, we omit the subscript that represents the training sample for brevity.
(13) 
where is the empirical conditional probability distribution and is the predicted conditional probability distribution (or the output of the neural network).
Each output vector is assigned to only one of discrete classes, and hence the decision surfaces are dimensional hyperplanes for the dimensional input space. That is,
(14) 
Substituting (14) in (13) implies that
(15) 
It is straightforward to express that (15) is minimized when is maximized (or equivalently is minimized). Since and ,
(16) 
due to (4) and (11), and is the case when and because of (12). This implies that
(17) 
By definition,
(18) 
and hence,
(19) 
which completes the proof due to the onetoone mapping between s and x, and and .
Appendix B Proof of Theorem 2
The encoder parameters are initialized with zeromean, unit variance Gaussian random variables in the onebit quantized AWGN channel autoencoder. Hence, the mutual information is found over these random weights as
(20) 
By the definition of mutual information,
(21) 
The entries of the random matrix are i.i.d, and the noise samples are independent. This implies that the ’s are independent, i.e.,
(22) 
Since can be either or due to the onebit quantization, , which means
(23) 
Due to the onetoone mapping between s and x,
(24) 
Notice that for all x, only one of its elements is , the rest are . This observation reduces (24) to
(25) 
where is one realization of x. Then, the total probability law gives
Since
(27) 
this completes the proof.
Appendix C Proof of Theorem 3
The autoencoder architecture, which is composed of 6 layers as illustrated in Fig. 3, can be expressed layerbylayer as
(28) 
where are the weights and is the bias. All the weights and biases are initialized with Gaussian random variables with variances and , respectively, as is standard practice [24], [25]. Thus, is an identical and independent Gaussian process for every (or unit) with zero mean and covariance
(29) 
where is an identity function except for , in which . As the width goes to infinity, (29) can be written in integral form as
(30) 
To be more compact, the double integral in (30) can be represented with a function such that
(31) 
Hence, is a Gaussian process with zero mean and covariance
(32) 
when , i.e., the output of the autoencoder yields Gaussian distributed data in the initialization phase.
During training, the parameters are iteratively updated as
(33) 
where , and is the loss function. In parallel, the output is updated as
(34) 
The gradient term in (34) is a nonlinear function of the parameters. Nevertheless, it was recently proven in [32] that as the width goes to infinity, this nonlinear term can be linearized via a firstorder Taylor expansion. More precisely,
(35) 
where the output at the initialization or is Gaussian as discussed above. Since the gradient (and hence the Jacobian matrix) is a linear operator, and a linear operation on a Gaussian process results in a Gaussian process, the output of the autoencoder for a given input (or ) is a Gaussian process throughout training with gradient descent.
References
 [1] T. S. Rappaport, Y. Xing, O. Kanhere, S. Ju, A. Madanayake, S. Mandal, A. Alkhateeb, and G. C. Trichopoulos, “Wireless communications and applications above 100 Ghz: Opportunities and Challenges for 6G and beyond,” IEEE Access, 2019.
 [2] R. Walden, “Analogtodigital converter survey and analysis”, IEEE J. Sel. Areas Commun., vol. 17, no. 4, pp. 539550, April 1999.
 [3] T. O’Shea and J. Hoydis, “An introduction to deep learning for the physical layer”, IEEE Trans. on Cogn. Commun. Netw., vol. 3, no. 4, pp. 563575, December 2017.
 [4] H. Ye, G. Y. Li, and B.H. Juang, “Power of deep learning for channel estimation and signal detection in OFDM systems”, IEEE Wireless Communications Letters, vol. 7, pp. 114117, February 2011.
 [5] E. Balevi and J. G. Andrews, “Onebit OFDM receivers via deep learning”, IEEE Trans. on Communications, Doi:10.1109/TCOMM.2019.2903811, 2019.
 [6] Q. Mao, F. Hu, and Q. Hao, “Deep learning for intelligent wireless networks: A comprehensive survey”, IEEE Communications Surveys Tutorials, vol. 20, no. 4, pp. 25952621, November 2018.
 [7] . J. Bruck and M. Blaum, “Neural networks, errorcorrecting codes, and polynomials over the binary ncube”, IEEE Trans. Inform. Theory, vol. 35, no. 5, pp. 976987, September 1989.
 [8] X.A. Wang and S. B. Wicker, “An artificial neural net Viterbi decoder”, IEEE Transactions on Communications, vol. 44, no. 2, pp. 165171, February 1996.
 [9] A. Hamalainen and J. Henriksson, “A recurrent neural decoder for convolutional codes”, in IEEE ICC, vol. 2, no. 99CH36311, pp. 13051309, June 1999.
 [10] T. Gruber, S. Cammerer, J. Hoydis, and S. T. Brink, “On deep learning based channel decoding”, in Proc. Conf. Inf. Sci. Syst., pp. 16, March 2017.
 [11] H. Kim, Y. Jiang, R. Rana, S. Kannan, S. Oh, and P. Viswanath, “Communication algorithms via deep learning”, in Proc ICLR, April 2018.
 [12] E. Nachmani, E. Marciano, L. Lugosch, W. J. Gross, D. Burshtein, and Y. Be’ery, “Deep learning methods for improved decoding of linear codes”, IEEE Journal of Selected Topics in Signal Processing, vol. 12, no. 1, pp. 119131, February 2018.
 [13] H. Kim, Y. Jiang, S. Kannan, S. Oh, and P. Viswanath, “Deepcode: Feedback codes via deep learning”, arXiv preprint arXiv:1807.00801, July 2018.
 [14] Y. Jiang, H. Kim, H. Asnani, S. Kannan, S. Oh, and P. Viswanath, “LEARN codes: Inventing lowlatency codes via recurrent neural networks”, arXiv preprint arXiv:1811.12707, November 2018.
 [15] J. Kosaian, K. Rashmi, and S. Venkataraman, “Learning a code: Machine learning for approximate nonlinear coded computation”, arXiv preprint arXiv:1806.01259, April 2018.
 [16] M. T. Ivrlac and J. A. Nossek, “On MIMO channel estimation with singlebit signalquantization”, in Proc. ITG Workshop Smart Antennas, February 2007.
 [17] S. Jacobsson, G. Durisi, M. Coldrey, U. Gustavsson, and C. Studer, “Throughput analysis of massive MIMO uplink with low resolution ADCs”, IEEE Trans. Wireless Comm., vol. 16, pp. 49384051, June 2017.
 [18] C. Studer and G. Durisi, “Quantized massive MUMIMOOFDM uplink”, IEEE Trans. Commun., vol. 64, no. 6, pp. 23872399, June 2016.
 [19] C. Risi, D. Persson, and E. G. Larsson, “Massive MIMO with 1bit ADC”, [Online]. Available: https://arxiv.org/abs/1404.7736, April 2014.
 [20] P. Vincent, H. Larochelle, Y. Bengio, and P.A. Manzagol, “Extracting and composing robust features with denoising autoencoders”, in Proc. ICML, July 2008.
 [21] M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun, “Efficient learning of sparse representations with an energybased mode”, in Proc. NIPS, December 2006.
 [22] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, “Contractive autoencoders: Explicit invariance during feature extraction”, in Proc. ICML, July 2011.
 [23] G. E. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep belief nets”, Neural Computation, vol. 18, no. 7, pp. 15271554, July 2006.
 [24] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks”, in Proc. NIPS, May 2010.
 [25] . K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification”, in ICCV, December 2015.
 [26] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift”, in ICML, July 2015.
 [27] Y. Bengio, “Learning deep architectures for AI”, Foundations and Trends in Machine Learning, vol. 2, no. 1, pp. 1127, August 2009.
 [28] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: a review and new perspectives”, IEEE Trans. Pattern Anal. Machine Intelligence, vol. 35, no. 8, pp. 17981828, August 2013.
 [29] 3GPP TS 36.212, “Evolved universal terrestrial radio access (EUTRA)  multiplexing and channel coding (rel 14),” in 3GPP FTP Server, 2017.
 [30] A. D. Liveris and C. N. Georghiades, “Exploiting fasterthanNyquist signaling”, IEEE Trans. on Communications, vol. 51, no. 9, pp. 15021511, September 2003.
 [31] J. G. Proakis and M. Salehi, Digital Communications. McGrawHill Higher Education, 2005.
 [32] J. Lee, L. Xiao, S. Schoenholz, Y. Bahri, J. SohlDickstein, and J. Pennington, “Wide neural networks of any depth evolve as linear models under gradient descent,” arXiv preprint arXiv:1902.06720, 2019.