VLSI Design of a Nonparametric Equalizer
for Massive MU-MIMO
Linear minimum mean-square error (L-MMSE) equalization is among the most popular methods for data detection in massive multi-user multiple-input multiple-output (MU-MIMO) wireless systems. While L-MMSE equalization enables near-optimal spectral efficiency, accurate knowledge of the signal and noise powers is necessary. Furthermore, corresponding VLSI designs must solve linear systems of equations, which requires high arithmetic precision, exhibits stringent data dependencies, and results in high circuit complexity. This paper proposes the first VLSI design of the NOnParametric Equalizer (NOPE), which avoids knowledge of the transmit signal and noise powers, provably delivers the performance of L-MMSE equalization for massive MU-MIMO systems, and is resilient to numerous system and hardware impairments due to its parameter-free nature. Moreover, NOPE avoids computation of a matrix inverse and only requires hardware-friendly matrix-vector multiplications. To showcase the practical advantages of NOPE, we propose a parallel VLSI architecture and provide synthesis results in 28 nm CMOS. We demonstrate that NOPE performs on par with existing data detectors for massive MU-MIMO that require accurate knowledge of the signal and noise powers.
It is widely believed that massive multi-user multiple-input multiple-output (MU-MIMO) will be a core technology for fifth-generation (5G) wireless systems. Massive MU-MIMO relies on base-station (BS) architectures with hundreds of antenna elements and radio-frequency (RF) chains that serve tens of user equipments (UEs) in the same time-frequency resource. While this emerging technology enables unprecedented spectral efficiency by means of fine-grained beamforming [1, 2], it also poses significant practical implementation challenges.
I-a The Case for Nonparametric Equalization
Data detection in the uplink (UEs transmit data to the BS) is among the most critical tasks from a spectral efficiency and hardware complexity perspective . While optimal MIMO data detection is known to be NP-hard , it has been shown in [5, 3] that linear minimum mean-square error (L-MMSE) equalization enables near-optimal performance in massive MU-MIMO systems. However, L-MMSE equalization requires accurate knowledge of the signal and noise powers . Furthermore, corresponding hardware designs must solve linear systems of equations, which requires high arithmetic precision and suffers from stringent data dependencies—both of these aspects result in relatively high circuit complexity [7, 8, 9, 10, 11]. In addition, practical massive MU-MIMO BS designs will most likely rely on inexpensive RF circuitry which suffers from numerous impairments, including amplifier nonlinearities, phase noise, and quantization artifacts [12, 13]. The presence of non-ideal hardware necessitates the design of new equalization algorithms that are resilient to real-world hardware imperfections.
Recently, a novel algorithm called NonParametric Equalizer (NOPE, for short) was proposed in . NOPE does not require knowledge of the signal and noise powers while provably achieving the performance of L-MMSE equalization in massive MU-MIMO systems. NOPE combines approximate message passing (AMP)  with Stein’s unbiased risk estimator (SURE)  and mismatched data detection , which renders this algorithm resilient to numerous hardware impairments while being computationally efficient: NOPE only requires matrix-vector products and avoids a computation of costly matrix inverses or matrix decompositions, which are typically required by L-MMSE equalizer algorithms. Despite all these advantages, NOPE has been designed only for idealistic channel models and has not yet been integrated in hardware.
In this paper, we generalize NOPE to practical channels and provide, to the best of our knowledge, its first VLSI design. Our contributions are summarized as follows:
We propose a set of algorithm-level modifications that enable NOPE to operate on more realistic MU channels.
We develop a VLSI architecture that relies on Cannon’s algorithm  to achieve high throughput at low area.
We show reference VLSI synthesis results in 28 nm CMOS for a 64 BS antenna, 16 UE massive MU-MIMO system.
We compare NOPE to existing massive MU-MIMO equalizers requiring knowledge of the signal and noise powers.
Our results demonstrate that massive MU-MIMO has the unique potential to design parameter-free algorithms, such as NOPE, that perform on par with solutions that require accurate knowledge of critical system and model parameters.
Lowercase and uppercase boldface letters designate column vectors and matrices, respectively. The transpose and conjugate transpose of a matrix are denoted by and , respectively; the entry on the th row and th column of is . The identity matrix is denoted by and the all-zeros matrix by . The th entry of a vector is . We define the Hadamard product as . For an -dimensional vector , we define . The probability density function (PDF) of a circularly-symmetric complex-valued Gaussian random vector with covariance matrix is denoted by . Expectation and variance with respect to the random vector is denoted by and , respectively.
Ii A Primer on L-MMSE Equalization
We now introduce the system model and review the basics of L-MMSE equalization. We then discuss NOPE.
Ii-a System Model
We consider the input-output relation to model a massive MU-MIMO uplink system operating in a frequency-flat channel . The vector contains the received signals at the BS; denotes the number of BS antennas; the matrix represents the uplink MIMO channel; denotes the number of UEs; the transmit signal vector is ; and the vector models receive noise, which has i.i.d. circularly-symmetric complex Gaussian entries with variance per entry. Throughout this paper, we assume that the transmit signal vector has i.i.d. entries so that , where models the signal prior (e.g., a 16-QAM constellation) with zero mean and signal variance , . We require the following definitions.
The antenna ratio is defined as .
The large-antenna limit is defined by fixing the antenna ratio and letting .
The channel matrix is said to have uniform channel gains if the entries are i.i.d. circularly-symmetric complex Gaussian with variance per complex entry.
Ii-B Basics of L-MMSE Equalization
L-MMSE equalization is among the most popular methods to compute an estimate for from and from knowledge of the channel matrix , and enjoys widespread use for data detection in MIMO systems [18, 19, 7, 20]. The relatively low computational complexity (except for the inversion of a potentially large matrix) and acceptable performance render this method a feasible alternative to more complicated data detection algorithms. Moreover, it has been shown in [5, 3] that L-MMSE equalization enables (often significantly) higher achievable rates than zero-forcing (ZF) or maximum ratio combining (MRC)-based equalizers in massive MU-MIMO systems. However, to enable near-optimal spectral efficiency via L-MMSE equalization, accurate knowledge of the signal and noise powers is required; see, e.g., [21, 6].
Mathematically, the goal of L-MMSE equalization is to compute a linear estimate from the receive vector that minimizes the using knowledge of the channel matrix as well as the signal and noise powers, and , respectively. For a circularly-symmetric complex-valued transmit signal , the equalization matrix is given by . If the signal is zero-mean and real-valued (e.g., for BPSK signals), then the optimal linear estimator for the real part is given by
where , , , and are the real and imaginary parts of and , respectively; the imaginary part of the estimate is . Clearly, L-MMSE equalization relies on knowledge of the quantities or , which requires (i) means to detect whether the transmit signals are real- or complex-valued and (ii) an accurate estimate of that is commonly acquired in a dedicated training phase .
Ii-C L-MMSE Equalization via AMP
As shown in , L-MMSE equalization can be implemented using the mismatched complex Bayesian AMP (mcB-AMP) framework. By assuming a mismatched Gaussian signal prior distribution with instead of the true signal prior (e.g., two Dirac delta functions concentrated at and for BPSK), one can design the following parametric L-MMSE algorithm:
Algorithm 1 (L-Mmse-Amp ).
Initialize , , , and . Then, for every iteration compute the output via the following steps:
Here, the posterior mean function and operate element-wise on vectors. We furthermore need the MSE function: .
Interestingly, the estimate computed by L-MMSE-AMP exhibits the same MSE as that of the L-MMSE equalizer in the large-antenna limit and for matrices with uniform channel gains . While this is an asymptotic equivalence, reference  has shown that the error-rate performance of L-MMSE-AMP is virtually indistinguishable from an L-MMSE equalizer in practical (finite-dimensional) massive MU-MIMO systems for a small number of iterations (ten or fewer). Clearly, Algorithm 1 mainly relies on matrix-vector multiplications, which enables parallel hardware designs. However, the exact knowledge of is still required.
Iii NOnParametric Equalizer: NOPE
We now summarize the necessary steps to free Algorithm 1 from knowledge of the signal power, leading to NOPE. We then propose a generalization of the algorithm that makes it suitable for more realistic MIMO system scenarios.
Iii-a The NOPE Algorithm
To develop NOPE, we wish to automatically tune the signal power and the parameter . To this end, we first introduce the parameter and reparametrize the functions and in Algorithm 1. Now, only a single parameter must be tuned per iteration, i.e., . Interestingly, [16, Thm. 3] shows that optimal parameter tuning is achieved by tuning each parameter by minimizing (1) separately at iteration starting from to . Hence, the remaining piece is to replace the MSE function with a function that does not depend on the true signal prior . As shown in , one can use Stein’s unbiased risk estimate (SURE)  to extract an estimate of the MSE function as
Since the minimum of is given by we can replace the tuning stage in (1) by , which leads to NOPE. As proven in [6, Cor. 6], NOPE achieves the performance of an L-MMSE equalizer in the large antenna limit given that has uniform channel gains and for .
Iii-B Robust Version of NOPE
NOPE and Algorithm 1 require the matrix to have uniform channel gains. However, in practice each UE typically has a different large-scale fading gain (e.g., affected by the distance to the BS), resulting in channel matrices whose columns have different scale. We now show how NOPE can be made robust to such channels. As in , one can rewrite the channel matrix as , where each element of is distributed as and is diagonal containing the th UE’s individual large-scale fading gain . For this model, one must estimate the gain of the th UE by , which converges to in the large-antenna limit. Thus, is estimated with a diagonal matrix , where the th diagonal element is given by . To enable NOPE to support nonuniform channel gains, we modify the posterior mean function in (2) into an element-wise operation 
to take into account the fact that different functions are used for each UE. This generalization also requires new estimates for the parameters and in (5). As shown in [6, Thm 7], both of these parameters can be estimated as follows
where we introduced shorthand notation for the weighted-norm of with respect to its large-scale fading gains, and for the residual norm.
The remaining piece of our robust NOPE is to enable L-MMSE data detection for BPSK constellations for which the imaginary part of is zero. In fact, assuming a circularly-symmetric complex Gaussian prior for BPSK signals is a poor match as the imaginary part is zero. We generalize NOPE by estimating the signal variance in (6) for the real and imaginary parts separately, which enables us to automatically adapt NOPE to the used constellation set. To do so, we decompose the weighted-norm of denoted by , into real and imaginary parts, i.e., . More specifically, we can estimate the necessary variances as
for which . With all these ingredients, we arrive at the generalized NOPE algorithm in Algorithm 2.
Iii-C Numerical Results
Figure 1 shows uncoded bit error-rate (BER) simulation results in a BS antenna, UE massive MU-MIMO system with BPSK, 16-QAM, and 256-QAM. We show the performance of exact L-MMSE equalization, as well as the performance of NOPE for both infinite and fixed-precision. Solid lines correspond to floating-point precision, and circle markers correspond to fixed-point precision simulations of NOPE. Evidently, the BER performance of NOPE with iterations ( for 256-QAM) is virtually indistinguishable from the exact L-MMSE estimator, which requires accurate knowledge of both the signal and noise powers. Due to its parameter free nature, NOPE is suitable for situations in which the signal and noise powers change rapidly (e.g., due to interference) or if the transmit constellation is unknown and must be estimated prior to data detection.
Iv VLSI Architecture and Synthesis Results
We now propose a very-large scale integration (VLSI) architecture of the NOPE algorithm for a BS antenna, UE massive MU-MIMO system. We then discuss the most essential optimization steps and finally present implementation results in a 28 nm CMOS technology.
Iv-a Architecture Overview
We partition the NOPE iterations into two phases, each executed by a separate unit; see Fig. 2 for an architecture overview. The matrix-vector unit (MVU) executes the necessary matrix-vector multiplications and the estimation unit (EU) implements automatic parameter tuning.
The MVU performs the matrix-vector multiplication required to compute the dimensional output vector (line 7 of Alg. 2) and the scalar (line 5 of Alg. 2). The EU implements the mean and variance estimation to compute the posterior mean (line 14 of Alg. 2) and Onsager constant (line 15 of Alg. 2). In addition, we compute the per-user post-equalization SNR (line 16 of Alg. 2), which is required for log-likelihood ratio (LLR) value calculations to perform soft-output data detection. Both units carry out their tasks in the same number of clock cycles, which enables us to process two independent equalization problems concurrently in the same architecture by means of coarse-grained pipeline interleaving.
Iv-B Architecture Details
We now provide architecture details for the MVU and EU, and briefly discuss the key fixed-point implementation aspects.
Iv-B1 MVU Details
The MVU computes both and in a single unified architecture, similarly to the architecture in . We divide the -dimensional channel matrix into four blocks, each of which are processed using a separate MVU, which we refer to as MVU-, . Each matrix-vector multiplication is carried out using complex-valued multiply-accumulate (MAC) units; the matrix-vector operation is carried out on a column-by-column basis so that each MAC unit is associated to a row of the matrix.
A straightforward approach to compute would be to broadcast the -dimensional vector to all MVUs. To compute within the same architecture, access contentions would arise as one would need to be able to read all entries from the row of and sum all partial products. To enable highly parallel matrix-vector computation without causing access contentions, we use an architecture as depicted in Fig. 3 that performs a variant of Cannon’s algorithm . Let be a block of where each row is cyclically shifted by its index. To compute , the input is first loaded into the input shift registers (the pre-shift block). We then circularly shift the entries of this shift register while sequentially calculating the MAC operations with entries of the matrix ; the outputs are accumulated in the registers at the output of each MAC unit. This effectively implements a column-by-column matrix-vector operation in clock cycles. To compute , we load into the input shift register but no cyclical shifts are carried out. Instead we cyclically exchange the outputs (the post-shift block) while accumulating the results. This effectively implements a row-by-row matrix-vector operation in clock cycles.
After the computation of , we have to accumulate the results of the four blocks. We do this over two additional clock cycles: in cycle 1, MVU- and MVU- pass their result to MVU- and MVU- for accumulation; in cycle 2, MVU- passes its result to MVU- to obtain the final result.
Iv-B2 EU Details
The EU computes the posterior mean and Onsager constant . To this end, the EU first computes the -dimensional norm of the real and imaginary part of , and . We employ two MAC units which compute the real and imaginary parts over 16 clock cycles. Once and are completed, we compute the so-called denoising parameter (line 12 of Alg. 2) and (line 13 of Alg. 2) for each -th UE sequentially over 16 clock cycles. We note that the function in lines 12 and 13 of Alg. 2 is numerically stable so we employ a single-iteration of the LUT-based Newton-Raphson procedure.
Iv-B3 Fixed-Point Arithmetic
In order to achieve low hardware complexity and high throughput, our design uses fixed-point arithmetic. We first globally scale so that the real and imaginary entries are less than . We then quantize each element of to fraction bits, and to integer and fraction bits. The fixed-point performance of our NOPE design is shown in Fig. 1. The solid lines correspond to floating-point performance, the markers to the fixed-point performance of our golden model.
|This work||Prabhu ||Tang ||Peng ||Castañeda |
|Preproc. quantities||col. gains||–||Gram mat.||–||Gram mat.|
this design also supports precoding; standard technology scaling rules apply; the ZF mode does not require any parameters.
Iv-C Implementation Results and Conclusion
Table I shows synthesis results for NOPE in a 28 nm CMOS technology and compares our design to existing data detectors for massive MU-MIMO. We note that the numbers reported in Table I for NOPE are based on synthesis results; an ASIC design is part of ongoing work. While our design is comparable to other designs in terms of hardware efficiency, we emphasize that NOPE is completely parameter-free (other than knowledge of and ), which makes it more resilient to parameter mismatch and dynamic variations of the system compared to all the other methods. In addition, NOPE requires a minimal amount of preprocessing, i.e., and , in contrast to, e.g., the design of  that requires computation of the Gram which often dominates the complexity of massive MU-MIMO data detectors . In summary, NOPE is a robust “fire-and-forget” equalization algorithm for MU-MIMO systems that achieves L-MMSE performance at competitive implementation complexity.
-  E. Larsson, O. Edfors, F. Tufvesson, and T. Marzetta, “Massive MIMO for next generation wireless systems,” IEEE Commun. Mag., vol. 52, no. 2, pp. 186–195, Feb. 2014.
-  J. Andrews, S. Buzzi, W. Choi, S. Hanly, A. Lozano, A. Soong, and J. Zhang, “What will 5G be?” IEEE J. Sel. Areas Commun., vol. 32, no. 6, pp. 1065–1082, Jun. 2014.
-  M. Wu, B. Yin, G. Wang, C. Dick, J. Cavallaro, and C. Studer, “Large-scale MIMO detection for 3GPP LTE: Algorithm and FPGA implementation,” IEEE J. Sel. Topics Signal Process., vol. 8, no. 5, pp. 916–929, Oct. 2014.
-  S. Verdú, Multiuser Detection, 1st ed. Cambridge Univ. Press, 1998.
-  J. Hoydis, S. ten Brink, and M. Debbah, “Massive MIMO: How many antennas do we need?” in Proc. Allerton Conf. Commun., Contr., Comput., Sept. 2011, pp. 545–550.
-  R. Ghods, C. Jeon, G. Mirza, A. Maleki, and C. Studer, “Optimally-tuned nonparametric linear equalization for massive MU-MIMO systems,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Jun. 2017, pp. 2118–2122.
-  C. Studer, S. Fateh, and D. Seethaler, “ASIC implementation of soft-input soft-output MIMO detection using MMSE parallel interference cancellation,” IEEE J. Solid-State Circuits, vol. 46, no. 7, pp. 1754–1765, Jul. 2011.
-  O. Castañeda, S. Jacobsson, G. Durisi, M. Coldrey, T. Goldstein, and C. Studer, “1-bit massive MU-MIMO precoding in VLSI,” IEEE J. Emerg. Sel. Topics Circuits Syst., 2017.
-  H. Prabhu, J. N. Rodrigues, L. Liu, and O. Edfors, “3.6A 60pJ/b 300Mb/s Massive MIMO precoder-detector in 28nm FD-SOI,” in IEEE Int. Solid-State Circuits Conf. (ISSCC), Feb 2017, pp. 60–61.
-  W. Tang, C. H. Chen, and Z. Zhang, “A 0.58 2.76Gb/s 79.8pJ/b 256-QAM massive MIMO message-passing detector,” in IEEE Symp. VLSI Circuits, Jun. 2016, pp. 1–2.
-  G. Peng, L. Liu, S. Zhou, S. Yin, and S. Wei, “A 1.58 Gbps/W 0.40 Gbps/ ASIC implementation of MMSE detection for 64-QAM Massive MIMO in 65 nm CMOS,” IEEE Trans. Circuits Syst. I, 2017.
-  C. Studer, M. Wenk, and A. Burg, “MIMO transmission with residual transmit-RF impairments,” in Int. ITG Workshop on Smart Antennas (WSA), Feb. 2010, pp. 189–196.
-  E. Björnson, J. Hoydis, M. Kountouris, and M. Debbah, “Massive MIMO systems with non-ideal hardware: Energy efficiency, estimation, and capacity limits,” IEEE Trans. Inf. Theory, vol. 60, no. 11, pp. 7112–7139, Apr. 2014.
-  D. Donoho, A. Maleki, and A. Montanari, “Message-passing algorithms for compressed sensing,” Proc. Natl. Academy of Sciences (PNAS), vol. 106, no. 45, pp. 18 914–18 919, Sept. 2009.
-  C. M. Stein, “Estimation of the mean of a multivariate normal distribution,” The Annals of Statistics, vol. 9, no. 6, pp. 1135–1151, Nov. 1981.
-  C. Jeon, A. Maleki, and C. Studer, “On the performance of mismatched data detection in large MIMO systems,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Jul. 2016, pp. 180–184.
-  L. Cannon, “A cellular computer to implement the Kalman filter algorithm,” Ph.D. dissertation, Montana State University, USA, 1969.
-  U. Madhow and M. L. Honig, “MMSE interference suppression for direct-sequence spread-spectrum CDMA,” IEEE Trans. Commun., vol. 42, no. 12, pp. 3178–3188, Aug. 1994.
-  K. R. Kumar, G. Caire, and A. L. Moustakas, “Asymptotic performance of linear receivers in MIMO fading channels,” IEEE Trans. Inf. Theory, vol. 55, no. 10, pp. 4398–4418, Oct. 2009.
-  J. Hoydis, S. Ten Brink, and M. Debbah, “Massive MIMO: How many antennas do we need?” in Allerton Conf. on Commun., Contr, and Comput. IEEE, 2011, pp. 545–550.
-  D. Tse and S. Hanly, “Linear multiuser receivers: effective interference, effective bandwidth and user capacity,” IEEE Trans. Inf. Theory, vol. 45, no. 2, pp. 641–657, Mar. 1999.
-  C. D. Perels, “Frame-based MIMO-OFDM systems: impairment estimation and compensation,” Ph.D. dissertation, ETH Zurich, Switzerland, 2007.
-  O. Castañeda, T. Goldstein, and C. Studer, “Data detection in large multi-antenna wireless systems via approximate semidefinite relaxation,” IEEE Trans. Circuits Syst. I, vol. 63, no. 12, pp. 2334–2346, Dec. 2016.