VLSI Design of a Nonparametric Equalizer
for Massive MUMIMO
Abstract
Linear minimum meansquare error (LMMSE) equalization is among the most popular methods for data detection in massive multiuser multipleinput multipleoutput (MUMIMO) wireless systems. While LMMSE equalization enables nearoptimal spectral efficiency, accurate knowledge of the signal and noise powers is necessary. Furthermore, corresponding VLSI designs must solve linear systems of equations, which requires high arithmetic precision, exhibits stringent data dependencies, and results in high circuit complexity. This paper proposes the first VLSI design of the NOnParametric Equalizer (NOPE), which avoids knowledge of the transmit signal and noise powers, provably delivers the performance of LMMSE equalization for massive MUMIMO systems, and is resilient to numerous system and hardware impairments due to its parameterfree nature. Moreover, NOPE avoids computation of a matrix inverse and only requires hardwarefriendly matrixvector multiplications. To showcase the practical advantages of NOPE, we propose a parallel VLSI architecture and provide synthesis results in 28 nm CMOS. We demonstrate that NOPE performs on par with existing data detectors for massive MUMIMO that require accurate knowledge of the signal and noise powers.
I Introduction
It is widely believed that massive multiuser multipleinput multipleoutput (MUMIMO) will be a core technology for fifthgeneration (5G) wireless systems. Massive MUMIMO relies on basestation (BS) architectures with hundreds of antenna elements and radiofrequency (RF) chains that serve tens of user equipments (UEs) in the same timefrequency resource. While this emerging technology enables unprecedented spectral efficiency by means of finegrained beamforming [1, 2], it also poses significant practical implementation challenges.
Ia The Case for Nonparametric Equalization
Data detection in the uplink (UEs transmit data to the BS) is among the most critical tasks from a spectral efficiency and hardware complexity perspective [3]. While optimal MIMO data detection is known to be NPhard [4], it has been shown in [5, 3] that linear minimum meansquare error (LMMSE) equalization enables nearoptimal performance in massive MUMIMO systems. However, LMMSE equalization requires accurate knowledge of the signal and noise powers [6]. Furthermore, corresponding hardware designs must solve linear systems of equations, which requires high arithmetic precision and suffers from stringent data dependencies—both of these aspects result in relatively high circuit complexity [7, 8, 9, 10, 11]. In addition, practical massive MUMIMO BS designs will most likely rely on inexpensive RF circuitry which suffers from numerous impairments, including amplifier nonlinearities, phase noise, and quantization artifacts [12, 13]. The presence of nonideal hardware necessitates the design of new equalization algorithms that are resilient to realworld hardware imperfections.
Recently, a novel algorithm called NonParametric Equalizer (NOPE, for short) was proposed in [6]. NOPE does not require knowledge of the signal and noise powers while provably achieving the performance of LMMSE equalization in massive MUMIMO systems. NOPE combines approximate message passing (AMP) [14] with Stein’s unbiased risk estimator (SURE) [15] and mismatched data detection [16], which renders this algorithm resilient to numerous hardware impairments while being computationally efficient: NOPE only requires matrixvector products and avoids a computation of costly matrix inverses or matrix decompositions, which are typically required by LMMSE equalizer algorithms. Despite all these advantages, NOPE has been designed only for idealistic channel models and has not yet been integrated in hardware.
IB Contributions
In this paper, we generalize NOPE to practical channels and provide, to the best of our knowledge, its first VLSI design. Our contributions are summarized as follows:

We propose a set of algorithmlevel modifications that enable NOPE to operate on more realistic MU channels.

We develop a VLSI architecture that relies on Cannon’s algorithm [17] to achieve high throughput at low area.

We show reference VLSI synthesis results in 28 nm CMOS for a 64 BS antenna, 16 UE massive MUMIMO system.

We compare NOPE to existing massive MUMIMO equalizers requiring knowledge of the signal and noise powers.
Our results demonstrate that massive MUMIMO has the unique potential to design parameterfree algorithms, such as NOPE, that perform on par with solutions that require accurate knowledge of critical system and model parameters.
IC Notation
Lowercase and uppercase boldface letters designate column vectors and matrices, respectively. The transpose and conjugate transpose of a matrix are denoted by and , respectively; the entry on the th row and th column of is . The identity matrix is denoted by and the allzeros matrix by . The th entry of a vector is . We define the Hadamard product as . For an dimensional vector , we define . The probability density function (PDF) of a circularlysymmetric complexvalued Gaussian random vector with covariance matrix is denoted by . Expectation and variance with respect to the random vector is denoted by and , respectively.
Ii A Primer on LMMSE Equalization
We now introduce the system model and review the basics of LMMSE equalization. We then discuss NOPE.
Iia System Model
We consider the inputoutput relation to model a massive MUMIMO uplink system operating in a frequencyflat channel [3]. The vector contains the received signals at the BS; denotes the number of BS antennas; the matrix represents the uplink MIMO channel; denotes the number of UEs; the transmit signal vector is ; and the vector models receive noise, which has i.i.d. circularlysymmetric complex Gaussian entries with variance per entry. Throughout this paper, we assume that the transmit signal vector has i.i.d. entries so that , where models the signal prior (e.g., a 16QAM constellation) with zero mean and signal variance , . We require the following definitions.
Definition 1.
The antenna ratio is defined as .
Definition 2.
The largeantenna limit is defined by fixing the antenna ratio and letting .
Definition 3.
The channel matrix is said to have uniform channel gains if the entries are i.i.d. circularlysymmetric complex Gaussian with variance per complex entry.
IiB Basics of LMMSE Equalization
LMMSE equalization is among the most popular methods to compute an estimate for from and from knowledge of the channel matrix , and enjoys widespread use for data detection in MIMO systems [18, 19, 7, 20]. The relatively low computational complexity (except for the inversion of a potentially large matrix) and acceptable performance render this method a feasible alternative to more complicated data detection algorithms. Moreover, it has been shown in [5, 3] that LMMSE equalization enables (often significantly) higher achievable rates than zeroforcing (ZF) or maximum ratio combining (MRC)based equalizers in massive MUMIMO systems. However, to enable nearoptimal spectral efficiency via LMMSE equalization, accurate knowledge of the signal and noise powers is required; see, e.g., [21, 6].
Mathematically, the goal of LMMSE equalization is to compute a linear estimate from the receive vector that minimizes the using knowledge of the channel matrix as well as the signal and noise powers, and , respectively. For a circularlysymmetric complexvalued transmit signal , the equalization matrix is given by . If the signal is zeromean and realvalued (e.g., for BPSK signals), then the optimal linear estimator for the real part is given by
where , , , and are the real and imaginary parts of and , respectively; the imaginary part of the estimate is . Clearly, LMMSE equalization relies on knowledge of the quantities or , which requires (i) means to detect whether the transmit signals are real or complexvalued and (ii) an accurate estimate of that is commonly acquired in a dedicated training phase [22].
IiC LMMSE Equalization via AMP
As shown in [16], LMMSE equalization can be implemented using the mismatched complex Bayesian AMP (mcBAMP) framework. By assuming a mismatched Gaussian signal prior distribution with instead of the true signal prior (e.g., two Dirac delta functions concentrated at and for BPSK), one can design the following parametric LMMSE algorithm:
Algorithm 1 (LMmseAmp [16]).
Initialize , , , and . Then, for every iteration compute the output via the following steps:
(1)  
(2)  
(3) 
Here, the posterior mean function and operate elementwise on vectors. We furthermore need the MSE function: .
Interestingly, the estimate computed by LMMSEAMP exhibits the same MSE as that of the LMMSE equalizer in the largeantenna limit and for matrices with uniform channel gains [16]. While this is an asymptotic equivalence, reference [6] has shown that the errorrate performance of LMMSEAMP is virtually indistinguishable from an LMMSE equalizer in practical (finitedimensional) massive MUMIMO systems for a small number of iterations (ten or fewer). Clearly, Algorithm 1 mainly relies on matrixvector multiplications, which enables parallel hardware designs. However, the exact knowledge of is still required.
Iii NOnParametric Equalizer: NOPE
We now summarize the necessary steps to free Algorithm 1 from knowledge of the signal power, leading to NOPE. We then propose a generalization of the algorithm that makes it suitable for more realistic MIMO system scenarios.
Iiia The NOPE Algorithm
To develop NOPE, we wish to automatically tune the signal power and the parameter . To this end, we first introduce the parameter and reparametrize the functions and in Algorithm 1. Now, only a single parameter must be tuned per iteration, i.e., . Interestingly, [16, Thm. 3] shows that optimal parameter tuning is achieved by tuning each parameter by minimizing (1) separately at iteration starting from to . Hence, the remaining piece is to replace the MSE function with a function that does not depend on the true signal prior . As shown in [6], one can use Stein’s unbiased risk estimate (SURE) [15] to extract an estimate of the MSE function as
(4) 
Since the minimum of is given by we can replace the tuning stage in (1) by , which leads to NOPE. As proven in [6, Cor. 6], NOPE achieves the performance of an LMMSE equalizer in the large antenna limit given that has uniform channel gains and for .
IiiB Robust Version of NOPE
NOPE and Algorithm 1 require the matrix to have uniform channel gains. However, in practice each UE typically has a different largescale fading gain (e.g., affected by the distance to the BS), resulting in channel matrices whose columns have different scale. We now show how NOPE can be made robust to such channels. As in [6], one can rewrite the channel matrix as , where each element of is distributed as and is diagonal containing the th UE’s individual largescale fading gain . For this model, one must estimate the gain of the th UE by , which converges to in the largeantenna limit. Thus, is estimated with a diagonal matrix , where the th diagonal element is given by . To enable NOPE to support nonuniform channel gains, we modify the posterior mean function in (2) into an elementwise operation [6]
(5) 
Furthermore, step (3) in Algorithm 1 must be replaced by
to take into account the fact that different functions are used for each UE. This generalization also requires new estimates for the parameters and in (5). As shown in [6, Thm 7], both of these parameters can be estimated as follows
(6) 
where we introduced shorthand notation for the weightednorm of with respect to its largescale fading gains, and for the residual norm.
The remaining piece of our robust NOPE is to enable LMMSE data detection for BPSK constellations for which the imaginary part of is zero. In fact, assuming a circularlysymmetric complex Gaussian prior for BPSK signals is a poor match as the imaginary part is zero. We generalize NOPE by estimating the signal variance in (6) for the real and imaginary parts separately, which enables us to automatically adapt NOPE to the used constellation set. To do so, we decompose the weightednorm of denoted by , into real and imaginary parts, i.e., . More specifically, we can estimate the necessary variances as
for which . With all these ingredients, we arrive at the generalized NOPE algorithm in Algorithm 2.
IiiC Numerical Results
Figure 1 shows uncoded bit errorrate (BER) simulation results in a BS antenna, UE massive MUMIMO system with BPSK, 16QAM, and 256QAM. We show the performance of exact LMMSE equalization, as well as the performance of NOPE for both infinite and fixedprecision. Solid lines correspond to floatingpoint precision, and circle markers correspond to fixedpoint precision simulations of NOPE. Evidently, the BER performance of NOPE with iterations ( for 256QAM) is virtually indistinguishable from the exact LMMSE estimator, which requires accurate knowledge of both the signal and noise powers. Due to its parameter free nature, NOPE is suitable for situations in which the signal and noise powers change rapidly (e.g., due to interference) or if the transmit constellation is unknown and must be estimated prior to data detection.
Iv VLSI Architecture and Synthesis Results
We now propose a verylarge scale integration (VLSI) architecture of the NOPE algorithm for a BS antenna, UE massive MUMIMO system. We then discuss the most essential optimization steps and finally present implementation results in a 28 nm CMOS technology.
Iva Architecture Overview
We partition the NOPE iterations into two phases, each executed by a separate unit; see Fig. 2 for an architecture overview. The matrixvector unit (MVU) executes the necessary matrixvector multiplications and the estimation unit (EU) implements automatic parameter tuning.
The MVU performs the matrixvector multiplication required to compute the dimensional output vector (line 7 of Alg. 2) and the scalar (line 5 of Alg. 2). The EU implements the mean and variance estimation to compute the posterior mean (line 14 of Alg. 2) and Onsager constant (line 15 of Alg. 2). In addition, we compute the peruser postequalization SNR (line 16 of Alg. 2), which is required for loglikelihood ratio (LLR) value calculations to perform softoutput data detection. Both units carry out their tasks in the same number of clock cycles, which enables us to process two independent equalization problems concurrently in the same architecture by means of coarsegrained pipeline interleaving.
IvB Architecture Details
We now provide architecture details for the MVU and EU, and briefly discuss the key fixedpoint implementation aspects.
IvB1 MVU Details
The MVU computes both and in a single unified architecture, similarly to the architecture in [8]. We divide the dimensional channel matrix into four blocks, each of which are processed using a separate MVU, which we refer to as MVU, . Each matrixvector multiplication is carried out using complexvalued multiplyaccumulate (MAC) units; the matrixvector operation is carried out on a columnbycolumn basis so that each MAC unit is associated to a row of the matrix.
A straightforward approach to compute would be to broadcast the dimensional vector to all MVUs. To compute within the same architecture, access contentions would arise as one would need to be able to read all entries from the row of and sum all partial products. To enable highly parallel matrixvector computation without causing access contentions, we use an architecture as depicted in Fig. 3 that performs a variant of Cannon’s algorithm [17]. Let be a block of where each row is cyclically shifted by its index. To compute , the input is first loaded into the input shift registers (the preshift block). We then circularly shift the entries of this shift register while sequentially calculating the MAC operations with entries of the matrix ; the outputs are accumulated in the registers at the output of each MAC unit. This effectively implements a columnbycolumn matrixvector operation in clock cycles. To compute , we load into the input shift register but no cyclical shifts are carried out. Instead we cyclically exchange the outputs (the postshift block) while accumulating the results. This effectively implements a rowbyrow matrixvector operation in clock cycles.
After the computation of , we have to accumulate the results of the four blocks. We do this over two additional clock cycles: in cycle 1, MVU and MVU pass their result to MVU and MVU for accumulation; in cycle 2, MVU passes its result to MVU to obtain the final result.
IvB2 EU Details
The EU computes the posterior mean and Onsager constant . To this end, the EU first computes the dimensional norm of the real and imaginary part of , and . We employ two MAC units which compute the real and imaginary parts over 16 clock cycles. Once and are completed, we compute the socalled denoising parameter (line 12 of Alg. 2) and (line 13 of Alg. 2) for each th UE sequentially over 16 clock cycles. We note that the function in lines 12 and 13 of Alg. 2 is numerically stable so we employ a singleiteration of the LUTbased NewtonRaphson procedure.
IvB3 FixedPoint Arithmetic
In order to achieve low hardware complexity and high throughput, our design uses fixedpoint arithmetic. We first globally scale so that the real and imaginary entries are less than . We then quantize each element of to fraction bits, and to integer and fraction bits. The fixedpoint performance of our NOPE design is shown in Fig. 1. The solid lines correspond to floatingpoint performance, the markers to the fixedpoint performance of our golden model.
This work  Prabhu [9]  Tang [10]  Peng [11]  Castañeda [23]  
System ()  
Algorithm  NOPE  MMSE/ZF  MPD  MMSE  SDR 
Parameters  none  ,  ,  ,  , 
Modulation  256QAM  256QAM  256QAM  64QAM  QPSK 
Preproc. included  no  yes  no  yes  no 
Preproc. quantities  col. gains  –  Gram mat.  –  Gram mat. 
Results  synthesis  ASIC  ASIC  ASIC  postlayout 
Technology [nm]  28  28  40  65  45 
Area [] 
0.28  1.10  0.58  2.57  0.48 
Frequency [MHz]  800  300  425  680  560 
Throughput [Gb/s]  0.92  0.30  2.76  1.02  0.13 
Eff. [Gb/s/]  3.29  0.27  13.87  4.96  1.08 
this design also supports precoding; standard technology scaling rules apply; the ZF mode does not require any parameters.
IvC Implementation Results and Conclusion
Table I shows synthesis results for NOPE in a 28 nm CMOS technology and compares our design to existing data detectors for massive MUMIMO. We note that the numbers reported in Table I for NOPE are based on synthesis results; an ASIC design is part of ongoing work. While our design is comparable to other designs in terms of hardware efficiency, we emphasize that NOPE is completely parameterfree (other than knowledge of and ), which makes it more resilient to parameter mismatch and dynamic variations of the system compared to all the other methods. In addition, NOPE requires a minimal amount of preprocessing, i.e., and , in contrast to, e.g., the design of [10] that requires computation of the Gram which often dominates the complexity of massive MUMIMO data detectors [3]. In summary, NOPE is a robust “fireandforget” equalization algorithm for MUMIMO systems that achieves LMMSE performance at competitive implementation complexity.
References
 [1] E. Larsson, O. Edfors, F. Tufvesson, and T. Marzetta, “Massive MIMO for next generation wireless systems,” IEEE Commun. Mag., vol. 52, no. 2, pp. 186–195, Feb. 2014.
 [2] J. Andrews, S. Buzzi, W. Choi, S. Hanly, A. Lozano, A. Soong, and J. Zhang, “What will 5G be?” IEEE J. Sel. Areas Commun., vol. 32, no. 6, pp. 1065–1082, Jun. 2014.
 [3] M. Wu, B. Yin, G. Wang, C. Dick, J. Cavallaro, and C. Studer, “Largescale MIMO detection for 3GPP LTE: Algorithm and FPGA implementation,” IEEE J. Sel. Topics Signal Process., vol. 8, no. 5, pp. 916–929, Oct. 2014.
 [4] S. Verdú, Multiuser Detection, 1st ed. Cambridge Univ. Press, 1998.
 [5] J. Hoydis, S. ten Brink, and M. Debbah, “Massive MIMO: How many antennas do we need?” in Proc. Allerton Conf. Commun., Contr., Comput., Sept. 2011, pp. 545–550.
 [6] R. Ghods, C. Jeon, G. Mirza, A. Maleki, and C. Studer, “Optimallytuned nonparametric linear equalization for massive MUMIMO systems,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Jun. 2017, pp. 2118–2122.
 [7] C. Studer, S. Fateh, and D. Seethaler, “ASIC implementation of softinput softoutput MIMO detection using MMSE parallel interference cancellation,” IEEE J. SolidState Circuits, vol. 46, no. 7, pp. 1754–1765, Jul. 2011.
 [8] O. Castañeda, S. Jacobsson, G. Durisi, M. Coldrey, T. Goldstein, and C. Studer, “1bit massive MUMIMO precoding in VLSI,” IEEE J. Emerg. Sel. Topics Circuits Syst., 2017.
 [9] H. Prabhu, J. N. Rodrigues, L. Liu, and O. Edfors, “3.6A 60pJ/b 300Mb/s Massive MIMO precoderdetector in 28nm FDSOI,” in IEEE Int. SolidState Circuits Conf. (ISSCC), Feb 2017, pp. 60–61.
 [10] W. Tang, C. H. Chen, and Z. Zhang, “A 0.58 2.76Gb/s 79.8pJ/b 256QAM massive MIMO messagepassing detector,” in IEEE Symp. VLSI Circuits, Jun. 2016, pp. 1–2.
 [11] G. Peng, L. Liu, S. Zhou, S. Yin, and S. Wei, “A 1.58 Gbps/W 0.40 Gbps/ ASIC implementation of MMSE detection for 64QAM Massive MIMO in 65 nm CMOS,” IEEE Trans. Circuits Syst. I, 2017.
 [12] C. Studer, M. Wenk, and A. Burg, “MIMO transmission with residual transmitRF impairments,” in Int. ITG Workshop on Smart Antennas (WSA), Feb. 2010, pp. 189–196.
 [13] E. Björnson, J. Hoydis, M. Kountouris, and M. Debbah, “Massive MIMO systems with nonideal hardware: Energy efficiency, estimation, and capacity limits,” IEEE Trans. Inf. Theory, vol. 60, no. 11, pp. 7112–7139, Apr. 2014.
 [14] D. Donoho, A. Maleki, and A. Montanari, “Messagepassing algorithms for compressed sensing,” Proc. Natl. Academy of Sciences (PNAS), vol. 106, no. 45, pp. 18 914–18 919, Sept. 2009.
 [15] C. M. Stein, “Estimation of the mean of a multivariate normal distribution,” The Annals of Statistics, vol. 9, no. 6, pp. 1135–1151, Nov. 1981.
 [16] C. Jeon, A. Maleki, and C. Studer, “On the performance of mismatched data detection in large MIMO systems,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Jul. 2016, pp. 180–184.
 [17] L. Cannon, “A cellular computer to implement the Kalman filter algorithm,” Ph.D. dissertation, Montana State University, USA, 1969.
 [18] U. Madhow and M. L. Honig, “MMSE interference suppression for directsequence spreadspectrum CDMA,” IEEE Trans. Commun., vol. 42, no. 12, pp. 3178–3188, Aug. 1994.
 [19] K. R. Kumar, G. Caire, and A. L. Moustakas, “Asymptotic performance of linear receivers in MIMO fading channels,” IEEE Trans. Inf. Theory, vol. 55, no. 10, pp. 4398–4418, Oct. 2009.
 [20] J. Hoydis, S. Ten Brink, and M. Debbah, “Massive MIMO: How many antennas do we need?” in Allerton Conf. on Commun., Contr, and Comput. IEEE, 2011, pp. 545–550.
 [21] D. Tse and S. Hanly, “Linear multiuser receivers: effective interference, effective bandwidth and user capacity,” IEEE Trans. Inf. Theory, vol. 45, no. 2, pp. 641–657, Mar. 1999.
 [22] C. D. Perels, “Framebased MIMOOFDM systems: impairment estimation and compensation,” Ph.D. dissertation, ETH Zurich, Switzerland, 2007.
 [23] O. Castañeda, T. Goldstein, and C. Studer, “Data detection in large multiantenna wireless systems via approximate semidefinite relaxation,” IEEE Trans. Circuits Syst. I, vol. 63, no. 12, pp. 2334–2346, Dec. 2016.