Multiresolution Bayesian CMB component separation through Wienerfiltering with a pseudoinverse preconditioner
Key Words.:
methods: numerical – methods: statistical — cosmic microwave backgroundWe present a Bayesian model for multiresolution CMB component separation based on Wiener filtering and/or computation of constrained realizations, extending a previously developed framework. We also develop an efficient solver for the corresponding linear system for the associated signal amplitudes. The core of this new solver is an efficient preconditioner based on the pseudoinverse of the coefficient matrix of the linear system. In the full sky coverage case, the method gives a speedup of 2–3x in compute time compared to a simple diagonal preconditioner, and it is easier to implement in terms of practical computer code. In the case where a mask is applied and priordriven constrained realization is sought within the mask, this is the first time full convergence has been achieved at the full resolution of the Planck dataset. Prototype benchmark code is available at https://github.com/dagss/cmbcr.
1 Introduction
The microwave sky has been observed on multiple frequencies by many experiments, both groundbased, suborbital and satellite experiments. Among the latter are the most recent NASA’s WMAP mission (Bennett et al., 2013) and ESA’s Planck mission (Planck Collaboration I, 2016). The primary goal of all these experiments is to produce the cleanest possible image of the true cosmological CMB sky, and use this image to constrain the physics, contents and evolution of the universe (e.g., Planck Collaboration XIII, 2016). A critical step in this process, however, is to remove the impact of obscuring foreground emission from our own Galaxy, the Milky Way (Planck Collaboration X, 2015). This step is often called CMB component separation, and informally refers to the process of combining observations taken at different frequencies into a single estimate of the true sky. Furthermore, as the sensitivity of new generations of CMB experiments continue to improve, the importance of CMB component separation is steadily increasing (e.g., BICEP2 Collaboration, 2014), and the field is today a central research field in CMB cosmology, as the community has turned its attention to detect submicrokelvin gravitational waves.
One popular class of component separation methods for full sky CMB estimation and component separation is through optimal Bayesian analysis, in particular as implemented through the Gibbs sampling and Wienerfiltering algorithms (Jewell et al., 2004; Wandelt et al., 2004; Eriksen et al., 2008). This approach, as implemented in the Commander computer code (Eriksen et al., 2004), has been established by the Planck collaboration as a standard method for astrophysical component separation and CMB extraction (Planck Collaboration X, 2015). However, the Commander implementation that was used in the Planck 2015 (and earlier) analysis suffers from one important limitation; it requires all frequency channels to have the same effective instrumental beam, or “pointspread function”. In general, though, the angular resolution of a given sky map is typically inversely proportional to the wavelength, and the solution to this problem adopted by the Planck 2015 Commander analysis was simply to smooth all frequency maps to a common (lowest) angular resolution. This, however, implies an effective loss of both sensitivity and angular resolution, and is as such highly undesirable. To establish an effective algorithm that avoids this problem is the main goal of the current paper.
First, we start by presenting a native multiresolution version of the Bayesian CMB component separation model. This is a straightforward extension of the singleresolution model presented in Eriksen et al. (2008). Second, we construct a novel solver for the resulting linear system, based on the pseudoinverse of the corresponding coefficient matrix. We find that this solver outperforms existing solvers in all situations we have applied it to. Additionally, the fullsky version of the preconditioner is easier to implement in terms of computer code than the simple diagonal preconditioner that is most commonly used in the literature (Eriksen et al., 2004). The big advantage, however, is most clearly seen in the presence of partial sky coverage, where the speedups reach factors of 100s. For full sky coverage, where a diagonal preconditioner performs reasonably, the improvement is a more modest 2–3x.
The previous Commander implementation (Eriksen et al., 2004) employs the Conjugate Gradient (CG) method to solve the singlecomponent CR system, using a preconditioner based on computing matrix elements in spherical harmonic domain. While this approach worked well enough for WMAP, the convergence rate scales with the signaltonoise ratio of the experiment, and for the sensitivity of Planck, it becomes impractical in the case of partial sky coverage, requiring several thousand iterations for convergence, if it converges at all. More sophisticated solvers are described by Smith et al. (2007); Elsner & Wandelt (2013). We argue in Seljebotn et al. (2014) that these appear to suffer from the same fundamental problem. In particular they quickly converge outside of the mask, but spend a long time finding the solution inside the mask.
To our knowledge the multilevel, pixeldomain solver described by Seljebotn et al. (2014) represents the state of the art prior to this paper. However, the multilevel solver has some weaknesses which became apparent in our attempt to generalize it for the purpose of component separation. First, due to only working locally in pixel domain, it is ineffective in deconvolving the signal on high \ell, where the spherical harmonic transfer function of the instrumental beam falls below b_{\ell}=0.2 or so. In the case of Seljebotn et al. (2014) this was not a problem because the solution is dominated by a \LambdaCDMtype prior on the CMB before reaching these scales. However, for the purposes of component separation one wants to apply no prior or rather weak priors, and so the solver in Seljebotn et al. (2014) fails to converge. The second problem is that it is challenging to work in pixel domain with multiresolution data where the beam sizes vary with a factor of 10, as is the case in Planck. Thirdly, extensive tuning was required on each level to avoid ringing problems. Finally, the algorithm in Seljebotn et al. (2014) requires extensive memoryconsuming precomputations.
In contrast, the solver developed in this paper 1) offers very cheap precomputations; 2) does not depend on applying a prior; 3) is robust with respect to the choice of statistical priors on the component amplitudes; and 4) has much less need for tuning. The solver combines a number of techniques, and fundamentally consists of two parts:

A maskrestricted solver which should be applied in addition in cases of partial sky coverage. This solver is developed in Sect. 4. By solving the system under the mask as a separate substep, we converge in about 20 CG iterations, rather than 1000s of iterations.
These two techniques are somewhat independent, and it would be possible to use the maskrestricted solver together with the older diagonal preconditioner. Still we present them together in this paper, in order to establish what we believe is the new state of the art for solving the CMB component separation and/or constrained realization problem.
2 Bayesian multiresolution CMB component separation
2.1 Spherical harmonic transforms in linear algebra
We will assume that the reader is familiar with spherical harmonic transforms (SHTs), and simply note that they are the spherical analogue of Fourier transforms (for further details see, e.g., Reinecke & Seljebotn, 2013). The present work requires us to be very specific about the properties of these transforms as part of linear systems. We will write \mathbf{Y} for the transform from harmonic domain to pixel domain (spherical harmonic synthesis), and \mathbf{Y}^{T}\mathbf{W} for the opposite transform (spherical harmonic analysis). The \mathbf{W} matrix is diagonal, containing the perpixel quadrature weights employed in the analysis integral for a particular grid.
Of particular importance for the present work is that there exists no perfect grid on the sphere; all spherical grids have slightly varying distances between grid points. This means that some parts of the sphere will see smaller scales than other parts, and that, ultimately, there is no discrete version of the spherical harmonic transform analogous to the discrete Fourier transforms which maps \mathbb{R}^{n} to \mathbb{R}^{n}. Specifically, \mathbf{Y} is rectangular and thus not invertible. Usually \mathbf{Y} is configured such that the number of rows (pixels) is greater than the number of columns (spherical harmonic coefficients), in which case an harmonic signal is unaltered by a roundtrip to pixel domain so that \mathbf{Y}^{T}\mathbf{W}\mathbf{Y}=\mathbf{I}. In that case, the converse operator, \mathbf{Y}\mathbf{Y}^{T}\mathbf{W}, is singular, but in a very specific way: It takes a pixel map and removes any scales from it that are above the bandlimit L.
2.2 The component separation model
Eriksen et al. (2008) describe a Bayesian model for CMB component separation under the assumption that all observed sky maps have the same instrumental beam and pixel resolution. For full resolution analysis of Planck data this is an unrealistic requirement, as the FullWidth HalfMax (FWHM) of the beam (Point Spread Function) span a large range, from 4.4 to 32 arcminutes, and so one loses much information by downgrading data to a common resolution. In this paper we generalize the model to handle sky maps observed with different beams and at different resolutions.
We will restrict our attention to the CMB component and diffuse foregrounds. Eriksen et al. (2008) additionally incorporate template components in the model for linear component separation. These are particularly useful for dealing with point sources, where beam asymmetry is much more noted than for the diffuse foregrounds. Recent versions of Commander sample template amplitudes as an additional Gibbs step, rather than as part of the linear system for component separation, so as to more easily include a positivity constraint on such amplitudes. We will therefore ignore templates in this paper.
The microwave sky is observed as N_{\text{obs}} different sky maps \mathbf{d}_{\nu} with different instrumental characteristics, and we wish to separate these into N_{\text{comp}} distinct diffuse foreground components \mathbf{s}_{k}. The key to achieving this is to specify the spectral energy density (SED) of each component; see figures 1 and 2. Eriksen et al. (2008) describes the Gibbs sampling steps employed to fit the SED to data. For the purposes of this paper we will simply assume that the SED information is provided as input, in the form of mixing maps. Our basic assumption, ignoring any instrumental effects, is that the true microwave emission in a direction \hat{n} on the sky, integrated over a given frequency bandpass labelled \nu, is given by
f_{\nu}(\hat{n})=\sum_{k}q_{\nu,k}(\hat{n})s_{k}(\hat{n}),  (1) 
where s_{k}(\hat{n}) represents the underlying component amplitude, and q_{\nu,k}(\hat{n}) represents an assumed mixing map. We will model the CMB component simply as another diffuse component, but note that the mixing maps are in that case spatially constant.
We do not observe f_{\nu}(\hat{n}) directly, but rather take as input a pixelized sky map \mathbf{d}_{\nu}, where f_{\nu} has been convolved with an instrumental beam and then further contaminated by instrumental noise. To simplify notation we will employ a notation with stacked vectors, and write
\mathbf{d}\equiv\left[\begin{array}[]{ccccccccccccccc}\mathbf{d}_{\text{30GHz}% }\\ \mathbf{d}_{\text{70GHz}}\\ \vdots\end{array}\right],\quad\mathbf{s}\equiv\left[\begin{array}[]{% ccccccccccccccc}\mathbf{s}_{\text{cmb}}\\ \mathbf{s}_{\text{dust}}\\ \vdots\end{array}\right]. 
Our data observation model can then be written
\mathbf{d}=\mathbf{P}\mathbf{s}+\mathbf{n},  (2) 
where \mathbf{P} is an N_{\text{obs}}\times N_{\text{comp}} blockmatrix where each block (\nu,k) projects component \mathbf{s}_{k} to the observed sky map \mathbf{d}_{\nu}, and \mathbf{n} represents instrumental noise, and is partitioned in the same way as \mathbf{d}. The noise \mathbf{n}_{\nu} is a pixelized map with the same resolution as \mathbf{d}_{\nu}, and assumed to be Gaussian distributed with zero mean. For our experiments we also assume that the noise is independent between pixels, so that \text{Var}(\mathbf{n}_{\nu})\equiv\mathbf{N}_{\nu} is a diagonal matrix, although this is not a fundamental requirement of the method. We do however require that the matrix \mathbf{N}_{\nu}^{1} can somehow be efficiently applied to a vector.
Each component \mathbf{s}_{k} represents the underlying, unconvolved field. In our implementation we work with \mathbf{s}_{k} being defined by the spherical harmonic expansion of s_{k}(\hat{n}), truncated at some bandlimit L_{k}, i.e., we assume s_{\ell,m}=0 for \ell>L_{k}. The choice of L_{k} is essentially a part of the statistical model, and typically chosen to match a resolution that the observed sky maps will support. Additionally, each component \mathbf{s}_{k} may have an associated Gaussian prior p(\mathbf{s}), specified through its covariance matrix \mathbf{S}. The role of the prior is to introduce an assumption on the smoothness of \mathbf{s}. The prior typically does not come into play where the signal for a component is strong, but in regions that lacks the component it serves to stabilize the solution, such that less noise is interpreted as signal. Computationally it is easier to assume the same smoothness prior everywhere on the sky, in which case the covariance matrix \text{Var}(\mathbf{s})=\mathbf{S} is diagonal in spherical harmonic domain with elements given by the spherical harmonic power spectrum, C_{k,\ell}. However, this is not a necessary assumption, and we comment on a different type of prior in Sect. 6.
The CMB power spectrum prior, given by C_{\text{cmb},\ell}, is particularly crucial. For the purposes of full sky component separation one would typically not specify any prior so as to not bias the CMB. For the purposes of estimating foregrounds, however, or filling in a CMB realization within a mask, one may insert a fiducial power spectrum predicted by some cosmological model.
We now return to the projection operator \mathbf{P}, which projects each component to the sky. This may be written in the form of a block matrix with each column representing a component k and each row representing a sky observation \nu, e.g.,
\mathbf{P}=\left[\begin{array}[]{ccccccccccccccc}\mathbf{P}_{\text{30GHz,cmb}}% &\mathbf{P}_{\text{30GHz,dust}}&\dots\\ \mathbf{P}_{\text{70GHz,cmb}}&\mathbf{P}_{\text{70GHz,dust}}&\dots\\ \vdots&\vdots&\ddots\end{array}\right],  (3) 
with each block taking the form
\mathbf{P}_{\nu,k}=\mathbf{Y}_{\nu}\mathbf{B}_{\nu}\widetilde{\mathbf{Q}}_{\nu% ,k}.  (4) 
The operator \mathbf{Y}_{\nu} denotes spherical harmonic synthesis to the pixel grid employed by \mathbf{d}_{\nu}. We assume in this paper an azimuthally symmetric instrumental beam for each sky map, in which case the beam convolution operator \mathbf{B}_{\nu} is diagonal in spherical harmonic domain with elements b_{\nu,\ell}. This transfer function decays to zero as \ell grows at a rate that fits the bandlimit of the grid. Finally \widetilde{\mathbf{Q}}_{\nu,k} is an operator that denotes pointwise multiplication of the input with the mixing map q_{\nu,k}(\hat{n}); computationally this should be done in pixel domain, so that
\widetilde{\mathbf{Q}}_{\nu,k}=\mathbf{Y}_{\nu,k}^{T}\mathbf{W}_{\nu,k}\mathbf% {Q}_{\nu,k}\mathbf{Y}_{\nu,k},  (5) 
where \mathbf{Q}_{\nu,k} contains q_{\nu,k}(\hat{n}) on its diagonal. The subscripts on the SHT operators indicate that these are defined on a grid specific to this mixing operation. Note that \widetilde{\mathbf{Q}}_{\nu,k} will cause the creation of new smallscale modes, implying that, technically speaking, the bandlimit of \widetilde{\mathbf{Q}}_{\nu,k}\mathbf{s}_{k} is 2L_{k} rather than L_{k} for a fullresolution mixing map q_{\nu,k}. For typical practical system solving, however, the mixing matrices are usually smoother than the corresponding amplitude maps (due to lower effective signaltonoise ratios), and the model may incorporate an approximation in that \widetilde{\mathbf{Q}}_{\nu,k} truncates the operator output at some lower bandlimit. At any rate, the grid used for \mathbf{Q}_{\nu,k} should accurately represent q_{\nu,k} up to this bandlimit. For this purpose it is numerically more accurate to use a GaussLegendre grid, rather than the HEALPix^{1}^{1}1http://healpix.jpl.nasa.gov grid (Górski et al., 2005) that is usually used for \mathbf{d}_{\nu}. The solver of the present paper simply treats this as a modelling detail, and any reasonable implementation works fine as long as \widetilde{\mathbf{Q}}_{\nu,k} is not singular. However, to achieve reasonable efficiency, we do require q_{\nu,k}(\hat{n}) to be relatively flat, such that approximating \widetilde{\mathbf{Q}}_{\nu,k} with a simple scalar q_{\nu,k} is a meaningful zeroorder representation. In practice we typically find ratios between the maximum and minimum values of q_{\nu,k}(\hat{n}) of 1.5–3. In general, the higher the contrast, the slower the convergence of the solver is.
2.3 The constrained realization linear system
We have specified a model for the components \mathbf{s} where the likelihood p(\mathbf{d}) and component priors p(\mathbf{s}) are Gaussian, and so the Bayesian posterior is also Gaussian (Jewell et al., 2004; Wandelt et al., 2004; Eriksen et al., 2004, 2008), and given by
p(\mathbf{s}\mathbf{d},\mathbf{S})\propto e^{\frac{1}{2}\mathbf{s}^{T}(% \mathbf{S}^{1}+\mathbf{P}^{T}\mathbf{N}^{1}\mathbf{P})^{1}\mathbf{s}}.  (6) 
To explore this density we are typically interested in either i) the mean vector, or ii) drawing samples from the density. Both can be computed by solving a linear system \mathbf{A}\mathbf{x}=\mathbf{b} with
\mathbf{A}\equiv\mathbf{S}^{1}+\mathbf{P}^{T}\mathbf{N}^{1}\mathbf{P}  (7) 
and
\mathbf{b}\equiv\mathbf{P}^{T}\mathbf{N}^{1}\mathbf{d}+\mathbf{P}^{T}\mathbf{% N}^{1/2}\omega_{1}+\mathbf{S}^{1}\omega_{2}.  (8) 
The vectors \omega_{1} and \omega_{2} should either be

zero, in which case the solution \mathbf{x} will be the mean \text{E}(\mathbf{s}\mathbf{d},\mathbf{S}), also known as the Wienerfiltered map.

vectors of variates from the standard Gaussian distribution, in which case the solution \mathbf{x} will be samples drawn p(\mathbf{s}\mathbf{d},\mathbf{S}). We refer to such samples as the constrained realizations.
For convenience we refer to the system as the constrained realization (CR) system in both cases. The computation of the righthand side \mathbf{b} is straightforward and not discussed further in this paper; our concern is the efficient solution of the linear system \mathbf{A}\mathbf{x}=\mathbf{b}.
For current and future CMB experiments, the matrix \mathbf{A} is far too large for the application of dense linear algebra. We are however able to efficiently apply the matrix to a vector, by applying the operators one after the other, and so we can use iterative solvers for linear systems. Since \mathbf{A} is symmetric and positive definite, the recommended iterative solver is the Conjugate Gradients (CG) method (see Shewchuk, 1994, for a tutorial). One starts out with some arbitrary guess, say, \mathbf{x}_{1}=\mathbf{0}. Then, for each iteration, a residual \mathbf{r}_{i}=\mathbf{b}\mathbf{A}\mathbf{x}_{i} is computed, and then this residual is used to produce an updated iterate \mathbf{x}_{i+1} that lies closer to the true solution \mathbf{x}_{\text{true}}.
Note that the residual \mathbf{r}_{i}, which is readily available, is used as a proxy for the error, \mathbf{e}_{i}=\mathbf{x}_{\text{true}}\mathbf{x}_{i}, which is unavailable as we do not know \mathbf{x}_{\text{true}}. The key is that since \mathbf{A}\mathbf{x}_{\text{true}}=\mathbf{b},
\mathbf{r}_{i}=\mathbf{b}\mathbf{A}\mathbf{x}_{i}=\mathbf{A}(\mathbf{x}_{% \text{true}}\mathbf{x}_{i})=\mathbf{A}\mathbf{e}_{i}, 
and since \mathbf{A} is linear, reducing the magnitude of \mathbf{r}_{i} will also lead to a reduction in the error \mathbf{e}_{i}. In a production run the error \mathbf{e}_{i} is naturally unavailable, but during development and debugging it is highly recommended to track it. This can be done by generating some \mathbf{x}_{\text{true}}, then generate \mathbf{b}=\mathbf{A}\mathbf{x}_{\text{true}}, and track the error \mathbf{e}_{i} while running the solver on this input. The benchmark results presented in this paper are generated in this manner.
The number of iterations required by CG depends on how uniform (or clustered) the eigenspectrum of the matrix is. Therefore the main ingredient in a linear solver is a good preconditioner which improves the eigenspectrum. A good preconditioner is a symmetric, positive definite matrix \mathbf{M} such that \mathbf{M}\mathbf{x} can be quickly computed, and where the eigenspectrum of \mathbf{M}\mathbf{A} is as flat as possible, or at least clustered around a few values. Intuitively, \mathbf{M} should in some sense approximate \mathbf{A}^{1}.
3 A full sky preconditioner based on pseudoinverses
3.1 On pseudoinverses
The inspiration to the solver presented below derives from the literature on the numerical solution of SaddlePoint systems. Our system \mathbf{A}, in the form given in in equation (7), is an instance of a socalled Schur complement, and approximating the inverse of such Schur complements plays a part in most solvers for SaddlePoint systems. An excellent review on the solution of saddlepoint systems is given in Benzi et al. (2005). The technique we will use was originally employed by Elman (1999) and Elman et al. (2006), who use appropriately scaled pseudoinverses to develop solvers for NavierStokes partial differential equations.
The pseudoinverse is a generalization of matrix inverses to rectangular and/or singular matrices. In our case, we will deal with an m\times nmatrix \mathbf{U} with linearly independent columns and m>n, in which case we have that the pseudoinverse is given by
\mathbf{U}^{+}\equiv(\mathbf{U}^{T}\mathbf{U})^{1}\mathbf{U}^{T}. 
Note that in this case, \mathbf{U}^{+} is simply the matrix that finds the leastsquares solution to the linear system \mathbf{U}\mathbf{x}=\mathbf{b}. This matrix has the property that
\mathbf{U}^{+}\mathbf{U}=(\mathbf{U}^{T}\mathbf{U})^{1}\mathbf{U}^{T}\mathbf{% U}=\mathbf{I}, 
and so it is a left inverse. Unlike the real inverse however, \mathbf{U}\mathbf{U}^{+}\neq\mathbf{I}. This follows from \mathbf{U}^{+} having the same shape as \mathbf{U}^{T}, so that \mathbf{U}\mathbf{U}^{+} is necessarily a singular matrix. For small matrices, \mathbf{U}^{+} can be computed by using the QRdecomposition. Note that in the case that \mathbf{U} does not have full rank, the pseudoinverse has a different definition and should be computed using the Singular Value Decomposition (SVD); the definition above is however sufficient for our purposes.
3.2 Fullsky, single component, without prior
For the first building block of our preconditioner we start out with a much simpler linear problem. Assuming a model with a single sky map \mathbf{d}, a single component, no mixing maps, and no prior, the linear system of equation (7) simply becomes
\mathbf{B}\mathbf{Y}^{T}\mathbf{N}^{1}\mathbf{Y}\mathbf{B}\mathbf{x}=\mathbf{% b}.  (9) 
Note that \mathbf{Y}^{T} is not the same as spherical harmonic analysis, which in our notation reads \mathbf{Y}^{T}\mathbf{W}. Without the weight matrix \mathbf{W}, \mathbf{Y}^{T} represents adjoint spherical harmonic synthesis, and simply falls out algebraically from the transpose skyprojection \mathbf{P}^{T}.
Since the basis of \mathbf{x} is in spherical harmonic domain, the beam convolution operator \mathbf{B} is diagonal and trivially inverted. The remaining matrix \mathbf{Y}^{T}\mathbf{N}^{1}\mathbf{Y} is diagonal in pixel domain, so if only \mathbf{Y} had been invertible we could have solved the system directly. Inspired by Elman (1999), we simply pretend that \mathbf{Y} is square, and let \mathbf{Y}^{T}\mathbf{W} play the role of the inverse, so that
(\mathbf{B}\mathbf{Y}^{T}\mathbf{N}^{1}\mathbf{Y}\mathbf{B})^{1}\approx% \mathbf{B}^{1}\mathbf{Y}^{T}\mathbf{W}\mathbf{N}\mathbf{W}\mathbf{Y}\mathbf{B% }^{1}.  (10) 
We stress that because \mathbf{Y} is not exactly invertible and \mathbf{Y}^{T}\mathbf{W} is a pseudoinverse, this is only an approximation which should be used as a preconditioner inside an iterative solver. For the HEALPix grid in particular, \mathbf{Y}^{T}\mathbf{W} is rather inaccurate and we only have \mathbf{Y}^{T}\mathbf{W}\mathbf{Y}\approx\mathbf{I}; it is however a very good approximation, as can be seen in Fig. 3.
3.3 Fullsky, multiple components, with priors
Next, we want to repeat the above trick for a case with multiple sky maps \mathbf{d}_{\nu}, multiple components \mathbf{x}_{k}, and a prior. We start by assuming that the mixing operators \widetilde{\mathbf{Q}}_{\nu,k} can be reasonably approximated by a constant,
\widetilde{\mathbf{Q}}_{\nu,k}\approx q_{\nu,k}\mathbf{I}, 
although, as noted above, this is only a matter of computational efficiency of the preconditioner. The optimal value for q_{\nu,k} is given in Appendix A.2. In the following derivation we will simply assume equality in this statement, keeping in mind that all the manipulations only apply to the preconditioner part of the CG search, and so does not affect the computed solution.
First note that the matrix of equation (7) can be written
\mathbf{A}=\left[\begin{array}[]{ccccccccccccccc}\mathbf{P}^{T}&\mathbf{I}\end% {array}\right]\left[\begin{array}[]{ccccccccccccccc}\mathbf{N}^{1}&\\ &\mathbf{S}^{1}\end{array}\right]\left[\begin{array}[]{ccccccccccccccc}% \mathbf{P}\\ \mathbf{I}\end{array}\right]. 
Writing the system in this way makes it evident that in the Bayesian framework, the priors are treated just like another “observation” of the components. These “prior observations” play a bigger role for parts of the solution where the instrumental noise is high, and a smaller role where the instrumental noise is low (see Fig. 4).
The idea is now to further rearrange and rescale the system in such a way that all information that is not spatially variant is expressed in the projection matrix on the side, leaving a unitfree matrix containing only spatial variations in the middle. Since \mathbf{S}^{1} is assumed to be diagonal, it is trivial to factor it and leave only the identity in the center matrix. Turning to the inverse noise term \mathbf{N}^{1}, we find experimentally that rescaling with a scalar performed well. Specifically, we define
\widetilde{\mathbf{N}}_{\nu}^{1}\equiv\alpha_{\nu}^{2}\mathbf{Y}_{\nu}^{T}% \mathbf{N}^{1}_{\nu}\mathbf{Y}_{\nu}, 
where \alpha takes the value that minimizes \\widetilde{\mathbf{N}}_{\nu}^{1}(\alpha_{\nu})\mathbf{I}\_{2}. Computing \alpha_{\nu} is cheap, and its optimal value is derived in Appendix A.1. A similar idea is used by Elsner & Wandelt (2013), as they split \mathbf{N}^{1} into a spherical harmonic term and a pixel term, but their split is additive rather than multiplicative. Our system can now be written
\mathbf{A}=\mathbf{U}^{T}\mathbf{T}\mathbf{U},  (11) 
where \mathbf{T} contains the unitfree spatial structure of the system, and \mathbf{U} the spatially invariant, but \elldependent, structure of the system:
\mathbf{T}\equiv\left[\begin{array}[]{ccccccccccccccc}\widetilde{\mathbf{N}}^{% 1}&\\ &\mathbf{I}\end{array}\right],\quad\mathbf{U}\equiv\left[\begin{array}[]{% ccccccccccccccc}\alpha\mathbf{B}\widetilde{\mathbf{Q}}\\ \mathbf{S}^{1/2}\end{array}\right]. 
In order to elucidate the block structure of these matrices we write out an example with 3 bands and 2 components:
\small\mathbf{T}=\left[\begin{array}[]{ccccccccccccccc}\widetilde{\mathbf{N}}^% {1}_{1}&&&&\\ &\widetilde{\mathbf{N}}^{1}_{2}&&&\\ &&\widetilde{\mathbf{N}}^{1}_{3}&&\\ &&&&\\ &&&\mathbf{I}&\\ &&&&\mathbf{I}\\ \end{array}\right],\quad\mathbf{U}=\left[\begin{array}[]{ccccccccccccccc}% \alpha_{1}\mathbf{B}_{1}\widetilde{\mathbf{Q}}_{1,1}&\alpha_{1}\mathbf{B}_{1}% \widetilde{\mathbf{Q}}_{1,2}\\ \alpha_{2}\mathbf{B}_{2}\widetilde{\mathbf{Q}}_{2,1}&\alpha_{2}\mathbf{B}_{2}% \widetilde{\mathbf{Q}}_{2,2}\\ \alpha_{3}\mathbf{B}_{3}\widetilde{\mathbf{Q}}_{3,1}&\alpha_{3}\mathbf{B}_{3}% \widetilde{\mathbf{Q}}_{3,2}\\ \mathbf{S}_{1}^{1/2}&\\ &\mathbf{S}_{2}^{1/2}\end{array}\right]. 
Under the assumption that \widetilde{\mathbf{Q}}_{\nu,k}=q_{\nu,k}\mathbf{I}, the matrix \mathbf{U} is blockdiagonal with small blocks of variable size when seen in \ell and mmajor ordering. Specifically, the entries in the “datablocks” (\nu,k^{\prime}) in the top part of \mathbf{U} are given by
U_{(\ell,m,\nu),(\ell^{\prime},m^{\prime},k^{\prime})}=\alpha_{\nu}b_{\nu,\ell% }q_{\nu,k^{\prime}}\delta_{\ell,\ell^{\prime}}\delta_{m,m^{\prime}},  (12) 
and the entries in the “priorblocks” (k,k^{\prime}) in the bottom part of \mathbf{U} are given by
U_{(\ell,m,k),(\ell^{\prime},m^{\prime},k^{\prime})}=C_{k,\ell}^{1/2}\delta_{% k,k^{\prime}}\delta_{\ell,\ell^{\prime}}\delta_{m,m^{\prime}}.  (13) 
In the event that no prior is present for a component, the corresponding block can simply be removed from the system, or one may equivalently set C_{k,\ell}^{1/2}=0.
At this point we must consider the multiresolution nature of our setup. Our model assumes a bandlimit L_{k} for each component, so that each component has (L_{k}+1)^{2} corresponding columns in \mathbf{U}, and, if a prior is used, an additional corresponding (L_{k}+1)^{2} rows. Each sky observation has a natural bandlimit L_{\nu} where the beam transfer function b_{\nu,\ell} has decayed so much that including further modes is numerically irrelevant, and so each band has (L_{\nu}+1)^{2} corresponding rows in \mathbf{U}. When we view \mathbf{U} in the blockdiagonal \ell and mmajor ordering, the block sizes are thus variable: Each bandrow only participates for \ell\leq L_{\nu}, and each componentcolumn and componentrow only participates for \ell\leq L_{k}. A way to see this is to consider that the first blocks for \ell=0 have size (N_{\text{obs}}+N_{\text{comp}})\times N_{\text{comp}} for all m; then, as \ell is increased past some L_{\nu} or L_{k}, corresponding rows or columns disappears from the blocks.
In code it is easier to introduce appropriate zeropadding for \ell>L_{k} and \ell>L_{\nu} so that one can work with a single constant block size. Either way, the numerical results are equivalent. Since \mathbf{U} consists of many small blocks on the diagonal computing the pseudoinverse is quick, and the memory use and compute time for constructing \mathbf{U}^{+} scales as O(N_{\text{comp}}^{2}N_{\text{obs}}L^{2}).
With this efficient representation of \mathbf{U}^{+} in our toolbox, we again use the trick of Elman (1999) to construct a preconditioner on the form
\mathbf{A}^{1}=(\mathbf{U}^{T}\mathbf{T}\mathbf{U})^{1}\approx\mathbf{U}^{+}% \mathbf{T}^{1}(\mathbf{U}^{+})^{T}.  (14) 
The lowerright identity blocks of \mathbf{T} requires no inversion. The \widetilde{\mathbf{N}}^{1}_{\nu}blocks in the upperleft section of \mathbf{T} can be approximately inverted using the technique presented in Sect. 3.2, so that
(\widetilde{\mathbf{N}}^{1}_{\nu})^{1}\approx\alpha_{\nu}^{2}\mathbf{Y}_{\nu% }^{T}\mathbf{W}_{\nu}\mathbf{N}_{\nu}\mathbf{W}_{\nu}\mathbf{Y}_{\nu}. 
Let \mathbf{T}^{+} denote \mathbf{T}^{1} approximated in this manner, and the resulting preconditioner is
\mathbf{M}_{\text{PI}}=\mathbf{U}^{+}\mathbf{T}^{+}(\mathbf{U}^{+})^{T}.  (15) 
Computationally speaking, this has the optimal operational form: The pseudoinverse \mathbf{U}^{+} and its transpose can be applied simply by a series of small matrixvector products in parallel using the precomputed blocks, while application of \mathbf{T}^{+} as an operator requires 2N_{\text{obs}} SHTs.
So far we have only considered the pseudoinverse as an algebraic trick, but some analytic insights are also available. First we look at two special cases. The preconditioned system can be written
\displaystyle\mathbf{M}_{\text{PI}}\mathbf{A}=(\mathbf{U}^{T}\mathbf{U})^{1}% \mathbf{U}^{T}\mathbf{T}^{+}\mathbf{U}(\mathbf{U}^{T}\mathbf{U})^{1}\mathbf{U% }^{T}\mathbf{T}\mathbf{U}. 
First, note that if the real model indeed has a flat inversenoise variance map, then \mathbf{T}^{+}=\mathbf{T}=\mathbf{I} and the preconditioner is perfect, \mathbf{M}\mathbf{A}=\mathbf{I}. Second, consider the case in which the inversenoise variance maps are not flat, but that rather 1) N_{\text{comp}}=N_{\text{obs}}; 2) there are no priors; and 3) the bandlimit of each sky observation matches the one of its dominating component. In this case, \mathbf{U} is square and invertible, and (\mathbf{U}^{T}\mathbf{U})^{1}\mathbf{U}^{T}=\mathbf{U}^{1}, and thus \mathbf{M}\mathbf{A}=\mathbf{U}^{1}\mathbf{T}^{+}\mathbf{T}\mathbf{U}\approx% \mathbf{I}, using the very good approximation developed in Sect. 3.2.
In the generic case, \mathbf{T}\neq\mathbf{I}, and \mathbf{U} has more rows than columns. First note that for an other equivalent model with constant inversenoise variance maps, the corresponding system matrix is \mathbf{A}_{\text{F}}\equiv\mathbf{U}^{T}\mathbf{U}. Further, we can define a system which acts just like \mathbf{A} in terms of effects that are spatially invariant, but which has the inverse effect when it comes to the scalefree spatial variations: \mathbf{A}_{\text{SPI}}\equiv\mathbf{U}^{T}\mathbf{T}^{+}\mathbf{U}. With these definitions we have
\displaystyle\mathbf{M}_{\text{PI}}\mathbf{A}  \displaystyle=\mathbf{A}^{1}_{\text{F}}\mathbf{A}_{\text{SPI}}\mathbf{A}^{1}% _{\text{F}}\mathbf{A}. 
The operator above applies the spatially variant effects once forward and once inversely, and the spatially invariant effects twice forward and twice inversely. Everything that is done is also undone, and in this sense \mathbf{M}_{\text{PI}}\mathbf{A} can be said to approximate \mathbf{I}.
To see where the approximation breaks down, one must consider what \mathbf{A}_{\text{SPI}} actually represents, which is a linear combination of the spatial inverses of the inversenoise maps and the priors. The \mathbf{U} matrix gives more weight to a band with more information for each component. For instance, in a Plancktype setup that includes a thermal dust component, the spatial inverse of the 857 GHz band will be given strong weight, while the spatial inverse of 30 GHz will be entirely neglected, as desired. Likewise, the prior terms will be given little weight in the datadominated regime at low multipoles, and then gradually be introduced as the data becomes noisedominated.
However, the weights between the spatial inverses ignore spatial variations, and only account for the average within each multipole \ell. Thus the preconditioner only works well in when faced with modest spatial variations in the inversenoise maps. This is illustrated in Fig. 5. For the spatially invariant part of the preconditioner, the crossover between the inversenoise dominated regime and the priordominated regime must happen at a single point in \ellspace, whereas in reality this point varies based on spatial position for \ell in the midrange where the prior lines are crossing the inversenoise bands.
Figure 6 summarizes the performance of the above preconditioner in a realistic full sky component separation setup in terms of iteration count. The analysis setup corresponds to a standard nineband Planck data set in terms of instrumental noise levels and beam characteristics (Planck Collaboration I, 2016). Of course, it should be noted that the pseudoinverse preconditioner requires extra SHTs compared to a standard diagonal preconditioner, and so requires some more compute time per iteration. This penalty is highly modelspecific as it depends on N_{\text{comp}}, N_{\text{obs}} and the number of nonconstant mixing maps N_{\text{mix}}. For this particular model with two flatmixing components and one variablemixing component, each multiplication with \mathbf{A} requires 2(2N_{\text{mix}}+N_{\text{obs}})=54 SHTs, whereas application of the pseudoinverse preconditioner requires 2N_{\text{obs}}=18 SHTs. If all components has the same resolution this translates to each iteration of the pseudoinverse solver taking 33% longer than the block diagonal preconditioner, and with a third of the iterations needed, we end up with a total runtime reduction of 60%. In the model benchmarked, the synchrotron component with a flat mixing maps has much lower resolution than the dust component with a variable mixing map, and so the speedup is somewhat larger.
4 Constrained realization under a mask
So far we have only considered the full sky case. In many practical applications, we additionally want to mask out parts of the sky, either because of missing data, or because we do not trust our statistical model in a given region of the sky. Within such scenarios, it is useful to distinguish between two very different cases:

Partial sky coverage, where only a small patch of the sky has been observed, and we wish to perform component separation only within this patch. The typical usecase is for this setup is groundbased or suborbital CMB experiments.

Natively fullsky coverage, but too high foregrounds in a given part of the sky to trust our model. In this case, one often masks out part of the sky, but still seeks a solution to the system under the mask, constrained by the observed sky at the edges of the mask and determined by the prior inside the mask. By ignoring data from this region we at least avoid that the CMB component is contaminated by foreground emission. Of course, the solution will not be the true CMB sky either, but it will have statistically correct properties for use inside of a Gibbs sampler (Jewell et al., 2004; Wandelt et al., 2004; Eriksen et al., 2004, 2008).
We expect the solver developed in the previous section to work well in case i) given appropriate modifications, but leave such modifications for future work, and focus solely on case ii) in what follows.
4.1 Incorporating a mask in the model while avoiding ringing effects
Recall from Sect. 2.2 that our data model reads
\mathbf{d}=\mathbf{Y}_{\nu}\mathbf{B}_{\nu}\widetilde{\mathbf{Q}}_{\nu,k}% \mathbf{s}+\mathbf{n}, 
where \mathbf{d} are the pixels of the observed sky maps. In Eriksen et al. (2004) and Seljebotn et al. (2014) a mask is introduced into the model by declaring that these pixels are missing from \mathbf{d}, or, equivalently, that \mathbf{N}^{1} is zero in these pixels. This is straightforward from a modelling perspective, but has an inconvenient numerical problem: Ideally, we want to specify the beam operator \mathbf{B} using a spherical harmonic transfer function b_{\ell}, but, because b_{\ell} must necessarily be bandlimited, \mathbf{B} exhibits ringing in its tails in pixel basis. Specifically, in pixel space the beam operator first exhibits an exponential decay, as desired, but then suddenly stops decaying before it hits zero. At this point, it starts to observe the entire sky through the ringing “floor” (see figures in Seljebotn et al., 2014), and it becomes nonlocal. When the signaltonoise ratio of the data in question is high enough compared to such numerical effects, the model will try to predict the signal component \mathbf{s} within the mask through deconvolution of the pixels at the edge of the mask, regardless of their distance. This causes a major complication for all solvers of this type, and in Seljebotn et al. (2014) we had to carefully tune the solver to avoid this ringing effect.
In the present solver we sidestep this problem by introducing the mask in the mixing maps \mathbf{Q}_{\nu,k}, rather than in the noise model. The sky is then split cleanly into one set of pixels outside the mask that hits the full matrix \mathbf{S}^{1}+\mathbf{P}^{T}\mathbf{N}^{1}\mathbf{P}, and another set of pixels (those under the mask) that only hits the prior term \mathbf{S}^{1}.
The statistical interpretation of this mask model is that to pretend that a massive screen has been installed far away in the universe, physically shielding the microwave radiation in the region of the mask. This model is of course not physically meaningful, but the numerical difference is only evident in how quickly the mask takes effect, within a region spanning one beamwidth around the mask border. Since the masks in use are mainly constructed by rules of thumb and by looking at residual maps, the difference between this mask model and actually removing pixels from \mathbf{d} is statistically irrelevant.
Another advantage of this mask model is that it enables the use of different masks for different microwave components, although we have not yet tried this feature of our solver on a real analysis. A typical usecase could be joint estimation of CMB and cosmic infrared background (CIB) fluctuations, which typically would require different effective masks. Conversely, one disadvantage is that the same set of masks applies to all input sky maps.
4.2 Independently solving for signals under a mask
Consider the schematic description of each regime in the right pane of Fig. 5. The pseudoinverse preconditioner, which we will denote \mathbf{M}_{\text{PI}}, automatically finds a good split per component for the low and high\ell regimes, respectively, but is, in the same way as a blockdiagonal preconditioner, blind to the different regimes inside and outside of the mask. As a result, it performs poorly for large scales inside the mask, i.e., for multipoles lower than the point of unity signaltonoise ratio shown in Fig. 7.
To solve this, we supplement the pseudoinverse preconditioner with a second preconditioner that is designed to work well only inside the mask, where it is possible to simplify the system. Let \mathbf{Z} denote spherical harmonic synthesis to the pixels within the mask only; that is, we first apply \mathbf{Y}, and then select only the masked pixels. Then, building on standard domain decomposition techniques, a preconditioner that provides a solution only within the masked region is given by
\mathbf{M}_{\text{mask}}=\mathbf{Z}^{T}(\mathbf{Z}\mathbf{A}\mathbf{Z}^{T})^{% 1}\mathbf{Z}=\mathbf{Z}^{T}(\mathbf{Z}\mathbf{S}^{1}\mathbf{Z}^{T})^{1}% \mathbf{Z}.  (16) 
We will develop a solver for the inner maskrestricted system in Sect. 4.4. For now we assume that we we can efficiently apply a suitable approximation of (\mathbf{Z}\mathbf{S}^{1}\mathbf{Z}^{T})^{1} to a vector. It then remains to combine \mathbf{M}_{\text{PI}} and \mathbf{M}_{\text{mask}}. The simplest possible way of doing this is simply to add them together, \mathbf{M}_{\text{add}}\equiv\mathbf{M}_{\text{PI}}+\mathbf{M}_{\text{mask}}, and we find experimentally that this simple combination performs well for our purposes.
This may however break down for more demanding models. We found that the most important feature in how well \mathbf{M}_{\text{add}} works is how far the inversenoise term decays before it is overtaken by the prior term (see Fig. 7). In the case of analysing Planck data, we find that the inversenoise term decays by a factor of roughly \lambda=0.15 at this point, which is unproblematic. In simulations with lower resolution or higher noise, such that \lambda=0.01, convergence is hurt substantially, and at \lambda=0.001 the preconditioner breaks down entirely.
Since our own usecases (which are targeted towards Planck) are not affected by this restriction we have not investigated this issue very closely. We have however diagnosed the effect at low resolution with a dense system solver for (\mathbf{Z}\mathbf{S}^{1}\mathbf{Z})^{1}. In these studies, we find that the problem is intimately connected with how the two preconditioners are combined, and it may well be that more sophisticated methods for combining preconditioners will do a better job. For the interested reader, we recommend Tang et al. (2009) for an introduction to the problem, as they cover many related methods arising from different fields using a common terminology and notation. In particular it would be interesting to use deflation methods to “deflate” the mask subspace out of the solver.
Finally, we give a word of warning: In the full sky case, we have been able to rescale the CR system arbitrarily without affecting the essentials of the system. For instance, Eriksen et al. (2008) scales \mathbf{A} with \mathbf{S}^{1/2} so that the system matrix becomes \mathbf{I}+\mathbf{S}^{1/2}\mathbf{P}^{T}\mathbf{N}^{1}\mathbf{P}\mathbf{S}^{% 1/2}. This has no real effect on the spherical harmonic preconditioners, but in pixel domain it changes the shape of each term. The natural unscaled form of \mathbf{A} is localized in pixel domain: The inversenoise term essentially looks like a sum of the instrumental beams, while the prior defines smoothness couplings between a pixel and its immediate neighbourhood. However, \mathbf{S}^{1/2} is a highly nonlocal operator, and multiplying with this factor decreases locality and causes breakdown of our method. There may of course be other filters which would increase locality in pixel domain instead of decrease it, in which case it could be beneficial to apply them.
4.3 Including a lowpass filter in the maskrestriction
The feature that most strongly define the maskrestricted system \mathbf{Z}\mathbf{S}^{1}\mathbf{Z}^{T} is the shape of the mask, and thus pixel basis is the natural domain in which to approach this system. The operator \mathbf{Z}\mathbf{S}^{1}\mathbf{Z}^{T} acts as a convolution with an azimuthally symmetric kernel on the pixels within the mask. In Fig. 7 we plot a cut through this convolution kernel (blue in bottom panel). Mainly due to the sharp truncation at the bandlimit L, oscillations extend far away from the center of the convolution kernel. To make the system easier to solve we follow Seljebotn et al. (2014) and insert a lowpass filter as part of the restriction operator \mathbf{Z}, so that the projection from spherical harmonics to the pixels within the mask is preceded by multiplication with the transfer function
r_{\ell}=\text{exp}(\beta\ell^{2}(\ell+1)^{2}),  (17) 
where we choose \beta so that r^{2}_{L/2}=0.05. The resulting system now has a transfer function of r_{\ell}^{2}/C_{\ell}, whose associated convolution kernel is much more localized (dashed orange), making it easier to develop a good solver for \mathbf{Z}\mathbf{S}^{1}\mathbf{Z}^{T}.
After introducting this lowpass filter we no longer have equality in Eq. (16), but only approximately that \mathbf{Z}\mathbf{A}\mathbf{Z}^{T}\approx\mathbf{Z}\mathbf{S}^{1}\mathbf{Z}^{T}. This appears to not hurt the overall method, as r_{\ell} is rather narrow when seen as a pixeldomain convolution (unlike \sqrt{C_{\ell}}, as noted above). Also note that we have now overpixelized the system \mathbf{Z}\mathbf{S}^{1}\mathbf{Z}, as higherfrequency information has been suppressed and the core of the convolution kernel is supported by two pixels. Our attempts at representing \mathbf{Z}\mathbf{S}^{1}\mathbf{Z} on a coarser grid failed however, because the solution will not converge along the edge of the mask unless the grid of \mathbf{Z} exactly matches the grid of the mixing map \mathbf{Q}_{\nu,k}. While the resulting system \mathbf{Z}\mathbf{S}^{1}\mathbf{Z}^{T} on the fullresolution grid is poorly conditioned for the smallest scales, this does not prevent us from applying iterative methods to solve for the larger scales.
4.4 Multigrid solver for the maskrestricted system
Finally we turn our attention to constructing an approximate inverse for \mathbf{Z}\mathbf{S}^{1}\mathbf{Z}^{T}. We now write the same system matrix as
\mathbf{G}=\mathbf{Z}\mathbf{S}^{1}\mathbf{Z}^{T}=\mathbf{Y}\mathbf{D}\mathbf% {Y}^{T},  (18) 
where \mathbf{D} is a diagonal matrix with d_{\ell}=r_{\ell}^{2}/C_{\ell} on the diagonal, and it should be understood that the spherical harmonic synthesis \mathbf{Y} only projects to grid points within the mask.
As noted in Seljebotn et al. (2014), in the case where d_{\ell}\propto\ell^{2} this is simply the Laplacian partial differential equation on the sphere, and the multigrid techniques commonly used for solving this system are also effective in our case. We will focus on the case where C_{\ell} is the CMB power spectrum; in this case 1/C_{\ell} starts out proportional to \ell^{2}, increasing to \ell^{6} around \ell\sim 1600, eventually reaching \ell^{8} at \ell\sim 6000. In theory this should make the system harder to solve than the Laplacian, but it seems that in our solver the application of the lowpass filter described in the previous section is able to work around this problem.
To solve the system \mathbf{G}\mathbf{x}=\mathbf{b} using iterative methods we might start out with a simple diagonal approximate inverse,
\mathbf{M}\equiv\text{diag}(\mathbf{G})^{1}, 
which is in fact a constant scaling since \mathbf{G} is locationally invariant. This turns out to work well as a preconditioner for the intermediate scales of the solution. For smaller scales (higher \ell) the quickly decaying restriction r_{\ell} starts to dominate over 1/C_{\ell} such that the combined effect is that of a lowpass filter; such filters can not to our knowledge be efficiently deconvolved in pixel domain and an harmonicdomain preconditioner would be required. Luckily, we do not need to solve for these smaller scales, as the pseudoinverse preconditioner will find the correct solution in this regime, and the restriction operator \mathbf{Z} will at any rate filter out whatever contribution comes from the solution of \mathbf{G}.
The problem at larger scales (lower \ell) is that the approximate inverse would have to embed inversion of the coupling of two distant pixels through a series of intermediate pixels inbetween; this is beyond the reach of our simple diagonal preconditioner. For this reason we introduce a multigrid (MG) Vcycle, where we recursively solve the system on coarser resolutions. For each coarsening, the preconditioner is able to see further on the sphere, as indirect couplings in the fullresolution system are turned into direct couplings in the coarser versions of the system. For a basic introduction to multigrid methods see e.g. Hackbush (1985) or the overview given in Seljebotn et al. (2014).
The first ingredient in MG is an hierarchy of grids, which are denoted relatively, with h denoting an arbitrary level and H denoting the grid on the next coarser level. We have opted for a HEALPix grid for \mathbf{Q}_{\nu,k} and \mathbf{G}, and use its hierarchical structure to define the coarser grid, simply letting N_{\text{side},H}=N_{\text{side},h}/2. We also need to consider which subset of grid points to include to represent the region within the mask. We got best results by only including those pixels of H which are covered 100% by the mask in the fine grid h, so that no pixel on any level ever represents a region outside the fullresolution mask.
The second ingredient in MG is the restriction operator \mathbf{I}_{h}^{H} which transfers a vector from grid h to grid H. We tried restriction operators both in pixel domain and spherical harmonic domain, and spherical harmonic restriction performed better by far. Thus we define
\mathbf{I}_{h}^{H}=\mathbf{Y}_{H}\mathbf{R}_{H}\mathbf{Y}^{T}_{h}\mathbf{W}_{h},  (19) 
where we use subscripts to indicate the grid of each operator, and where \mathbf{R}_{H} has some harmonic lowpass filter r_{H,\ell} on its diagonal. In Seljebotn et al. (2014) the corresponding filter had to be carefully tuned to avoid problems with ringing, because the \mathbf{N}^{1}term created high contrasts in the system matrix. In the present method we no longer have to deal with the \mathbf{N}^{1} term, and the requirements on the lowpass filter are much less severe, as long as they correspond to a convolution kernel with a FWHM of roughly one pixel on the coarse grid. A Gaussian bandlimited at L_{H}=3N_{\text{side,H}}=L_{h}/2 performed slightly better than the filter of Eq. (17) in our tests, even if it has somewhat more ringing at this bandlimit.
The third ingredient in MG is the coarsened linear system.
\mathbf{G}_{H}=\mathbf{I}_{h}^{H}\mathbf{G}_{h}\mathbf{I}_{H}^{h}=\mathbf{Y}_{% H}\mathbf{D}_{H}\mathbf{Y}_{H}^{T},  (20) 
where d_{H,\ell}=r^{2}_{H,\ell}d_{h,\ell} is bandlimited at L_{H}=L_{h}/2^{2}^{2}2Note that the matrix coarsening must be done in another way if using a pixeldomain restriction operator. In that case \mathbf{D}_{H} is in principle a dense matrix due to pixelization irregularities, but can still be very well approximated by a diagonal matrix. Details are given in Appendix B.. We stress again that the grid H embeds the structure of the mask, so that \mathbf{Y}_{H} in this context denotes spherical harmonic synthesis only to grid points within the mask. In computer code, zero padding is used outside of the mask before applying \mathbf{Y}_{H}^{T}, and entries outside the mask are discarded after applying \mathbf{Y}_{H}.
The fourth ingredient in MG is an approximate inverse, in this context named the smoother. The name refers to removing small scales in the error \mathbf{x}_{\text{true}}\mathbf{x}. Removing these scales happens through approximately solving the system, and should not be confused with applying a lowpass filter. In our case we will use the simple constant smoother \mathbf{M} discussed above, although in combination with a damping factor \omega=0.2, so that the eigenvalues of \omega\mathbf{M}\mathbf{G} are bounded above by 2 as required by the MG method. We write \mathbf{M}_{h}=\omega\;\text{diag}(\mathbf{G}_{h})^{1} for the smoother on level h.
Finally, the ingredients are combined in the simplest possible MG Vcycle algorithm (see Fig. 8). It turns out that the restriction and interpolation operations can share one SHT each with the associated system matrix multiplication, so that 6 SHTs are required per level. Furthermore, the SHTs can be performed at different resolutions for additional savings.
Figure 9 shows the results of solving the full system when inserting this algorithm as an approximation of (\mathbf{Z}\mathbf{S}^{1}\mathbf{Z}^{T})^{1} in Eq. (16). While the diagonal preconditioner degrades, \mathbf{M}_{\text{add}} converges very quickly. The pseudoinverse preconditioner \mathbf{M}_{\text{PI}} by itself shows much the same behaviour as the diagonal preconditioner in this situation (not plotted).
5 Benchmark notes
The implementation used for the convergence plots in this paper are produced using a prototype implementation written in a mixture of Fortran, Cython and Python, and is available under an open source license at https://github.com/dagss/cmbcr. As a prototype, it does not support polarization or distributed computing with MPI. A full implementation in the production quality Commander code is in progress.
As the solver does not support MPIparallelization, a full resolution run is an overnight run. However, as shown in Fig. 10, the performance of our solver does not significantly degrade as resolution is increased, and in this paper we therefore present benchmarks of downscaled systems that we have worked with during development. The downgrade procedure follows these steps:

Downgrade the RMS maps to a lower N_{\text{side}} using HEALPix routines

Find the best fit Gaussian beam approximation to the instrumental beams, and make equivalent lowresolution beams based on scaling down the FWHM parameter

Downgrade each prior C_{\ell} by subsampling coefficients. For instance, for a degrade from N_{\text{side}}=2048 to N_{\text{side}}=256, we take every 8th coefficient. Similarly, we divide each bandlimit L_{k} by the relevant downgrade factor.

Scale C_{\ell} in such a way that the diagonal of \mathbf{S}^{1} crosses the diagonal of \mathbf{P}^{T}\mathbf{N}^{1}\mathbf{P} at the same \ell, relative to the full L_{k}, ensuring that the system has the same signaltonoise properties as the full resolution system.
To save computing resources, the numerical experiments of Figs. 3, 6 and 9 are performed on system downgraded to N_{\text{side}}=128.
Simulations are performed with a known \mathbf{x}_{\text{true}} drawn randomly from a Gaussian distribution, and a right hand side given by \mathbf{b}=\mathbf{A}\mathbf{x}_{\text{true}}. Then the convergence statistic denoted “relative error” in these figures simply reads
e_{i}\equiv\\mathbf{x}_{i}\mathbf{x}_{\text{true}}\. 
Finally, unless otherwise noted, we add regularization noise to the 1% of the highest signaltonoise pixels in the RMS maps. As noted in Fig. 3, this is more of an advantage for the diagonal preconditioner than the pseudoinverse preconditioner, but this typically mimics what one would do in real analysis cases.
In the present paper we have focused strictly on algorithm development, and as such the prototype code is not optimized; we have not invested the effort to benchmark the preconditioners in terms of CPU time spent. As detailed in Sects. 3.3 and 4.4, the additional cost for a particular usecase can be calculated from the number of extra SHTs.
The blockdiagonal preconditioner we use as a comparison point is described in further detail by Eriksen et al. (2004). In the notation of this paper, it can be written
\mathbf{M}_{\text{diag}}\equiv(\mathbf{U}^{T}\text{diag}(\mathbf{T})\mathbf{U}% )^{1}, 
where each element of \text{diag}(\mathbf{T}) can be computed in O(L) time by a combination of Fourier transforms and computing the associated Legendre polynomials, which is available, e.g., in the latest version of Libsharp^{3}^{3}3https://github.com/dagss/libsharp. Since the matrix \mathbf{U}^{T}\text{diag}(\mathbf{T})\mathbf{U} consists of ({\ell_{\text{max}}}+1)^{2} blocks of size N_{\text{comp}}\times N_{\text{comp}} the inversion is cheap.
6 Discussion and outlook
In this paper we have presented a versatile Bayesian model for the multiresolution CMB component separation and constrained realization problem, as well as an efficient solver for the associated linear system. This model is currently in active use for component separation for the Planck 2017 collaboration. The final result is the ability to perform exact, fullresolution, multiresolution component separation of fullsky microwave data within a reasonable number of conjugate gradient iterations.
To achieve such good convergence several novel techniques were employed. First, we developed a novel pseudoinverse based preconditioner. For the fullsky case this provides a speedup of 2–3x compared to a diagonal preconditioner, depending on model parameters. Second, we extend the model with a mask through the mixing maps, rather than through the noise covariance matrix, to avoid ringing problems associated with going between spherical harmonic domain and pixel domain. Third, we solved for the solution under the mask using a dedicated multigrid solver in pixel domain restricted to the area under mask, where the linear system can be simplified.
We note that the pseudoinverse preconditioner not only performs very well for the fullsky case with reasonably uniform mixing maps, but it is also very simple to implement; significantly simpler than the previously standard diagonal preconditioner. As mentioned earlier, this technique is of course not novel for or restricted to CMB component separation; it has been in use for solving NavierStokes equations for some time. The fundamental idea is to approximate the total inverse of a sum with the best linear combination of inverses of each term. A problem related to this paper, which in particular fits this description, is the basic CMB mapmaking equation (e.g., Tegmark, 1997). This equation is a sum over individual time segments of observations, which can in isolation be inverted in Fourier domain. If the pseudoinverse preconditioner works well in this case, as we believe it will, it may speed up exact maximum likelihood map makers substantially.
Regarding more direct extensions of the work in the present paper, a natural next step is the use of some pixel domain basis instead of a spherical harmonic coefficients to represent the microwave components. We have so far have assumed an isotropic prior for all components which can be specified in the form of a power spectrum C_{\ell}, with a sharp bandlimit L. This model has a tendency to excite ringing in the resulting maps around sharp objects, unless much time is spent tuning the priors, or one adopts a very high bandlimit L for all components. Working with pixeldomain vectors and, with the exception of the CMB, pixeldomain prior specifications, one could easily define the models that more robust against this problem. Also, we know that the diffuse foregrounds has much variation where their signal is strong, but should be more heavily stabilized where their signal is weak. Such a nonisotropic prior is easier to model using a sparse matrix in pixel domain. Of particular interest are the socalled Conditional AutoRegressive (CAR) models, which have a natural interpretation and which directly produce sparse inverses \mathbf{S}^{1}.
Acknowledgements.
We thank Jeff Jewell, Sigurd NÃ¦ss, Martin Reinecke, and Mikolaj Szydlarski for useful input. Parts of this work was supported by the European Research Council grant StG2010257080. Some of the results in this paper have been derived using the HEALPix (Górski et al., 2005), Libsharp (Reinecke & Seljebotn, 2013), Healpy, NumPy, and SciPy software packages.References
 Bennett et al. (2013) Bennett, C. L., Larson, D., Weiland, J. L., et al. 2013, ApJS, 208, 20
 Benzi et al. (2005) Benzi, M., Golub, G. H., & Liesen, J. 2005, ACTA NUMERICA, 14, 1
 BICEP2 Collaboration (2014) BICEP2 Collaboration. 2014, Physical Review Letters, 112, 241101
 Elman (1999) Elman, H. C. 1999, SIAM J. Sci. Comput., 20, 1299
 Elman et al. (2006) Elman, H. C., Howle, V. E., Shadid, J., Shuttleworth, R., & Tuminaro, R. 2006, SIAM J. Sci. Comput., 27
 Elsner & Wandelt (2013) Elsner, F. & Wandelt, B. D. 2013, A&A, 549, A111
 Eriksen et al. (2008) Eriksen, H. K., Jewell, J. B., Dickinson, C., et al. 2008, ApJ, 676, 10
 Eriksen et al. (2004) Eriksen, H. K., O’Dwyer, I. J., Jewell, J. B., et al. 2004, ApJS, 155, 227
 Górski et al. (2005) Górski, K. M., Hivon, E., Banday, A. J., et al. 2005, ApJ, 622, 759
 Hackbush (1985) Hackbush, W. 1985, Multigrid methods and applications (Springer Verlag, Berlin)
 Jewell et al. (2004) Jewell, J., Levin, S., & Anderson, C. H. 2004, ApJ, 609, 1
 Planck Collaboration I (2016) Planck Collaboration I. 2016, A&A, 594, A1
 Planck Collaboration X (2015) Planck Collaboration X. 2015, A&A, 594, A10
 Planck Collaboration XIII (2016) Planck Collaboration XIII. 2016, A&A, 594, A13
 Reinecke & Seljebotn (2013) Reinecke, M. & Seljebotn, D. S. 2013, A&A, 512
 Seljebotn et al. (2014) Seljebotn, D. S., Mardal, K.A., Jewell, J. B., Eriksen, H. K., & Bull, P. 2014, ApJS, 210
 Shewchuk (1994) Shewchuk, J. R. 1994
 Smith et al. (2007) Smith, K. M., Zahn, O., & Doré, O. 2007, Phys. Rev. D, 27
 Tang et al. (2009) Tang, J. M., Nabben, R., Vuik, C., & Erlangga, Y. A. 2009, Journal of Scientific Computing, 39
 Tegmark (1997) Tegmark, M. 1997, ApJ, 480, L87
 Wandelt et al. (2004) Wandelt, B. D., Larson, D. L., & Lakshminarayanan, A. 2004, Phys. Rev. D, 70, 083511
Appendix A Details of the pseudoinverse preconditioner
A.1 Approximating the inversenoise maps
We seek \alpha_{\nu} such that the distance between \widetilde{\mathbf{N}}_{\nu}^{1} and the identity matrix is minimized:
\displaystyle\\widetilde{\mathbf{N}}_{\nu}^{1}(\alpha_{\nu})\mathbf{I}\_{2}  \displaystyle=\\alpha^{2}_{\nu}\mathbf{Y}^{T}_{\nu}\mathbf{N}^{1}_{\nu}% \mathbf{Y}_{\nu}\mathbf{Y}_{\nu}^{T}\mathbf{W}_{\nu}\mathbf{Y}_{\nu}\_{2}  
\displaystyle=\\mathbf{Y}^{T}_{\nu}(\alpha^{2}_{\nu}\mathbf{N}^{1}_{\nu}% \mathbf{W}_{\nu})\mathbf{Y}_{\nu}\_{2}  
\displaystyle=\\mathbf{Y}^{T}_{\nu}\mathbf{W}^{1/2}(\alpha^{2}_{\nu}\mathbf{% W}^{1}\mathbf{N}^{1}_{\nu}\mathbf{I})\mathbf{W}^{1/2}\mathbf{Y}_{\nu}\_{2}  
\displaystyle=\(\alpha^{2}_{\nu}\mathbf{W}^{1}\mathbf{N}^{1}_{\nu}\mathbf% {I})\_{2}. 
The last equality follows because all the singular values of \mathbf{Y}^{T}\mathbf{W}^{1/2} are 1, at least for the GaussLegendre grid. For the HEALPix grid the statement is only approximate, within 1020%, depending on resolution parameters, and this is close enough for our purposes. We conclude that the best choice is
\alpha_{\nu}=\sqrt{\frac{\sum_{i}(\tau_{\nu}(\hat{n}_{i})/w_{i})^{2}}{\sum_{i}% \tau_{\nu}(\hat{n}_{i})/w_{i}}},  (21) 
where \tau_{i} represents the pixels in the inversenoise variance map on the diagonal of \mathbf{N}^{1}_{\nu}, and w_{i} are the quadrature weights of the associated grid. We have verified this expression experimentally by perturbing \alpha_{\nu} in either direction, and find that either choice leads to slower convergence. Ultimately, the method may even fail to converge if \alpha_{\nu} deviates too much from the optimal value.
In code, the easiest way to compute \tau_{\nu}(\hat{n}_{i})/w_{i} is by performing a pair of SHTs, \mathbf{W}^{1}\tau=\mathbf{Y}\mathbf{Y}^{T}\tau. By replacing the usual analysis \mathbf{Y}^{T}\mathbf{W} with adjoint synthesis \mathbf{Y}^{T}, we end up implicitly multiplying \tau with \mathbf{W}^{1}.
A.2 Approximating the mixing maps
Following a similar derivation to the previous section, the optimal scalar to approximate the mixing maps is given by minimizing
\\widetilde{\mathbf{Q}}_{\nu,k}q_{\nu,k}\mathbf{I}\=\\mathbf{Y}^{T}\mathbf% {W}\mathbf{Q}_{\nu,k}\mathbf{Y}q_{\nu,k}\mathbf{Y}^{T}\mathbf{W}\mathbf{Y}\=% \\mathbf{Q}_{\nu,k}q_{\nu,k}\mathbf{I}\, 
so that the best choice is
q_{\nu,k}=\frac{\sum_{i}q_{\nu,k}(\hat{n}_{i})^{2}}{\sum_{i}q_{\nu,k}(\hat{n}_% {i})}.  (22) 
In our cases, however, the difference between this quantity and the mean of the mixing map is negligible.
A.3 Possible future extension: Merging observations
Depending on the data model, it may be possible to reduce the number of SHTs required for each application of the pseudoinverse preconditioner. When N_{\text{obs}}>N_{\text{comp}}, the system in some ways supply redundant information. Assume that two rows in \mathbf{U} are (at least approximately) identical up to a constant scaling factor; this requires that the corresponding sky maps have the same beams, the same normalized spatial inversenoise structure, and is located at the same frequency \nu with the same SED for each component. That is, we require both \widetilde{\mathbf{N}}_{1}^{1}=\widetilde{\mathbf{N}}_{2}^{1} and \mathbf{U}_{2}=\gamma\mathbf{U}_{1}, where \mathbf{U}_{\nu} indicate a row in \mathbf{U} and \gamma is an arbitrary scale factor. Note that this situation is very typical for experiments with several independent detectors within the same frequency channel, which is nearly always the case for modern experiments.
Under these assumptions, we have
\displaystyle\left[\begin{array}[]{ccccccccccccccc}\mathbf{U}_{1}^{T}&\mathbf{% U}_{2}^{T}\end{array}\right]\left[\begin{array}[]{ccccccccccccccc}\mathbf{N}_{% 1}^{1}&\\ &\mathbf{N}_{2}^{1}\end{array}\right]\left[\begin{array}[]{ccccccccccccccc}% \mathbf{U}_{1}\\ \mathbf{U}_{2}\end{array}\right]=  
\displaystyle\left[\begin{array}[]{ccccccccccccccc}(\gamma+1)^{1/2}\mathbf{U}_% {1}^{T}\end{array}\right]\left[\begin{array}[]{ccccccccccccccc}\mathbf{N}_{1}^% {1}\end{array}\right]\left[\begin{array}[]{ccccccccccccccc}(\gamma+1)^{1/2}% \mathbf{U}_{1}\end{array}\right]. 
Thus, we may combine the two rows without affecting the rest of the system, and thereby halve the number of SHTs required. Of course, two sky observations with such identical properties could have been coadded prior to solving the system, as is typically done when creating coadded frequency maps. In practice, however, there are typically many advantages in working with detector subsets, including improved ability to isolate systematics effects (e.g., Planck Collaboration X 2015), and more easily allow for crosscorrelation analysis. In addition, there may be experiments where some sky maps do not have identical properties and one does not wish to coadd, but they are similar enough that coadding poses no problem if done in the preconditioner alone. One then needs to somehow produce compromises for \mathbf{B}, \mathbf{N}^{1} and \mathbf{M}_{\nu,k}, replace the relevant matrices with the compromise versions, and finally use the row merge procedure described above to create a new \mathbf{U} solely for use in the preconditioner.
Appendix B Alternative strategies for the maskrestricted solver
We spent some time
exploring pixeldomain restriction operator before turning to spherical
harmonic restrictions. The simple restriction we
attempted, averaging the 4 nested pixels using the standard HEALPix
udgrade
function, more than doubled the number of iterations required
for a small mask when compared to a restriction in spherical harmonic domain,
and had trouble converging at all for a large mask.
The spherical harmonic restrictions are therefore well worth the extra time
spent performing SHTs. Still, it is probably possible to pull out a little bit
more performance by experimenting with averaging over a larger region
with a choice of weights that approximates a Gaussian
lowpass filter.
When using a pixel based restriction the system can no longer be coarsened simply by multiplying spherical harmonic transfer functions. However, since the operator is rotationally and locationally invariant it is simple to coarsen the system numerically. The idea is to image the operator in a single pixel, and then solve for the spherical harmonic transfer function that would produce this image. Let \mathbf{u} represent a unit vector located on equator on the coarse grid H. We then seek \mathbf{D}_{H} such that
\displaystyle\mathbf{Y}_{H}\mathbf{D}_{H}\mathbf{Y}^{T}_{H}\mathbf{u}  \displaystyle=\mathbf{I}_{h}^{H}\mathbf{G}_{h}\mathbf{I}_{H}^{h}\mathbf{u}=% \mathbf{I}_{h}^{H}\mathbf{Y}_{h}\mathbf{D}_{h}\mathbf{Y}_{h}^{T}\mathbf{I}_{H}% ^{h}\mathbf{u}  
\displaystyle\mathbf{D}_{H}\mathbf{Y}^{T}_{H}\mathbf{u}  \displaystyle=\mathbf{Y}_{H}^{T}\mathbf{W}_{H}\mathbf{I}_{h}^{H}\mathbf{Y}_{h}% \mathbf{D}_{h}\mathbf{Y}_{h}^{T}\mathbf{I}_{H}^{h}\mathbf{u}. 
Now, assuming that \mathbf{D}_{H} is diagonal we must have
\displaystyle d_{H,\ell,m}=\frac{(\mathbf{Y}_{H}^{T}\mathbf{W}_{H}\mathbf{I}_{% h}^{H}\mathbf{Y}_{h}\mathbf{D}_{h}\mathbf{Y}_{h}^{T}\mathbf{I}_{H}^{h}\mathbf{% u})_{\ell,m}}{(\mathbf{Y}^{T}_{H}\mathbf{u})_{\ell,m}}. 
In practice, due to pixelization effects, \mathbf{D}_{H} cannot be fully diagonal and this equation cannot be satisfied for all \ell, m. However, assuming that \mathbf{D}_{h} is isotropic it should be fully characterized by the modes m=0, and, as \mathbf{u} was located on equator, these produce a very good estimate. Using this coarsening procedure instead of the analytical coarsening procedure in Sect. 4.4 produces identical results when applied to the Gaussian restriction operator. With pixeldomain restriction operators, pixelization effects will hurt the approximation somewhat. We expect that the approximation will be hurt less if the averaging weights are a function of the physical distance between the grid points rather than the logical distance.
We have also experimented with using Fourier basis to represent \mathbf{G}_{h}. When using a thin mask around equator, or a small point source, applying torus or flat skyapproximations, respectively, allows for using the much faster FFTs instead of SHTs. The operator should then be transferred using the same principle as above. Let \mathbf{F} denote a discrete Fourier transform from harmonic space to real space, then, within a narrow equatorial band or a small patch, we require
\mathbf{F}\mathbf{D}_{\text{FFT}}\mathbf{F}^{T}\mathbf{u}\approx\mathbf{Y}% \mathbf{D}_{\text{SHT}}\mathbf{Y}\mathbf{u}  (23) 
and solve for
\mathbf{D}_{\text{FFT},k,k^{\prime}}=\frac{(\mathbf{F}^{1}\mathbf{Y}\mathbf{D% }_{\text{SHT}}\mathbf{Y}^{T}\mathbf{u})_{k,k^{\prime}}}{(\mathbf{F}^{T}\mathbf% {u})_{k,k^{\prime}}}. 
Then coarser systems can be produced either analytically (restriction in harmonic domain) or by appropriate modifications to the technique above (restriction in pixel domain). We were able to produce a functional solver using this principle, but feel that the loss in flexibility was not worth the gain in performance compared to the solver presented in Sect. 4.4.