A Parallel Implementation of the Ensemble Kalman Filter Based on Modified Cholesky Decomposition
Abstract
This paper discusses an efficient parallel implementation of the ensemble Kalman filter based on the modified Cholesky decomposition. The proposed implementation starts with decomposing the domain into subdomains. In each subdomain a sparse estimation of the inverse background error covariance matrix is computed via a modified Cholesky decomposition; the estimates are computed concurrently on separate processors. The sparsity of this estimator is dictated by the conditional independence of model components for some radius of influence. Then, the assimilation step is carried out in parallel without the need of interprocessor communication. Once the local analysis states are computed, the analysis subdomains are mapped back onto the global domain to obtain the analysis ensemble. Computational experiments are performed using the Atmospheric General Circulation Model (SPEEDY) with the T63 resolution on the Blueridge cluster at Virginia Tech. The number of processors used in the experiments ranges from 96 to 2,048. The proposed implementation outperforms in terms of accuracy the wellknown local ensemble transform Kalman filter (LETKF) for all the model variables. The computational time of the proposed implementation is similar to that of the parallel LETKF method (where no covariance estimation is performed). Finally, for the largest number of processors, the proposed parallel implementation is 400 times faster than the serial version of the proposed method.
copyrightbox
Computer Science Technical Report CSTR3/2016
August 17, 2019
Elias D. Niño, Adrian Sandu and Xinwei Deng
A Parallel Implementation of the Ensemble Kalman Filter Based on Modified Cholesky Decomposition
Computational Science Laboratory
Computer Science Department
Virginia Polytechnic Institute and State University
Blacksburg, VA 24060
Phone: (540)2312193
Fax: (540)2316075
Email: sandu@cs.vt.edu
Web: http://csl.cs.vt.edu
Innovative Computational Solutions 
Keywords: ensemble Kalman filter, covariance matrix estimation, local domain analysis.
1 Introduction
In operational data assimilation, sequential and variational methods are required to posses the ability of being performed in parallel [1, 2, 3]. This obeys to current atmospheric and oceanic model resolutions in which the total number of components arises to the order of millions and the daily information to be assimilated in the order of terabytes [4, 5]. Thus, serial data assimilation methods are impractical under realistic operational scenarios. In sequential data assimilation, one of the best parallel ensemble Kalman filter (EnKF) implementations is the local ensemble transform Kalman filter (LETKF) [6]. This method is based on domain localization given a radius of influence . Usually, the assimilation process is performed for each model component in parallel making use of a deterministic formulation of the EnKF in the ensemble space. In this formulation, the unknown background error covariance matrix is estimated by the rankdeficient ensemble covariance matrix which, in ensemble space, is welldefined. The LETKF relies in the assumption that local domain analyses avoid the impact of spurious correlations, for instance, by considering only small values for . However, in operational data assimilation, can be large owing to circumstances such as sparse observational networks and/or long distance data error correlations (i.e., pressure fields) In such cases, the accuracy of the LETKF can be negatively impacted owing to spurious correlations.
We think there is an opportunity to provide a more robust parallel ensemble Kalman filter implementation via a better estimation of background error correlations. When two model components (i.e., grid points) are assumed to be conditionally independent, their corresponding entry in the estimated inverse background error covariance matrix is zero. Conditionally dependence/independence of model components can be forced making use of local domain analyses. For instance, when the distance of two model components in physical space is larger than , their corresponding entry in the inverse background error covariance matrix is zero. This can be exploited in order to obtain sparse estimators of such matrix which implies huge savings in terms of memory and computations. Even more, high performance computing can be used in order to speedup the assimilation process: the global domain can be decomposed according to an available number of processors, for all processors, local inverse background error covariance matrices are estimated and then, the stochastic EnKF formulation [7] can be used in order to compute local domain analyses. The local analyses and then mapped back onto the global domain from which the global analysis state is obtained.
This paper is organized as follows. In section 2 basic concepts regarding sequential data assimilation and covariance matrix estimation are presented, in section 3 a parallel implementation of the ensemble Kalman filter based on the modified Cholesky decomposition is proposed; experimental results are discussed in section 4 and future research directions are presented in section 5. Conclusions are drawn in section 6.
2 Preliminaries
2.1 Modified Cholesky decompositon
Let , the matrix whose columns are th dimensional random Gaussian vectors with probability distribution , where the number of columns denotes the number of samples. Denote by , the vector holding the th component across all the columns of , for . The modified Cholesky decomposition [8] arises from regressing each variable on its predecessors , , , , that is , fitting regressions:
(1) 
where denotes the error in the regression of the th component. Let be the diagonal matrix of error variances and let denote the unitary lowertriangular matrix containing the negative value of regression coefficients, for . An approximation of the inverse covariance matrix reads:
(2) 
and making use of basic linear algebra, an approximation of is:
(3) 
2.2 Local ensemble transform Kalman filter
Localization is commonly used in the context of sequential data assimilation in order to mitigate the impact of spurious correlations in the assimilation process. In general, two forms of localization methods are used: covariance matrix localization and domain localization, both have proven to be equivalent [9]. In practice, covariance matrix localization can be very difficult owing to the explicit representation in memory of the ensemble covariance matrix. On the other hand, domain localization methods avoid spurious correlations by considering only observations within a given radius of influence : in the twodimensional case, each model component is surrounded by a local box of dimension and the information within the scope of (observed components and background error correlations) is used in the assimilation process and conversely, the information out the local box is discarded. In figure 1, local boxes for different radii of influence are shown. The red grid point is the one to be assimilated, blue points are used in the assimilation process while black points are discarded. Based on this idea, the local ensemble transform Kalman filter is proposed (LETKF) [6]
The global formulation of the LETKF is defined as follows: for a given background ensemble
(4) 
and ensemble perturbation matrix
(5) 
where is the number of model components, is the ensemble size, is the th ensemble member, for , is the ensemble mean, is the th dimensional vector whose components are all ones and denotes the outer product of two vectors, an estimated of the analysis error covariance matrix in the ensemble space reads:
(6a)  
where , is the linear observational operator, is the number of observed components and, is the estimated data error covariance matrix. The optimal weights in such space reads:  
(6b)  
therefore, the optimal perturbations can be computed as follows:  
(6c)  
from which, in model space, the analysis reads:  
(6d) 
The set of equations (6) are applied to each model component in order to compute the global analysis state.
2.3 Ensemble Kalman Filter Based On Modified Cholesky
In [10], the modified Cholesky decomposition is used in order to obtain sparse estimators of the inverse background error covariance matrix. the columns of matrix (5) are assumed normally distributed with moments:
(7) 
where is the true unknown background error covariance matrix. Denote by the vector holding the th model component across all the columns of matrix (5), for , following the analysis of section 2.1, i.e., , an estimate of the inverse background error covariance matrix reads:
(8) 
and similar to (3),
(9) 
Based on (1), the resulting estimator can be dense. This implies no conditional independence of model components in space which, in practice, can be quite unrealistic for model variables such as wind components, specific humidity and temperature. Thus, a more realistic approximation of implies a sparse estimator . Readily, the structure of depends on the structure of this is, on the nonzero coefficients from the regression problems (1). Consequently, if we want to force a particular structure on some of the coefficients in (1) must be set to zero. Thus, we can condition the predecessors of a particular model component to be inside the scope of some radius . This will depend on the manner how the model components are labeled. In practice, rowmajor and columnmajor formats are commonly used in the context of data assimilation but, other formats can be used in order to exploit particular features of model discretizations and/or dynamics. For instance, making use of rowmajor format, consider we want to compute the corresponding set of coefficients for the grid point 6 in figure 2 for . The local box surrounding the grid point 6 provides the model components inside the scope of . Readily, the predecessors of 6 are the model components labeled from 1 to 5 according to the labelling system utilized.
In general, the analysis increments of the EnKF reads:
(10) 
where is known as the analysis increment. According to the primal formulation of the EnKF, is used in order to compute the analysis correction:
(11) 
while, in the dual formulation is implicitly used:
(12) 
where
(13) 
is the matrix of perturbed observation with data error distribution , and . The primal approach can be employed making use of iterative solvers in order to solve the implicit linear system in (11). On the other hand, the dual approach relies most of its computation in the solution of the unitary triangular linear system in (13). In general, there are good linear solvers in the current literature, some of them wellknown and used in operational data assimilation such as the case of LAPACK [11] and CuBLAS [12]. Compact representation of matrices can be used as well in order to exploit the structures of and in terms of memory allocation.
3 Proposed parallel implementation of the ensemble Kalman filter based on modified Cholesky decomposition
We consider the use of domain decomposition in order to reduce the dimension of the data assimilation problem. To start, the domain is split according to a given number of subdomains. Typically, the number of subdomains matches the number of threads/processors involved in the assimilation process. In figures 2(a), 2(b) and 2(c) the domain is decomposed in 12, 20 and 80 equitable subdomains, respectively. With no loose of generality, consider the number of subdomains to be a multiple of . The total number of model components at each subdomain is but, in order to estimate , boundary information is needed which adds model grid points to the procedure of background covariance matrix estimation. For instance, figure 2(d) shows a domain decomposed in 16 subdomains, the blue dashed squares denote boundary information for two particular subdomains.
If we consider subdomains, at the th subdomain, for , the analysis reads:
where , and at subdomain : are the model components, is the linear observational operator, is the number of observed components in the subdomain, is the subset of perturbed observations, is the local inverse estimation of the background error covariance matrix and is the local dataerror covariance information. Thus, for all , the analysis subdomains (3) are computed, the boundary points are discarded and then, analysis points are mapped back onto the global domain. Readily, the dual approach can be used as well. One desired property of the proposed EnKF implementation is that boundary information is not exchanged during the assimilation process, each subdomain works independently in the estimation of and posterior assimilation of . In the Algorithm 1, the parallel ensemble Kalman filter based on modified Cholesky decomposition is detailed. The analysis step of this method is shown in the Algorithm 2 wherein, the model state is divided according to the number of subdomains and then, in parallel, information of the background ensemble, the observed components, the observation operator, the estimated data error correlations at each subdomain are utilized in order to perform the local assimilations. The analysis subdomains are then merged into the global analysis state as can be seen in line 11 of the Algorithm 2. Atomicity is not needed for this operation since analysis subdomains do not intersect owing to all information concerning to boundaries is discarded after the assimilation step. The local assimilation process is detailed in the Algorithm (3).
We are now ready to test our proposed parallel implementation of EnKF based on modified Cholesky decomposition.
4 Experimental Settings
In this section we study the performance of the proposed parallel ensemble Kalman filter based on modified Cholesky decomposition (PAREnKFMC). The experiments are performed using the atmospheric general circulation model SPEEDY [13, 14]. SPEEDY is a hydrostatic, spectral coordinate, spectral transform model in the vorticitydivergence form, with semiimplicit treatment of gravity waves. The number of layers in the SPEEDY model is 8 and the T63 model resolution ( grids) is used for the horizontal space discretization of each layer. Four model variables are part of the assimilation process: the temperature (), the zonal and the meridional wind components (), and the specific humidity (). The total number of model components is . The number of ensemble members is for all the scenarios. The model state space is approximately 6,274 times larger than the number of ensemble members (). The tests are performed on the super computer Blueridge cluster at the university of Virginia Tech. BlueRidge is a 408node Cray CS300 cluster. Each node is outfitted with two octacore Intel Sandy Bridge CPUs and 64 GB of memory, for a total of 6528 cores and 27.3 TB of memory systemwide.
Starting with the state of the system at time , the model solution is propagated in time over one year:
The reference solution is used to build a perturbed background solution:
(15) 
The perturbed background solution is propagated over another year to obtain the background solution at time :
(16) 
This model propagation attenuates the random noise introduced in (15) and makes the background state (16) consistent with the physics of the SPEEDY model. Then, the background state (16) is utilized in order to build an ensemble of perturbed background states:
(17) 
from which, after three months of model propagation, the initial ensemble is obtained at time :
Again, the model propagation of the perturbed ensemble ensures that the ensemble members are consistent with the physics of the numerical model.
The experiments are performed over a period of 24 days, where observations are taken every 2 days (). At time synthetic observations are built as follows:
The observation operators are fixed throughout the time interval. We perform experiments with several operators characterized by different proportions of observed components from the model state (). We consider four different values for : 0.50, 0.12, 0.06 and 0.04 which represent 50%, 12 %, 6 % and 4 % of the total number of model components, respectively. Some of the observational networks used during the experiments are shown in Figure 4 with their corresponding percentage of observed components from the model state.
The analyses of the PAREnKFMC are compared against those obtained making use of the LETKF implementation proposed by Hunt et al in [15, 6, 16] . The analysis accuracy is measured by the root mean square error (RMSE)
(18) 
where and are the reference and the analysis solutions at time , respectively, and is the number of assimilation times.
During the assimilation steps, the data error covariance matrices are used and therefore, no representativeness errors are involved during the assimilation. The different EnKF implementations are performed making use of FORTRAN and specialized libraries such as BLAS and LAPACK are used in order to perform the algebraic computations.
4.1 Influence of the localization radius on analysis accuracy
We study the accuracy of the proposed PAREnKFMC and the LETKF implementations for different radii of influence. The relations between the accuracy of the methods and the radii for 96 and for 768 processors are shown in Figures 5 and 6, respectively. The results reveal that the accuracy of the PAREnKFMC formulation can be improved by increasing the radius of influence . This implies that the impact of spurious correlations is mitigated when background error correlations are estimated via the modified Cholesky decomposition. However, the larger the radius of influence, the larger the local data assimilation problem to solve. This will demand more computational time which can be mitigated by increasing the number of processors during the assimilation step. On the other hand, in the LETKF context, since background error correlations are estimated based on the empirical moments of the ensemble, spurious correlations affect the analysis when . Consequently, localization radius sizes beyond this value decreases the performance of the LETKF.
4.2 Computational times for different numbers of processors
We compare the elapsed times and the accuracy of both implementations when the number of processors (subdomains) is increased. We vary the number of compute nodes from 6 (96 processors) to 128 (2,048 processors), fix the radius of influence at , and use an observational network with . The elapsed times for different numbers of computing nodes for the PAREnKFMC and LETKF are shown in Figure 7. As expected, the elapsed time of the LETKF is smaller than that of PAREnKFMC formulation since no covariance estimation is performed. Nevertheless, the difference between the elapsed times is small (in the order of seconds), while the PAREnKFMC results are more accurate than those obtained by the LETKF.
4.3 Influence of the number of processors (subdomains) on accuracy of PAREnKFMC analyses
An important concern to address in the PAREnKFMC formulation is how its accuracy is impacted when the number of processors (subdomains) is increased. As we mentioned before, the model domain is decomposed in order to speedup computations but not for increasing the accuracy of the method (i.e., the impact of spurious correlations can be small for small subdomain sizes) Two main reasons are that we have a wellconditioned estimated of and even more, the conditional independence of model components makes the subdomain size to have no impact in the accuracy of the PAREnKFMC. As can be seen in figure 8, for the specific humidity variable and values of and , the PAREnKFMC provides almost the same accurate results among all configurations. The small variations in the RMSE values of the PAREnKFMC obey to the synthetic data built at different processors during the assimilation step. For instance, the random number generators used in the experiments depends on the processors id and therefore, the exact synthetic data is not replicated when the number of processors is changed. In the LETKF context we obtain the exact same results for all configurations since it is a deterministic filter and even more, the assimilation is performed for each grid point in the subdomain.
Lastly, figure 9 shows an estimate of a local inverse background error covariance matrix for some subdomain. Figure 8(a) shows the nonzero coefficients in that particular subdomain, figure 8(b) reflects the structure of based on . Figures 8(c) and 8(d) show the estimated background error covariance matrix from two different perspectives. As is expected, the correlations are dissipated in space but, they still quite large as can be seen in figure 8(d). Intuitively, when the subdomain size is small, high correlations are present between model components owing to their proximity. On the other hand, when the subdomain size is large, more disipation is expected on the correlation waves of .
5 Future Work
We think there is an opportunity to exploit even more high performance computing tools in the context of PAREnKFMC. Here, most of the computational time is spent in the estimation of the coefficients in (1). The approximation of those coefficients is performed making use of the singular value decomposition (SVD) SVD implementations are highly proposed in the context of accelerating devices such as Many Core Intel (MIC) [17] and the Compute Unified Device Architecture (CUDA) [18]. Since the analysis corrections are computed at each subdomain independently, each processor (subdomain) can submit to a given device the information needed in order to solve the linear regression problem (1). Once the solution is computed, the device returns the coefficients to the processor which assembles the received information in . Generally speaking the process is as follows:

The domain is split according to processors (subdomains)

At each subdomain a local inverse estimation of the background error covariance matrix is computed:

Submit the vectors to the assigned device in order to compute the weights in the linear regression (1).

In the device, compute the coefficients making use of SVD.

The subdomain receives the coefficients from the device.


The nonzero coefficients are placed in their respective positions in .

Continue until the coefficients for all local components have been computed.

Perform the local assimilation.
6 Conclusions
An efficient and parallel implementation of the ensemble Kalman filter based on a modified Cholesky decomposition is proposed. The method exploits the conditional independence of model components in order to obtain sparse estimators of via the modified Cholesky decomposition. High performance computing can be used in order to speedup the assimilation process: the global domain is decomposed according to the number of processors (subdomains), at each subdomain a local estimator of the inverse background error covariance matrix is computed and the local assimilation process is carried out. Each subdomain is then mapped back onto the global domain where then, the global analysis is obtained. The proposed EnKF implementation is compared against the wellknown local ensemble transform Kalman filter (LETKF) making use of the Atmospheric General Circulation Model (SPEEDY) with the T63 resolution in the super computer cluster Blueridge at Virginia Tech. The number of processors is ranged from 96 to 2,048. The accuracy of the proposed EnKF outperforms that of the LETKF. Even more, the computational time of the proposed implementation differs in seconds of the parallel LETKF method in which no covariance estimation is performed. Finally, for the largest number of processors, the proposed method is 400 times faster than its serial theoretical implementation.
7 Acknowledgments
This work was supported in part by awards NSF CCF –1218454, AFOSR FA9550–12–1–0293–DEF, and by the Computational Science Laboratory at Virginia Tech.
References
 L. Nerger, W. Hiller, Software for ensemblebased data assimilation systemsâimplementation strategies and scalability, Computers & Geosciences 55 (2013) 110–118.
 V. Rao, A. Sandu, A timeparallel approach to strongconstraint fourdimensional variational data assimilation, Journal of Computational Physics 313 (2016) 583–593.
 Y. Liu, A. Weerts, M. Clark, H. Hendricks Franssen, S. Kumar, H. Moradkhani, D. Seo, D. Schwanenberg, P. Smith, A. Van Dijk, et al., Advancing data assimilation in operational hydrologic forecasting: progresses, challenges, and emerging opportunities, Hydrology and Earth System Sciences, 16 (10), 2012.

M. Zupanski, Theoretical
and Practical Issues of Ensemble Data Assimilation in Weather and Climate,
in: S. Park, L. Xu (Eds.), Data Assimilation for Atmospheric, Oceanic and
Hydrologic Applications, Springer Berlin Heidelberg, 2009, pp. 67–84.
doi:10.1007/9783540710561_3.
URL http://dx.doi.org/10.1007/9783540710561_3  A. A. Emerick, A. C. Reynolds, Ensemble smoother with multiple data assimilation, Computers & Geosciences 55 (2013) 3–15.

E. Ott, B. R. Hunt, I. Szunyogh, A. V. Zimin, E. J. Kostelich, M. Corazza,
E. Kalnay, D. J. Patil, J. A. Yorke,
A Local Ensemble
Kalman Filter for Atmospheric Data Assimilation, Tellus A 56 (5) (2004)
415–428.
doi:10.1111/j.16000870.2004.00076.x.
URL http://dx.doi.org/10.1111/j.16000870.2004.00076.x 
G. Evensen, The Ensemble
Kalman Filter: Theoretical Formulation and Practical Implementation, Ocean
Dynamics 53 (4) (2003) 343–367.
doi:10.1007/s1023600300369.
URL http://dx.doi.org/10.1007/s1023600300369 
P. J. Bickel, E. Levina,
Regularized estimation of
large covariance matrices, Ann. Statist. 36 (1) (2008) 199–227.
doi:10.1214/009053607000000758.
URL http://dx.doi.org/10.1214/009053607000000758 
P. Sakov, L. Bertino,
Relation Between Two
Common Localisation Methods for the EnKF, Computational Geosciences 15 (2)
(2011) 225–237.
doi:10.1007/s1059601092026.
URL http://dx.doi.org/10.1007/s1059601092026  E. D. Nino, A. Sandu, W. Deng, An ensemble kalman filter implementation based on modified cholesky decomposition for inverse covariance matrix estimation, arXiv preprint arXiv:1572695.

E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, A. McKenney, J. Du Croz,
S. Hammerling, J. Demmel, C. Bischof, D. Sorensen,
LAPACK: A Portable
Linear Algebra Library for Highperformance Computers, in: Proceedings of
the 1990 ACM/IEEE Conference on Supercomputing, Supercomputing ’90, IEEE
Computer Society Press, Los Alamitos, CA, USA, 1990, pp. 2–11.
URL http://dl.acm.org/citation.cfm?id=110382.110385  L. S. Blackford, J. Demmel, J. Dongarra, I. Duff, S. Hammarling, G. Henry, M. Heroux, L. Kaufman, A. Lumsdaine, A. Petitet, R. Pozo, K. Remington, R. C. Whaley, An Updated Set of Basic Linear Algebra Subprograms (BLAS), ACM Transactions on Mathematical Software 28 (2001) 135–151.

F. Molteni, Atmospheric
simulations using a gcm with simplified physical parametrizations. i: model
climatology and variability in multidecadal experiments, Climate Dynamics
20 (23) (2003) 175–191.
doi:10.1007/s0038200202682.
URL http://dx.doi.org/10.1007/s0038200202682 
F. Kucharski, F. Molteni, A. Bracco,
Decadal interactions
between the western tropical pacific and the north atlantic oscillation,
Climate Dynamics 26 (1) (2006) 79–91.
doi:10.1007/s0038200500855.
URL http://dx.doi.org/10.1007/s0038200500855  K. Terasaki, M. Sawada, T. Miyoshi, Local ensemble transform kalman filter experiments with the nonhydrostatic icosahedral atmospheric model nicam, SOLA 11 (0) (2015) 23–26.
 T. Miyoshi, M. Kunii, The local ensemble transform kalman filter with the weather research and forecasting model: experiments with real observations, Pure and applied geophysics 169 (3) (2012) 321–333.

M. Huang, C. Lai, X. Shi, Z. Hao, H. You,
Study
of Parallel Programming Models on Computer Clusters with Intel MIC
Coprocessors, International Journal of High Performance Computing
ApplicationsarXiv:http://hpc.sagepub.com/content/early/2015/04/10/1094342015580864.full.pdf+html,
doi:10.1177/1094342015580864.
URL http://hpc.sagepub.com/content/early/2015/04/10/1094342015580864.abstract  S. Lahabar, P. Narayanan, Singular Value Decomposition on GPU Using CUDA, in: IEEE International Symposium On Parallel Distributed Processing, 2009. IPDPS 2009., 2009, pp. 1–10. doi:10.1109/IPDPS.2009.5161058.