An Asynchronous Distributed Expectation Maximization Algorithm For Massive Data: The DEM Algorithm

An Asynchronous Distributed Expectation Maximization Algorithm For Massive Data: The DEM Algorithm

Abstract

The family of Expectation-Maximization (EM) algorithms provides a general approach to fitting flexible models for large and complex data. The expectation (E) step of EM-type algorithms is time consuming in massive data applications because it requires multiple passes through the full data. We address this problem by proposing an asynchronous and distributed generalization of the EM called the Distributed EM (DEM). Using DEM, existing EM-type algorithms are easily extended to massive data settings by exploiting the divide-and-conquer technique and widely available computing power, such as grid computing. The DEM algorithm reserves two groups of computing processes called workers and managers for performing the E step and the maximization step (M step), respectively. The samples are randomly partitioned into a large number of disjoint subsets and are stored on the worker processes. The E step of DEM algorithm is performed in parallel on all the workers, and every worker communicates its results to the managers at the end of local E step. The managers perform the M step after they have received results from a -fraction of the workers, where is a fixed constant in . The sequence of parameter estimates generated by the DEM algorithm retains the attractive properties of EM: convergence of the sequence of parameter estimates to a local mode and linear global rate of convergence. Across diverse simulations focused on linear mixed-effects models, the DEM algorithm is significantly faster than competing EM-type algorithms while having a similar accuracy. The DEM algorithm maintains its superior empirical performance on a movie ratings database consisting of 10 million ratings.


Keywords: Divide-and-conquer; iterative computations; large and complex data; linear mixed-effects model; EM-type algorithm; message passing interface (MPI).

1 Introduction

Developing efficient generalizations of the EM algorithm is an active area of research. The monotonic ascent of the EM algorithm can be extremely slow, especially near the optimum. There are many EM extensions that increase the speed of convergence of EM while retaining its stability and simplicity and that reduce to the original EM algorithm under certain assumptions. EM-type algorithms are very slow in massive data applications because the E step requires multiple passes through the whole data for each iteration. Memory limitations further worsen efficiency of the E step. This has motivated a rich literature on online EM algorithms, which modify the E step using stochastic approximation. Our goal is to propose an asynchronous and distributed generalization of the EM called the DEM algorithm. It leads to easy extensions of EM-type algorithms using the divide-and-conquer technique. Distributed computations allow scalability to arbitrarily large data sets, and asynchronous computations minimize communication cost for each iteration. Both features are key in maintaining the efficiency of DEM algorithm in massive data applications while retaining the simplicity and stability of EM-type algorithms.

The EM algorithm has been extended by generalizing the missing data augmentation schemes or by developing efficient M steps. In most of these extensions, the E step uses data from every sample. Such extensions are inefficient in massive data settings for two main reasons. First, every iteration is time consuming due to a large number of samples. Second, if the data require many machines for storage, then extensive communication among all the machines further increases the time of each iteration; therefore, EM (Dempster et al., 1977) and the family of EM-type algorithms, such as ECM, ECME, AECM, PXEM, and DECME (Meng & Rubin, 1993; Liu & Rubin, 1994; Meng & van Dyk, 1997; Liu et al., 1998; He & Liu, 2012), are inefficient in massive data settings simply due to the time consuming E step or possibly due to the communication cost. The same is also true for EM extensions that modify the M step by borrowing ideas from optimization (Lange, 1995; Jamshidian & Jennrich, 1997; Neal & Hinton, 1998; Salakhutdinov & Roweis, 2003; Varadhan & Roland, 2008; Yu, 2012).

Current EM extensions for massive data applications are based on stochastic approximation and fall in the online EM family (Titterington, 1984; Lange, 1995). Cappé & Moulines (2009) generalized online EMs to statistical models that have their complete-data likelihood in the curved exponential family. All online EMs use the data sequentially rather than in a batch and continuously update parameter estimates as data arrive. Online EMs modify the E step of the classical EM to an online E step that computes the conditional expectation of the complete-data log likelihood obtained using a small fraction of the full data. As greater fraction of the full data are processed, the online E step increases in accuracy and yields similar results as the E step of classical EM. The M step of an online EM is same as that of the classical EM. Online EMs retain the simplicity of implementation of the classical EM but fail to retain the monotone ascent of the likelihood for each iteration. Developing online EMs for complex models is an active area of research (Cappé, 2011; Le Corff et al., 2011).

EM extensions based on the divide-and-conquer technique provide an alternative to online EMs in massive data settings. The methods in this class divide the data into smaller disjoint subsets and perform E steps in parallel on the subsets. The E steps could be performed in parallel on different nodes in a cluster, threads of a graphical processing unit (GPU), or processors in a computer, which are generically called processes. The results of all the parallel E steps are combined into an objective for maximization in the M step. Typically, all processes communicate the conditional expectations of sufficient statistics to a common process, which combines them before performing the M step. A variety of such algorithms exist for mixture models (Nowak, 2003; Gu, 2008; Zhou et al., 2010; Suchard et al., 2010; Weng et al., 2011; Altinigneli et al., 2013; Chen et al., 2013; Lee et al., 2016; Fajardo & Liang, 2017). However, a general extension of the EM algorithm that is tuned for applications in grid computing environments and has theoretical convergence guarantees remains unknown. The main challenge here is to retain the generality of E step while minimizing the computational bottleneck due to extensive communication among processes.

The DEM algorithm is designed for extending EM-type algorithms, which work on a single machine, to the distributed setting with minimal modifications. DEM first reserves processes for computations, where and and are the number of workers and managers, respectively. The samples in the full data are randomly partitioned into disjoint subsets that are stored on the workers. The managers manage the communications among workers and track the progress of DEM, including maintaining the latest copies of E step results received from all the worker processes. The E step of DEM algorithm consists of performing the local E step in parallel on the worker processes, and the result of every local E step is communicated to the managers. For a , the managers receive results from a -fraction of the workers and perform the M step using an objective that depends on the -fraction of new E step results and the ()-fraction of old E step results. The managers stop the DEM iterations when the likelihood has reached a local mode. The local E step on every subset is fast and free of any memory limitations. The asynchronous M step minimizes computational bottlenecks due to extensive communication among workers and managers.

The DEM algorithm is a generalization of the EM algorithm in that it reduces to a classical but distributed EM algorithm if . DEM differs from the previous EM extensions in its distributed and fractional updates. Many EM-based model fitting methods in robust statistics also use only a fraction of the data; see for example Neykov et al. (2007). This idea differs from that of DEM which uses results from every subset for each iteration but only a -fraction of these results are new. DEM’s sequence of parameter estimates converges to a local mode under the theoretical setup of the classical EM. This is a major advantage relative to existing distributed EM extensions that are restricted to a particular class of models or likelihoods. Existing distributed EM extensions are special cases of the DEM depending on . For example, the distributed EM algorithms for mixture models are a special case of DEM if ; and DEM reduces to the Incremental EM (IEM) (Neal & Hinton, 1998) if , where is the sample size. Our numerical experiments show that DEM is also easy to implement on cluster of computers using a non-distributed implementation of an EM-type algorithm and the message passing interface (MPI) (Gabriel et al., 2004).

The run-time efficiency of DEM relative to its non-distributed version depends on and . DEM with requires more iterations to reach a local mode than its non-distributed version because it uses only a -fraction of the full data for each iteration. If is close to 1, then the number of iterations required for convergence in DEM is very similar to that of the non-distributed version but the communication overhead is large; if is close to 0, then the communication overhead is small but the number of iterations required for convergence is relatively large. If is chosen to be large enough so that the local E steps finish quickly and communication cost is of the order , then DEM algorithm can be faster that its non-distributed version for a broad range of . Empirically, DEM is more than two times faster than its non-distributed version for values around , which achieves the optimal balance between the decreased computational burden due to efficient local E steps and increased number of iterations required for convergence.

2 Motivating example: MovieLens ratings data

MovieLens data are one of the largest publicly available movie ratings data (http://grouplens.org). The database contains 10 million ratings for about 11 thousand movies by about 72 thousand users of the online movie recommender service MovieLens, where a user has rated multiple movies. An observation in the database includes movie rating from 0.5 to 5 in 0.5 increments, time of the rating, and 18 genres to which the movie belongs. A user rating can be predicted using a linear mixed-effects model with movie genres as fixed and random covariates. A variety of efficient EM-type algorithms are available for fitting mixed-effects models; however, they are slow due the time consuming E step. Motivated by similar problems in using state-of-the-art lme4 R package (Bates et al., 2013), Perry (2017) proposed a new approach for fitting such models using the method of moments.

We extended an ECME algorithm in van Dyk (2000) using the DEM algorithm to achieve two time speed ups in fitting linear mixed-effects model to the MovieLens data with , and . We reserved 20 worker processes and a manager process on a cluster, randomly split the users into 20 disjoint subsets, and stored their data on the workers. The E step was performed in parallel on the 20 workers, and the manager performed the M step. DEM was implemented in R (R Core Team, 2016) using the Rmpi package (Yu, 2002) and van Dyk’s ECME algorithm. Our real data analysis based on the examples considered in Perry (2017) showed that DEM with matches the accuracy of van Dyk’s ECME in parameter estimation while being significantly faster for all three values of ; see Section 5. The major advantage of DEM was its generality in that it scaled the ECME algorithm to massive data settings using its non-distributed implementation and MPI.

3 Basic setup of DEM algorithm

Consider a general framework for implementing any iterative statistical algorithm when the data are stored on multiple processes in a distributed or grid computing environment. Assume that the distributed environment has a collection of processes that store the disjoint data subsets and that are responsible for computations for each iteration of the statistical algorithm. At the -th iteration of the algorithm, process computes quantities determined by the assigned data subset and the current state of a common quantity shared by all the data subsets. In the context of EM algorithm, is the conditional expectation of the sufficient statistics given and the current estimate of the parameter . The common or population quantity is updated by

(1)

where , . This provides a general setup for any distributed iterative statistical algorithm, including the DEM.

Consider the Inter-Process Communication (IPC) scheme for implementing distributed statistical algorithms that can be described using (1). For process ,

Scheme 1

Starting with , iterate between the following two steps for .

  1. Compute and send to all the other processes.

  2. Upon receiving all values for , evaluate (1) to obtain .

Although conceptually simple, such a generic scheme fails in a distributed setting where communication between processes is time consuming or even unreliable. Chen et al. (2013) proposed an EM extension for mixture models based on this scheme and employed single-program-multiple-data paralellization technique to achieve efficiency; however, generalizations of this approach to other EM-type algorithms are unclear.

Statistical thinking in terms of imputation-and-analysis steps for iterative algorithms, such as EM, make it appealing to consider distributed computing environments with few manager processes for “analysis” and a large number of worker processes for “imputation.” Accordingly, the version of the Scheme 1 in the distributed environments with one manager process and worker processes is given as follows.

Scheme 2

This scheme consists of two iterative sub-schemes, one for the workers processes and another for the manager process. The manager process starts with and iterates between the following two steps for .

  1. Send to all the worker processes.

  2. Wait to receive from all , and use (1) to obtain .

The worker process iterates between the following two steps for .

  1. Wait to receive from the manager process.

  2. Compute and send to the manager process.

Scheme 2 avoids the expensive communication overheads of Scheme 1, but algorithms implemented using this scheme can be dramatically slow if the computational burden on a few worker processes is large. Unfortunately, such events are typical in distributed environments where limited computational resources are shared by many processes.

Scheme 2 is adopted by most distributed implementations of EM. Zhou et al. (2010) and Suchard et al. (2010) have used a GPU framework, where threads are worker processes and no manager is required because the computations are performed on a single node. Efficiency is maintained by exploiting the GPU architecture and avoiding data copying between the GPUs and the host machine. Lee et al. (2016) use a multi-threading framework that does not require data copying and communication. These approaches are best suited for computations on shared-memory architectures. The E step in all these methods uses data from all the worker threads, which can be slow if one thread has a large computational burden.

An asynchronous modification of Scheme 2 allows the manager to update once it has received at least a pre-specified -fraction of updated ’s. The manager updates with ’s fixed at their most recent values. A simplified version of this scheme is as follows.

Scheme 3

Given and denoting the most recent value of as , the manager process starts with and , and iterates between the following two steps for .

  1. Wait until having received at least a -fraction of ’s.

  2. Compute

    and send to all the worker processes.

The worker process iterates between the following two steps for .

  1. Wait to receive from the manager process.

  2. Compute and send to the manager process.

Scheme 3 has many desirable properties for implementing iterative statistical algorithms, such as EM, in distributed environments. The DEM algorithm provides a general framework for implementing any EM-type algorithm in distributed computing environments using Scheme 3. The formal definition of the DEM algorithm is given in Section 4.2.

4 Basic theory of DEM

4.1 Notation and background

Consider a general EM algorithm setup with representing the full data consisting of samples. The samples in full data are randomly partitioned into disjoint subsets. Represent the data in subset as , where for some and every , so the full data . The data subsets are stored separately on workers. Let be the density of based on its probability model parametrized by lying in some space , where is a shorthand for the sequence . The log likelihood of given the observed data is

(2)

where represents the contribution of the data on process to the log likelihood (). The maximum likelihood estimate (MLE) of in the parameter space is

(3)

Finding by direct maximization in (3) is difficult in many statistical applications. EM algorithm simplifies this problem by augmenting “missing” data to , yielding complete-data for subset (). The joint density of the complete-data still depends on and marginalizing missing-data from the joint yields the density of observed data

(4)

The EM algorithm maximizes (2) by iteratively maximizing a modified form of . Let represent the estimate of at the end of -th iteration of EM. The E step at the -th iteration replaces by its conditional expectation with respect to the conditional density of given with parameter , denoted as , to obtain

(5)

where , represents expectation with respect to , and represents the contribution of worker to . The M step finds the -th update of as

(6)

Let . Then, at the -th iteration,

(7)

where represents the contribution of worker to . As a function of , is maximized at , so for defined in (6); see Theorem 1 of Dempster et al. (1977). Any version of the EM algorithm that ensures the ascent of is called a generalized EM (GEM) algorithm (Dempster et al., 1977).

Neal & Hinton (1998) present an alternative interpretation of the EM algorithm that greatly simplifies the theory and results related to DEM. They show that the E and M steps of any GEM algorithm respectively maximize a common functional , where represents an unknown density of parametrized by . The E step estimates by maximizing the objective functional

where denotes expectation with respect to the density . After some algebra, this reduces to

(8)

where is the Kullback-Liebler (KL) divergence between and . Theorem 1 of Neal & Hinton (1998) shows that the E step in -th iteration maximizes in (8) by setting for a fixed , where (). The M step then maximizes with respect for a fixed . These two steps are repeated until convergence to the stationary point . Theorem 2 of Neal & Hinton (1998) shows that if has a global or local maximum at and , then has a global or local maximum at . Based on this observation, Neal & Hinton (1998) propose the IEM algorithm that cyclically updates and separately based on the -th sample for ; see equations (7), (8), and (9) in Neal & Hinton (1998).

4.2 E and M steps of DEM

The DEM algorithm is an asynchronous and distributed generalization of the EM algorithm based on Scheme 3. In any iteration of the DEM algorithm, the managers for the DEM algorithm maintain a copy of , , and based on the last communication with worker (). The worker performs its local E step of the DEM algorithm using its data subset , calculates , and returns its to the managers. The M step is performed by the manager machines when they have received ’s from processes such that . After the M step, the managers send the updated to all the processes for the next iteration of DEM algorithm. This process is repeated until convergence to the local mode . If , then DEM follows the synchronous Scheme 2 and reduces to the classical but distributed EM.

The DEM iterations are defined using the notation introduced in the previous section. Denote , , and respectively as the , , functions and as the latest copy of maintained by the managers for the worker () at the -th iteration. At the start of -th iteration, the managers send the current parameter estimate to all the processes and DEM proceeds as follows.

E step:

For , worker computes its and returns the to the managers.

M step:

The managers wait until they have received ’s from workers, where is such that . Once the managers are done with receiving, they calculate the -th update for as

(9)

where contains the indices of processes that returned their ’s to the managers and . The managers send to all the workers for the next iteration, including the workers that did not return their ’s to the manager.

Later we assume that every worker returns its -function to the managers infinitely often. Under this assumption, if we relabel the processes that returned their ’s as and the remaining processes as , then the M step in (9) reduces to

(10)

The E and M steps of DEM are repeated until sequence converges. Theorem 4.1 proves that the sequence indeed has a stationary point.

Theorem 4.1

The sequence does not decrease in DEM; that is, , . If is bounded above, then for some . In particular, for some .

The proof is in the supplementary material along with other proofs. This is a desirable but a weaker result in that DEM fails to maintain the monotone ascent of the likelihood sequence. While DEM is not a GEM, Theorem 4.1 implies that there exists a likelihood subsequence that maintains the monotone ascent of the likelihood in that , . If we define a weak-GEM to be an EM-type algorithm that maintains the monotone ascent of a likelihood subsequence, then DEM is a weak-GEM, and in the same fashion as online EMs and the IEM.

The M step in (10) can be modified based on any efficient extension of the classical EM, such as ECM, ECME, PX-EM. The proof of Theorem 4.1 implies that DEM maintains the monotone ascent for in (8). The functions in (10) can be replaced by any other function such that does not decrease . For example, the -th update for defined as

(11)

which guarantees , . We use this idea in our simulated and real data analyses to implement distributed extensions of ECME algorithm. The IEM algorithm of Neal & Hinton (1998) is obtained by fixing and by modifying (9) as for a , where can be chosen randomly or in a specific order. This implies that IEM is a DEM if the samples are treated as subsets and .

Wu (1983) shows that more regularity conditions are needed to guarantee that the DEM sequence converges to , a local mode or stationary point. Proving a similar result for DEM is difficult because DEM is not a GEM and the sequence in DEM depends on multiple previous iterates. We modify the arguments in Wu (1983) using as the ascent function to obtain the global convergence result for the DEM sequence. Define . Our setup has the following assumptions:

  1. is a subset in the -dimensional Euclidean space .

  2. The set is compact for any starting point of the (, ) sequence, denoted as , that satisfies and .

  3. is continuous in and differentiable in the interior of .

  4. is in the interior of for any .

  5. The first order differential is continuous in .

  6. Worker returns to the manager infinitely often for and .

Assumptions 15 follow from Wu (1983). Assumption 2 implies that sequence is bounded. Assumptions 13 and the definition of imply that is bounded above for any . Assumption 4 guarantees the existence of derivatives of , , at . Assumption 5 is used to show that sequence converges monotonically to for a stationary point as in Theorem 2 of Wu (1983). Assumption 6 ensures that uses the full data as and is used later in deriving the matrix rate of convergence of the DEM in the next section. These assumptions hold for the linear mixed-effects model used in our simulated and real data analysis.

The next theorem describes the convergence of sequence, which implies the convergence of sequence. Let be the set of stationary points and be the set of local maxima in the interior of . For a given , define the sets and .

Theorem 4.2

Suppose Assumptions 16 hold. Then,

  1. if (resp. ) = , where is the limit of sequence, then , implying that , where is a stationary point (resp. local maximum) of ; and

  2. if (resp. ) is discrete and as , where is a norm on , then for some in , implying that , where is a stationary point (resp. local maximum) of .

This theorem strengthens Theorem 2 in Neal & Hinton (1998) because it describes the convergence of sequence, which is not implied by the convergence of sequence. There are exceptional cases due to uneven load sharing on the workers where Assumption 6 can be violated. Theorem 4.2 is inapplicable in those cases.

4.3 Matrix rate of convergence of DEM

We compare the matrix rate of convergence of DEM sequence to its non-distributed version from the managers’ perspective. Distributed and asynchronous computations are an important component of DEM, but for simplicity we use the tools developed in Dempster et al. (1977), Meng (1994), and Liu et al. (1998) for deriving the matrix rate of convergence of DEM and assume that the cost of communication among workers and managers is negligible. We show that the difference between the rates of convergence of DEM and the classical EM depends on the observed information and complete-data information matrices calculated using the -fraction and -fraction of the full data.

Consider an EM sequence . Each EM iteration defines a mapping so that . If is a fixed point of the sequence, then a Taylor expansion at gives , where is the gradient of map , , evaluated at and as . The matrices and , where is an identity matrix of appropriate dimension, are called the matrix rate of convergence and speed matrix of EM, respectively. Let and represent the minimum and maximum singular values of a matrix . Then, and respectively are the global rate and global speed of convergence of EM. Dempster et al. (1977) show that

(12)

where and are the observed-data and complete-data information matrices.

These techniques require modification before their application to deriving the rate of convergence of DEM. Assume that and consider the DEM sequence at the managers for estimating , where corresponds to the value of in at the th iteration. If , then ; otherwise, for some (). Since s are independent random sets of indices, the gradient of DEM mapping such that is not well-defined; therefore, we choose a subsequence of corresponding to those s such that M step in (9) has and the objective is (10). This implies that are equal. If , then we represent this DEM subsequence as and . Borel-Cantelli lemma implies the existence of with positive probability using Assumption 6 in Theorem 4.2 and the independence of s for .

Our rate of convergence results from the managers’ perspective are derived by focusing on , the first coordinates of the DEM subsequence . The subsequence and subsequences () converge because they are subsequences of the convergent DEM sequence ; see Theorem 4.2. Since is a vector of convergent sequences, is one of its fixed point. Let be a mapping such that and . Represent as , where . Define the observed-data and complete-data information matrices for and sequences as

(13)

where represents a block-diagonal matrix with along the diagonal. Using (12), we have that . If is the DEM map that maps to , then and because the first elements of equal .

The speed matrices and are related using , , and . If represents the -th diagonal block of in (4.3), then and respectively are the complete-data and observed-data information matrices obtained using the -fraction of the full data ignored by the DEM subsequence generated using . The analytic forms of and in (12) imply that and . The following theorem relates these information matrices to the global speeds of EM and DEM for the parameter sequence that are computed using the objective in (10) with in the M step.

Theorem 4.3

Let be the DEM subsequence generated using . If Assumptions 16 hold, is in a small neighborhood around , , and , then and .

We interpret as the (matrix) fraction of observed-data information ignored by the DEM in its fractional updates. Since is a product of two positive semi-definite matrices, and Theorem 4.3 implies that . With our interpretation of , this says that DEM cannot be slower than an EM that only uses a -fraction of the full data. This is true even if the DEM and EM converge to different values.

5 Experiments

5.1 Setup

We evaluate the performance of DEM in fitting linear mixed-effects models in large sample settings using the setup in van Dyk (2000). Let , , , , and be the number of fixed effects, number of random effects, sample size, total number of observations, and total number of observations for sample () so that . If is the observation for sample for , then

(14)

where and are known matrices of fixed and random effects covariates, respectively, is the fixed effects parameter vector, is the error variance parameter, is a symmetric positive definite matrix, is the random effects vector for sample that follows a -dimensional Gaussian distribution with mean and covariance parameter , and is -by- identity matrix. van Dyk (2000) developed many efficient extensions of EM-type algorithms for the estimation of , but every extension is slow if is large due to the time consuming E step.

We extended van Dyk’s ECME algorithm, called ECME, using DEM. We randomly partitioned the samples into disjoint subsets such that observations specific to a sample were in the same subset. The model in (18) satisfies Assumptions 15 in Theorem 4.2; see the supplementary material for details. DEM ran using one manager and worker processes. The E step of ECME algorithm was split into local E steps of DEM on workers, where as the M step of DEM was performed using (9) on the manager. We chose three values of to demonstrate the trade-off between the number of iterations required to reach a local mode, faster local E steps, and the communication overhead. DEM reduced to IEM when , but convergence to the local mode was too slow, so we used DEM results with as IEM results. The maximum number of iterations in any ECME, IEM, or DEM run was fixed at , and convergence to the local mode was achieved if the change in log likelihood between two successive iterations was less than . We implemented ECME, IEM, and DEM algorithms in R and used , , and