Dictionary Learning over Distributed Models

Dictionary Learning over Distributed Models

Jianshu Chen,  Zaid J. Towfic,  and Ali H. Sayed,  Copyright (c) 2014 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to pubs-permissions@ieee.org. This work was supported in part by NSF grants CCF-1011918 and ECCS-1407712. A short and preliminary version of this work appears in the conference publication [1]. J. Chen is with Microsoft Research, Redmond, WA, 98052. Email: cjs09@ucla.edu. Zaid J. Towfic is with MIT Lincoln Laboratory, Lexington, MA. Email: ztowfic@ucla.edu. A. H. Sayed is with the Department of Electrical Engineering, University of California, Los Angeles, CA 90095. Email: sayed@ee.ucla.edu. This work was completed while J. Chen and Z. J. Towfic were PhD students at UCLA.
Abstract

In this paper, we consider learning dictionary models over a network of agents, where each agent is only in charge of a portion of the dictionary elements. This formulation is relevant in Big Data scenarios where large dictionary models may be spread over different spatial locations and it is not feasible to aggregate all dictionaries in one location due to communication and privacy considerations. We first show that the dual function of the inference problem is an aggregation of individual cost functions associated with different agents, which can then be minimized efficiently by means of diffusion strategies. The collaborative inference step generates dual variables that are used by the agents to update their dictionaries without the need to share these dictionaries or even the coefficient models for the training data. This is a powerful property that leads to an effective distributed procedure for learning dictionaries over large networks (e.g., hundreds of agents in our experiments). Furthermore, the proposed learning strategy operates in an online manner and is able to respond to streaming data, where each data sample is presented to the network once.

{keywords}

Dictionary learning, distributed model, diffusion strategies, dual decomposition, conjugate functions, image denoising, novel document detection, topic modeling, bi-clustering.

I Introduction and Related Work

Dictionary learning is a useful procedure by which dependencies among input features can be represented in terms of suitable bases[2, 3, 4, 5, 6, 7, 8, 9, 10, 11]. It has found applications in many machine learning and inference tasks including image denoising[5, 6], dimensionality-reduction [7, 8], bi-clustering [9], feature-extraction and classification [10], and novel document detection [11]. Dictionary learning usually alternates between two steps: (i) an inference (sparse coding) step and (ii) a dictionary update step. The first step finds a sparse representation for the input data using the existing dictionary by solving, for example, a regularized regression problem, while the second step usually employs a gradient descent iteration to update the dictionary entries.

With the increasing complexity of various learning tasks, it is not uncommon for the size of the learning dictionaries to be demanding in terms of memory and computing requirements. It is therefore important to study scenarios where the dictionary is not necessarily available in a single central location but its components are possibly spread out over multiple locations. This is particularly true in Big Data scenarios where large dictionary components may already be available at separate locations and it is not feasible to aggregate all dictionaries in one location due to communication and privacy considerations. This observation motivates us to examine how to learn a dictionary model that is stored over a network of agents, where each agent is in charge of only a portion of the dictionary elements. Compared with other works, the problem we solve in this article is how to learn a distributed dictionary model, which is, for example, different from the useful work in [12] where it is assumed instead that each agent maintains the entire dictionary model.

In this paper, we first formulate a general dictionary learning problem, where the residual error function and the regularization function can assume different forms in different applications. As we shall explain, this form turns out not to be directly amenable to distributed implementations. However, when the regularization is strongly convex, we will show that the problem has a dual function that can be solved in a distributed manner using diffusion strategies [13, 14, 15, 16]. In this solution, the agents will not need to share their (private) dictionary elements but only the dual variable. Useful consensus strategies [17, 18, 19, 20] can also be used for the same purpose. However, since it has been shown that diffusion strategies have enhanced stability and learning abilities over consensus strategies [21, 22, 23], we will continue our presentation by focusing on diffusion strategies.

We will test our proposed algorithm on two important applications of dictionary learning: (i) novel document detection[11, 24, 25], and (ii) bi-clustering on microarray data [9]. A third application related to image denoising is considered in [1]. In the novel document detection problem [11, 24, 25], each learner receives documents associated with certain topics, and wishes to determine if an incoming document is associated with a topic that has already been observed in previous data. This application is useful, for example, in finance when a company wishes to mine news streams for factors that may impact stock prices. Another example is the mining of social media streams for topics that may be unfavorable to a company. In these applications, our algorithm is able to perform distributed non-negative matrix factorization tasks, with the residual metric chosen as the Huber loss function [26], and is able to achieve a high area under the receiver operating characteristic (ROC) curve. In the bi-clustering experiment, our algorithm is used to learn relations between genes and types of cancer. From the learned dictionary, the patients are subsequently clustered into groups corresponding to different manifestations of cancer. We show that our algorithm can obtain similar clustering results to those in [9], which relies instead on a batched (centralized) implementation.

Tasks
Sparse SVD 0
Bi-Clustering a
Nonnegative Matrix b 0
Factorization c 0
  • The notation is used to denote the sum of all absolute entries in the matrix : , which is different from the conventional matrix norm defined as the maximum absolute column sum: .

  • The notation is defined as if and otherwise. It imposes infinite penalty on any negative entry appearing in the vector . Since negative entries are already penalized in , there is no need to penalize it again in the term.

  • The scalar Huber loss function is defined as , where is a positive parameter.

TABLE I: Examples of tasks solved by the general formulation (1)–(2). The loss functions are illustrated in Fig. 2.

The paper is organized as follows. In Section II, we introduce the dictionary learning problem over distributed models. In Section III, using the concepts of conjugate function and dual decomposition, we transform the original dictionary learning problem into a form that is amenable to distributed optimization. In Section IV, we test our proposed algorithm on two applications. In Section V we conclude the exposition.

Ii Problem Formulation

Ii-a General Dictionary Learning Problem

We seek to solve the following general form of a global dictionary learning problem over a network of agents connected by a topology:

(1)
(2)

where denotes the expectation operator, is the input data vector at time (we use boldface letters to represent random quantities), is a coding vector defined further ahead as the solution to (7), and is an dictionary matrix. Moreover, the -th column of , denoted by , is called the -th dictionary element (or atom), in (1) denotes a differentiable convex loss function for the residual error, and are convex (but not necessarily differentiable) regularization terms on and , respectively, and denotes the convex constraint set on . Depending on the application problem of interest, there are different choices for , , and . Table I lists some typical tasks and the corresponding choices for these functions. In regular dictionary learning [6], the constraint set is

(3)

and in applications of nonnegative matrix factorization [6] and novel document detection (topic modeling) [11], it is

(4)

where the notation means each entry of the matrix is nonnegative. We note that if there is a constraint on , it can be absorbed into the regularization factor , by including an indicator function of the constraint into this regularization term. For example, if is required to satisfy , where denotes the all-one vector, we can modify the original regularization by adding an additional indicator function:

(5)

where the indicator function for is defined as

(6)

The vector in (1) is the solution to the following general inference problem for each input data sample at time for a given (the regular font for and denotes realizations for the random quantities and ):

(7)

Note that dictionary learning consists of two steps: the inference step (sparse coding) for at each time in (7), and the dictionary update step (learning) in (1)–(2).

Ii-B Dictionary Learning over Networked Agents

Fig. 1: The data sample at time is available to a subset of agents in the network (e.g., agents and in the figure), and each agent is in charge of one sub-dictionary, , and the corresponding optimal sub-vector of coefficients estimated at time , . Each agent can only exchange information with its immediate neighbors (e.g., agents , and in the figure and itself). We use to denote the set of neighbors of agent .

Let the matrix and the vector be partitioned in the following block forms:

(8)

where is an sub-dictionary matrix and is an sub-vector. Note that the sizes of the sub-dictionaries add up to the total size of the dictionary, , i.e.,

(9)

Furthermore, we assume the regularization terms and admit the following decompositions:

(10)

Then, the objective function of the inference step (7) can be written as

(11)

We observe from (11) that the sub-dictionary matrices are linearly combined to represent the input data . By minimizing over , the first term in (11) helps ensure that the representation error for is small. The second term in (11), which usually involves a combination of and measures, as indicated in Table I, helps ensure that each of the resulting combination coefficients is sparse and small. We will make the following assumption regarding throughout the paper

Assumption 1 (Strongly convex regularization).

The regularization terms are assumed to be strongly convex for . ∎

This assumption will allow us to develop a fully distributed strategy that enables the sub-dictionaries and the corresponding coefficients to be stored and learned in a distributed manner over the network; each agent will infer its own and update its own sub-dictionary with limited interaction with its neighboring agents. Requiring to be strongly convex is not restrictive since we can always add a small regularization term to make it strongly convex. For example, in Table I, we add an term to regularization so that the resulting ends up amounting to elastic net regularization, in the manner advanced in [7].

Figure 1 shows the assumed configuration of the knowledge and data distribution over the network. The sub-dictionaries can be interpreted as the “wisdom” that is distributed over the network, and which we wish to combine in a distributed manner to form a greater “intelligence” for interpreting the data . Observe that we are allowing to be observed by only a subset, , of the agents. By having the dictionary distributed over the agents, we would then like to develop a procedure that enables these networked agents to find the global solutions to both the inference problem (7) and the learning problem (1)–(2) with interactions that are limited to their neighborhoods.

Fig. 2: Illustration of the loss functions, and the elastic net regularization.

Ii-C Relation to Prior Work

Ii-C1 Model Distributed vs. Data Distributed

The problem we are solving in this paper is different from the useful work [27, 12] on distributed dictionary learning and from the traditional distributed learning setting [13, 16, 14, 28], where it is assumed that the entire dictionary is maintained by each agent or that individual data samples generated by the same distribution, denoted by , are observed by the agents at each time . That is, these previous works study data distributed formulations. What we are studying in this paper is to find a distributed solution where each agent is only in charge of a portion of the dictionary ( for each agent ) and where the incoming data, , is observed by only a subset of the agents. This scenario corresponds to a model distributed (or dictionary-distributed) formulation. A different formulation is also considered in [29] in the context of distributed deep neural network (DNN) models over computer networks. In these models, each computer is in charge of a portion of neurons in the DNN, and the computing nodes exchange their private activation signals. As we will see further ahead, our distributed model requires exchanging neither the private combination coefficients nor the sub-dictionaries .

The distributed-model setting we are studying is important in practice because agents tend to be limited in their memory and computing power and they may not be able to store large dictionaries locally. Even if the agents were powerful enough, different agents may still have access to different databases and different sources of information. Rather than aggregate the information in the form of large dictionaries at every single location, it is often more advantageous to keep the information distributed due to costs in exchanging large dataset and dictionary models, and also due to privacy considerations where agents may not be in favor of sharing their private information.

Ii-C2 Distributed Basis Pursuit

Other useful related works appear in the studies [30, 31, 32] on distributed basis pursuit, which also rely on dual decomposition arguments. However, there are some key differences in problem formulation, generality, and technique, as explained in [33]. For example, the works [30, 31, 32] do not deal with dictionary learning problems and focus instead on the solution of special cases of the inference problem (7). Specifically, the problem formulations in [30, 31, 32] focus on determining sparse solutions to (underdetermined) linear systems of equations, which can be interpreted as corresponding to scenarios where the dictionaries are static and not learned from data. In comparison, in this article, we show how the inference and learning problems (7) and (1)–(2) can be jointly integrated into a common framework. Furthermore, our proposed distributed dictionary learning strategy is an online algorithm, which updates the dictionaries sequentially in response to streaming data. We also only require the data sample to be available to a subset of the agents (e.g., one agent) while it is assumed in [30, 31, 32] that all agents have access to the same data .

For instance, one of the problems studied in [30] is the following inference problem (compare with (7)):

(12a)
(12b)

This formulation can be recast as a special case of (7) by selecting:

(13a)
(13b)

where is the indicator function defined by:

(14)

where is a set consisting of the zero vector in . Equality constraints of the form (12b), or a residual function of the form (13b), are generally problematic for problems that require both learning and inference since modeling and measurement errors usually seep into the data and the may not be able to represent the accurately with a precise equality as in (12b). To handle the modeling error, the work [31] considered instead:

(15a)
(15b)

for some , which again can be viewed as a special case of problem (7) for the same from (13a) and with the indicator function in (13b) replaced by relative to the set

(16)

An alternative problem formulation that removes the indicator functions is considered in [34, 31], namely,

(17)

Here, we now have and . However, for problem (17), the dictionary elements as well as the entries of , were partitioned in [34, 31] by rows across the network as opposed to our column-wise partitioning in (8):

(18)

In this case, it is straightforward to rewrite problem (17) in the form

(19)

which is naturally in a “sum-of-costs” form; such forms are directly amenable to distributed optimization and do not require transformations — see (20) further ahead. However, the more challenging problem where the matrix is partitioned column-wise as in (8), which leads to the “cost-of-sum” form showed earlier in (11), was not examined in [31, 34].

In summary, we will solve the more challenging problem of joint inference and dictionary learning (instead of inference alone under static dictionaries) under the column-wise partitioning of (rather than row-wise partitioning) and general penalty functions and (instead of the special indicator choices in (14) and (16)).

Iii Learning over Distributed Models

Iii-a “Cost-of-Sum” vs. “Sum-of-Costs”

We thus start by observing that the cost function (11) is a regularized “cost-of-sum”; it consists of two terms: the first term has a sum of quantities associated with different agents appearing as an argument for the function and the second term is a collection of separable regularization terms . This formulation is different from the classical “sum-of-costs” problem, which usually seeks to minimize a global cost function, , that is expressed as the aggregate sum of individual costs , say, as:

(20)

The “sum-of-costs” problem (20) is amenable to distributed implementations[13, 14, 15, 16, 17, 18, 19, 20, 21]. In comparison, minimizing the regularized “cost-of-sum” problem in (11) directly would require knowledge of all sub-dictionaries and coefficients . Therefore, this formulation is not directly amenable to the distributed techniques from [13, 14, 15, 16, 17, 18, 19, 20, 21]. In [35], the authors proposed a useful consensus-based primal-dual perturbation method to solve a similar constrained “cost-of-sum” problem for smart grid control. In their method, an averaging consensus step was used to compute the sum inside the cost. We follow a different route and arrive at a more efficient distributed strategy by transforming the original optimization problem into a dual problem that has the same form as (20) — see (30a)–(30b) further ahead, and which can then be solved efficiently by means of diffusion strategies. There will be no need to exchange any information among the agents beyond the dual variable, or to employ a separate consensus step to evaluate the sum inside the cost in order to update their own sub-dictionaries.

Iii-B Inference over Distributed Models: A Dual Formulation

To begin with, we first transform the minimization of (11) into the following equivalent optimization problem by introducing a splitting variable :

(21a)
(21b)

Note that the above problem is convex over both and since the objective is convex and the equality constraint is linear. Problem (21a)–(21b) is a convex optimization problem with linear constraints so that strong duality holds[36, p.514], meaning that the optimal solution to (21a)–(21b) can be found by solving its corresponding dual problem (see (22) below) and then recovering the optimal primal variables and (to be discussed in Sec. III-E):

(22)

where is the dual function associated with the optimization problem (21a)–(21b), and is defined as follows. First, the Lagrangian over the primal variables and is given by

(23)

Then, the dual function can be expressed as:

(24)
(25)

where in step (a) we introduced , and and are the conjugate functions of and , respectively, with the corresponding domains denoted by and , respectively. We note that the conjugate function (or Legendre-Fenchel transform[37, p.37]), , for a function is defined as [38, pp.90-95]:

(26)

where the domain is defined as the set of where the above supremum is finite. The conjugate function and its domain are convex regardless of whether is convex or not [36, p.530][38, p.91]. In particular, it holds that if is strongly convex [37, p.82]. Now since is assumed in Assumption 1 to be strongly convex, its domain is the entire . If happens to be strongly convex (rather than only convex, e.g., if ), then would also be , otherwise it is a convex subset of . Therefore, the dual function in (25) becomes

(27)

Now, maximizing is equivalent to minimizing so that the dual problem (22) is equivalent to

(28a)
(28b)

Note that the objective function in the above optimization problem is an aggregation of (i) individual costs associated with sub-dictionaries at different agents (last term in (28a)), (ii) a term associated with the data sample (second term in (28a)), and (iii) a term that is the conjugate function of the residual cost (first term in (28a)). In contrast to (11), the cost function in (28a) is now in a form that is amenable to distributed processing. In particular, diffusion strategies [39, 14, 21], consensus strategies [17, 18, 19, 20], or ADMM strategies [30, 31, 40, 41, 42, 33] can now be applied to obtain the optimal dual variable in a distributed manner at the various agents.

To arrive at the distributed solution, we proceed as follows. We denote the set of agents that observe the data sample by . Motivated by (28a), with each agent , we associate the local cost function:

(29)

where denotes the cardinality of . Then, the optimization problem (28a)–(28b) can be rewritten as

(30a)
(30b)

In Sections III-C and III-D, we will first discuss the solution of (30a)–(30b) for the optimal dual variable, , in a distributed manner. And then in Sec. III-E, we will reveal how to recover the optimal primal variables and from .

Iii-C Inference over Distributed Models: Diffusion Strategies

Note that the new equivalent form (30a) is an aggregation of individual costs associated with different agents; each cost only requires knowledge of . Consider first the case in which is strongly convex. Then, it holds that and problem (30a)–(30b) becomes an unconstrained optimization problem of the same general form as problems studied in [16, 15]. Therefore, we can directly apply the diffusion strategies developed in these works to solve (30a)–(30b) in a fully distributed manner. The adapt-then-combine (ATC) implementation of the diffusion algorithm then takes the following form:

(31a)
(31b)

where denotes the estimate of the optimal at agent at iteration (we will use to denote the -th iteration of the inference, and use to denote the -th data sample), is an intermediate variable, denotes the neighborhood of agent , is the step-size parameter chosen to be a small positive number, and is the combination coefficient that agent assigns to the information received from agent and it satisfies

(32)

Let denote the matrix that collects as its -th entry. Then, it is shown in [16] that as long as the matrix is doubly-stochastic (i.e., satisfies ) and is selected such that

(33)

where is the Lipschitz constant111 If is twice-differentiable, then the Lipschitz gradient condition (34) is equivalent to requiring an upper bound on the Hessian of , i.e., . of the gradient of :

(34)

then algorithm (31a)–(31b) converges to a fixed point that is away from the optimal solution of (30a) in squared Euclidean distance. We remark that a doubly-stochastic matrix is one that satisfies .

Tasks
Sparse SVD b a
Bi-Clustering
Nonnegative Matrix d c
Factorization Nothing
  • denotes the entry-wise soft-thresholding operator on the vector : , where .

  • is the function defined by for .

  • denotes the entry-wise one-side soft-thresholding operator on the vector : .

  • is defined by for .

  • The functions , , , and for the case of a scalar argument are illustrated in Fig. 3.

TABLE II: Conjugate functions used in this paper for different tasks

Consider now the case in which the constraint set is not equal to but is still known to all agents. This is a reasonable requirement. In general, we need to solve the supremum in (26) with to derive the expression for and determine the set that makes the supremum in (26) finite. Fortunately, this step can be pursued in closed-form for many typical choices of . We list in Table II the results that will be used in Sec. IV; part of these results are derived in Appendix A and the rest is from [38, pp.90-95]. Usually, for these typical choices of are simple sets whose projection operators222The projection operator onto the set is defined as . can be found in closed-form — see also [43]. For example, the projection operator onto the set

(35)

that is listed in the third row of Table II is given by

(36)

where denotes the -th entry of the vector and denotes the -th entry of the vector . Once the constraint set is found, it can be enforced either by incorporating local projections onto into the combination step (31b) at each agent [44] or by using the penalized diffusion method [45]. For example, the projection-based strategy replaces (31a)–(31b) by:

(37a)
(37b)

where is the projection operator onto .

Iii-D Inference over Distributed Models: ADMM Strategies

An alternative approach to solving the dual inference problem (30a)–(30b) is the distributed alternating direction multiplier method (ADMM) [30, 31, 46, 40, 41]. Depending on the configuration of the network, there are different variations of distributed ADMM strategies. For example, the method proposed in [40] relies on a set of bridge nodes for the distributed interactions among agents, and the method in [30, 31] uses a graph coloring approach to partition the agents in the network into different groups, and lets the optimization process alternate between different groups with one group of agents engaged at a time. In [41] and [46], the authors developed ADMM strategies that adopt Jacobian style updates with all agents engaged in the computation concurrently. Below, we describe the Jacobian-ADMM strategies from[46, p.356] and briefly compare them with the diffusion strategies.

The Jacobian-ADMM strategy solves (30a)–(30b) by first transforming it into the following equivalent optimization problem:

(38a)
(38b)

where the cost function is decoupled among different and the constraints are coupled through neighborhoods. Then, the following recursion is used to solve (38a)–(38b):

(39a)
(39b)

where is the -th entry of the adjacency matrix of the network, which is defined as:

(40)
Fig. 3: Illustration of the functions , , , and .
Fig. 4: Comparison between the ADMM strategy and the diffusion strategy. The diffusion strategy has two time scales and the ADMM strategy may have three time scales. The first time scale is the dictionary update over the data stream (see Sec. III-G), the second time scale is the iterative algorithm for solving the inference problem for each data sample , and the third time scale in ADMM is to solve the “argmin” in (39a).

From recursion (39a)–(39b), we observe that ADMM requires solving a separate optimization problem () for each ADMM step. This optimization problem generally requires an iterative algorithm to solve when it cannot be solved in closed-form, which adds a third time scale to the algorithm, as explained in[33] in the context of dictionary learning. This situation is illustrated in Fig. 4. The need for a third time-scale usually translates into requiring faster processing at the agents between data arrivals, which can be a hindrance for adaptation in real-time.

Iii-E Recovery of the Primal Variables

Returning to the diffusion solution (31a)–(31b) or (37a)–(37a), once the optimal dual variable has been estimated by the various agents, the optimal primal variables and can now be recovered uniquely if and are strongly convex. In this case, the infimums in (24) can be attained and become minima. As a result, optimal primal variables can be recovered via

(41)
(42)

where step (a) performs the variable substitution . By (41)–(42), we obtain the optimal solutions of (21a)–(21b) (and also of the original inference problem (7)) after first solving the dual problem (22). For many typical choices of and , the solutions of (41)–(42) can be expressed in closed form in terms of . In Table II, we list the results that will be used later in Sec. IV with the derivation given in Appendix A.

The strong convexity of and is needed if we want to uniquely recover and from the dual problem (22). As we will show further ahead in (56), the quantities are always needed in the dictionary update. For this reason, we assumed in Assumption 1 that the are strongly convex, which can always be satisfied by means of elastic net regularization as explained earlier. On the other hand, depending on the application, the recovery of is not always needed and neither is the strong convexity of (in these cases, it is sufficient to assume that ) is convex). For example, as explained in [1], the image denoising application requires recovery of as the final reconstructed image. On the other hand, the novel document detection application discussed further ahead does not require recovery of but the maximum value of the dual function, , which, by strong duality, is equal to the minimum value of the cost function (21a) and that of (7).

Iii-F Choice of Residual and Regularization Functions

In Tables III, we list several typical choices for the residual function, , and the regularization functions, . In general, a careful choice of and can make the dual cost (28a) better conditioned than in the primal cost (21a). Recall that the primal cost (21a) may not be differentiable due to the choice of (e.g., the elastic net). However, if is chosen to be strictly convex with Lipschitz gradients and the are chosen to be strongly convex (not necessarily differentiable), then the conjugate function will be a differentiable strongly convex function with Lipschitz gradient and the will be differentiable convex functions with Lipschitz gradients [37, pp.79–84]. Adding and together in (28a) essentially transforms a non-differentiable primal cost (21a) into a differentiable strongly convex dual cost (28a) with Lipschitz gradients. As a result, the algorithms that optimize the dual problem (28a)–(28b) can generally enjoy a fast (geometric) convergence rate [47, 16, 22].

Iii-G Distributed Dictionary Updates

Now that we have shown how the inference task (7) can be solved in a distributed manner, we move on to explain how the local sub-dictionaries can be updated through the solution of the stochastic optimization problem (1)–(2), which is rewritten as:

(43a)
(43b)

where the loss function is given in (11), , the decomposition for from (10) is used, and we assume the constraint set can be decomposed into a set of constraints on the individual sub-dictionaries ; this condition usually holds for typical dictionary learning applications — see Table I. Problem (43a)–(43b) can also be written as the following unconstrained optimization problem by introducing indicator functions for the sets :

(44)

Note that the cost function in (44) consists of two parts, where the first term is differentiable333Note from (11) that depends on via , which is assumed to be differentiable. with respect to while the second term, if it exists, is non-differentiable but usually consists of simple components — see Table I. A typical approach to optimizing cost functions of this type is the proximal gradient method[48, 49, 50, 43], which applies gradient descent to the first differentiable part followed by a proximal operator to the second non-differentiable part. This method is known to converge faster than applying the subgradient descent method to both parts. However, the proximal gradient methods in[48, 49, 50, 43] are developed for deterministic optimization, where the exact form of the objective function is known. In constrast, our objective function in (44) assumes a stochastic form and is unknown beforehand because the statistical distribution of the data is not known. Therefore, our strategy is to apply the proximal gradient method to the cost function in (44) and remove the expectation operator to obtain an instantaneous approximation to the true gradient; this is the approach typically used in adaptation [51, 22, 21] and stochastic approximation[52]:

(45)

Recursion (45) is effective as long as the proximal operator of