Scalable Online Convolutional Sparse Coding

Scalable Online Convolutional Sparse Coding

Yaqing Wang, Quanming Yao, James T. Kwok, and Lionel M. Ni. Y. Wang, Q. Yao and J. T. Kwok are with Department of Computer Science and Engineering, Hong Kong University of Science and Technology University, Hong Kong.L.M. Ni is with Department of Computer and Information Science, University of Macau, Macau.
Abstract

Convolutional sparse coding (CSC) improves sparse coding by learning a shift-invariant dictionary from the data. However, existing CSC algorithms operate in the batch mode and are expensive, in terms of both space and time, on large data sets. In this paper, we alleviate these problems by using online learning. The key is a reformulation of the CSC objective so that convolution can be handled easily in the frequency domain and much smaller history matrices are needed. We use the alternating direction method of multipliers (ADMM) to solve the resultant optimization problem, and the ADMM subproblems have efficient closed-form solutions. Theoretical analysis shows that the learned dictionary converges to a stationary point of the optimization problem. Extensive experiments show that convergence of the proposed method is much faster and its reconstruction performance is also better. Moreover, while existing CSC algorithms can only run on a small number of images, the proposed method can handle at least ten times more images.

I Introduction

In recent years, sparse coding has been widely used in signal processing [1, 2] and computer vision [3, 4]. In sparse coding, each data sample is represented as a weighted combination of a few atoms from an over-complete dictionary learned from the data. Despite its popularity, sparse coding cannot capture shifted local patterns that are common in image samples. Often, it has to first extract overlapping image patches, which is analogous to manually convolving the dictionary with the samples. As each sample element (e.g., an image pixel) is contained in multiple overlapping patches, the separately learned representations may not be consistent. Moreover, the resultant representation is highly redundant [5].

Convolutional sparse coding (CSC) addresses this problem by learning a shift-invariant dictionary composed of many filters. Local patterns at translated positions of the samples are easily extracted by convolution, and eliminates the need for generating overlapping patches. Each sample is approximated by the sum of a set of filters convolved with the corresponding codes. The learned representations are consistent as they are obtained together. CSC has been used successfully in various image processing applications such as super-resolution image reconstruction [6], high dynamic range imaging [7], image denoising and inpainting [8]. It is also popular in biomedical applications, e.g., cell identification [9], calcium image analysis [10], tissue histology classification [11] and segmentation of curvilinear structures [12]. CSC has also been used in audio processing applications such as piano music transcription [13].

A number of approaches have been proposed to solve the optimization problem in CSC. In the pioneering deconvolutional network (DeconvNet) [14], simple gradient descent is used. As convolution is slow in the spatial domain, fast convolutional sparse coding (FCSC) [5] formulates CSC in the frequency domain, and the alternating direction method of multipliers (ADMM) [15] is used to solve the resultant optimization problem. Its most expensive operation is the inversion of a convolution-related linear operator. To alleviate this problem, convolutional basis pursuit denoising (CBPDN) [16] exploits a special structure of the dictionary, while the global consensus ADMM (CONSENSUS) [17] utilizes the matrix inverse lemma to simplify computations. Fast and flexible convolutional sparse coding (FFCSC) [8] further introduces mask matrices so as to handle incomplete samples that are common in image/video inpainting and demosaicking applications. Note that all these algorithms operate in the batch mode (i.e., all the samples/codes have to be accessed in each iteration). Hence they can become expensive, in terms of both space and time, on large data sets.

In general, online learning has been commonly used to improve the scalability of machine learning algorithms [18, 19]. While batch learning algorithms train the model after arrival of the whole data set, online learning algorithms observe the samples sequentially and update the model incrementally. Moreover, data samples need not be stored after being processed. This can significantly reduce the algorithm’s time and space complexities. In the context of sparse coding, an efficient online algorithm is proposed in [20]. In each iteration, information necessary for dictionary update is summarized in fixed-sized history matrices. The space complexity of the algorithm is thus independent of sample size. Recently, this has also been extended for large-scale matrix factorization [21].

However, though CSC is similar to sparse coding, the online sparse coding algorithm in [20] cannot be directly used. This is because convolution in CSC needs to be performed in the frequency domain for efficiency. Moreover, the sizes of history matrices depend on dimensionality of the sparse codes, which becomes much larger in CSC than in sparse coding. Storing the resultant history matrices can be computationally infeasible.

In this paper, we propose a scalable online CSC algorithm for large data sets. The algorithm, which will be called Online Convolutional Sparse Coding (OCSC), is inspired by the online sparse coding algorithm of [20]. It avoids the above-mentioned problems by reformulating the CSC objective so that convolution can be handled easily in the frequency domain and much smaller history matrices are needed. We use ADMM to solve the resultant optimization problem. It will be shown that the ADMM subproblems have efficient closed-form solutions. Consequently, to process a given number of samples, OCSC has the same time complexity as state-of-the-art batch CSC methods but requires much less space. Empirically, as OCSC updates the dictionary after coding each sample, it converges much faster than batch CSC methods. Theoretical analysis shows that the learned dictionary converges to a stationary point of the optimization problem. Extensive experiments show that convergence of the proposed method is much faster and its reconstruction performance is also better. Moreover, while existing CSC algorithms can only run on a small number of images, the proposed method can at least handle ten times more images.

The rest of the paper is organized as follows. Section II briefly reviews online sparse coding, the ADMM, and batch CSC methods. Section III describes the proposed online convolutional sparse coding algorithm. Experimental results are presented in Section IV, and the last section gives some concluding remarks.

Notations: For vector , its th element is denoted , its norm is , its norm is , and reshapes to a diagonal matrix with elements ’s. Given another vector , the convolution is a vector , with . For matrix with elements ’s, stacks the columns of to a vector. Given another matrix , the Hadamard product is . The identity matrix is denoted , and the conjugate transpose is denoted .

The Fourier transform that maps from the spatial domain to the frequency domain is denoted , and is the inverse Fourier transform. For a variable in the spatial domain, its corresponding variable in the frequency domain is denoted .

Ii Related Works

Ii-a Online Sparse Coding

Given samples , where each , sparse coding learns an over-complete dictionary of atoms and sparse codes [1]. It can be formulated as the following optimization problem:

(1)

where and . Many efficient algorithms have been developed for solving (1). Examples include K-SVD [1] and active set method [2]. However, they require storing all the samples, which can become infeasible when is large.

To solve this problem, an online learning algorithm for sparse coding that processes samples one at a time is proposed in [20]. After observing the th sample , the sparse code is obtained as

(2)

where is the dictionary obtained at the th iteration. After obtaining , is updated as

(3)
(4)

where

(5)
(6)

Each column in (4) can be obtained by coordinate descent. and can also be updated incrementally as

(7)

Using and , one does not need to store all the samples and codes to update . The whole algorithm is shown in Algorithm 1.

0:  samples .
1:  Initialize: dictionary as a Gaussian random matrix, , ;
2:  for  do
3:     draw from ;
4:     obtain sparse code using (2);
5:     update history matrices using (7);
6:     update dictionary using (4) by coordinate descent;
7:  end for
8:  return  .
Algorithm 1 Online sparse coding [20].

The following assumptions are made in [20].

Assumption 1.
  1. Samples are generated i.i.d. from some distribution with bounded.

  2. The code is unique w.r.t. data .

  3. The objective in (4) is strictly convex with lower-bounded Hessians.

Theorem 1 ([20]).

With Assumption 1, the distance between and the set of stationary points of the dictionary learning problem converges almost surely to 0 when .

Ii-B Alternating Direction Method of Multipliers (ADMM)

ADMM [15] has been popularly used for solving optimization problems of the form

(8)

where are convex functions, and (resp. ) are constant matrices (resp. vector). It first constructs the augmented Lagrangian of problem (8)

(9)

where is the dual variable, and is a penalty parameter.

At the th iteration, the values of and (denoted as and ) are updated by minimizing (9) w.r.t. and in an alternating manner. Define the scaled dual variable . The ADMM updates can be written as

(10)
(11)

The above procedure converges to the optimal solution at a rate of [22], where is the number of iterations.

Ii-C Convolutional Sparse Coding

Convolutional sparse coding (CSC) learns a dictionary composed of filters, each of length , that can capture the same local pattern at different translated positions of the samples. This is achieved by replacing the multiplication between dictionary and code by convolution. While each in sparse coding is represented by a single code , each in CSC is represented by codes stored together in the matrix .

The dictionary and codes are obtained by solving the following optimization problem:

(12)

where denotes convolution in the spatial domain.

Convolution can be accelerated in the frequency domain via the convolution theorem [23]: , where is first zero-padded to -dimensional. Hence, recent CSC methods [5, 8, 16, 17] choose to operate in the frequency domain. Let , and . (12) is reformulated as

s.t.

where the factor in the objective comes from the Parseval’s theorem [24], and is the linear operation that removes the extra dimensions in .

Problem (II-C) can be solved by block coordinate descent [5, 8, 16, 17], which updates and alternately.

Ii-C1 Updating

Given , the can be obtained one by one for each th sample as

(13)
s.t.

where is introduced to decouple the loss and the -regularizer in (II-C). This can then be solved by ADMM [5, 8, 16, 17].

Ii-C2 Updating

Given , can be obtained as

(14)
s.t.

where is introduced to decouple the loss and constraint in (II-C). This can again be solved by using ADMM [5, 8, 16, 17].

After obtaining and , the sparse codes can be recovered as for , and the dictionary filters as .

The above algorithms all need in space. They differ mainly in how to compute the linear system involved with in the ADMM subproblems. FCSC [5] directly solves the subproblem, which takes time. CBPDN [16] exploits a special structure in the dictionary and reduces the time complexity to , which is efficient for small . The CONSENSUS algorithm [17] utilizes the matrix inverse lemma to reduce the time complexity to . The state-of-the-art is FFCSC [8], which incorporates various linear algebra techniques (such as Cholesky factorization [25] and cached factorization [25]) to reduce the time complexity to .

Iii Online Convolutional Sparse Coding

Existing CSC algorithms operate in the batch mode, and need to store all the samples and codes which cost space. This becomes infeasible when the data set is large. In this section, we will scale up CSC by using online learning.111After the initial arXiv posting of our paper [26], we became aware of some very recent independent works that also consider CSC in the online setting [27, 28]. These will be discussed in Section III-E.

After observing the th sample , online CSC considers the following optimization problem which is analogous to (12):

(15)

To solve problem (15), some naive approaches are first considered in Section III-A. The proposed online convolutional sparse coding algorithm is then presented in Section III-B. It takes the same time complexity for one data pass as state-of-the-art batch CSC algorithms, but has a much lower space complexity (Section III-C). The convergence properties of the proposed algorithm is discussed in Section III-D.

Iii-a Naive Approaches

As in batch CSC, problem (15) can be solved by alternating minimization w.r.t. the codes and dictionary (as in Section II-C). Given the dictionary, the codes are updated as in (13). Given the codes, the dictionary is updated by solving the following optimization subproblem analogous to (14):

(16)
s.t.

However, solving (16) as in Section II-C2 requires keeping all the samples and codes, and is computationally expensive on large data sets.

Alternatively, the objective in (16) can be rewritten as

(17)

where

(18)

and . This is of the form in (3). Hence, we may attempt to reuse the online sparse coding in Algorithm 1, and thus avoid storing all the samples and codes. However, recall that online learning the dictionary is possible because one can summarize the observed samples into the history matrices in (5), (6). For (17), the history matrices become

(19)

In typical CSC applications, the number of image pixels is at least in the tens of thousands, and may only be in the thousands. Hence, the space required for storing and is even higher than the space required for batch methods.

Iii-B Proposed Algorithm

Note that in (18) is composed of a number of diagonal matrices. By utilizing this special structure, the following Proposition rewrites the objective in (16) so that much smaller history matrices can be used.

Proposition 2.

The objective in (16) is equivalent to the following apart from a constant:

(20)

The proof is in Appendix A-A. Obviously, each in (20) can then be independently optimized. This avoids directly handling the much larger matrix in (18). Let

in (20). The total space required for is , which is much smaller than the space for storing in (19). Moreover, as in (7), and can be updated incrementally as

(21)
(22)

The dictionary and codes can then be efficiently updated in an alternating manner as follows.

Iii-B1 Updating the Dictionary

With the codes fixed, using Proposition 2, the dictionary can be updated by solving the following optimization problem:

(23)

This can solved using ADMM. At the th ADMM iteration, let be the ADMM dual variable. The following shows the update equations for and .

Updating : From (20), can be updated by solving the following subproblem:

where . Note that . Hence, . The rows of can then be obtained separately as

(24)

where . With , we do not need to store .

Computing the matrix inverse takes time. This can be simplified by noting from (21) that is the sum of rank-1 matrices and a (scaled) identity matrix. Using the Sherman-Morrison formula222Given an invertible square matrix and vectors , . [25], we have

(25)

This takes , instead of , time.

Updating : From (10), each column can be updated as

It has the following closed-form solution.

Proposition 3 ([29]).

, where .

Finally, the dual variables is updated as in (11). The whole dictionary update procedure (DictOCSC) is shown in Algorithm 2. As (23) is convex, convergence to the globally optimal solution is guaranteed [22].

0:  initial dictionary , , ;
1:  Initialize: , , ;
2:  for  do
3:     update using (24);
4:     update using Proposition 3;
5:     update as ;
6:  end for
7:  return  .
Algorithm 2 DictOCSC().

In batch CSC methods, its dictionary update in (14) is also based on ADMM (Section II-C2). However, our dictionary update step first reformulates the objective as in (20). This enables each ADMM subproblem to be solved with a much lower space complexity ( vs for the state-of-the-art [8]) but still with the same iteration time complexity (i.e., ).

Iii-B2 Updating the Code

Given the dictionary, as the codes for different samples are independent, they can be updated one by one as in batch CSC methods (Section II-C1).

The whole algorithm, which will be called Online Convolutional Sparse Coding (OCSC), is shown in Algorithm 3.

0:  samples .
1:  Initialize: dictionary as a Gaussian random matrix, , ;
2:  for  do
3:     , where is drawn from ;
4:     obtain using (13);
5:     update using (22);
6:     update using (25);
7:     .
8:  end for
9:  for  do
10:     ;
11:  end for
12:  return  .
Algorithm 3 Online convolutional sparse coding (OCSC).

Iii-C Complexity Analysis

In Algorithm 3, the space requirement is dominated by , which takes space. For one data pass which precesses samples, updating and takes time, dictionary update takes time, code update takes time, and FFT/inverse FFT takes time. Hence, one data pass takes a total of time.

A comparison with the existing batch CSC algorithms is shown in Table I. As can be seen, the proposed algorithm takes the same time complexity for one data pass as the state-of-the-art FFCSC algorithm, but has a much lower space complexity ( instead of ).

batch/online convolution operation space time for one data pass
DeconvNet [14] batch spatial
FCSC [5] batch frequency
FFCSC [8] batch frequency
CBPDN [16] batch frequency
CONSENSUS [17] batch frequency
OCSC online frequency
TABLE I: Comparing the proposed online CSC algorithm with existing batch CSC algorithms.

Iii-D Convergence

In this section, we show that Algorithm 3 outputs a stationary point of the CSC problem (15) when . This is achieved by connecting Algorithm 3 to a direct application of Algorithm 1 on (15).

The convolution operation in the spatial domain can be written as matrix multiplication [5, 14]. Specifically,

(26)

where is a linear operator which maps a vector to its associated Toeplitz matrix. Specifically,

The number of columns in is equal to the dimension of (i.e., ).

Let , be the inverse operator of which maps back to , and . Problem (15) can be rewritten as

(27)

where

(28)

Thus, the objective (27) is of the same form as that in (3). However, a direct use of Algorithm 1 is not feasible. First, the feasible region in (28) is more complex, and coordinate descent cannot be used as there is no simple projection to . Second, as is of length , the corresponding history matrices (analogous to those in (5)) require space.

Though a direct application of Algorithm 1 is not practical, Theorem 1 still holds. Indeed, Theorem 1 can be further extended by relaxing its feasible region on . As discussed in [20], can be, for example, . It is mentioned in [20] that has to be a union of independent constraints on each column of . However, this only serves to facilitate the use of coordinate descent (step 6 in Algorithm 1), but is not required in the proof. In general, Theorem 1 holds when is bounded, convex, and .

The following Lemma shows that in (28) satisfies the conditions. The proof is in Appendix A-B. Thus, Theorem 1 also holds for Algorithm 3, and a stationary point of problem (15) can be obtained.

Lemma 4.

in (28) is bounded, convex, and a subset of .

Iii-E Discussion with [27, 28]

Here, we discuss the very recent works of [27, 28] which also consider online learning of the dictionary in CSC. Extra experiments are performed in Section B, which shows our method is much faster than them.

In the online convolutional dictionary learning (OCDL) algorithm [27], convolution is performed in the spatial domain. They started with the observation that convolution is commutative. Hence, for the summation in (15), we have

(29)

where with and . (15) can then be rewritten as

(30)

where (as each is repeated times in the Toeplitz matrix ). Using (30), the history matrices are constructed as

Recall that is the length of filter when CSC is solved in the spatial domain. The space complexity of [27] is dominated by (which takes space) or (which takes space), depending on the relative sizes of and . Though this is comparable to our space requirement, its time complexity is much larger. For one data pass, convolution in the spatial domain takes time and updating the history matrices above takes time. The dictionary update in total takes time. In contrast, the proposed algorithm takes time. In the experiments, , and ranges from to . Thus, the algorithm in [27] is much more expensive.

The algorithm in [28], also called online convolutional dictionary learning, considers the frequency domain and solves problem (16) as in the proposed method. First, they rewrite (16) as

where , and . The history matrices are then constructed as