Learning Fast Sparsifying Transforms

Learning Fast Sparsifying Transforms

Cristian Rusu and John Thompson The authors are with the Institute for Digital Communications, School of Engineering, The University of Edinburgh, Scotland. Email: {c.rusu, john.thompson}@ed.ac.uk. Demo source code available online at https://udrc.eng.ed.ac.uk/sites/udrc.eng.ed.ac.uk/files/attachments/demo.zip. This work was supported by the Engineering and Physical Sciences Research Council (EPSRC) Grant number EP/K014277/1 and the MOD University Defence Research Collaboration (UDRC) in Signal Processing.
Abstract

Given a dataset, the task of learning a transform that allows sparse representations of the data bears the name of dictionary learning. In many applications, these learned dictionaries represent the data much better than the static well-known transforms (Fourier, Hadamard etc.). The main downside of learned transforms is that they lack structure and therefore they are not computationally efficient, unlike their classical counterparts. These posse several difficulties especially when using power limited hardware such as mobile devices, therefore discouraging the application of sparsity techniques in such scenarios. In this paper we construct orthogonal and non-orthogonal dictionaries that are factorized as a product of a few basic transformations. In the orthogonal case, we solve exactly the dictionary update problem for one basic transformation, which can be viewed as a generalized Givens rotation, and then propose to construct orthogonal dictionaries that are a product of these transformations, guaranteeing their fast manipulation. We also propose a method to construct fast square but non-orthogonal dictionaries that are factorized as a product of few transforms that can be viewed as a further generalization of Givens rotations to the non-orthogonal setting. We show how the proposed transforms can balance very well data representation performance and computational complexity. We also compare with classical fast and learned general and orthogonal transforms.

I Introduction

Dictionary learning methods [1] represent a well-known class of algorithms that have seen many applications in signal processing [2], image processing [3], wireless communications [4] and machine learning [5]. The key idea of this approach is not to use an off-the-shelf transform like the Fourier, Hadamard or wavelet but to learn a new transform, often called an overcomplete dictionary, for a particular task (like coding and classification) from the data itself. While the dictionary learning problem is NP-hard [6] in general, it has been extensively studied and several good algorithms to tackle it exist. Alternating minimization methods like the method of optimal directions (MOD) [7], K–SVD [8, 9] and direct optimization [10] have been shown to work well in practice and also enjoy some theoretical performance guarantees. While learning a dictionary we need to construct two objects: the dictionary and the representation of the data in the dictionary.

One problem that arises in general when using learned dictionaries is the fact that they lack any structure. This is to be compared with the previously mentioned off-the-shelf transforms that have a rich structure. This is reflected in their low computational complexity, i.e., they can be applied directly using computations for example [11]. Our goal in this paper is to provide a solution to the problem of constructing fast transforms, based upon the structure of Givens rotations, learned from training data.

We first choose to study orthogonal structures since sparse reconstruction is computationally cheaper in such a dictionary: we project the data onto the column space of the dictionary and keep the largest coefficients in magnitude to obtain the provable best -term approximation. Working in an dimensional feature space, this operation has complexity . In a general non-orthogonal (and even overcomplete) dictionary, special non-linear reconstruction methods such as minimization [12], greedy approaches like orthogonal matching pursuit (OMP) [13] or variational Bayesian algorithms like approximate message passing (AMP) [14] need to be applied. Aside from the fact that in general these methods cannot guarantee to produce best -term approximations they are also computationally expensive. For example, the classical OMP has complexity [15] and, assuming that we are looking for sparse approximations with , it is in general computationally cheaper than optimization. Therefore, considering a square orthogonal dictionary is a first step in the direction of constructing a fast transform. For the analysis dictionary, recent work based on transform learning [16, 17] has been proposed. Still, notice that computing sparse representations in such a dictionary has complexity and therefore, our goal of constructing a fast transform cannot be reached with just a general orthogonal dictionary. We make the case that our fundamental goal is to actually build a structured orthogonal dictionary such that matrix-vector multiplications with this dictionary can be achieved with less than operations, preferably . This connects our paper to previous work on approximating orthogonal (and symmetric) matrices [18] such that matrix-vector multiplications are computationally efficient.

When we talk about “learning fast sparsifying transforms” we do not refer to the efficient learning procedures (although the proposed learning methods have polynomial complexity) but we refer to the transforms themselves, i.e., once we have the transform, the computational complexity of using it is low, preferably to perform matrix-vector multiplication.

Previous work [19, 20, 21, 22, 23, 24, 25] in the literature has already proposed various structured dictionaries to cope with the high computational complexity of learned transforms. Previous work also dealt with the construction of structured orthogonal dictionaries. Specifically, [26] proposed to build an orthogonal dictionary composed of a product of a few Householder reflectors. In this fashion, the computational complexity of the dictionary is controlled and a trade-off between representation performance and computational complexity is shown.

Learned dictionaries with low computational complexity can bridge the gap between the classical transforms that are preferred especially in power limited hardware (or battery operated devices) and the overcomplete, computationally cumbersome, learned dictionaries that provide state-of-the-art performance in many machine learning tasks. The contribution of this paper is two fold.

First, we consider the problem of constructing an orthogonal dictionary as a product of a given number of generalized Givens rotations. We start by showing the optimum solution to the dictionary learning problem when the dictionary is a single generalized Givens rotation and then move to expand on this result and propose an algorithm that sequentially builds a product of generalized Givens rotations to act as a dictionary for sparse representations. Each step of the algorithm solves exactly the proposed optimization problem and therefore we can guarantee that it monotonically converges to a local minimum. We show numerically that the fast dictionaries proposed in this paper outperform those based on Householder reflectors [26] in terms of representation error, for the same computational complexity.

Second, based on a structure similar to the generalized Givens rotation we then propose a learning method that constructs square, non-orthogonal, computationally efficient dictionaries. In order to construct the dictionary we again solve exactly a series of optimization problems. Unfortunately we cannot prove the monotonic convergence of the algorithm since the sparse reconstruction step, based in this paper on OMP, cannot guarantee in general a monotonic reduction in our objective function. Still, we are able to show that these fast non-orthogonal transforms perform very well, better than their orthogonal counterparts.

In the results section we compare the proposed methods among each other and to previously proposed dictionary learning methods in the literature. We show that the methods proposed in this paper provide a clear trade-off between representation performance and computational complexity. Interestingly, we are able to provide numerical examples where the proposed fast orthogonal dictionaries have higher computational efficiency and provide better representation performance than the well-known discrete cosine transform (DCT), the transform at the heart of the jpeg compression standard [27].

Ii A brief description of dictionary learning optimization problems

Given a real dataset and sparsity level , the general dictionary learning problem is to produce the factorization given by the optimization problem:

(1)
subject to

where the objective function describes the Frobenius norm representation error achieved by the square dictionary with the sparse representations whose columns are subject to the pseudo-norm (the number of non-zero elements of columns ). To avoid trivial solutions, the dimensions obey . Several algorithms that work very well in practice exist [7] [8] [15] to solve this factorization problem. Their approach, and the one we also adopt in this paper, is to keep the dictionary fixed and update the representations and then reverse the roles by updating the dictionary with the representations fixed. This alternating minimization approach proves to work very well experimentally [7, 8] and allows some theoretical insights [28].

In this paper we also consider the dictionary learning problem (1) with an orthogonal dictionary [29] [30] [31] [32]. The orthogonal dictionary learning problem (which we call in this paper Q–DLA) [33] is formulated as:

(2)
      subject to

Since the dictionary is orthogonal, the construction of no longer involves [12], OMP [13] or AMP [14] approaches as in (1), but reduces to , where is an operator that given an input vector zeros all entries except the largest in magnitude and given an input matrix applies the same operation on each column in turn. To solve (2) for variable and fixed , a problem also known as the orthogonal Procrustes problem [34], a closed form solution is given by the singular value decomposition of .

Iii A building block for fast transforms

For indices and variables let us define the basic transform, which we call an R-transform:

(3)

where we have denoted as the identity matrix of size . For simplicity, we denote the non-zero part of as

(4)

A right side multiplication between a R-transform and a matrix operates only rows and as

(5)

where is the row of . The number of operations needed for this task is only . Left and right multiplications with a R-transform (or its transpose) are therefore computationally efficient. We use this matrix structure as a basic building block for the transforms learned in this paper.

Remark 1. Every matrix can be written as a product of at most R-transforms. Therefore, we can consider the R-transforms as fundamental building blocks for all square transforms .

Proof. Consider the singular value decomposition . Each and can be factored as a product of Givens rotations [35] which are all in fact constrained R-transforms (with and for some given and such that ) and a diagonal matrix containing only entries. While the diagonal can be factored as a product of diagonal R-transforms.

In this paper we will be interested to use in least squares problems with the objective function as:

(6)

For simplicity of exposition we have defined

(7)

where and are the rows of and , respectively.

We now introduce learning methods to create computationally efficient orthogonal and non-orthogonal dictionaries.

Iv A method for designing fast orthogonal transforms: G–Dla

In this section we propose a method called G–DLA to learn orthogonal dictionaries that are factorized as a product of G-transforms (constrained R-transforms).

Iv-a An overview of G-transforms

We call a G-transform, an orthogonal constrained R-transform (3) parameterized only by with , and the indices such that the non-zero part of , corresponding to (4), is given by

(8)

Classically, a Givens rotation is a matrix as in (3) with such that , i.e., proper rotation matrices are orthogonal matrices with determinant one. These rotations are important since any orthogonal dictionary of size can be factorized in a product of Givens rotations [35]. In this paper, since we are interested in the computational complexity of these structures, we allow both options in (8) that fully characterize all real orthogonal matrices – these structures are discussed in [36, Chapter 2.1]. With the G-transform in (3) is in fact a Householder reflector where has all entries equal to zero except for the and entries that are and , respectively – one might call this a “Givens reflector” to highlight its distinguishing sparse structure. Givens rotations have been previously used in matrix factorization applications [37, 38].

Iv-B One G-transform as a dictionary

Consider now the dictionary learning problem in (2). Let us keep the sparse representations fixed and consider a single G-transform as a dictionary. We reach the following

(9)

When indices are fixed, the problem reduces to constructing , a constrained two dimensional optimization problem. To select the indices , among the possibilities, an appropriate strategy needs to be defined. We detail next how to deal with these two problems to provide an overall solution for (9).

To solve (9) for the fixed coordinates we reach the optimization problem

(10)

This is a two dimensional Procrustes problem [34] whose optimum solution is where . It has been shown in [26] that the reduction in the objective function of (10) when considering an orthogonal dictionary given by the Procrustes solution is

(11)

where is the nuclear norm of , i.e., the sum of its singular values.

Choosing in (9) requires a closer look at its objective function (6) for , the constrained G-transform structure. Using (11) we can state a result in the special case of a G-transform. We need both because for any indices the reduction in the objective function invokes the nuclear norm, while for the other indices the reduction invokes the trace. We can analyze the two objective function values separately because the Frobenius norm is elementwise and as such also blockwise. Therefore, the objective of (9) is

(12)

Since we want to minimize this quantity, the choice of indices needs to be made as follows

(13)

and then solve a Procrustes problem [34] to construct .

These values are the optimum indices that lead to the maximal reduction in the objective function of (9). The expression in (13) is computationally cheap given that is a real matrix. Its trace is trivial to compute (one addition operation) while the singular values of can be explicitly computed as

(14)

Therefore, the full singular value decomposition can be avoided and the sum of the singular values from (14) can be computed in only 23 operations (three of which are taking square roots). The cost of computing for all indices is operations. The computational burden is still dominated by constructing which takes operations.

Remark 2. Notice that always. In general, this is because the sum of the singular values of any matrix of size is always greater than the sum of its eigenvalues. To see this, use the singular value decomposition of , and develop:

(15)

where we have use the circular property of the trace and where are its diagonal entries which obey since both and are orthogonal and their entries are sub-unitary in magnitude. Therefore, in our particular case, we have that when is symmetric and positive semidefinite (we have that in (15) and therefore ). If we have that for all and then no G-transform can reduce the objective function in (9) and therefore the solution is .

Remark 3. We can extend the G-transform to multiple indices. For example, if we consider three coordinates then has the non-zero orthogonal block . For a transform over indices there are such blocks and its matrix-vector multiplication takes operations.

Remark 4. There are some connections between the Householder [26] and the G-transform approaches. As previously explained, when the G-transform can also be viewed as a Householder reflector, i.e., where is a 2-sparse vector. Following results from [26] we can also write

(16)

which, together with (12), leads to . This means that choosing to maximize in (12) is equivalent to computing an eigenvector of of sparsity two associated with a negative eigenvalue.

There are also some differences between the two approaches. For example, matrix-vector multiplication with a G-transform takes 6 operations but when using the Householder structure takes 8 operations (4 operations to compute the constant , 2 operations to compute the 2-sparse vector and 2 operations to compute the final result ). Therefore, the G-transform structure is computationally preferable to the Householder structure. Each Householder reflector has (because of the orthogonality constraint) degrees of freedom while each G-transform has only (the angle for which and ) plus 1 bit (the choice of the rotation or reflector in (8)).

This concludes our discussion for the single G-transform case. Notice that the solution outlined in this section solves (9) exactly, i.e., it finds the optimum G-transform.

Iv-C A method for designing fast orthogonal transforms: G–Dla

In this paper we propose to construct an orthogonal transform with the following structure:

(17)

The value of is a choice of the user. For example, if we choose to be the transform can be computed in computational complexity – similar to the classical fast transforms. The goal of this section is to propose a learning method that constructs such a transform.

We fix the representations and all G-transforms in (17) except for the , denoted as . To optimize the dictionary only for this transform we reach the objective function

(18)

where we have used the fact that multiplication by any orthogonal transform preserves the Frobenius norm. For simplicity we have denoted and the known quantities in (18) and therefore .

Notice that we have reduced the problem to the formulation in (9) whose full solution is outlined in the previous section. We can apply this procedure for all G-transforms in the product of and therefore a full update procedure presents itself: we will sequentially update each transform and then the sparse representations until convergence. The full procedure we propose, called G–DLA, is detailed in Algorithm 1.

The initialization of G–DLA uses a known construction. It has been shown experimentally in the past [39], that a good initial orthogonal dictionary is to choose from the singular value decomposition of the dataset . We can also provide a theoretical argument for this choice. Consider that

(19)

A sub-optimal choice is to assume that the operator keeps only the first rows of , i.e., where is the matrix where we keep only the leading principal submatrix of size and set to zero everything else. This is a good choice since the positive diagonal elements of are sorted in decreasing order of their values and therefore we expect to keep entries with large magnitude. In fact, , where the ’s are the diagonal elements of , due to the fact that the rows of have unit magnitude. Furthermore, with the same we have . We expect this error term to be relatively small since we sum over the smallest squared singular values of . Therefore, with this choice of and the optimal we have that , i.e., the representation error is always smaller than the error given by the best -rank approximation of .

In G–DLA, with the sparse representations we proceed to iteratively construct each G-transform. At step , the problem to be solved is similar to (18) but all transforms indexed above are currently the identity (not initialized) and will be computed in the following steps.

Notice that , necessary to compute all the values , is computed fully only once before the iterative process. At each iteration of the algorithms only two columns of need to be recomputed. Therefore, the update of is trivial since it involves the linear combinations of two columns according to a G-transform multiplication (5).

Initialization:
  1. Perform the economy size singular value decomposition of the dataset .

  2. Compute sparse representations .

  3. For : with and all previous G-transforms fixed and , construct the new where indices are given by (13) and by the singular value decomposition such that we minimize

  1. as in (18) for and .

Iterations :
  1. For : two-step update of , with and all other transforms fixed, such that (18) is minimized:

    1. Update best indices by (13).

    2. With new indices, update the transform by the singular value decomposition , where as in (18).

  2. Compute sparse representations , where is given by (17).

Algorithm 1 – G–DLA.
Fast Orthonormal Transform Learning.

Input: The dataset , the number of G-transforms , the target sparsity and the number of iterations .
Output: The sparsifying square orthogonal transform and sparse representations such that is reduced.

The iterations of G–DLA update each G-transform sequentially, keeping all other constant, in order to minimize the current error term.

The algorithm is fast since the matrices involved in all computations can be updated from previous iterations. For example, at step , notice from (18) that and . The same observation holds for . Of course, and . We always need to construct from scratch since has been fully updated in the sparse reconstruction step.

After all transforms are computed, the dictionary is never explicitly constructed. We always remember its factorization (17) and apply it (directly or inversely) by sequentially applying the G-transforms in its composition. The total computational complexity of applying this dictionary for is which is for sufficiently large (of order ). This is to be compared with the of a general orthogonal dictionary. Additionally, when consecutive G-transforms operate on different indices they can be applied in parallel, reducing the running time of G–DLA and that of manipulating the resulting dictionary.

The number of transforms could be decided during the runtime of G–DLA based on the magnitude of the largest value . Since this magnitude decides the reduction in the objective function of our problem, a threshold can be introduced to decide on the fly if a new transform is worth adding to the factorization.

These observations are important from a computational perspective since the number of transforms is relatively high, , and therefore their manipulation should be performed efficiently when learning the dictionary to keep the running time of G–DLA low.

Since each G-transform computed in our method maximally reduces the objective function and because the sparse reconstruction step is exact when using an orthogonal dictionary, we can guarantee that the proposed method monotonically converges to a local optimum point.

Remark 5. At each iteration of the proposed algorithm we update a single G-transform according to the maximum value . We have in fact the opportunity to update a maximum of transforms simultaneously. We could for example partition the set in pairs of two and construct the corresponding G-transforms such that the sum of their is maximized. With such a strategy fewer iterations are necessary but the problem of partitioning the indices such that the error is maximally reduced can be computationally demanding (all possible unique combinations of indices associations need to be generated). We expect G–DLA, as it is, to produce better results (lower representation error) due to the one-by-one transform update mechanism. Compared to the Householder approach [26] we again expect G–DLA to performs better since the optimization is made over two coordinates at a time.

Even so, there are several options regarding the ordering. We can process the G-transforms in the order of their indices or in a random order for example in an effort to try to avoid local minimum points.

Remark 6. After indices are selected we have that and therefore this pair cannot be selected again until either index or participates in the construction of a future G-transform. This is because after constructing to minimize (10) we have that is updated to which is symmetric and positive definite due to the solution .

Remark 7. As previously discussed, the Procrustes solution is the best orthogonal minimizer of (11). It has been shown in [26] that with this we have that is symmetric positive semidefinite. Since is the global minimizer, there cannot be a G-transform such that further reduces the error. This means that all symmetric sub-matrices of are positive semidefinite, i.e., and for all pairs . This observation needs to hold for any symmetric positive definite matrix . Unfortunately, the converse is not true in general.

This means that even with an appropriately large , G–DLA might not always be able to match the performance of Q–DLA. This is not a major concern since in this paper we explore fast transforms and therefore .

This concludes the presentation of the proposed G–DLA method. Based on similar principles next we provide a learning method for fast square but non-orthogonal dictionaries.

V A method for designing fast, general, non-orthogonal transforms: R–Dla

In the case of orthogonal dictionaries, the fundamental building blocks like Householder reflectors and Givens rotations are readily available. This is not the case for general dictionaries. In this section we propose a building block for non-orthogonal structures in subsection A and then show how this can be used in a similar fashion to the G-transform to learn computationally efficient square non-orthogonal dictionaries by deriving the R–DLA method in subsection B.

V-a A building block for fast non-orthogonal transforms

We assume no constraints on the variables (these are four degrees of freedom) and therefore from (3) is no longer orthogonal in general. We propose to solve the following optimization problem

(20)

As in the G-transform case, we proceed with analyzing how indices are selected and then how to solve the optimization problem (20), with the indices fixed. We define

(21)

with entries and respectively.

Solving (20) for fixed leads to a least squares optimization problem as

(22)

where are rows of and respectively and whose solution is

Choosing in (20) depends on the objective function value in (22) given by the least squares solution from above:

(23)

This, together with the development in (6), leads to

(24)

Since the matrices involved in the computation of are we can use the trace formula and the inversion of a matrix formula to explicitly calculate

(25)

Finally, to solve (20) we select the indices as

(26)

and then solve a least square problem to construct . The are computed only when , otherwise . To compute each in (25) we need 24 operations and there are such . The computational burden is dominated by constructing which take and operations, respectively.

Remark 8. A necessary condition for a dictionary to be a local minimum point for the dictionary learning problem is that all for .

This concludes our discussion for one transform . Notice that just like in the case of one G-transform, the solution given here finds the optimum to minimize (20).

V-B A method for designing fast general transforms: R–Dla

Initialization:
  1. Perform the economy size singular value decomposition of the dataset .

  2. Compute sparse representations .

Iterations :
  1. For : with and all previous R-transforms fixed and , construct the new where indices are given by (26) and is given by the least squares solution that minimizes (28).

  2. Compute in (27) such that .

  3. Compute sparse representations where is given in (27).

Iterations :
  1. For : with , indices and all transforms except the fixed, update only the non-zero part of , denoted , such that (30) is minimized.

  2. Compute in (27) such that .

  3. Compute sparse representations where is given in (27).

Algorithm 2 – R–DLA.
Fast Non-orthogonal Transform Learning.

Input: The dataset , the number of transforms , the target sparsity and the number of iterations .
Output: The sparsifying square non-orthogonal transform and sparse representations such that is reduced.

Similarly to G–DLA, we now propose to construct a general dictionary with the following structure:

(27)

The value of is a choice of the user. For example, if we choose to be the dictionary can be applied in computational complexity – similar to the classical fast transforms. The goal of this section is to propose a learning method that constructs such a general dictionary. As the transformations are general, the diagonal matrix is there to ensure that all columns of are normalized (as in the formulation (1)). This normalization does not affect the performance of the method since is equivalent to .

We fix the representations and all transforms in (27) except for the transform . Moreover, all transforms are set to . Because the transforms are not orthogonal we cannot access directly any transform in (27), but only the left most one . In this case, to optimize the dictionary only for this transform we reach the objective

(28)

Therefore, our goal is to solve

(29)

Notice that we have reduced the problem to the formulation in (20) whose full solution is outlined in the previous section. We can apply this procedure for all G-transforms in the product of and therefore a full update procedure presents itself: we will sequentially update each transform in (27), from the right to the left, and then the sparse representations until convergence or for a total number of iterations .

Once these iterations terminate we can refine the result. As previously mentioned, we cannot arbitrarily update a transform because this transform is not orthogonal. But we can update its non-zero part . Consider the development:

(30)

where and . We have denoted by the column of and is the Kronecker product. To develop (30) we have used the fact that the Frobenius norm is an elementwise operator, the structure of and the fact that

(31)

The that minimizes (30) is given by the least squares solution . Therefore, once the product of the transforms is constructed we can update the non-zero part of any transform to further reduce the objective function. What we cannot do is update the indices on which the calculation takes place, these stay the same.

Therefore, we propose a learning procedure that has two sets of iterations: the first constructs the transforms