Convergence Acceleration via Chebyshev Step: Plausible Interpretation of Deep-Unfolded Gradient Descent
Deep unfolding is a promising deep-learning technique, whose network architecture is based on expanding the recursive structure of existing iterative algorithms. Although convergence acceleration is a remarkable advantage of deep unfolding, its theoretical aspects have not been revealed yet. The first half of this study details the theoretical analysis of the convergence acceleration in deep-unfolded gradient descent (DUGD) whose trainable parameters are step sizes. We propose a plausible interpretation of the learned step-size parameters in DUGD by introducing the principle of Chebyshev steps derived from Chebyshev polynomials. The use of Chebyshev steps in gradient descent (GD) enables us to bound the spectral radius of a matrix governing the convergence speed of GD, leading to a tight upper bound on the convergence rate. The convergence rate of GD using Chebyshev steps is shown to be asymptotically optimal, although it has no momentum terms. We also show that Chebyshev steps numerically explain the learned step-size parameters in DUGD well. In the second half of the study, Chebyshev-periodical successive over-relaxation (Chebyshev-PSOR) is proposed for accelerating linear/nonlinear fixed-point iterations. Theoretical analysis and numerical experiments indicate that Chebyshev-PSOR exhibits significantly faster convergence for various examples such as Jacobi method and proximal gradient methods.
Deep learning is emerging as a fundamental technology in numerous fields such as image recognition, speech recognition, and natural language processing. Recently, deep learning techniques have been applied to general computational processes that are not limited to neural networks. The approach is called differential programming. If the internal processes of an algorithm are differentiable, then standard deep learning techniques such as back propagation and stochastic gradient decent (SGD) can be employed to adjust learnable parameters of the algorithm. Data-driven tuning often improves the performance of the original algorithm, and it provides flexibility to learn the environment in which the algorithm is operating.
A set of standard deep learning techniques can be naturally applied to internal parameter optimization of a differentiable iterative algorithm such as iterative minimization algorithm. By embedding the learnable parameters in an iterative algorithm, we can construct a flexible derived algorithm in a data-driven manner. This approach is often called deep unfolding [19, 21]. In , a deep-unfolded sparse signal recovery algorithm is proposed based on the iterative shrinkage-thresholding algorithm (ISTA); the proposed algorithm shows remarkable performance compared with the original ISTA.
Recently, numerous studies on signal processing algorithms focusing deep unfolding have been presented for sparse signal recovery [44, 52, 7, 51, 23, 46], image recovery [43, 24, 30, 25], wireless communications [34, 42, 20, 45, 53, 49, 47], and distributed computing . Theoretical aspects of deep unfolding have also been investigated [11, 28, 31], which mainly focus on analyses of deep-unfolded sparse signal recovery algorithms. More details on related works can be found in the excellent surveys on deep unfolding [1, 33].
Ito et al. proposed trainable ISTA (TISTA)  for sparse signal recovery which is an instance of deep-unfolded algorithms. This algorithm can be regarded as a proximal gradient method with trainable step-size parameters. They showed that the convergence speed of signal recovery processes can be significantly accelerated with deep unfolding. This phenomenon, called convergence acceleration, is the most significant advantage of deep unfolding.
However, a clear explanation on how convergence acceleration emerges when using deep unfolding is still lacking. When plotting the learned step sizes of TISTA as a function of the iterative step, we often obtain a zig-zag shape (Fig.11 of ), i.e., non-constant step sizes. Another example of convergence acceleration is shown in Fig.1. We can observe that the DUGD provides much faster convergence to the optimal point compared with a naive GD with the optimal constant step size. A zig-zag shape of a learned step-size sequence can be seen in Fig.1 (lower). The zig-zag strategy on step sizes seems preferable to accelerate the convergence; however, a reasonable interpretation still remains an open problem.
The primary goal of this study is to provide a theoretically convincing interpretation of learned parameters and to answer the question regarding the mechanism of the convergence acceleration. We expect that deepening our understanding regarding convergence acceleration is of crucial importance because knowing the mechanism would lead to a novel way for improving convergence behaviors of numerous iterative signal processing algorithms.
The study is divided into two parts. In the first part, we will pursue our primary goal, i.e., to provide a plausible theoretical interpretation of learned step sizes. The main contributions of this part are 1) a special step size sequence called Chebyshev steps is introduced by extending an acceleration method for average consensus , which can accelerate the convergence speed of GD without momentum terms and 2) a learned step size in DUGD can be well explained by Chebyshev steps, which can be considered a promising answer to the question posed above. The Chebyshev step is naturally derived by minimizing an upper bound on the spectral radius dominating the convergence rate using Chebyshev polynomials.
The second part is devoted to our secondary goal, i.e., to extend the theory of Chebyshev steps. We especially focus on accelerating the linear/nonlinear fixed-point iterations using Chebyshev steps. The main contributions of the second part are 1) a novel method, called Chebyshev-periodical successive over-relaxation (PSOR), for accelerating the convergence speed of fixed-point iterations and 2) its local convergence analysis are proposed. Chebyshev-PSOR can be regarded as a variant of SOR or Krasnosel’skiǐ-Mann iteration . One of the most notable features of Chebyshev-PSOR is that it can be applied to nonlinear fixed-point iterations in addition to linear fixed-point iterations. It appears particularly effective for accelerating proximal gradient methods such as ISTA.
This paper is organized as follows. In Section II, we describe brief reviews of background materials as preliminaries. In Section III, we provide a theoretical interpretation of learned step-size parameters in DUGD by introducing Chebyshev steps. In Section IV, we introduce Chebyshev-PSOR for fixed-point iterations as an application of Chebyshev steps. The last section provides a summary of this paper.
In this paper, represents an -dimensional column vector and denotes a matrix of size . In particular, denotes the identity matrix of size . For a complex matrix , denotes a Hermitian transpose of . For with real eigenvalues, we define and as the maximum and minimum eigenvalues of , respectively. Then, the condition number of is given as . The spectral radius of is given as , where is an eigenvalue of the matrix . For matrix , represents the operator norm (spectral norm) of . We define as the Gaussian distribution with mean and variance .
Ii-B Deep unfolding
In this subsection, we briefly introduce the principle of deep unfolding and demonstrate it using a toy problem.
Deep unfolding is a deep learning technique based on existing iterative algorithms. Generally, iterative algorithms have some parameters that control convergence properties. In deep unfolding, we first expand the recursive structure to the unfolded feed-forward network, in which we can embed some trainable parameters to increase the flexibility of the iterative algorithm. These embedded parameters can be trained by standard deep learning techniques if the algorithm consists of only differentiable sub-processes. Figure 2 shows a schematic diagram of (a) an iterative algorithm with a set of parameters and (b) an example of deep-unfolded algorithm. In the deep-unfolded algorithm, the embedded trainable parameters are updated to decrease the loss function between the true solution and output by an optimizer. This training process is executed using supervised data , where is an input of the algorithm and is the corresponding true solution.
Example of deep unfolding
We turn to an example of a deep-unfolded algorithm based on GD
where is a noiseless measurement vector with the measurement matrix () and the original signal . In the following experiment, is sampled from a Gaussian random matrix ensemble, whose elements independently follows
Although the solution of (1) is explicitly given by , GD can be used to evaluate its solution. The recursive formula of GD is given by
where is an initial vector, and is a step-size parameter.
By applying the principle of deep unfolding, we introduce a deep-unfolded GD (DUGD) by
where is a trainable step-size parameter that depends on the iteration index . The parameters can be trained by minimizing the mean squared error (MSE) between the estimate and in a mini batch containing supervised data . Especially, we employ incremental training  in the training process to avoid a gradient vanishing problem and escape from local minima of the loss function. During incremental training, the number of trainable parameters increases at every mini batch; that is, the DUGD of iterations (or layers) is trained from the th mini batch to the th mini batch, which we call the th generation of incremental training. After the th generation ends, we add an initialized parameter to the learned parameters to start the next generation.
Figure 1 shows empirical results of the MSE performance (upper) and learned parameters (lower) of DUGD () and the naive GD with optimal constant step size when . In the training process, the initial values of were set to , and mini batches of size and Adam optimizer  with learning rate were used in each generation. We found that the learned parameter sequence had a zig-zag shape and that DUGD yields a much steeper error curve after 20-25 iterations compared with the naive GD with a constant step size. These learned parameters are not intuitive or interpretable from conventional theory. Unveiling the reason for the zig-zag shape is the main objective of the next section.
Ii-C Brief review of accelerated gradient descent
In the first part of this study, we will focus on GD for minimizing a quadratic objective function , where and is a Hermitian matrix. Here, we will review first-order methods that use only first-order gradient information in their update rules.
GD with optimal constant step
An iterative process of a naive GD with the constant step size is given by
It is known [5, Section 1.3] that the convergence rate with an optimal step size is given as
where , , and . The momentum method exhibits faster convergence than the naive GD because of the momentum term . It was proved that the momentum method achieves a convergence rate . which coincides with the lower bound of the convergence rate of the first-order methods . This implies that the momentum method is an optimal first-order method in terms of the convergence rate.
Chebyshev semi-iterative method
The Chebyshev semi-iterative method is another first-order method with an optimal rate which is defined by
where the initial conditions are given by , , , and . The Chebyshev semi-iterative method also achieves the convergence rate .
Iii Chebyshev steps for gradient descent
In this section, we introduce Chebyshev steps that accelerate the convergence speed of GD and discuss the relation between Chebyshev steps and learned step-size sequences in DUGD.
Iii-a Deep-unfolded gradient descent
We consider the minimization of a convex quadratic function where is a Hermitian positive definite matrix and is its solution. Note that this minimization problem corresponds to the LMS problem (1) under proper transformation.
GD with the step-size sequence is given by
where is an arbitrary point in .
In DUGD, we first fix the total number of iterations
for any . Note that is a function of step-size parameters . We aim to show that a proper step-size sequence accelerates the convergence speed, i.e., reduces the spectral radius .
Iii-B Chebyshev steps
In this subsection, our aim is to determine a step-size sequence that reduces the spectral radius to understand the nontrivial step-size sequence of DUGD. In the simplest case, the optimal step size is analytically given by in (5). When , however, minimizing with respect to becomes a non-convex problem in general. We thus cannot rely on numerical optimization methods. We alternatively introduce a step-size sequence that tightly bounds the spectral radius from above and analyze its convergence rate. These ideas follow the worst-case optimal (WO) interpolation for average consensus . Whereas the WO interpolation is considered only for a graph Laplacian, the analysis in this paper treats GD with a general Hermitian matrix whose minimum eigenvalue may not be equal to zero.
Recall that the Hermitian positive definite matrix has positive eigenvalues including degeneracy. Hereafter, we assume that to avoid a trivial case.
To calculate eigenvalues of a matrix function, we use the following lemma [22, p. 4-11, 39].
For a polynomial with complex coefficients, if is an eigenvalue of associated with eigenvector , then is an eigenvalue of the matrix associated with the eigenvector .
Then, we have
where is a function containing step-size parameters .
In the general case in which , we focus on the step-size sequence which minimizes the upper bound of the spectral radius . The upper bound is given by
In the following discussions, we use Chebyshev polynomials  (see App. A in Supplemental Material). Chebyshev polynomials have the minimax property, in which has the smallest -norm in the Banach space (see Thm. 5). Using the affine transformation , the monic polynomial is given as
has the smallest -norm in . Let be reciprocals to the zeros of , i.e., . Then, from (53), are given by
In this study, this sequence is called Chebyshev steps. Note that this sequence is also proposed for accelerated Gibbs sampling .
As described above, Chebyshev steps minimize the -norm in in which a function is represented by . Although it is nontrivial that Chebyshev steps also minimize , we can prove that this is true.
For a given , we define Chebyshev steps of length as
Then, the Chebyshev steps form a sequence that minimizes the upper bound of the spectral radius of .
The proof is shown in App. B in Supplemental Material. Note that the Chebyshev step is identical to the optimal constant step size (5) when . Chebyshev steps are a natural extension of the optimal constant step size.
The reciprocal of the Chebyshev steps corresponds to Chebyshev nodes as zeros of . Figure 3 shows the Chebyshev nodes and Chebyshev steps when , , and . A Chebyshev node is defined as a point which is projected onto an axis from a point of degree on a semi-circle (see right part of Fig. 3). The Chebyshev nodes are located symmetrically with respect to the center of the semi-circle corresponding to . The Chebyshev steps are given by , which is shown in the left part of Fig. 3. We found that the Chebyshev steps are widely located compared with the optimal constant step size .
Iii-C Convergence analysis
We analyze the convergence rate of GD with Chebyshev steps (CHGD). Because is a Hermitian matrix, is a normal matrix and [22, p. 8-7, 19] holds. Let be the matrix including the Chebyshev steps of length . Then, we have
where the second equality is due to . This indicates that the MSE of every iteration is upper bounded by .
Calculating is similar to that in . Finally, we get the following lemma.
For and , we have
See App. C in Supplemental Material for the proof.
Let be the function with Chebyshev steps. Figure 4 shows and its upper bound for when and . We find that reaches its maximum value several times due to the extremal point property of Chebyshev polynomials. Moreover, because holds for , holds. This indicates that CHGD always converges to an optimal point.
We first compare the convergence rate obtained in the above argument with the spectral radius of the optimal constant step size. Letting be with the optimal constant step size , we get Then, we can prove the following theorem.
For , we have .
The proof can be found in App. D in Supplemental Material. This indicates that the convergence speed of CHGD is strictly faster than that of GD with the optimal step size when we compare every iteration.
The convergence rate of GD is defined as using the spectral radius per iteration. From (58), the convergence rate of GD with the Chebyshev steps (CHGD) of length is bounded by
which is lower than the convergence rate (5) of GD with the optimal constant step size. The rate approaches the lower bound of first-order methods from above. In Figure 5, we show the convergence rates of GD and CHGD as functions of the condition number . For CHGD, some upper bounds of convergence rates with different are plotted. We confirmed that CHGD with has a smaller convergence rate than GD with the optimal constant step size and the rate of CHGD approaches the lower bound (18) quickly as increases. This means that CHGD is asymptotically, i.e., in the large- limit, optimal in terms of convergence rate.
To summarize, in Sec. III-B-Sec. III-C, we introduced learning-free Chebyshev steps that minimize the upper bound of the spectral radius and improve the convergence rate. Additionally, we can show that the training process of DUGD also reduces the corresponding spectral radius (see App. E in Supplemental Material). These imply that a learned step-size sequence of DUGD is possibly explained by the Chebyshev steps because Chebyshev steps are reasonable step sizes that reduce the spectral radius .
Iii-D Numerical results
In this subsection, we examine DUGD following the setup in Sec III-A. The main goal is to examine whether Chebyshev steps explain the nontrivial learned step-size parameters in DUGD. Additionally, we also analyze the convergence property of DUGD and CHGD numerically and compare CHGD with other accelerated GDs.
We describe the details of the numerical experiments. DUGD was implemented using PyTorch 1.5 . Each training datum was given by a pair of random initial point and optimal solution . A random initial point was generated as the i.i.d. Gaussian random vector with unit mean and unit variance. The matrix is generated by using whose elements independently follow . Then, from the Marchenko-Pastur distribution law, as with fixed to a constant, the maximum and minimum eigenvalues approach and , respectively. The matrix was fixed throughout training process.
The training process was executed using incremental training. See Sec. II-B2 for details. All initial values of were set to unless otherwise noted. For each generation, the parameters were optimized to minimize the MSE loss function between the output of DUGD and the optimal solution using mini batches of size . We used Adam optimizer  with a learning rate of .
Learned step sizes
Next, we compared values of the learned step-size sequence with Chebyshev steps. Figure 6 shows examples of the sequences of length and when . To compare parameters clearly, the learned step sizes and Chebyshev steps were rearranged in descending order. The lines with symbols represent the Chebyshev steps with and . Other symbols indicate the learned step-size parameters corresponding to five trials, that is, different matrices of and the training process with different random seeds. We found that the learned step sizes agreed with each other, thus indicating the self-averaging property of random matrices and success of the training process. More importantly, they were close to the Chebyshev steps, particularly when was small.
The zig-zag shape of the learned step-size parameters is another nontrivial behavior of DUGD, although the order of parameters does not affect MSE performance after the th iteration. It is numerically suggested that the shape depends on incremental training and initial values of . Figure 7 shows the trajectories of step sizes in the second generation of DUGD incremental training. Squares represent the initial point where is a trained value of in the first generation and is the initial value of . We found that a convergent point clearly depends on the initial value of . This is natural because the optimal values of in the second generation show permutation symmetry. Indeed, we confirmed that the location of a convergent point is based on the landscape of the MSE loss used in the training process.
From this observation, we attempted to determine a permutation of the Chebyshev steps systematically by emulating incremental training. Figure 8 shows the learned step-size parameters in DUGD () with different initial values of and the corresponding permuted Chebyshev steps. We found that they agreed with each other including the order of parameters. Details of searching for such a permutation are provided in App. F in Supplemental Material.
Difference between Chebyshev steps and learned step sizes
Figure 9. shows absolute eigenvalues of as a function of eigenvalue in In the figure, circles represent using learned step-size parameters in DUGD, whereas cross marks are using Chebyshev steps when , , and . We found that of DUGD were smaller than those of the Chebyshev steps in the high spectral-density regime and were larger otherwise. This is because it reduced the MSE loss that all eigenvalues of the matrix contributed. By contrast, it increased the maximum value of corresponding to the spectral radius . In this case, the spectral radius of DUGD was , whereas that of the Chebyshev steps was . Note that the spectral radius using the optimal constant step size was . We found that DUGD accelerated the convergence speed in terms of the spectral radius, whereas the Chebyshev steps further improved the convergence rate.
Performance analysis and convergence rate
Here, we compare the performance of DUGD and CHGD with the convergence rate evaluated in Section III-C.
In the experiment DUGD was trained the step sizes with . CHGD was executed repeatedly with the Chebyshev steps of length . Figure 10 shows the MSE performance of DUGD, CHGD, and GD with the optimal constant step size when . We found that DUGD and CHGD converged faster than GD, indicating that a proper step-size parameter sequence accelerated the convergence speed. Although DUGD had slightly better MSE performance than CHGD when , CHGD exhibited faster convergence as the number of iterations increased. This is because the spectral radius of CHGD was smaller than that of DUGD, as discussed in the last subsection. Figure 10 also shows the MSE calculated by (16) using the convergence rates. We found that the upper bound of the convergence rate (18) correctly predicted the convergence properties of CHGD.
Comparison with accelerated GD
Figure 11 shows the MSE performance of CHGD () and those of other GD algorithms when . We found that CHGD improved its MSE performance as increased. In particular, CHGD () showed reasonable performance compared with MOM and CH-semi. This indicates that CHGD exhibits acceleration of the convergence speed comparable to MOM and CH-semi when is sufficiently large. We can conclude that CHGD is an accelerated GD algorithm without momentum terms, which does not require a training process like DUGD.
Iv Extension of Theory of Chebyshev Steps
In the previous section, we saw that Chebyshev steps successfully accelerate the convergence speed of GD. The discussion can be straightforwardly extended to local convergence analysis of the strongly convex objective functions because such a function can be well approximated by a quadratic function around the optimal point.
In this section, we present a novel method termed Chebyshev-periodical successive over-relaxation (PSOR) for accelerating the convergence speed of linear/nonlinear fixed-point iterations based on Chebyshev steps.
In this subsection, we briefly summarize some background materials required for this section.
A fixed-point iteration
is widely used for solving scientific and engineering problems such as inverse problems . An eminent example is GD for minimizing a convex or non-convex objective function . Fixed-point iterations are also widely employed for solving large-scale linear equations. For example, Gauss-Seidel methods, Jacobi methods, and conjugate gradient methods are well-known fixed-point iterations . Another example of fixed-point iterations is the proximal gradient descent method such as ISTA  for sparse signal recovery problems. The projected gradient method also is an example that can be subsumed within the proximal gradient method.
Successive over-relaxation (SOR) is a well-known method for accelerating the Gauss-Seidel and Jacobi methods to solve a linear equation , where . For example, Jacobi method is based on the following linear fixed-point iteration:
where is a diagonal matrix whose diagonal elements are identical to the diagonal elements of . The update rule of SOR for Jacobi method is simply given by
where . A constant SOR factor is often employed in practice. When the SOR factors are appropriately chosen, the SOR-Jacobi method achieves much faster convergence compared with the standard Jacobi method. SOR for linear equation solvers  is particularly useful for solving sparse linear equations. Readers can refer to  for a convergence analysis.
Assume that we want to solve the following minimization problem:
where and . The function is differentiable, and is a strictly convex function which is not necessarily differentiable. The iteration of the proximal gradient method is given by
where is a step size. It is known that converges to the minimizer of (23) if is included in the semi-closed interval , where is the Lipschitz constant of .
Iv-B Fixed-point iteration
Let be a differentiable function. Throughout the paper, we will assume that has a fixed point satisfying and that is locally contracting mapping around . Namely, there exists such that
holds for any , where is ball defined by
These assumptions clearly show that the fixed-point iteration
eventually converges to the fixed point if the initial point is included in .
SOR is a well-known technique for accelerating the convergence speed for linear fixed-point iterations such as the Gauss-Seidel method . The SOR (or Krasnosel’skiǐ-Mann) iteration was originally defined for a linear update function ; however, it can be generalized for a nonlinear update function. The SOR iteration is given by
where is a real coefficient, which is called the SOR factor. When is linear, and is a constant, the optimal choice of in terms of the convergence rate is well-known ; however, the optimization of iteration-dependent SOR factors for nonlinear update functions has not hitherto been studied to the best of the authors’ knowledge.
We define the SOR update function by
Note that holds for any . Because is also the fixed point of , the SOR iteration does not change the fixed point of the original fixed-point iteration.
Iv-C Periodical SOR
Because is a differentiable function, it would be natural to consider the linear approximation around the fixed point . The Jacobian matrix of around is given by . Matrix is the Jacobian matrix of at the fixed point :
where . In the following discussion, we use the abbreviation for simplicity:
Here, we assume that is a Hermitian matrix.
The function in the SOR iteration (27) can be approximated using
because of Taylor series expansion around the fixed point. In this context, for expository purposes, we will omit the residual terms such as appearing in the Taylor series expansion. A detailed argument which considers the residual terms can be found in the proof of Thm 3.
Let be a point sequence obtained by a SOR iteration. By letting , we have
In the following argument in this paper, we assume periodical SOR (PSOR) factors satisfying
where is a positive integer. SOR utilizing PSOR factors are called a PSOR.
Applying a norm inequality to (32), we can obtain
for sufficiently large . The inequality (34) shows that the local convergence behavior around the fixed point is dominated by the spectral radius of .
If the spectral radius satisfies spectral condition , we can expect local linear convergence around the fixed point.
Iv-D Chebyshev-PSOR factors
As assumed in the previous subsection, is a Hermitian matrix, and we can show that is also a positive definite matrix due to the assumption that is contractive mapping. Therefore, we can apply the theory of Chebyshev steps in the last section by replacing the matrix in Sec. III-B to . We define the Chebyshev-PSOR factors by
where are defined by (15). We will call the PSOR employing the Chebyshev-PSOR.
The asymptotic convergence rate of the Chebyshev-PSOR can be evaluated as in Sec. III-C. The following theorem clarifies the asymptotic local convergence behavior of a Chebyshev-PSOR.
Assume that is a positive definite Hermitian matrix. Suppose that there exists a convergent point sequence produced by a Chebyshev-PSOR, which is convergent to the fixed point . The following inequality
holds, where .
Iv-E Convergence acceleration by Chebyshev-PSOR
In the previous subsection, we observed that the spectral radius is strictly smaller than one for any Chebyshev-PSOR. We still need to confirm whether the Chebyshev-PSOR provides definitely faster convergence compared with the original fixed-point iteration.
holds if is a multiple of and is sufficiently close to . We show that is a bounded monotonically decreasing function. The limit value of is given by
using In the following discussion, we assume that and . In this case, we have and . We thus have, for any ,
Figure 12 presents the region where can exist. The figure implies that the Chebyshev-PSOR can definitely accelerate the local convergence speed if is sufficiently close to .