Parsimonious Deep Learning: A Differential Inclusion Approach with Global Convergence

# Parsimonious Deep Learning: A Differential Inclusion Approach with Global Convergence

Yanwei Fu Chen Liu Donghao Li Xinwei Sun Jinshan Zengﬁ and Yuan Yao

Fudan University
Microsoft Research-Asia
Jiangxi Normal University

Hong Kong University of Science and Technology
###### Abstract

Over-parameterization is ubiquitous nowadays in training neural networks to benefit both optimization in seeking global optima and generalization in reducing prediction error. However, compressive networks are desired in many real world applications and direct training of small networks may be trapped in local optima. In this paper, instead of pruning or distilling an over-parameterized model to compressive ones, we propose a parsimonious learning approach based on differential inclusions of inverse scale spaces, that generates a family of models from simple to complex ones with a better efficiency and interpretability than stochastic gradient descent in exploring the model space. It enjoys a simple discretization, the Split Linearized Bregman Iterations, with provable global convergence that from any initializations, algorithmic iterations converge to a critical point of empirical risks. One may exploit the proposed method to boost the complexity of neural networks progressively. Numerical experiments with MNIST, Cifar-10/100, and ImageNet are conducted to show the method is promising in training large scale models with a favorite interpretability.

## 1 Introduction

The expressive power of Deep Convolutional Neural Networks (DNNs) comes from the millions of parameters, which are optimized by Stochastic Gradient Descent (SGD) algorithms (Bottou, 2010) and variants like Adam (Kingma and Ba, 2015). Remarkably, model over-parameterization helps both optimization and generalization. For optimization, over-parameterization may simplify the landscape of empirical risks toward locating global optima efficiently by gradient descent method (Mei et al., 2018, 2019; Venturi et al., 2018; Allen-Zhu et al., 2018; Du et al., 2018). On the other hand, over-parameterization does not necessarily result in a bad generalization or overfitting (Zhang et al., 2017), especially when some weight-size dependent complexities are controlled (Bartlett, 1997; Bartlett et al., 2017; Golowich et al., 2018; Neyshabur et al., 2019).

However, compressive networks are desired in many real world applications, e.g. robotics, self-driving cars, and augmented reality. Despite that regularization has been applied to deep learning to enforce the sparsity on the weights toward compact, memory efficient networks, it sacrifices some prediction performance (Collins and Kohli, 2014). This is because that the weights learned in neural networks are highly correlated, and regularization on such weights violates the incoherence or irrepresentable conditions needed for sparse model selection (Donoho and Huo, 2001; Tropp, 2004; Zhao and Yu, 2006), leading to spurious selections with poor generalization. On the other hand, regularization is often utilized for correlated weights as some low-pass filtering, sometimes in the form of weight decay (Loshchilov and Hutter, 2019) or early stopping (Yao et al., 2007; Wei et al., 2017). Furthermore, group sparsity regularization (Yuan and Lin, 2006) has also been applied to neural networks, such as finding optimal number of neuron groups (Alvarez and Salzmann, 2016) and exerting good data locality with structured sparsity (Wen et al., 2016; Yoon and Hwang, 2017).

Yet, without the aid of over-parameterization, directly training a compressive model architecture may meet the obstacle of being trapped in local optima in contemporary experience. Alternatively, researchers in practice typically start from training a big model using common task datasets like ImageNet, and then prune or distill such big models to small ones without sacrificing too much of the performance (Jaderberg et al., 2014; Han et al., 2015; Zhu et al., 2017; Zhou et al., 2017; Zhang et al., 2016; Li et al., 2017; Abbasi-Asl and Yu, 2017; Yang et al., 2018; Arora et al., 2018). In particular, a recent study (Frankle and Carbin, 2019) created the lottery ticket hypothesis based on empirical observations: “dense, randomly-initialized, feed-forward networks contain subnetworks (winning tickets) that – when trained in isolation – reach test accuracy comparable to the original network in a similar number of iterations”. How to effectively reduce an over-parameterized model thus becomes the key to compressive deep learning. Such a backward pruning approach is in a sharp contrast to traditional parsimonious or sparse learning approach where one starts from simple yet interpretable models, delving into complex models progressively. Examples of such a parsimonious learning include boosting as functional gradient descent or coordinate descent method, regularization paths associated with (Tikhonov regularization or Ridge regression) and/or (Lasso) penalties, etc. None of these has been systematically studied in deep learning.

In this paper, we aim to fill in this gap by exploiting a dynamic approach to forward parsimonious learning. We are able to establish a family of neural networks, from simple to complex, by following regularization paths as solutions of differential inclusions of inverse scale spaces. Our key idea is to design some dynamics that simultaneously exploit over-parameterized models and structural sparsities. To achieve this goal, the original network parameters are lifted to a coupled pair, with one set of parameters following the standard gradient descend to explore the over-parameterized model space, while the other set of parameters learning structure sparsity in an inverse scale space, that the large-scale important parameters are learned at a fast speed while the small unimportant ones are learned at a slow speed. The two sets of parameters are coupled in an regularization. The dynamics enjoys a simple discretization, the Split Linearized Bregman Iteration, with provable global convergence guarantee. Equipped with such a method, one may efficiently boost or grow a network with different sparse models, in a more efficient way than SGD to explore the model space and in favor of better interpretability. For example, Figure 1 exhibits the performance on training ResNet-18 on ImageNet, using SGD, Adam, and our algorithms, where a proper choice of parameters leads to more efficient training than SGD and Adam in early epochs with a comparable accuracy in the end. Moreover, Figure 3 further demonstrates that the sparse convolutional filters trained on ImageNet, focus more on non-semantic textures than shapes and colors, in support of alternative studies (Abbasi-Asl and Yu, 2017; Geirhos et al., 2019). How to enhance the semantic shape invariance learning, is arguably a key to improve the robustness of convolutional neural networks.

The organization of the paper is as follows. Section 2 introduces the methodology; Section 3 establishes the global convergence theory; Section 4 discusses stochastic approximation algorithms and boosting; experimental results are presented in Section 5. Proofs are collected in Appendix A.

## 2 Methodology

The supervised learning task learns a mapping , from input space to output space , with a parameter such as weights in neural networks, by minimizing certain loss functions on training samples . For example, a neural network of -layer is defined as

 ΦW(x)=σl(Wlσl−1(Wl−1⋯σ1(W1x))

where , is the nonlinear activation function of the -th layer and the loss function is usually the cross-entropy.

Differential Inclusion of Inverse Scale Space. Consider the following dynamics,

 ˙Wtκ =−∇W¯L(Wt,Γt) (1a) ˙Vt =−∇Γ¯L(Wt,Γt) (1b) Vt ∈∂¯Ω(Γt) (1c)

where for some sparsity-enforced regularization such as Lasso or grouped Lasso penalties, as a damping parameter such that the solution path is continuous, and the augmented loss function is

 ¯L(W,Γ)=ˆLn(W)+12ν∥W−Γ∥22, (2)

with controlling the gap admitted between and . Compared to the original loss function , the additionally adopt the variable splitting strategy, by lifting the original neural network parameter to with to model the structural sparsity of . For simplicity, we assumed is differentiable with respect to here, otherwise the gradient in (1a) is understood as subgradient and the equation becomes an inclusion.

The differential inclusion system (1) is called as the Split Linearized Bregman Inverse Scale Space (SplitISS) for the following reasons. It can be understood as a gradient descent flow of in the proximity of and a mirror descent (Nemirovski and Yudin, 1983) flow of associated with a sparsity enforcement penalty . For a large enough , it reduces to the gradient descent method for . Yet the solution path of exhibits the following selection property in the separation of scales: starting at the zero, important parameters of large scale will be learned fast, popping up to be nonzeros early, while unimportant parameters of small scale will be learned slowly, appearing to be nonzeros late. In fact, taking and for simplicity, undergoes a gradient descent flow before reaching the -unit box, which implies that in this stage and follows the gradient descent with a standard -regularization; the earlier a component in reaching the -unit box, the earlier a corresponding component in becomes nonzero and rapidly evolves toward a critical point of under gradient flow, which drives closely following dynamics of whose important parameters are selected. Such a property is called as the inverse scale space in applied mathematics (Burger et al., 2006) and recently was shown to achieve statistical model selection consistency in high dimensional linear regression (Osher et al., 2016) and general linear models (Huang and Yao, 2018), with a reduction of bias as increases. In this paper, we shall see the inverse scale space property still holds for neural networks trained by Eq. (1), e.g. Figure 2 shows a LeNet trained on MNIST by the discretized dynamics, where important sparse filters are selected in early epochs while the popular stochastic gradient descent method returns dense filters.

Compared with directly enforcing a penalty function (e.g. or regularization)

 minWˆRn(W):=ˆLn(W)+ λΩ(W),   λ∈R+. (3)

the SplitISS avoids the parameter correlation problem in over-parameterized models. In fact, a necessary and sufficient condition for Lasso or -type sparse model selection is the incoherence or irrepresentable conditions ((Tropp, 2004; Zhao and Yu, 2006)) that are violated for highly correlated weight parameters, leading to spurious discoveries. In contrast, Huang et al. (Huang et al., 2018) showed that equipped with such a variable splitting where enjoys an orthogonal design, the SplitISS can achieve model selection consistency under weaker conditions than generalized Lasso, relaxing the incoherence or irrepresentable conditions when parameters are highly correlated. For weight parameter , instead of directly being imposed with -sparsity, it adopts -regularization in the proximity of the sparse path of that admits simultaneously exploring highly correlated parameters in over-parameterized models and sparsity regularizations.

Split Linearized Bregman Iterations. The SplitISS admits an extremely simple discrete approximation, using the Euler forward discretization of the dynamics above:

 Wk+1=Wk−καk⋅∇W¯L(Wk,Γk), (4a) Vk+1=Vk−αk⋅∇Γ¯L(Wk,Γk), (4b) Γk+1=κ⋅ProxΩ(Vk+1), (4c)

where , can be small random numbers such as Gaussian distribution in neural networks, the proximal map in Eq (4c) that controls the sparsity of is given by

 ProxΩ(V)=argminΓ {12∥Γ−V∥22+Ω(Γ)}, (5)

We shall call such an iterative procedure as Split Linearized Bregman Iteration (SplitLBI), that was firstly proposed in (Huang et al., 2016) as an iterative regularization path for sparse modeling in high dimensional statistics with an improved model selection consistency than generalized Lasso, and has been later used in medical image analysis (Sun et al., 2017) and computer vision (Zhao et al., 2018). In the application to neural networks, the loss becomes highly non-convex, the SplitLBI returns a sequence of sparse models from simple to complex ones with a global convergence guarantee shown below, while solving Eq (3) at various levels of might not be tractable except for over-parameterized models.

The sparsity-enforcement penalty used in convolutional neural networks can be chosen as follows. Our choice in this paper aims at regularizing the groups of weight parameters using the group Lasso penalty (Yuan and Lin, 2006) defined by,

 Ω(Γ):=∑g∥Γg∥2=∑g ⎷∣Γg∣∑i=1(Γgi)2,

where is the number of weights in . We treat convolutional layers and fully connected layers in different ways as follows.

• For a convolutional layer, denote the convolutional filters where denotes the kernel size and and denote the numbers of input channels and output channels, respectively. We select and regard each group as each convolutional filter.

• For a fully connected layer, where and denote the numbers of inputs and outputs of the fully connected layer. Each group corresponds to each element , and the group Lasso penalty degenerates to the Lasso penalty.

With such choices, Eq. (4c) hence has a closed form solution for the -th filter.

## 3 Global Convergence of SplitLBI for Neural Networks

We present a theorem that guarantees the global convergence of SplitLBI, i.e. from any intialization, the SplitLBI sequence converges to a critical point of .

First of all, we reformulate Eq. (6) into an equivalent form. Let , the SplitLBI algorithm in Eq (4a)-(4c) can be rewritten as the following standard Linearized Bregman Iteration (Huang and Yao, 2018),

 Pk+1=argminP{⟨P−Pk,α∇¯L(Pk)⟩+BpkΨ(P,Pk)}, (6)

where

 Ψ(P) =Ω(Γ)+12κ∥P∥22=Ω(Γ)+12κ∥W∥22+12κ∥Γ∥22, (7)

, and is the Bregman divergence associated with convex function , defined by

 BqΨ(P,Q) :=Ψ(P)−Ψ(Q)−⟨q,P−Q⟩, for some q∈∂Ψ(Q). (8)

Denote , then Eq (6) can be rewritten as,

 Pk+1=ProxκR(Pk+κ(pk−α∇¯L(Pk))), (9a) pk+1=pk−κ−1(Pk+1−Pk+κα∇¯L(Pk)), (9b)

where and . Thus SplitLBI is equivalent to the following iterations,

 Wk+1=Wk−κα∇W¯L(Wk,Γk), (10a) Γk+1=ProxκΩ(Γk+κ(gk−α∇Γ¯L(Wk,Γk))), (10b) gk+1=gk−κ−1(Γk+1−Γk+κα⋅∇Γ¯L(Wk,Γk)). (10c)

Exploiting the equivalent reformulation (10a)-(10c), one can establish the global convergence of under the following assumptions.

###### Assumption 1.

Suppose that:

• is continuous differentiable and is Lipschitz continuous with a positive constant ;

• has bounded level sets;

• is lower bounded (without loss of generality, we assume that the lower bound is );

• is a proper lower semi-continuous convex function and has locally bounded subgradients, that is, for every compact set , there exists a constant such that for all and all , there holds ;

• the Lyapunov function

 F(P,G):=α¯L(W,Γ)+B~gΩ(Γ,~Γ), (11)

is a Kurdyka-Łojasiewicz function on any bounded set, where , , and is the conjugate of defined as

 Ω∗(g):=supU∈Rn{⟨U,g⟩−Ω(U)}.
###### Remark 1.

Assumption 1 (a)-(c) are regular in the analysis of nonconvex algorithm (see, Attouch et al. (2013) for instance), while Assumption 1 (d) is also mild including all Lipschitz continuous convex function over a compact set. Some typical examples satisfying Assumption 1(d) are the norm, group norm, and every continuously differentiable penalties. By Eq (11) and the definition of conjugate, the Lyapunov function can be rewritten as follows,

 F(W,Γ,g)=α¯L(W,Γ)+Ω(Γ)+Ω∗(g)−⟨Γ,g⟩. (12)

Now we are ready to present the main theorem.

###### Theorem 1 (Global Convergence of SplitLBI).

Suppose that Assumption 1 holds. Let be the sequence generated by SplitLBI (Eq (10a)-(10c)) with a finite initialization. If

 0<αk=α<2κ(Lip+ν−1),

then converges to a critical point of . Moreover, converges to a stationary point of defined in Eq (2), and converges to a stationary point of .

Applying to the neural network training settings, typical examples satisfying the assumption of this theorem are summarized in the following corollary.

###### Corollary 1.

Let be a sequence generated by SLBI (10a)-(10c) for the deep neural network training problem where

1. is any smooth definable loss function, such as the square loss , exponential loss , logistic loss , and cross-entropy loss;

2. is any smooth definable activation, such as linear activation, sigmoid , hyperbolic tangent , and softplus ( for some ) as a smooth approximation of ReLU;

3. is the group Lasso penalty (i.e., ).

Moreover, suppose that the empirical risk has lower bounded level set and its gradient is Lipschitz continuous with a constant . Then the sequence converges to a stationary point of if the step size .

Proofs of Theorem 1 and Corollary 1 are given in Appendix A.

## 4 Stochastic Approximation of SplitLBI and Applications

For neural network training with large datasets, stochastic approximation of the gradients in Split LBI over the mini-batch is adopted to update the parameter ,

 ˜∇tW=∇WˆLn(W)∣(X,Y)batcht. (13)

Moreover, inspired by the variants of SGD, the momentum term can be also incorporated to the standard Split LBI that leads to the following updates of by replacing Eq ((4a)) with,

 vt+1 = τvt+˜∇W¯L(Wt,Γt) (14a) Wt+1 = Wt−καvt+1 (14b)

where is the momentum factor, empirically setting as 0.9 in default. One immediate application of such stochastic algorithms of SplitLBI is to “boost networks”, i.e. growing a network from the null to a complex one by sequentially applying our algorithm on subnets with increasing complexities.

Boosting Networks. We propose the task of boosting network by SplitLBI, i.e., learning to expand the convolutional filters and network parameters during training. Remarkably, our boosting network is very different from previous tasks, including AutoML (Wong et al., 2018), or life-long learning (Wang et al., 2017; Thrun and Mitchell, 1995; Pentina and Lampert, 2015; Li and Hoiem, 2016), knowledge distill (Hinton et al., 2014). Starting from very few filters of each convolutional layer, algorithms of boosting network require not only efficiently optimizing the parameters of filters, but also adding more filters if the existing filters do not have enough capacity to model the distribution of training data. Note that the existing work mentioned above does not allow additional expanding or fine-tuning algorithms which are very computational expensive in practice.

To apply the SplitLBI to boost networks, we define a projection of the convolutional filters on the support set of ,

 ˜Wk+1=Projsupp(Γk+1)(Wk+1). (15)

The basic idea is to monitor the gap between and its projection along the iterations: when the gap becomes small, we are going to expand the network by adding new filters.

We give the description of the procedure of expanding a network by SplitLBI. Starting from a small number of filters (e.g. 2) of convolutional layer, more and more filters tend to be with non-zero value as the algorithm iterates. At each step, we can compute the ratio of to be non-zero. If this ratio passes a pre-set threshold (empirically setting as 0.8), we add (e.g. 2) new convolutional filters into ; and the newly added parameters are randomly initialized. Then we continue learning the network from training data. This process is repeated until model converged or maximum epochs reached.

## 5 Experiments

The proposed SplitLBI is evaluated on MNIST, Cifar-10/100 and ImageNet-2012 datasets, which are widely utilized as the benchmarks in deep learning. Particularly, Sec. 5.1 compares SplitLBI against SGD and Adam in image classification. Furthermore, Sec. 5.2 shows that SplitLBI can efficiently boost a network from null to complex ones from data.

Implementation. Various algorithms are evaluated over the LeNet (LeCun et al. (2015)), AlexNet and ResNet (He et al. (2016)), respectively. By default, the minibatch-based SplitLBI is employed here with the batch size 256. For MNIST and Cifar-10, the hyper-parameters are , and starts from 0.1 and decays by 0.1 every 40 epochs. In ImageNet-2012, we use , , and starts from 0.1 and decays by 0.1 every 30 epochs. Adam follows the default PyTorch setting and SGD follows He et al. (2016) with initial learning rate , momentum and weight decay .

### 5.1 Image Classification

In the supervised setting, SGD, Adam and SplitLBI are adopted in optimizing DNNs on these datasets. The results are shown in Table 1. To train ResNet-18 on ImageNet-2012 dataset, the model trained by SplitLBI () can achieve the Top-1 accuracy of by 40 epochs, which is lower than the best ResNet-18 results report in He et al. (2016) optimized by 120 epochs. Still this indicates that the SplitLBI may have a more efficient training than SGD. Note that, (1) In Fig. 1, we qualitatively compare our SplitLBI of against SGD and Adam. Remarkably, at the early stage, our SplitLBI of has a much better efficiency than Adam and SGD within the same number of training epochs; (2) our SLBI requires the extra more GPU usage than that of SGD, in order to save the parameters of and . Comparing with SGD, the SplitLBI adds little extra memory cost to store and . The total running time of each epoch is comparable to that of SGD.

To interpret the learned filters by SplitLBI with structural sparsity, we visualize a LeNet-5 trained on MNIST and a ResNet-18 on ImageNet.

Visualization of LeNet-5 on MNIST. We visualize the solution paths and filter patterns of the third convolutional layer (i.e., conv.c5) of LeNet-5 trained by SplitLBI and SGD trained on MNIST as in Fig. 2. X-axis and Y-axis indicate the training epochs, and magnitudes of each filter respectively. In particular, the left two figures compares the magnitude value changes of each filter trained by SplitLBI and SGD, individually. The filters that are not in the support of , are drawn by red color. In the right figure, the model parameters learned at 20 (blue), 40 (green), and 80 (black) epochs are visualized as Erhan et al. (2009), with the corresponding color bounding boxes, i.e., blue, green, and black, respectively. The conv.c5 has 120 convolutional filters, which are ordered by the magnitude of filters. We can observe that the filters of high magnitude learned by our SplitLBI are much sparser and have very much clearer patterns that those trained by SGD, showing that SplitLBI encourages structural sparsity of filters.

Visualization of ResNet-18 on ImageNet-2012. To further reveal the insights of learned patterns of SplitLBI, we visualize the first convolutional layer of ResNet-18 on ImageNet-2012 in Fig. 3. The left figure compares the training and validation accuracy of SplitLBI () and SGD. The right figure compares visualizations of the filters learned by SplitLBI and SGD using Springenberg et al. (2014). To be specific, denote the weights of an -layer network as . For the th layer weights , denote the th channel . Then we compute the gradient of the sum of the feature map computed from each filter with respect to the input image (here a snake image). We further conduct the min-max normalization to the gradient image, and generate the final visualization map. The right figure compares the visualized gradient images of first convolutional layer of 64 filters with receptive fields. We visualize the models parameters at 20 (purple), 40 (green), and 60 (black) epochs, respectively, which corresponds to the bounding boxes in the right figure annotated by the corresponding colors, i.e., purple, green, and black. We order the gradient images produced from 64 filters by the descending order of the magnitude (-norm) of filters, i.e., images are ordered from the upper left to the bottom right. For comparison, we also provide the visualized gradient from random initialized weights. We find that the filters with high magnitudes mostly focus on the texture and shape information, while color information is kept in the filters with small magnitudes. This phenomenon is in accordance with observation of Abbasi-Asl and Yu (2017) that filters mainly contain color information can be pruned for saving computational cost. Moreover, among the filters of high magnitudes, most of them capture non-semantic textures while few pursue shapes. This shows that the first convolutional layer of ResNet-18 trained on ImageNet learned non-semantic textures rather than shape to do image classification tasks, in accordance with recent studies (Geirhos et al., 2019).

### 5.2 Boosting a Network

In this experiment, we explore the performance that SplitLBI boosts filters of a network with a fixed number of layers. In each individual layer, we start from a small initial number of filters (Min), and gradually grow to a maximum number of filters (Max), adaptively learned by the SplitLBI-Boosting described in Sec. 4. We compare the results of directly using small number of filters, maximum number of filters (Max) and our boosting method in Table 2. Note that, the maximum number of filters are computed by our boosting algorithm. We do not know it beforehand. In addition, we also include the networks with over-parameterized (O.P.) convolutional filters, which is twice the maximum number of filters of each corresponding layer. By default, all models are trained to 1000 epochs. We employ MNIST, Cifar-10 and Cifar-100 as the evaluation benchmarks here.

As shown in Table 2, our SplitLBI can successfully boost the filters of networks from Min to Max. In Fig. 4, Fig. 5, Fig. 6, and 7, we give the full training paths of Min (Minimum Model), Max (Maximum Model) and Boosting (Grow) as compared in Table 2. Each black line indicates filters are added. In our experiments, two filters () are added each time. To the best of our knowledge, this is the first algorithm that can simultaneously learn the network structures and parameters from data. It is desired that the Max and Boosting hit the same performance with enough training epochs; and impressively, after the 1000 epochs, Max and Boosting can indeed achieve almost comparable results on all of the two layer networks. This further indicates the efficacy of proposed SplitLBI. On Cifar-100 with a three convolutional layer network, the results of boosting are inferior to that of Max. A further inspection on the full training paths of Min, Max and Boosting (Fig. 7) shows that the boosting may require more training epochs in such a relative complex structures. Thus empirically, we fix the Boosted network in Table 2, continue training another 500 epochs, and we find the trained model indeed can report almost the same performance as the Max.

## 6 Conclusion

In this paper, a parsimonious deep learning method is proposed based on differential inclusions of inverse scale spaces. Its simple discretization – SplitLBI has a provable global convergence on DNNs, and thus is employed to train deep networks. Furthermore, the proposed method can be applied to boost a network by simultaneously learning both network structure and parameters. The extensive experiments and visualization show that the proposed method may not only have better efficiency than SGD, but also capture the structural sparsity of convolutional filters for interpretability.

## Appendix A Proof of Theorem 1

### a.1 Kurdyka-Łojasiewicz Property

To introduce the definition of the Kurdyka-Łojasiewicz (KL) property, we need some notions and notations from variational analysis, which can be found in Rockafellar and Wets (1998).

The notion of subdifferential plays a central role in the following definitions. For each , the Fréchet subdifferential of at , written , is the set of vectors which satisfy

 liminfy≠x,y→x h(y)−h(x)−⟨v,y−x⟩∥x−y∥≥0.

When we set The limiting-subdifferential (or simply subdifferential) of introduced in Mordukhovich (2006), written at , is defined by

 ∂h(x):={v∈Rp:∃xk→x,h(xk)→h(x),vk∈ˆ∂h(xk)→v}. (16)

A necessary (but not sufficient) condition for to be a minimizer of is . A point that satisfies this inclusion is called limiting-critical or simply critical. The distance between a point to a subset of , written , is defined by , where represents the Euclidean norm.

Let be an extended-real-valued function (respectively, be a point-to-set mapping), its graph is defined by

 Graph(h):={(x,y)∈Rp×R:y=h(x)}, (resp. Graph(h):={(x,y)∈Rp×Rq:y∈h(x)}),

and its domain by (resp. ). When is a proper function, i.e., when the set of its global minimizers (possibly empty) is denoted by

 argminh:={x∈Rp:h(x)=infh}.

The KL property (Łojasiewicz, 1963, 1993; Kurdyka, 1998; Bolte et al., 2007a, b) plays a central role in the convergence analysis of nonconvex algorithms (Attouch et al., 2013; Wang et al., 2019). The following definition is adopted from Bolte et al. (2007b).

###### Definition 1 (Kurdyka-Łojasiewicz property).

A function is said to have the Kurdyka-Łojasiewicz (KL) property at , if there exists a constant , a neighborhood of and a function , which is a concave function that is continuous at and satisfies , , i.e., is continuous differentiable on , and for all , such that for all , the following inequality holds

 ϕ′(h(u)−h(¯u))⋅dist(0,∂h(u))≥1. (17)

If satisfies the KL property at each point of , is called a KL function.

KL functions include real analytic functions, semialgebraic functions, tame functions defined in some o-minimal structures (Kurdyka, 1998; Bolte et al., 2007b), continuous subanalytic functions (Bolte et al., 2007a) and locally strongly convex functions. In the following, we provide some important examples that satisfy the Kurdyka-Łojasiewicz property.

###### Definition 2 (Real analytic).

A function with domain an open set and range the set of either all real or complex numbers, is said to be real analytic at if the function may be represented by a convergent power series on some interval of positive radius centered at : for some . The function is said to be real analytic on if it is real analytic at each (Krantz and Parks, 2002, Definition 1.1.5). The real analytic function over for some positive integer can be defined similarly.

According to Krantz and Parks (2002), typical real analytic functions include polynomials, exponential functions, and the logarithm, trigonometric and power functions on any open set of their domains. One can verify whether a multivariable real function on is analytic by checking the analyticity of for any .

###### Definition 3 (Semialgebraic).
1. A set is called semialgebraic (Bochnak et al., 1998) if it can be represented as

 D=s⋃i=1t⋂j=1{x∈Rp:Pij(x)=0,Qij(x)>0},

where are real polynomial functions for

2. A function (resp. a point-to-set mapping ) is called semialgebraic if its graph is semialgebraic.

According to Łojasiewicz (1965); Bochnak et al. (1998) and Shiota (1997, I.2.9, page 52), the class of semialgebraic sets are stable under the operation of finite union, finite intersection, Cartesian product or complementation. Some typical examples include polynomial functions, the indicator function of a semialgebraic set, and the Euclidean norm (Bochnak et al., 1998, page 26).

### a.2 KL Property in Deep Learning and Proof of Corollary 1

In the following, we consider the deep neural network training problem. Consider a -layer feedforward neural network including hidden layers of the neural network. Particularly, let be the number of hidden units in the -th hidden layer for . Let and be the number of units of input and output layers, respectively. Let be the weight matrix between the -th layer and the -th layer for any 111To simplify notations, we regard the input and output layers as the -th and the -th layers, respectively, and absorb the bias of each layer into ..

According to Theorem 1, one major condition is to verify the introduced Lyapunov function defined in (11) satisfies the Kurdyka-Łojasiewicz property. For this purpose, we need an extension of semialgebraic set, called the o-minimal structure (see, for instance Coste (1999), van den Dries (1986), Kurdyka (1998), Bolte et al. (2007b)). The following definition is from Bolte et al. (2007b).

###### Definition 4 (o-minimal structure).

An o-minimal structure on is a sequence of boolean algebras of “definable” subsets of , such that for each

1. if belongs to , then and belong to ;

2. if is the canonical projection onto , then for any in , the set belongs to ;

3. contains the family of algebraic subsets of , that is, every set of the form

 {x∈Rn:p(x)=0},

where is a polynomial function.

4. the elements of are exactly finite unions of intervals and points.

Based on the definition of o-minimal structure, we can show the definition of the definable function.

###### Definition 5 (Definable function).

Given an o-minimal structure (over ), a function is said to be definable in if its graph belongs to .

According to van den Dries and Miller (1996); Bolte et al. (2007b), there are some important facts of the o-minimal structure, shown as follows.

1. The collection of semialgebraic sets is an o-minimal structure. Recall the semialgebraic sets are Bollean combinations of sets of the form

 {x∈Rn:p(x)=0,q1(x)<0,…,qm(x)<0},

where and ’s are polynomial functions in .

2. There exists an o-minimal structure that contains the sets of the form

 {(x,t)∈[−1,1]n×R:f(x)=t}

where is real-analytic around .

3. There exists an o-minimal structure that contains simultaneously the graph of the exponential function and all semialgebraic sets.

4. The o-minimal structure is stable under the sum, composition, the inf-convolution and several other classical operations of analysis.

The Kurdyka-Łojasiewicz property for the smooth definable function and non-smooth definable function were established in (Kurdyka, 1998, Theorem 1) and (Bolte et al., 2007b, Theorem 11), respectively. Now we are ready to present the proof of Corollary 1.

###### Proof of Corollary 1.

To justify this corollary, we only need to verify the associated Lyapunov function satisfies Kurdyka-Łojasiewicz inequality. In this case and by (12), can be rewritten as follows

 F(W,Γ,G)=α(ˆLn(W,Γ)+12ν∥W−Γ∥2)+Ω(Γ)+Ω∗(g)−⟨W,g⟩.

Because and ’s are definable by assumptions, then are definable as compositions of definable functions. Moreover, according to Krantz and Parks (2002), and are semi-algebraic and thus definable. Since the group Lasso is the composition of and norms, and the conjugate of group Lasso penalty is the maximum of group -norm, i.e. , where the , , and norms are definable, hence the group Lasso and its conjugate are definable as compositions of definable functions. Therefore, is definable and hence satisfies Kurdyka-Łojasiewicz inequality by (Kurdyka, 1998, Theorem 1).

The verifications of these specific cases listed in assumptions can be found in the proof of (Zeng et al., 2019, Proposition 1). This finishes the proof of this corollary. ∎

### a.3 Proof of Theorem 1

Our analysis is mainly motivated by a recent paper (Benning et al., 2017), as well as the influential work (Attouch et al., 2013). According to (Attouch et al., 2013, Lemma 2.6), there are mainly four ingredients in the analysis, that is, the sufficient descent property, relative error property, continuity property of the generated sequence and the Kurdyka-Łojasiewicz property of the function. More specifically, we first establish the sufficient descent property of the generated sequence via exploiting the Lyapunov function (see, (11)) in Lemma A.4 in Section A.4, and then show the relative error property of the sequence in Lemma A.5 in Section A.5. The continuity property is guaranteed by the continuity of and the relation established in Lemma 1(i) in Section A.4. Thus, together with the Kurdyka-Łojasiewicz assumption of , we establish the global convergence of SLBI following by (Attouch et al., 2013, Lemma 2.6).

Let be a critical point of , then the following holds

 ∂WF(¯W,¯Γ,¯g)=α(∇ˆLn(¯W)+ν−1(¯W−¯Γ))=0, ∂ΓF(¯W,¯Γ,¯g)=αν−1(¯Γ−¯W)+∂Ω(¯Γ)−¯g∋0, (18) ∂gF(¯W,¯Γ,¯g)=¯Γ−∂Ω∗(¯g)∋0.

By the final inclusion and the convexity of , it implies . Plugging this inclusion into the second inclusion yields . Together with the first equality imples

 ∇¯L(¯W,¯Γ)=0,∇ˆLn(¯W)=0.

This finishes the proof of this theorem.

### a.4 Sufficient Descent Property along Lyapunov Function

Let , and . In the following, we present the sufficient descent property of along the Lyapunov function .

Lemma. Suppose that is continuously differentiable and is Lipschitz continuous with a constant . Let be a sequence generated by SLBI with a finite initialization. If , then

 F(Qk+1)≤F(Qk)−ρ∥Qk+1−Qk∥22,

where .

###### Proof.

By the optimality condition of (9a) and also the inclusion , there holds

 κ(α∇¯L(Pk)+pk+1−pk)+Pk+1−Pk=0,

which implies

 −⟨α∇¯L(Pk),Pk+1−Pk⟩=κ−1∥Pk+1−Pk∥22+D(Γk+1,Γk) (19)

where

 D(Γk+1,Γk):=⟨gk+1−gk,Γk+1−Γk⟩.

Noting that and by the Lipschitz continuity of with a constant implies is Lipschitz continuous with a constant . This implies

 ¯L(Pk+1)≤¯L(Pk)+⟨∇¯L(Pk),Pk+1−Pk⟩+Lip+ν−12∥Pk+1−Pk∥22.

Substituting the above inequality into (19) yields

 α¯L(Pk+1)+D(Γk+1,Γk)+ρ∥Pk+1−Pk∥22≤α¯L(Pk). (20)

Adding some terms in both sides of the above inequality and after some reformulations implies

 α¯L(Pk+1)+BgkΩ(Γk+1,Γk) (21) ≤α¯L(Pk)+Bgk−1Ω(Γk,Γk−1)−ρ∥Pk+1−Pk∥22−(D(Γk+1,Γk)+Bgk−1Ω(Γk,Γk−1)−BgkΩ(Γk+1,Γk)) =α¯L(Pk)+Bgk−1Ω(Γk,Γk−1)−ρ∥Pk+1−Pk∥22−Bgk+1Ω(Γk,Γk−1)−Bgk−1Ω(Γk,Γk−1),

where the final equality holds for That is,

 F(Qk+1) ≤F(Qk)−ρ∥Pk+1−Pk∥22−Bgk+1Ω(Γk,Γk−1)−Bgk−1Ω(Γk,Γk−1) (22) ≤F(Qk)−ρ∥Pk+1−Pk∥22, (23)

where the final inequality holds for and Thus, we finish the proof of this lemma. ∎

Based on Lemma A.4, we directly obtain the following lemma.

###### Lemma 1.

Suppose that assumptions of Lemma A.4 hold. Suppose further that Assumption 1 (b)-(d) hold. Then

1. both and converge to the same finite value, and

2. the sequence is bounded,

3. and

4. at a rate of .

###### Proof.

By (20), is monotonically decreasing due to . Similarly, by (23), is also monotonically decreasing. By the lower boundedness assumption of , both and are lower bounded by their definitions, i.e., (2) and (11), respectively. Therefore, both and converge, and it is obvious that . By (22),

 Bgk−1Ω(Γk,Γk−1)≤F(Qk)−F(Qk+1), k=1,….

By the convergence of and the nonegativeness of , there holds

 limk→∞Bgk−1Ω(Γk,Γk−1)=0.

By the definition of and the above equality, it yields

Since has bounded level sets, then is bounded. By the definition of and the finiteness of , is also bounded due to is bounded. The boundedness of is due to , condition (d), and the boundedness of .

By (23), summing up (23) over yields

 K∑k=0(ρ∥Pk+1−Pk∥2+D(Γk+1,Γk))<α¯L(P0)<∞. (24)

Letting and noting that both and are nonnegative, thus

 limk→∞∥Pk+1−Pk∥2=0,limk→∞D(Γk+1,Γk)=0.

Again by (24),

 1KK∑k=0(ρ∥Pk+1−Pk∥2+D(Γk+1,Γk))