Parsimonious Deep Learning:
A Differential Inclusion Approach
with Global Convergence
Abstract
Overparameterization is ubiquitous nowadays in training neural networks to benefit both optimization in seeking global optima and generalization in reducing prediction error. However, compressive networks are desired in many real world applications and direct training of small networks may be trapped in local optima. In this paper, instead of pruning or distilling an overparameterized model to compressive ones, we propose a parsimonious learning approach based on differential inclusions of inverse scale spaces, that generates a family of models from simple to complex ones with a better efficiency and interpretability than stochastic gradient descent in exploring the model space. It enjoys a simple discretization, the Split Linearized Bregman Iterations, with provable global convergence that from any initializations, algorithmic iterations converge to a critical point of empirical risks. One may exploit the proposed method to boost the complexity of neural networks progressively. Numerical experiments with MNIST, Cifar10/100, and ImageNet are conducted to show the method is promising in training large scale models with a favorite interpretability.
1 Introduction
The expressive power of Deep Convolutional Neural Networks (DNNs) comes from the millions of parameters, which are optimized by Stochastic Gradient Descent (SGD) algorithms (Bottou, 2010) and variants like Adam (Kingma and Ba, 2015). Remarkably, model overparameterization helps both optimization and generalization. For optimization, overparameterization may simplify the landscape of empirical risks toward locating global optima efficiently by gradient descent method (Mei et al., 2018, 2019; Venturi et al., 2018; AllenZhu et al., 2018; Du et al., 2018). On the other hand, overparameterization does not necessarily result in a bad generalization or overfitting (Zhang et al., 2017), especially when some weightsize dependent complexities are controlled (Bartlett, 1997; Bartlett et al., 2017; Golowich et al., 2018; Neyshabur et al., 2019).
However, compressive networks are desired in many real world applications, e.g. robotics, selfdriving cars, and augmented reality. Despite that regularization has been applied to deep learning to enforce the sparsity on the weights toward compact, memory efficient networks, it sacrifices some prediction performance (Collins and Kohli, 2014). This is because that the weights learned in neural networks are highly correlated, and regularization on such weights violates the incoherence or irrepresentable conditions needed for sparse model selection (Donoho and Huo, 2001; Tropp, 2004; Zhao and Yu, 2006), leading to spurious selections with poor generalization. On the other hand, regularization is often utilized for correlated weights as some lowpass filtering, sometimes in the form of weight decay (Loshchilov and Hutter, 2019) or early stopping (Yao et al., 2007; Wei et al., 2017). Furthermore, group sparsity regularization (Yuan and Lin, 2006) has also been applied to neural networks, such as finding optimal number of neuron groups (Alvarez and Salzmann, 2016) and exerting good data locality with structured sparsity (Wen et al., 2016; Yoon and Hwang, 2017).
Yet, without the aid of overparameterization, directly training a compressive model architecture may meet the obstacle of being trapped in local optima in contemporary experience. Alternatively, researchers in practice typically start from training a big model using common task datasets like ImageNet, and then prune or distill such big models to small ones without sacrificing too much of the performance (Jaderberg et al., 2014; Han et al., 2015; Zhu et al., 2017; Zhou et al., 2017; Zhang et al., 2016; Li et al., 2017; AbbasiAsl and Yu, 2017; Yang et al., 2018; Arora et al., 2018). In particular, a recent study (Frankle and Carbin, 2019) created the lottery ticket hypothesis based on empirical observations: “dense, randomlyinitialized, feedforward networks contain subnetworks (winning tickets) that – when trained in isolation – reach test accuracy comparable to the original network in a similar number of iterations”. How to effectively reduce an overparameterized model thus becomes the key to compressive deep learning. Such a backward pruning approach is in a sharp contrast to traditional parsimonious or sparse learning approach where one starts from simple yet interpretable models, delving into complex models progressively. Examples of such a parsimonious learning include boosting as functional gradient descent or coordinate descent method, regularization paths associated with (Tikhonov regularization or Ridge regression) and/or (Lasso) penalties, etc. None of these has been systematically studied in deep learning.
In this paper, we aim to fill in this gap by exploiting a dynamic approach to forward parsimonious learning. We are able to establish a family of neural networks, from simple to complex, by following regularization paths as solutions of differential inclusions of inverse scale spaces. Our key idea is to design some dynamics that simultaneously exploit overparameterized models and structural sparsities. To achieve this goal, the original network parameters are lifted to a coupled pair, with one set of parameters following the standard gradient descend to explore the overparameterized model space, while the other set of parameters learning structure sparsity in an inverse scale space, that the largescale important parameters are learned at a fast speed while the small unimportant ones are learned at a slow speed. The two sets of parameters are coupled in an regularization. The dynamics enjoys a simple discretization, the Split Linearized Bregman Iteration, with provable global convergence guarantee. Equipped with such a method, one may efficiently boost or grow a network with different sparse models, in a more efficient way than SGD to explore the model space and in favor of better interpretability. For example, Figure 1 exhibits the performance on training ResNet18 on ImageNet, using SGD, Adam, and our algorithms, where a proper choice of parameters leads to more efficient training than SGD and Adam in early epochs with a comparable accuracy in the end. Moreover, Figure 3 further demonstrates that the sparse convolutional filters trained on ImageNet, focus more on nonsemantic textures than shapes and colors, in support of alternative studies (AbbasiAsl and Yu, 2017; Geirhos et al., 2019). How to enhance the semantic shape invariance learning, is arguably a key to improve the robustness of convolutional neural networks.
2 Methodology
The supervised learning task learns a mapping , from input space to output space , with a parameter such as weights in neural networks, by minimizing certain loss functions on training samples . For example, a neural network of layer is defined as
where , is the nonlinear activation function of the th layer and the loss function is usually the crossentropy.
Differential Inclusion of Inverse Scale Space. Consider the following dynamics,
(1a)  
(1b)  
(1c) 
where for some sparsityenforced regularization such as Lasso or grouped Lasso penalties, as a damping parameter such that the solution path is continuous, and the augmented loss function is
(2) 
with controlling the gap admitted between and . Compared to the original loss function , the additionally adopt the variable splitting strategy, by lifting the original neural network parameter to with to model the structural sparsity of . For simplicity, we assumed is differentiable with respect to here, otherwise the gradient in (1a) is understood as subgradient and the equation becomes an inclusion.
The differential inclusion system (1) is called as the Split Linearized Bregman Inverse Scale Space (SplitISS) for the following reasons. It can be understood as a gradient descent flow of in the proximity of and a mirror descent (Nemirovski and Yudin, 1983) flow of associated with a sparsity enforcement penalty . For a large enough , it reduces to the gradient descent method for . Yet the solution path of exhibits the following selection property in the separation of scales: starting at the zero, important parameters of large scale will be learned fast, popping up to be nonzeros early, while unimportant parameters of small scale will be learned slowly, appearing to be nonzeros late. In fact, taking and for simplicity, undergoes a gradient descent flow before reaching the unit box, which implies that in this stage and follows the gradient descent with a standard regularization; the earlier a component in reaching the unit box, the earlier a corresponding component in becomes nonzero and rapidly evolves toward a critical point of under gradient flow, which drives closely following dynamics of whose important parameters are selected. Such a property is called as the inverse scale space in applied mathematics (Burger et al., 2006) and recently was shown to achieve statistical model selection consistency in high dimensional linear regression (Osher et al., 2016) and general linear models (Huang and Yao, 2018), with a reduction of bias as increases. In this paper, we shall see the inverse scale space property still holds for neural networks trained by Eq. (1), e.g. Figure 2 shows a LeNet trained on MNIST by the discretized dynamics, where important sparse filters are selected in early epochs while the popular stochastic gradient descent method returns dense filters.
Compared with directly enforcing a penalty function (e.g. or regularization)
(3) 
the SplitISS avoids the parameter correlation problem in overparameterized models. In fact, a necessary and sufficient condition for Lasso or type sparse model selection is the incoherence or irrepresentable conditions ((Tropp, 2004; Zhao and Yu, 2006)) that are violated for highly correlated weight parameters, leading to spurious discoveries. In contrast, Huang et al. (Huang et al., 2018) showed that equipped with such a variable splitting where enjoys an orthogonal design, the SplitISS can achieve model selection consistency under weaker conditions than generalized Lasso, relaxing the incoherence or irrepresentable conditions when parameters are highly correlated. For weight parameter , instead of directly being imposed with sparsity, it adopts regularization in the proximity of the sparse path of that admits simultaneously exploring highly correlated parameters in overparameterized models and sparsity regularizations.
Split Linearized Bregman Iterations. The SplitISS admits an extremely simple discrete approximation, using the Euler forward discretization of the dynamics above:
(4a)  
(4b)  
(4c) 
where , can be small random numbers such as Gaussian distribution in neural networks, the proximal map in Eq (4c) that controls the sparsity of is given by
(5) 
We shall call such an iterative procedure as Split Linearized Bregman Iteration (SplitLBI), that was firstly proposed in (Huang et al., 2016) as an iterative regularization path for sparse modeling in high dimensional statistics with an improved model selection consistency than generalized Lasso, and has been later used in medical image analysis (Sun et al., 2017) and computer vision (Zhao et al., 2018). In the application to neural networks, the loss becomes highly nonconvex, the SplitLBI returns a sequence of sparse models from simple to complex ones with a global convergence guarantee shown below, while solving Eq (3) at various levels of might not be tractable except for overparameterized models.
The sparsityenforcement penalty used in convolutional neural networks can be chosen as follows. Our choice in this paper aims at regularizing the groups of weight parameters using the group Lasso penalty (Yuan and Lin, 2006) defined by,
where is the number of weights in . We treat convolutional layers and fully connected layers in different ways as follows.

For a convolutional layer, denote the convolutional filters where denotes the kernel size and and denote the numbers of input channels and output channels, respectively. We select and regard each group as each convolutional filter.

For a fully connected layer, where and denote the numbers of inputs and outputs of the fully connected layer. Each group corresponds to each element , and the group Lasso penalty degenerates to the Lasso penalty.
With such choices, Eq. (4c) hence has a closed form solution for the th filter.
3 Global Convergence of SplitLBI for Neural Networks
We present a theorem that guarantees the global convergence of SplitLBI, i.e. from any intialization, the SplitLBI sequence converges to a critical point of .
First of all, we reformulate Eq. (6) into an equivalent form. Let , the SplitLBI algorithm in Eq (4a)(4c) can be rewritten as the following standard Linearized Bregman Iteration (Huang and Yao, 2018),
(6) 
where
(7) 
, and is the Bregman divergence associated with convex function , defined by
(8) 
Denote , then Eq (6) can be rewritten as,
(9a)  
(9b) 
where and . Thus SplitLBI is equivalent to the following iterations,
(10a)  
(10b)  
(10c) 
Exploiting the equivalent reformulation (10a)(10c), one can establish the global convergence of under the following assumptions.
Assumption 1.
Suppose that:

is continuous differentiable and is Lipschitz continuous with a positive constant ;

has bounded level sets;

is lower bounded (without loss of generality, we assume that the lower bound is );

is a proper lower semicontinuous convex function and has locally bounded subgradients, that is, for every compact set , there exists a constant such that for all and all , there holds ;

the Lyapunov function
(11) is a KurdykaŁojasiewicz function on any bounded set, where , , and is the conjugate of defined as
Remark 1.
Assumption 1 (a)(c) are regular in the analysis of nonconvex algorithm (see, Attouch et al. (2013) for instance), while Assumption 1 (d) is also mild including all Lipschitz continuous convex function over a compact set. Some typical examples satisfying Assumption 1(d) are the norm, group norm, and every continuously differentiable penalties. By Eq (11) and the definition of conjugate, the Lyapunov function can be rewritten as follows,
(12) 
Now we are ready to present the main theorem.
Theorem 1 (Global Convergence of SplitLBI).
Applying to the neural network training settings, typical examples satisfying the assumption of this theorem are summarized in the following corollary.
Corollary 1.
Let be a sequence generated by SLBI (10a)(10c) for the deep neural network training problem where

is any smooth definable loss function, such as the square loss , exponential loss , logistic loss , and crossentropy loss;

is any smooth definable activation, such as linear activation, sigmoid , hyperbolic tangent , and softplus ( for some ) as a smooth approximation of ReLU;

is the group Lasso penalty (i.e., ).
Moreover, suppose that the empirical risk has lower bounded level set and its gradient is Lipschitz continuous with a constant . Then the sequence converges to a stationary point of if the step size .
4 Stochastic Approximation of SplitLBI and Applications
For neural network training with large datasets, stochastic approximation of the gradients in Split LBI over the minibatch is adopted to update the parameter ,
(13) 
Moreover, inspired by the variants of SGD, the momentum term can be also incorporated to the standard Split LBI that leads to the following updates of by replacing Eq ((4a)) with,
(14a)  
(14b) 
where is the momentum factor, empirically setting as 0.9 in default. One immediate application of such stochastic algorithms of SplitLBI is to “boost networks”, i.e. growing a network from the null to a complex one by sequentially applying our algorithm on subnets with increasing complexities.
Boosting Networks. We propose the task of boosting network by SplitLBI, i.e., learning to expand the convolutional filters and network parameters during training. Remarkably, our boosting network is very different from previous tasks, including AutoML (Wong et al., 2018), or lifelong learning (Wang et al., 2017; Thrun and Mitchell, 1995; Pentina and Lampert, 2015; Li and Hoiem, 2016), knowledge distill (Hinton et al., 2014). Starting from very few filters of each convolutional layer, algorithms of boosting network require not only efficiently optimizing the parameters of filters, but also adding more filters if the existing filters do not have enough capacity to model the distribution of training data. Note that the existing work mentioned above does not allow additional expanding or finetuning algorithms which are very computational expensive in practice.
To apply the SplitLBI to boost networks, we define a projection of the convolutional filters on the support set of ,
(15) 
The basic idea is to monitor the gap between and its projection along the iterations: when the gap becomes small, we are going to expand the network by adding new filters.
We give the description of the procedure of expanding a network by SplitLBI. Starting from a small number of filters (e.g. 2) of convolutional layer, more and more filters tend to be with nonzero value as the algorithm iterates. At each step, we can compute the ratio of to be nonzero. If this ratio passes a preset threshold (empirically setting as 0.8), we add (e.g. 2) new convolutional filters into ; and the newly added parameters are randomly initialized. Then we continue learning the network from training data. This process is repeated until model converged or maximum epochs reached.
5 Experiments
The proposed SplitLBI is evaluated on MNIST, Cifar10/100 and ImageNet2012 datasets, which are widely utilized as the benchmarks in deep learning. Particularly, Sec. 5.1 compares SplitLBI against SGD and Adam in image classification. Furthermore, Sec. 5.2 shows that SplitLBI can efficiently boost a network from null to complex ones from data.
Implementation. Various algorithms are evaluated over the LeNet (LeCun et al. (2015)), AlexNet and ResNet (He et al. (2016)), respectively. By default, the minibatchbased SplitLBI is employed here with the batch size 256. For MNIST and Cifar10, the hyperparameters are , and starts from 0.1 and decays by 0.1 every 40 epochs. In ImageNet2012, we use , , and starts from 0.1 and decays by 0.1 every 30 epochs. Adam follows the default PyTorch setting and SGD follows He et al. (2016) with initial learning rate , momentum and weight decay .
Dataset  MNIST  Cifar10  ImageNet2012  

Models  LeNet  ResNet20  AlexNet  ResNet18 
SGD  99.21  91.25  56.55/79.09  69.76/89.18 
Adam  99.19  90.86  –/–  59.66/83.28 
SplitLBI  99.18  91.50  –/–  62.86/84.67 
SplitLBI with momentum  99.19  91.57  56.09/78.86  68.55/87.85 
5.1 Image Classification
In the supervised setting, SGD, Adam and SplitLBI are adopted in optimizing DNNs on these datasets. The results are shown in Table 1. To train ResNet18 on ImageNet2012 dataset, the model trained by SplitLBI () can achieve the Top1 accuracy of by 40 epochs, which is lower than the best ResNet18 results report in He et al. (2016) optimized by 120 epochs. Still this indicates that the SplitLBI may have a more efficient training than SGD. Note that, (1) In Fig. 1, we qualitatively compare our SplitLBI of against SGD and Adam. Remarkably, at the early stage, our SplitLBI of has a much better efficiency than Adam and SGD within the same number of training epochs; (2) our SLBI requires the extra more GPU usage than that of SGD, in order to save the parameters of and . Comparing with SGD, the SplitLBI adds little extra memory cost to store and . The total running time of each epoch is comparable to that of SGD.
To interpret the learned filters by SplitLBI with structural sparsity, we visualize a LeNet5 trained on MNIST and a ResNet18 on ImageNet.
Visualization of LeNet5 on MNIST. We visualize the solution paths and filter patterns of the third convolutional layer (i.e., conv.c5) of LeNet5 trained by SplitLBI and SGD trained on MNIST as in Fig. 2. Xaxis and Yaxis indicate the training epochs, and magnitudes of each filter respectively. In particular, the left two figures compares the magnitude value changes of each filter trained by SplitLBI and SGD, individually. The filters that are not in the support of , are drawn by red color. In the right figure, the model parameters learned at 20 (blue), 40 (green), and 80 (black) epochs are visualized as Erhan et al. (2009), with the corresponding color bounding boxes, i.e., blue, green, and black, respectively. The conv.c5 has 120 convolutional filters, which are ordered by the magnitude of filters. We can observe that the filters of high magnitude learned by our SplitLBI are much sparser and have very much clearer patterns that those trained by SGD, showing that SplitLBI encourages structural sparsity of filters.
ResNet18@ImageNet (Conv1) 
Visualization of ResNet18 on ImageNet2012. To further reveal the insights of learned patterns of SplitLBI, we visualize the first convolutional layer of ResNet18 on ImageNet2012 in Fig. 3. The left figure compares the training and validation accuracy of SplitLBI () and SGD. The right figure compares visualizations of the filters learned by SplitLBI and SGD using Springenberg et al. (2014). To be specific, denote the weights of an layer network as . For the th layer weights , denote the th channel . Then we compute the gradient of the sum of the feature map computed from each filter with respect to the input image (here a snake image). We further conduct the minmax normalization to the gradient image, and generate the final visualization map. The right figure compares the visualized gradient images of first convolutional layer of 64 filters with receptive fields. We visualize the models parameters at 20 (purple), 40 (green), and 60 (black) epochs, respectively, which corresponds to the bounding boxes in the right figure annotated by the corresponding colors, i.e., purple, green, and black. We order the gradient images produced from 64 filters by the descending order of the magnitude (norm) of filters, i.e., images are ordered from the upper left to the bottom right. For comparison, we also provide the visualized gradient from random initialized weights. We find that the filters with high magnitudes mostly focus on the texture and shape information, while color information is kept in the filters with small magnitudes. This phenomenon is in accordance with observation of AbbasiAsl and Yu (2017) that filters mainly contain color information can be pruned for saving computational cost. Moreover, among the filters of high magnitudes, most of them capture nonsemantic textures while few pursue shapes. This shows that the first convolutional layer of ResNet18 trained on ImageNet learned nonsemantic textures rather than shape to do image classification tasks, in accordance with recent studies (Geirhos et al., 2019).
5.2 Boosting a Network
In this experiment, we explore the performance that SplitLBI boosts filters of a network with a fixed number of layers. In each individual layer, we start from a small initial number of filters (Min), and gradually grow to a maximum number of filters (Max), adaptively learned by the SplitLBIBoosting described in Sec. 4. We compare the results of directly using small number of filters, maximum number of filters (Max) and our boosting method in Table 2. Note that, the maximum number of filters are computed by our boosting algorithm. We do not know it beforehand. In addition, we also include the networks with overparameterized (O.P.) convolutional filters, which is twice the maximum number of filters of each corresponding layer. By default, all models are trained to 1000 epochs. We employ MNIST, Cifar10 and Cifar100 as the evaluation benchmarks here.

As shown in Table 2, our SplitLBI can successfully boost the filters of networks from Min to Max. In Fig. 4, Fig. 5, Fig. 6, and 7, we give the full training paths of Min (Minimum Model), Max (Maximum Model) and Boosting (Grow) as compared in Table 2. Each black line indicates filters are added. In our experiments, two filters () are added each time. To the best of our knowledge, this is the first algorithm that can simultaneously learn the network structures and parameters from data. It is desired that the Max and Boosting hit the same performance with enough training epochs; and impressively, after the 1000 epochs, Max and Boosting can indeed achieve almost comparable results on all of the two layer networks. This further indicates the efficacy of proposed SplitLBI. On Cifar100 with a three convolutional layer network, the results of boosting are inferior to that of Max. A further inspection on the full training paths of Min, Max and Boosting (Fig. 7) shows that the boosting may require more training epochs in such a relative complex structures. Thus empirically, we fix the Boosted network in Table 2, continue training another 500 epochs, and we find the trained model indeed can report almost the same performance as the Max.
6 Conclusion
In this paper, a parsimonious deep learning method is proposed based on differential inclusions of inverse scale spaces. Its simple discretization – SplitLBI has a provable global convergence on DNNs, and thus is employed to train deep networks. Furthermore, the proposed method can be applied to boost a network by simultaneously learning both network structure and parameters. The extensive experiments and visualization show that the proposed method may not only have better efficiency than SGD, but also capture the structural sparsity of convolutional filters for interpretability.
Appendix
Appendix A Proof of Theorem 1
a.1 KurdykaŁojasiewicz Property
To introduce the definition of the KurdykaŁojasiewicz (KL) property, we need some notions and notations from variational analysis, which can be found in Rockafellar and Wets (1998).
The notion of subdifferential plays a central role in the following definitions. For each , the Fréchet subdifferential of at , written , is the set of vectors which satisfy
When we set The limitingsubdifferential (or simply subdifferential) of introduced in Mordukhovich (2006), written at , is defined by
(16) 
A necessary (but not sufficient) condition for to be a minimizer of is . A point that satisfies this inclusion is called limitingcritical or simply critical. The distance between a point to a subset of , written , is defined by , where represents the Euclidean norm.
Let be an extendedrealvalued function (respectively, be a pointtoset mapping), its graph is defined by
and its domain by (resp. ). When is a proper function, i.e., when the set of its global minimizers (possibly empty) is denoted by
The KL property (Łojasiewicz, 1963, 1993; Kurdyka, 1998; Bolte et al., 2007a, b) plays a central role in the convergence analysis of nonconvex algorithms (Attouch et al., 2013; Wang et al., 2019). The following definition is adopted from Bolte et al. (2007b).
Definition 1 (KurdykaŁojasiewicz property).
A function is said to have the KurdykaŁojasiewicz (KL) property at , if there exists a constant , a neighborhood of and a function , which is a concave function that is continuous at and satisfies , , i.e., is continuous differentiable on , and for all , such that for all , the following inequality holds
(17) 
If satisfies the KL property at each point of , is called a KL function.
KL functions include real analytic functions, semialgebraic functions, tame functions defined in some ominimal structures (Kurdyka, 1998; Bolte et al., 2007b), continuous subanalytic functions (Bolte et al., 2007a) and locally strongly convex functions. In the following, we provide some important examples that satisfy the KurdykaŁojasiewicz property.
Definition 2 (Real analytic).
A function with domain an open set and range the set of either all real or complex numbers, is said to be real analytic at if the function may be represented by a convergent power series on some interval of positive radius centered at : for some . The function is said to be real analytic on if it is real analytic at each (Krantz and Parks, 2002, Definition 1.1.5). The real analytic function over for some positive integer can be defined similarly.
According to Krantz and Parks (2002), typical real analytic functions include polynomials, exponential functions, and the logarithm, trigonometric and power functions on any open set of their domains. One can verify whether a multivariable real function on is analytic by checking the analyticity of for any .
Definition 3 (Semialgebraic).

A set is called semialgebraic (Bochnak et al., 1998) if it can be represented as
where are real polynomial functions for

A function (resp. a pointtoset mapping ) is called semialgebraic if its graph is semialgebraic.
According to Łojasiewicz (1965); Bochnak et al. (1998) and Shiota (1997, I.2.9, page 52), the class of semialgebraic sets are stable under the operation of finite union, finite intersection, Cartesian product or complementation. Some typical examples include polynomial functions, the indicator function of a semialgebraic set, and the Euclidean norm (Bochnak et al., 1998, page 26).
a.2 KL Property in Deep Learning and Proof of Corollary 1
In the following, we consider the deep neural network training problem. Consider a layer feedforward neural network including hidden layers of the neural network. Particularly, let be the number of hidden units in the th hidden layer for . Let and be the number of units of input and output layers, respectively. Let be the weight matrix between the th layer and the th layer for any ^{1}^{1}1To simplify notations, we regard the input and output layers as the th and the th layers, respectively, and absorb the bias of each layer into ..
According to Theorem 1, one major condition is to verify the introduced Lyapunov function defined in (11) satisfies the KurdykaŁojasiewicz property. For this purpose, we need an extension of semialgebraic set, called the ominimal structure (see, for instance Coste (1999), van den Dries (1986), Kurdyka (1998), Bolte et al. (2007b)). The following definition is from Bolte et al. (2007b).
Definition 4 (ominimal structure).
An ominimal structure on is a sequence of boolean algebras of “definable” subsets of , such that for each

if belongs to , then and belong to ;

if is the canonical projection onto , then for any in , the set belongs to ;

contains the family of algebraic subsets of , that is, every set of the form
where is a polynomial function.

the elements of are exactly finite unions of intervals and points.
Based on the definition of ominimal structure, we can show the definition of the definable function.
Definition 5 (Definable function).
Given an ominimal structure (over ), a function is said to be definable in if its graph belongs to .
According to van den Dries and Miller (1996); Bolte et al. (2007b), there are some important facts of the ominimal structure, shown as follows.

The collection of semialgebraic sets is an ominimal structure. Recall the semialgebraic sets are Bollean combinations of sets of the form
where and ’s are polynomial functions in .

There exists an ominimal structure that contains the sets of the form
where is realanalytic around .

There exists an ominimal structure that contains simultaneously the graph of the exponential function and all semialgebraic sets.

The ominimal structure is stable under the sum, composition, the infconvolution and several other classical operations of analysis.
The KurdykaŁojasiewicz property for the smooth definable function and nonsmooth definable function were established in (Kurdyka, 1998, Theorem 1) and (Bolte et al., 2007b, Theorem 11), respectively. Now we are ready to present the proof of Corollary 1.
Proof of Corollary 1.
To justify this corollary, we only need to verify the associated Lyapunov function satisfies KurdykaŁojasiewicz inequality. In this case and by (12), can be rewritten as follows
Because and ’s are definable by assumptions, then are definable as compositions of definable functions. Moreover, according to Krantz and Parks (2002), and are semialgebraic and thus definable. Since the group Lasso is the composition of and norms, and the conjugate of group Lasso penalty is the maximum of group norm, i.e. , where the , , and norms are definable, hence the group Lasso and its conjugate are definable as compositions of definable functions. Therefore, is definable and hence satisfies KurdykaŁojasiewicz inequality by (Kurdyka, 1998, Theorem 1).
The verifications of these specific cases listed in assumptions can be found in the proof of (Zeng et al., 2019, Proposition 1). This finishes the proof of this corollary. ∎
a.3 Proof of Theorem 1
Our analysis is mainly motivated by a recent paper (Benning et al., 2017), as well as the influential work (Attouch et al., 2013). According to (Attouch et al., 2013, Lemma 2.6), there are mainly four ingredients in the analysis, that is, the sufficient descent property, relative error property, continuity property of the generated sequence and the KurdykaŁojasiewicz property of the function. More specifically, we first establish the sufficient descent property of the generated sequence via exploiting the Lyapunov function (see, (11)) in Lemma A.4 in Section A.4, and then show the relative error property of the sequence in Lemma A.5 in Section A.5. The continuity property is guaranteed by the continuity of and the relation established in Lemma 1(i) in Section A.4. Thus, together with the KurdykaŁojasiewicz assumption of , we establish the global convergence of SLBI following by (Attouch et al., 2013, Lemma 2.6).
Let be a critical point of , then the following holds
(18)  
By the final inclusion and the convexity of , it implies . Plugging this inclusion into the second inclusion yields . Together with the first equality imples
This finishes the proof of this theorem.
a.4 Sufficient Descent Property along Lyapunov Function
Let , and . In the following, we present the sufficient descent property of along the Lyapunov function .
Lemma. Suppose that is continuously differentiable and is Lipschitz continuous with a constant . Let be a sequence generated by SLBI with a finite initialization. If , then
where .
Proof.
By the optimality condition of (9a) and also the inclusion , there holds
which implies
(19) 
where
Noting that and by the Lipschitz continuity of with a constant implies is Lipschitz continuous with a constant . This implies
Substituting the above inequality into (19) yields
(20) 
Adding some terms in both sides of the above inequality and after some reformulations implies
(21)  
where the final equality holds for That is,
(22)  
(23) 
where the final inequality holds for and Thus, we finish the proof of this lemma. ∎
Based on Lemma A.4, we directly obtain the following lemma.
Lemma 1.
Proof.
By (20), is monotonically decreasing due to . Similarly, by (23), is also monotonically decreasing. By the lower boundedness assumption of , both and are lower bounded by their definitions, i.e., (2) and (11), respectively. Therefore, both and converge, and it is obvious that . By (22),
By the convergence of and the nonegativeness of , there holds
By the definition of and the above equality, it yields
Since has bounded level sets, then is bounded. By the definition of and the finiteness of , is also bounded due to is bounded. The boundedness of is due to , condition (d), and the boundedness of .