Solution of linear ill-posed problems by model selection and aggregation

# Solution of linear ill-posed problems by model selection and aggregation

Felix Abramovich
Department of Statistics
and Operations Research
Tel Aviv University
Israel
felix@post.tau.ac.il
Daniela De Canditiis
Istituto per le Applicazioni
del Calcolo ”M. Picone”, CNR
Rome
Italy
d.decanditiis@iac.cnr.it
Marianna Pensky
Department of Mathematics
University of Central Florida
USA
Marianna.Pensky@ucf.edu
###### Abstract

We consider a general statistical linear inverse problem, where the solution is represented via a known (possibly overcomplete) dictionary that allows its sparse representation. We propose two different approaches. A model selection estimator selects a single model by minimizing the penalized empirical risk over all possible models. By contrast with direct problems, the penalty depends on the model itself rather than on its size only as for complexity penalties. A Q-aggregate estimator averages over the entire collection of estimators with properly chosen weights. Under mild conditions on the dictionary, we establish oracle inequalities both with high probability and in expectation for the two estimators. Moreover, for the latter estimator these inequalities are sharp. The proposed procedures are implemented numerically and their performance is assessed by a simulation study.

Keywords: Aggregation, ill-posed linear inverse problems, model selection, oracle inequality, overcomplete dictionary
AMS (2000) Subject Classification: Primary: 62G05. Secondary: 62C10

## 1 Introduction

Linear inverse problems, where the data is available not on the object of primary interest but only in the form of its linear transform, appear in a variety of fields: medical imaging (X-ray tomography, CT and MRI), astronomy (blured images), finance (model calibration of volatility) to mention just a few. The main difficulty in solving inverse problems is due to the fact that most of practically interesting and relevant cases fall into into the category of so-called ill-posed problems, where the solution cannot be obtained numerically by simple invertion of the transform. In statistical inverse problems the data is, in addition, corrupted by random noise that makes the solution even more challenging.

Statistical linear inverse problems have been intensively studied and there exists an enormous amount of literature devoted to various theoretical and applied aspects of their solutions. We refer a reader to to Cavalier (2009) for review and references therein.

Let and be two separable Hilbert spaces and be a bounded linear operator. Consider a general statistical linear inverse problem

 y=Af+σε, (1)

where is the observation, is the (unknown) object of interest, is a white noise and is a (known) noise level. For ill-posed problems does not exist as a linear bounded operator.

Most of approaches for solving (1) essentially rely on reduction of the original problem to a sequence model using the following general scheme:

1. Choose some orthonormal basis on and expand the unknown in (1) as

 f=∑j⟨f,ϕj⟩G ϕj (2)
2. Define as the solution of , where is the adjoint operator, that is, . Reduce (1) to the equivalent sequence model:

 ⟨y,ψj⟩H = ⟨Af,ψj⟩H+⟨ε,ψj⟩H = ⟨f,ϕj⟩G+⟨ε,ψj⟩H, (3)

where for ill-posed problems increases with . Following the common terminology, an inverse problem is called mildly ill-posed, if the variances increase polynomially and severely ill-posed if their growth is exponential (see, e.g., Cavalier, 2009).

3. Estimate the unknown coefficients from empirical coefficients : , where is some truncating/shrinking/thresholding procedure (see, e.g., Cavalier 2009, Section 1.2.2.2 for a survey), and reconstruct as

 ^f=∑jˆ⟨f,ϕj⟩G ϕj

Efficient representation of in a chosen basis in (2) is essential. In the widely-used singular value decomposition (SVD), ’s are the orthogonal eigenfunctions of the self-adjoint operator and , where is the corresponding eigenvalue. SVD estimators are known to be optimal in various minimax settings over certain classes of functions (e.g., Johnstone & Silverman, 1990; Cavalier & Tsybakov, 2002; Cavalier et al., 2002). A serious drawback of SVD is that the basis is defined entirely by the operator and ignores the specific properties of the object of interest . Thus, for a given , the same basis will be used regardless of the nature of a scientific problem at hand. While the SVD-basis could be very efficient for representing in one area, it might yield poor approximation in the other. The use of SVD, therefore, restricts one within certain classes depending on a specific operator . See Donoho (1995) for further discussion.

In wavelet-vaguelette decomposition (WVD), ’s are orthonormal wavelet series. Unlike SVD-basis, wavelets allows sparse representation for various classes of functions and the resulting WVD estimators have been studied in Donoho (1995), Abramovich & Silverman (1998), Kalifa & Mallat (2003), Johnstone et al. (2004). However, WVD imposes relatively stringent conditions on that are satisfied only for specific types of operators, mainly of convolution type.

A general shortcoming of orthonormal bases is due to the fact that they may be “too coarse” for efficient representation of unknown . Since 90s, there was a growing interest in the atomic decomposition of functions over overcomplete dictionaries (see, for example, Mallat & Zhang, 1993; Chen, Donoho & Saunders, 1999; Donoho & Elad, 2003). Every basis is essentially only a minimal necessary dictionary. Such “miserly” representation usually causes poor adaptivity (Mallat & Zhang, 1993). Application of overcomplete dictionaries improves adaptivity of the representation, because one can choose now the most efficient (sparse) one among many available. One can see here an interesting analogy with colors. Theoretically, every other color can be generated by combining three basic colors (green, red and blue) in corresponding proportions. However, a painter would definitely prefer to use the whole available palette (overcomplete dictionary) to get the hues he needs! Selection of appropriate subset of atoms (model selection) that allows a sparse representation of a solution is a core element in such an approach. Pensky (2016) was probably the first to use overcomplete dictionaries for solving inverse problems. She utilized the Lasso techniques for model selection within the overcomplete dictionary, established oracle inequalities with high probability and applied the proposed procedure to several examples of linear inverse problems. However, as usual with Lasso, it required somewhat restrictive compatibility conditions on the design matrix .

In this paper we propose two alternative approaches for overcomplete dictionaries based estimation in linear ill-posed problems. The first estimator is obtained by minimizing penalized empirical risk with a penalty on model proportional to . The second one is based on a Q-aggregation type procedure that is specifically designed for solution of linear ill-posed problems and is different from that of Dai et al. (2012) developed for solution of direct problems. We establish oracle inequalities for both estimators that hold with high probabilities and in expectation. Moreover, for the Q-aggregation estimator, the inequalities are sharp. Simulation study shows that the new techniques produce more accurate estimators than Lasso.

The rest of the paper is organized as follows. Section 2 introduces the notations and some preliminary results. The model selection and aggregation-type procedures are studied respectively in Section 3 and Section 4. The simulation study is described in Section 5. All proofs are given in the Appendix.

## 2 Setup and notations

Consider a discrete analog of a general statistical linear inverse problem (1):

 y=Af+σε, (4)

where is the vector of observations, is the unknown vector to be recovered, is a known (ill-posed) matrix with , and .

In what follows and denote respectively a Euclidean norm and an inner product in . Let with be a set of normalized vectors (dictionary), where typically (overcomplete dictionary). Let be the complete dictionary matrix with the columns , and is such that , that is, and .

In what follows, we assume that, for some positive , the minimal -sparse eigenvalue of is separated from zero:

 ν2r=minx∈Rn,∥x∥0≤r∥Φx∥2∥x∥2>0

and consider a set of models of sizes not larger than .

For a given model define a diagonal indicator matrix with diagonal entries . The design matrix corresponding to is then , while . Let be the projection matrix on a span of nonzero columns of and

 fM=HMf=ΦM(ΦTMΦM)−1ΦTM f=ΦM(ΦTMΦM)−1ΨTM Af

be the projection of on . Denote

 z=(ATA)−1ATy=f+ξ,whereξ=(ATA)−1ATε. (5)

Consider the corresponding projection estimator

 ^fM=HMz=HM(ATA)−1ATy=ΦM(ΦTMΦM)−1ΨTM y=ΦM^θM, (6)

where the vector of projection coefficients . By straightforward calculus, and the quadratic risk

 E∥^fM−f∥2=∥fM−f∥2+σ2Tr((ATA)−1HM) (7)

The oracle model is the one that minimizes (7) over all models and the ideal oracle risk is

 R(oracle)=infM∈ME∥^fM−f∥2=infM{∥fM−f∥2+σ2Tr((ATA)−1HM)} (8)

The oracle risk is obviously unachievable but can be used as a benchmark for a quadratic risk of any available estimator.

## 3 Model selection by penalized empirical risk

For a given model , in (6) minimizes the corresponding empirical risk , where was defined in (5). Select a model by minimizing the penalized empirical risk:

 ˆM=argminM∈M{∥z−^fM∥2+Pen(M)}=argminM∈M{−∥^fM∥2+Pen(M)}, (9)

where is a penalty function on a model . The proper choice of is the core of such an approach.

For direct problems (), the penalized empirical risk approach, with the complexity type penalties on a model size, has been intensively studied in the literature. In the last decade, in the context of linear regression, the in-depth theories (risk bounds, oracle inequalities, minimaxity) have been developed by a number of authors (see, e.g., Foster & George (1994), Birgé & Massart (2001, 2007), Abramovich & Grinshtein (2010), Rigollet & Tsybakov (2011), Verzelen (2012) among many others).

For inverse problems, Cavalier et al. (2002) considered a truncated orthonormal series estimator, where the cut-off point was chosen by SURE criterion corresponding to the AIC-type penalty and established oracle inequalities for the resulting estimator . It was further generalized and improved by risk hull minimization in Cavalier & Golubev (2006).

To the best of our knowledge, Pensky (2016) was the first to consider model selections within overcomplete dictionaries by empirical risk minimization for statistical inverse problems. She utilized Lasso penalty. However, as usual with Lasso, it required somewhat restrictive compatibility conditions on the design matrix (see Pensky, 2016 for more details).

In this paper, we utilize the penalty that depends on the Frobenius norm of the matrix :

 ∥ΨM∥2F=∑j∈M∥ψj∥2=Tr((ATA)−1ΦMΦTM)

The following theorem provides nonasymptotic upper bounds for the quadratic risk of the resulting estimator both with high probability and in expectation:

###### Theorem 1.

Consider the model (4) and the penalized empirical risk estimator , where the model is selected w.r.t. (9) with the penalty

 Pen(M)≥4σ2(δ+1)aν2r ∥ΨM∥2F lnp (10)

for some and . Then,

1. With probability at least

 (11)
2. If, in addition, we restrict the set of admissible models to for some constant ,

 E∥^fˆM−f∥2≤1+a1−a minM∈Mγ{∥fM−f∥2+32(1+a) Pen(M)}+4γ2σ2a(1−a)ν2r np−δ (12)

The additional restriction on the set of models in the second part of Theorem 1 is required to guarantee that the oracle risk in (8) does not grow faster than .

Note that for the direct problems, and the penalty (10) is the RIC-type complexity penalty of Foster & George (1994) of the form .

We can compare the quadratic risk of the proposed estimator with the oracle risk in (8). Consider the penalty

 Pen(M)=4σ2(δ+1)aν2r ∥ΨM∥2F lnp (13)

for some and . Assume that (overcomplete dictionary) and choose . Then, the last term in the RHS of (12) turns to be of a smaller order and we obtain

 E∥^fˆM−f∥2≤C1minM∈Mγ{∥fM−f∥2+C2γσ2ν2r∥ΨM∥2Flnp}

for some positive constants depending on and only. By standard linear algebra arguments, and, therefore, the following oracle inequality holds:

###### Corollary 1.

Assume that and consider the penalized empirical risk estimator from Theorem 1, where is selected w.r.t. (9) over with the penalty (13) for some and . Then,

 E∥^fˆM−f∥2≤C0κ2rν2rlnp R(oracle)

for some constant depending on , and only.

Thus, the quadratic risk of the proposed estimator is within -factor of the ideal oracle risk. The -factor is a common closest rate at which an estimator can approach an oracle even in direct (complete) model selection problems (see, e.g., Donoho & Johnstone, 1994; Birgé & Massart, 2001; Abramovich & Grinshtein, 2010; Rigollet & Tsybakov, 2011; and also Pensky (2016) for inverse problems). For an ordered model selection within a set of nested models, it is possible to construct estimators that achieve the oracle risk within a constant factor (see, e.g., Cavalier et al., 2002 and Cavalier & Golubev, 2006).

Similar oracle inequalities (even sharp with the coefficient in front of equals to one) with high probability were obtained for the Lasso estimator but under the additional compatibility assumption on the matrix (Pensky, 2016).

## 4 Q-aggregation

Note that inequalities (11) and (12) in Theorem 1 for model selection estimator are not sharp in the sense that the coefficient in front of the minimum is greater than one. In order to derive sharp oracle inequalities both in probability and expectation, one needs to aggregate the entire collection of estimators: rather than to select a single estimator .

Leung & Barron (2006) considered exponentially weighted aggregation (EWA) with , where is some (prior) probability measure on and is a tuning parameter. They established sharp oracle inequalities in expectation for EWA in direct problems. Dalalyan & Salmon (2012) proved sharp oracle inequalities in expectation for EWA of affine estimators in nonparametric regression model. Their paper offers limited extension of the theory to the case of mildly ill-posed inverse problems where increase at most polynomially with . However, their results are valid only for the SVD decomposition and require block design which seriously limits the scope of application of their theory. Moreover, Dai et al. (2012) argued that EWA cannot satisfy sharp oracle inequalities with high probability and proposed instead to use Q-aggregation.

Define a general Q-aggregation estimator of as

 ^f^θ=∑M∈MθM^fM, (14)

where the vector of weights is the solution of the following optimization problem:

 (15)

for some and a penalty , and is the simplex

 ΘM={θ∈Rcard(M):θM≥0,∑M∈MθM=1} (16)

In particular, Dai et al. (2012) considered proportional to the Kullback-Leibler divergence for some prior on . For direct problems, they derived sharp oracle inequalities both in expectation and with high probability for Q-aggregation with such penalty. In fact, EWA can also be viewed as an extreme case of Q-aggregation of Dai et al. for (see Rigolette & Tsybakov, 2012). However, the results for Q-aggregation with Kullback-Leibler-type penalty are not valid for ill-posed problems.

In this section we propose a different type of penalty for Q-aggregation in (15) that is specifically designed for the solution of inverse problems. In particular, this penalty allows one to obtain sharp oracle inequalities both in expectation and with high probability in both mild and severe ill-posed linear inverse problems. Namely, we consider the penalty

 Pen(θ)=4σ2(δ+1)lnpν2r∑M∈MθM∥ΨM∥2F (17)

with a tuning parameter . For such a penalty and , the resulting is

 ^θ=argminθ∈ΘM{∑M∈MθM(∥z−^fM∥2+8σ2(δ+1)lnpν2r∥ΨM∥2F)+∥∥z−∑M∈MθM^fM∥∥2} (18)

Note that the first term in the minimization criteria (18) is the same as in model selection (9) with the penalty (10) for . The presence of the second term is inherent for Q-aggregation. In fact, the model selection estimator from Section 3 is a particular case of a Q-aggregate estimator with the weights obtained by solution of problem (15) with .

The non-asymptotic upper bounds for the quadratic risk of , both with high probability and in expectation, are given by the following theorem :

###### Theorem 2.

Consider the model (4) and the Q-aggregate estimator given by (14), where the weights are selected as a solution of the optimization problem (18). Then,

1. With probability at least

 ∥^f^θ−f∥2≤minM∈M{∥fM−f∥2+10ν−2rσ2(δ+1) ∥ΨM∥2F lnp} (19)
2. If, in addition, we restrict the set of admissible models to for some constant , then

 E∥^fˆM−f∥2≤minM∈Mγ{∥fM−f∥2+ν−2rσ2∥ΨM∥2F(3+2(δ+1)lnp)}+2γ2np−δ (20)

Observe that unlike Theorem 1 for model selection estimator, inequalities in both (19) and (20) for Q-aggregation are sharp.

## 5 Simulation study

In this section we present results of a simulation study that illustrates the performance of the model selection estimator from (9) with the penalty (10) and the Q-aggregation estimator given by (14) where the weights are defined in (18).

The data were generated w.r.t. a (discrete) ill-posed statistical linear problem (4) corresponding to the convolution-type operator , on a regular grid :

 Aij=e−i−jnI(j≤i),i,j=1,…,n,

where is the indicator function and .

We considered the dictionary obtained by combining two wavelet bases of different type: the Daubechies 8 wavelet basis and the Haar basis , with the overall dictionary size . In our notations, and are the scaling functions, while and are the wavelets functions of the Daubechies and Haar bases respectively with the initial resolution level .

In order to investigate the behavior of the estimators, we considered various sparsity and noise levels. In particular, we used four test functions, presented in Figure 2, that correspond to different sparsity scenarios:

1. (high sparsity)

2. (moderate sparsity)

3. (low sparsity)

4. is the well-known HeaviSine function from Donoho & Johnstone (1994) (uncontrolled sparsity)

For each test function, we used three different values of that were chosen to ensure a signal-to-noise ratios , where .

The accuracy of each estimator was measured by its relative integrated error:

 R(^f)=∥f−^f∥2/∥f∥2

Since the model selection estimator involves minimizing a cost function of the form over the entire model space of a very large size, we used a Simulated Annealing (SA) stochastic optimization algorithm for an approximate solution. The SA algorithm is a kind of a Metropolis sampler where the acceptance probability is “cooled down” by a synthetic temperature parameter (see Brémaud 1999, Chapter 7, Section 8). More precisely, if is a solution at step of the algorithm, at step a tentative solution is selected according to a given symmetric proposal distribution and it is accepted with probability

 a(M∗,M(r))=min{1,exp(−π(M∗)−π(M(r))T(r))}. (21)

where is a temperature parameter at step . The expression (21) is motivated by the fact that while is always accepted if , it can still be accepted even if in spite of being worse than the current one. The chance of acceptance of for the same value of diminishes at every step as the temperature decreases with . The law that reduces the temperature is called the cooling schedule, in particular, here we choose .

In this paper we adopted the classical symmetric uniform proposal distribution and selected a starting solution according to the following initial probability

 p(j)=Cexp{⟨ψj,y⟩2−c∥ψj∥2},for j=1,…,p (22)

where and is the normalizing constant. Observe that the argument in the exponent in (22) is the difference of and where the first term is the squared -th empirical coefficient while the second term is the increase in the variance due to the addition of the -th dictionary function. Hence, the prior is more likely to choose dictionary functions with small variances that are highly correlated to the true function .

Thus, the adopted SA procedure can be summarized as follows:

• generate a random number . Set

• generate a starting solution with by sampling indices according to the probability given by equation (22)

• repeat for

1. generate a variable

2. if propose

else

propose

3. with probability given in equation (21) assign , otherwise

4. update the temperature parameter

While various stopping criteria could be used in the SA procedure, we found to be sufficient for obtaining a good approximation of the global minimum in (9). Once the algorithm is terminated, we evaluated , where was the “best” model in the chain of models generated by SA algorithm.

Similarly, the Q-aggregation estimator involves computationally expensive aggregation of estimators over the entire model space . We, therefore, approximated it by aggregating over the subset of the last 50 “visited” models in the SA chain, i.e. with being a solution of (18).

For , and we also considered the oracle projection estimator based on the true model. In addition, we compared the proposed estimators with the Lasso-based estimator of Pensky (2016), where the vector of coefficients is a solution of the following optimization problem

 ^θ=argminθ⎧⎪⎨⎪⎩∥∥ ∥∥p∑j=1θjϕj∥∥ ∥∥2−2p∑j=1θj⟨y,ψj⟩+λp∑j=1|θj|∥ψj∥2⎫⎪⎬⎪⎭,

and is a tuning parameter.

The tuning parameters for and , were chosen by minimizing the error on a grid of possible values. To reduce heavy computational costs we used the same of for all 50 aggregated models used for calculating .

Figure 1 presents the boxplots of over 100 independent runs for (for and ), , and . Performances of all estimators deteriorate as SNR decreases especially for the less sparse test functions. The estimators and always outperform and, as it is expected from our theoretical statements, yields somewhat better results than . We expect that the differences in precisions of and would be more significant if we carried out aggregation over a larger portion of the model space than the last 50 visited models. Figure 2 illustrates these conclusion by displaying examples of the estimators for . We should also mention that estimator was usually more sparse than .

## Acknowledgments

Marianna Pensky was partially supported by National Science Foundation (NSF), grants DMS-1407475 and DMS-1712977.

## References

• [1] Abramovich, F. and Grinshtein, V. (2010). MAP model selection in Gaussian regression. Electr. J. Statist. 4, 932–949.
• [2] Abramovich F. and Silverman B.W. (1998). Wavelet decomposition approaches to statistical inverse problems. Biometrika 85, 115–129.
• [3] Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. in Second International Symposium on Information Theory. (Eds. B.N. Petrov and F. Czáki). Akademiai Kiadó, Budapest, 267–281. Birgé, L. and Massart, P. (2001). Gaussian model selection. J. Eur. Math. Soc. 3, 203–268.
• [4] Birgé, L. and Massart, P. (2007). Minimal penalties for Gaussian model selection. Probab. Theory Relat. Fields 138, 33–73.
• [5] Brémaud, P. (1999). Markov Chains. Text in Applied Mathematics. Springer
• [6] Cavalier, L. (2012). Inverse problems in staistics. Inverse Problems and High Dimensional Estimation (Eds. Alquier, P., Gautier, E. and Gilles, G.), Lecture Notes in Statistics 203, Springer, 3–96.
• [7] Cavalier, L. and Golubev, Yu. (2006) Risk hull method and regularization by projections of ill-posed inverse problems. Ann. Statist. 34, 1653–1677.
• [8] Cavalier, L., Golubev, G.K., Picard, D. and Tsybakov, A.B. (2002). Oracle inequalities for inverse problems. Ann. Statist. 30, 843–874.
• [9] Cavalier, L. and Tsybakov, A. (2002). Sharp adaptation for inverse problems with random noise, Probab. Theory Related Fields 123, 323-â354.
• [10] Chen, S., Donoho, D.L. and Saunders, M. A. (1999). Atomic decomposition by basic pursuit. SIAM J. Scient. Comput. 20, 33–61.
• [11] Dai, D., Rigolette, P. and Zhang, T. (2012) Deviation optimal leaning using greedy Q-aggregation. Ann. Statist. 40, 1878–1905.
• [12] Dalalyan, A. and Salmon, J. (2012) Sharp oracle inequalities for aggregation of affine estimators. Ann. Statist. 40, 2327–2355.
• [13] Donoho, D.L. (1995). Nonlinear solutions of linear inverse problems by wavelet-vaguelette decomposition. Appl. Comput. Harmon. Anal. 2, 101â-126.
• [14] Donoho, D.L. and Elad, M. (2003). Optimally sparse representation in general (nonorthogonal) dictionaries via minimization. PNAS 100, 2197–2202.
• [15] Donoho, D.L. and Johnstone, I.M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika 81, 425–455.
• [16] Foster, D.P. and George, E.I. (1994) The risk inflation criterion for multiple regression. Ann. Statist. 22, 1947–1975.
• [17] Johnstone, I.M., Kerkyacharian, G., Picard, D. and Raimondo, M. (2004) Wavelet decomposition in a periodic setting. J. R. Statist. Soc. B 66, 547–573 (with discussion).
• [18] Johnstone, I.M. and Silverman, B.W. (1990) Speed of estimation in positron emission tomography and related inverse problems. Ann. Statist. 18, 251–280.
• [19] Kalifa, J. and Mallat, S. (2003) Thresholding estimators for linear inverse problems and deconvolution. Ann. Statist. 31, 58–109.
• [20] Leung, G. and Barron, A.R. (2006) Information theory and mixing least-squares regressions. IEEE Trans. Inf. Theory 52, 3396–3410.
• [21] Mallat, S. and Zhang, Z. (1993). Matching pursuit with time-frequency dictionaries. IEEE Trans. in Signal Proc. 41, 3397–3415.
• [22] Pensky, M. (2016) Solution of linear ill-posed problems using overcomplete dictionaries. Ann. Statist. 44, 1739–1754.
• [23] Rigollet, P. and Tsybakov, A. (2011) Exponential screening and optimal rates of sparse estimation. Ann. Statist. 39, 731–771.
• [24] Rigollet, P. and Tsybakov, A. (2012) Sparse estimation by exponential weighting. Statist. Science 27, 558–575.
• [25] Stephens, M. (2000) Bayesian Analysis of Mixture Models with an Unknown Number of Components- An Alternative to Reversible Jump Methods. The Annals of Statistics 28(1), 40–74
• [26] Verzelen, N. (2012). Minimax risks for sparse regressions: Ultra-high dimensionals phenomenon. Electr. J. Statist. 6, 38–90.

## Appendix

The proofs of the main results are based on the following auxiliary lemmas.

###### Lemma 1.

For any ,

 P(supM∈M{∥ΨTMε∥2−2σ2∥ΨM∥2F(lnp+x)}≤0)≥1−√2π e−x
###### Proof.

For any model , , where . By Mill’s ratio

 P((ψTjε)2>2σ2∥ψj∥2(lnp+x))≤√2π p−1e−x

for any . Then,

 P(⋂j∈M{(ψTjε)2−2σ2∥ψj∥2(lnp+x)≤0})≥1−∑j∈MP((ψTjε)2−2σ2 ∥ψj∥2(lnp+x)>0)≥1−√2π e−x

and, therefore,

 P(supM∈M{∥ΨTMε∥2−2σ2∥ΨM∥2F(lnp+x)≤0})≥P(⋂M∈M⋂j∈M{(ψTjε)2−2σ2 ∥ψj∥2(lnp+x)≤0})=P(p⋂j=1{(ψTjε)2−2σ2 ∥ψj∥2(lnp+x)≤0})≥1−√2π p−1e−x

###### Lemma 2.

For any , one has

 Tr(ΨTM1∪M2ΨM1∪M2) ≤ Tr(ΨTM1ΨM1)+Tr(ΨTM2ΨM2) (23) λmax(ΨTM1∪M2ΨM1∪M2) ≤ 2[λmax(ΨTM1ΨM1)+λmax(ΨTM2ΨM2)] (24)
###### Proof.

Inequality (23) follows from

 Tr(ΨTM1∪M2ΨM1∪M2)=∑j∈M1∪M2∥ψj∥2≤Tr(ΨTM1ΨM1)+% Tr(ΨTM2ΨM2)

In order to prove inequality (24), observe that

 λmax(ΨTM1∪M2ΨM1∪M2) = maxx∈R|M1+M2|∥ΨM1xM1+ΨM2∖M1xM2∖M1∥2∥x∥2 ≤ 2maxx∈R|M1+M2|[∥ΨM1xM1∥2∥x∥2+∥ΨM2xM2∥2∥x∥2]

which implies (24). ∎

###### Lemma 3.

If where is defined in (16), then, for any function

 ∑M∈M^θM∥~f−^fM∥2=∥~f−^f^θ∥2+∑M∈M^θM∥^f^θ−^fM∥2. (25)
###### Proof.

Note that for any one has

 ∑M∈M^θM∥~f−^fM∥2 = ∑M∈M^θM∥~f−^f^θ∥2+∑M∈M^θM∥^fM−^f^θ∥2 + 2⟨~f−^f^θ,∑M∈M^θM(^f^θ−^fM⟩.

Now validity of (25) follows from the fact that , so that the scalar product term in the last identity is equal to zero. ∎

###### Lemma 4.

Let be defined in formula (5) and be an arbitrary fixed model. Denote and

 Δ0=2σ⟨ξ,^f^θ−^fM0⟩+σ2ν2r∥ΨTM0ε∥2−32UM0−∑M∈M^θMUM−12∑M∈M^θM∥^fM−^fM0∥2. (26)

Then, with probability at least , one has . Moreover, if the set of models is restricted to , then

 E(Δ0)≤UM0[34(δ+1)lnp−32]+4σ2γ2ν2rnp−δ. (27)
###### Proof.

Note that due to , one has

 2σ⟨ξ,^f^θ−^fM0⟩ = 2σ∑M∈M^θM⟨ξ,^fM−^fM0⟩.

Recall that , hence, applying the inequality for any positive and , one obtains

 2σ|⟨ξ,^f^θ−^fM0⟩| = 2σ|ξTHM∪M0(HM−HM0)z| ≤ 2σ2ν−2r∥ΨTM∪M0ε∥2+0.5∥^fM−^fM0∥2.

Combining the last two inequalities, taking into account that and plugging them into (26), obtain

 Δ0≤2σ2ν2r∑M∈M^θM∥ΨTMε∥2+3σ2ν2r∥ΨTM0ε∥2−32UM0−∑M∈M^θMUM. (28)

Applying Lemma 1 with , expressing via and taking into account that , we derive that on the set with on which

 supM∈M{∥ΨTMε∥2−2σ2∥ΨM∥2Flnp(1+δ)}≤0.

In order to derive inequality (27), plugging in the expression for and