Regression from Dependent Observations

# Regression from Dependent Observations

EECS & CSAIL, MIT
costis@csail.mit.edu
Supported by NSF awards CCF-1617730 and IIS-1741137, a Simons Investigator Award, a Google Faculty Research Award, and an MIT-IBM Watson AI Lab research grant.
Nishanth Dikkala 1
EECS & CSAIL, MIT
nishanthd@csail.mit.edu
Ioannis Panageas
ISTD & SUTD
ioannis@sutd.edu.sg
Supported by SRG ISTD 2018 136. Part of this work was done while the authors were visiting Northwestern University for a spring program in Econometrics
footnotemark:
###### Abstract

The standard linear and logistic regression models assume that the response variables are independent, but share the same linear relationship to their corresponding vectors of covariates. The assumption that the response variables are independent is, however, too strong. In many applications, these responses are collected on nodes of a network, or some spatial or temporal domain, and are dependent. Examples abound in financial and meteorological applications, and dependencies naturally arise in social networks through peer effects. Regression with dependent responses has thus received a lot of attention in the Statistics and Economics literature, but there are no strong consistency results unless multiple independent samples of the vectors of dependent responses can be collected from these models. We present computationally and statistically efficient methods for linear and logistic regression models when the response variables are dependent on a network. Given one sample from a networked linear or logistic regression model and under mild assumptions, we prove strong consistency results for recovering the vector of coefficients and the strength of the dependencies, recovering the rates of standard regression under independent observations. We use projected gradient descent on the negative log-likelihood, or negative log-pseudolikelihood, and establish their strong convexity and consistency using concentration of measure for dependent random variables.

## 1 Introduction

Linear and logistic regression are perhaps the two most prominent models in Statistics. In their most standard form, these models postulate that a collection of response variables , which are scalar and binary respectively, are linearly related to a collection of covariates through some coefficient vector , as follows:

• in vanilla linear regression it is assumed that:

• for all : ,
where ; and

• are independent.

• in vanilla logistic regression it is assumed that:

• for all and : ; and

• are independent.

It is well-known that, given examples , where the ’s are sampled independently as specified above, the coefficient vector can be estimated to within -error in both models, under mild assumptions about the smallest singular value of the matrix whose rows are . In both cases, this can be achieved by solving the corresponding Maximum Likelihood Estimation (MLE) problem, which is concave. In fact, in linear regression, the optimum of the likelihood has a closed form, which is the familiar least-squares estimate.

The assumption that the response variables are independent is, however, too strong. In many applications, these variables are observed on nodes of a network, or some spatial or temporal domain, and are dependent. Examples abound in financial and meteorological applications, and dependencies naturally arise in social networks through peer effects, whose study has recently exploded in topics as diverse as criminal activity (see e.g. [24]), welfare participation (see e.g. [2]), school achievement (see e.g. [36]), participation in retirement plans [18], and obesity (see e.g. [39, 11]). A prominent dataset where peer effects have been studied are data collected by the National Longitudinal Study of Adolescent Health, a.k.a. AddHealth study [26]. This was a major national study of students in grades 7-12, who were asked to name their friends—up to 10, so that friendship networks can be constructed, and answer hundreds of questions about their personal and school life, and it also recorded information such as the age, gender, race, socio-economic background, and health of the students. Estimating models that combine peer and individual effects to predict behavior in such settings has been challenging; see e.g. [31, 5].

### 1.1 Modeling Dependence

In this paper, we generalize the standard linear and logistic regression models to capture dependencies between the response variables, and show that if the dependencies are sufficiently weak, then both the coefficient vector and the strength of the dependencies among the response variables can be estimated to within error . To define our models, we drop the assumption that the response variables are independent, but maintain the form of the conditional distribution that each response variable takes, conditioning on a realization of the other response variables . In particular, for all , conditioning on a realization of all other variables , the conditional distribution of :

• (in our linear regression model) is a Gaussian of variance , as in standard linear regression, except that the mean of this Gaussian may depend on both and in some restricted way the realizations and the covariates , for ;

• (in our logistic regression model) takes value with probability computed by the logistic function, as in standard logistic regression, except that the logistic function is evaluated at a point that may depend on both and in some restricted way the realizations and covariates , for .

To capture network effects we parametrize the afore-described general models through a (known) interaction matrix and an (unknown) strength of interactions , as follows.

• In linear regression with -dependent samples we assume that:

• , with .

• Or equivalently, for all , conditioning on a realization of the response variables :

 yi=θ⊤xi+ϵi, (1)

where , where is the i-th row of by removing the coordinate (diagonal element) -th, is by removing the i-th column and -th row, is by removing -th row and column and finally column vector (this is the Schur complement for conditional multivariate Gaussians). Observe that 111 denotes the -row of by removing coordinate , i.e., vector and hence the expectation becomes and moreover the variance becomes . By the transformation we get that

 yi=θ⊤xi−1Dii⎡⎣∑j≠iβAij(yj−θ⊤xj)⎤⎦+ϵ′i, (2)

with .

• Interpretation: The conditional expectation of is additively perturbed from its expectation by the weighted average, according to weights , of how much the other responses are perturbed from their expectations in realization .

• Remark 2: The model proposed in Eq. (2) falls in the realm of auto-regressive models studied by Manski [31] and Bramoullé et al. [5], where it is shown that the model can be identified under conditions on the interaction matrix . In contrast to our work, one of the conditions imposed on is that it can be partitioned into many identical blocks (i.e. the weighted graph defined by has many identical connected components). Thus the response variables cluster into multiple groups that are independently and identically sampled, given the covariates. Instead we want to identify and even when corresponds to one strongly connected graph, and therefore there is no independence to be exploited.

• In logistic regression with -dependent samples it is assumed that:

• For all and , conditioning on a realization of the response variables :

 Pr[yi=σi] =11+exp(−2(θ⊤xi+β∑j≠iAijyj)σi). (3)
• Interpretation: The probability that the conditional distribution of assigns to is determined by the logistic function applied to instead of , i.e. it is increased by the weighted average, according to weights , of the other responses in realization .

• Remark 3: It is easy to see that the joint distribution of random variables , satisfying the requirements of Eq. (3), is an instance of the Ising model. See Eq. (6). In this Ising model each variable has external field , and controls the inverse temperature of the model. The Ising model was originally proposed to study phase transitions in spin systems [27], and has since found myriad applications in diverse research disciplines, including probability theory, Markov chain Monte Carlo, computer vision, theoretical computer science, social network analysis, game theory, and computational biology [29, 7, 21, 16, 22, 19, 35].

A particularly simple instance of our model arises when all covariates are single dimensional and identical. In this case, our model only has two free parameters, and this setting has been well-studied. [12] consider the consistency of maximum likelihood estimation in this setting. More recent work of Chatterjee [10], Bhattacharya and Mukherjee [4], and Ghosal and Mukherjee [23] has identified conditions on the interaction matrix under which these parameters can be identified. Our work generalizes these works to the case of multi-dimensional covariates.

Now let us state our results for the above regression models with dependent response variables. We are given a set of observations where the covariates are deterministic, and the response variables are assumed to have been sampled according to either of the models above, for a given interaction matrix and an unknown scalar and coefficient vector . Given our observations, we are interested in estimating and . It is important to stress that we only have one sample of the variables . In particular, we cannot redraw the response variables many times and derive statistical power from the independence of the samples. This is motivated by our application to network collected data, where we often have no access to independent snapshots of the responses at the nodes of the network. On a technical standpoint, estimating from a single sample distinguishes our work from other works in the literature of auto-regressive models and graphical models, and requires us to deal with the challenges of concentration of measure of functions of dependent random variables.

Our main results are stated as Theorems 3.1, for logistic regression, and 4.1, for linear regression. In both cases, the parameters can be estimated to within error , the dependence of the rate on matching that of vanilla logistic regression and vanilla linear regression respectively 222The dependence on in the rate is for both vanilla regression and our result. Hence we achieve the optimal rate with respect to all parameters in the setting of logistic regression. For linear regression with dependent samples, our rate dependence on is which is off by the rate achieved for independent samples by a factor of . . These results hold under the assumptions of Table 1. We note that the assumptions on , , and the covariates are standard, even in the case of vanilla regression. Moreover, the bounds on the norm of have been shown to be necessary for logistic regression by [4, 23]. And the minimum singular value condition for matrix is mild, and holds for various ensembles of ; see e.g. Corollary 4.1 shown using Ky Fan inequalities [20].

##### Proof Overview:

The estimation algorithms in both Theorem 3.1 and Theorem 4.1 are instances of Projected Gradient Descent (PGD). In the linear case (Theorem 4.1, PGD is applied to the negative log-likelihood of the observations . However, the log-likelihood is not convex, so we perform a re-parametrization of the model, indeed an overparametrization of the model that renders it convex. Showing strong convexity of the re-parametrized negative log-likelihood requires some mild linear algebra. It has to be established that despite the overparametrization the optimum collapses to the right dimensionality, and can be used to recover the original parameters. A more complete overview of the approach is presented in the beginning of Section 4.

In the logistic case (Theorem 3.1), we do not run PGD on the negative log-likelihood but the negative
log-pseudolikelihood. Pseudolikelihood is the product of the conditional probabilities of each response , conditioning on all other responses . Pseudolikelihood is trivially convex, but we need to establish that is optimum is close to the true parameters and also that it is strongly convex. We show both properties via concentration results for functions of dependent random variables. To show that the maximum of the pseudolikelihood is close to the true parameters we use exchangeable pairs, adapting [9]. To show that it is strongly convex we show additional properties of which are implied by our assumptions. Combining these with a new concentration inequality, we obtain the desired bound. A more complete overview of the approach is presented in Section 3.2.

##### Other Related Work:

We have already reviewed the work that is most relevant to ours from the Economics, Probability Theory, and Statistics literature. Further discussion of the Econometrics and Statistics literature on the theory and applications of regression with dependent observations is discussed in [30]. There is another strand of literature studying generalization bounds that can be attained when learning from sequences of dependent observations; see e.g. [33, 32, 41, 38, 34, 1]. These works assume, however, that the sequence of observations is a stationary process, which does not hold in our models, and they impose strong mixing conditions on that sequence. Finally, we note that generalized linear regression, which accommodates dependencies among the response variables, cannot be applied directly to our linear regression setting to estimate , because the covariance matrix of our response variables depends on the parameter , which is unknown and thus needs to be disentangled before bounding the error in the estimation of .

In the case of logistic regression, there has been a lot of work showing that under certain high-temperature conditions on the Ising model (which are similar to the assumptions we make in our paper), one can perform many statistical tasks such as learning, testing and sampling of Ising models efficiently [28, 14, 13, 15, 25, 17].

## 2 Preliminaries

We use bold letter such as to denote vectors and capital letters to denote matrices. All vectors are assumed to be column vectors, i.e. . We will refer to as the entry of matrix . We will use the following matrix norms. For a matrix ,

 ∥A∥∞=maxj∈[n]n∑i=1∣∣Aij∣∣, ∥A∥F= ⎷n∑i=1n∑j=1A2ij.

When is a symmetric matrix we have that . We use to denote eigenvalues of a matrix and to denote singular values. refers to the smallest eigenvalue and to the largest, and similar notation is used for the singular values as well.

We will say an estimator is consistent with a rate (or equivalently -consistent) with respect to the true parameter if there exists an integer and a constant such that for every , with probability at least ,

 ∥∥^θn−θ0∥∥2≤C√n

We utilize the following two well-known examples of graphical models to characterize dependencies in our logistic and linear regression models respectively.

1. Ising Model: Given an unweighted undirected graph with adjacency matrix and assignment , an Ising model is the following probability distribution on the configurations of :

 Pr{y=σ}=exp(∑v∈Vhvσv+βσ⊤Aσ)ZG(β,θ) (4)

where

 Z(G)=∑~σexp(∑v∈Vhv~σv+β~σ⊤A~σ)

is the partition function of the system (or renormalization factor). Moreover the term is called the external field. It can be observed that, without loss of generality, we can restrict the matrix to have zeros on its diagonal.

2. Gaussian Graphical Model: Let be an undirected graph with . A random vector is said to be distributed according to (undirected) Gaussian Graphical model with graph if has a multivariate Gaussian distribution with

 (Σ−1)ij=0∀(i,j)∉E, (5)

where the density function of is

 fμ,Σ(x)=exp(−12(x−μ)TΣ−1(x−μ))(2π)n/2det(Σ)1/2

under the condition that is positive semi-definite ( is also known as the precision matrix).

### 2.1 Some Useful Lemmas from Literature

Weyl’s inequalities are useful to understand how the spectra of symmetric matrices change under addition. We state them here for reference.

###### Lemma 2.1 (Weyl’s Inequalities).

Let , and be three symmetric matrices with real entries such that . Let , , be their eigenvalues respectively. Then we have for all , .

We will use the following concentration inequality which is standard in literature.

###### Theorem 2.1 ([40], Remark 5.40).

Assume that is an matrix whose rows are independent sub-gaussian random vectors in with second moment matrix . Then for every , the following inequality holds with probability at least ,

 ∥∥∥1nX⊤X−Σ∥∥∥2≤max(δ,δ2)

with .

###### Remark 2.1.

By choosing to be , it follows that with probability we get that

 ∥∥ ∥∥1nn∑i=1xix⊤i−Q∥∥ ∥∥2 is O(√lnnn),

from which follows that is at least with probability (by Weyl’s inequality).

###### Lemma 2.2 (Useful Inequalities on Singular Values).

The following inequalities hold:

1. Let be a matrix. It holds that (see [20]).

2. Let be matrices. It holds that (folklore).

3. Let be matrices, then (folklore).

###### Lemma 2.3 (Expectation and Variance of a Quadratic form of a Gaussian Distribution).

Let and we have the quadratic form . It holds that

 Ez[f(z)]=tr(AΣ)+f(μ),Vz[f(z)]=2tr(AΣAΣ)+4μ⊤AΣAμ+4b⊤ΣAμ+b⊤Σb.

Table 1 lists the assumptions under which our main theorems for logistic and linear regression hold.

## 3 Logistic Regression with Dependent Data

In this section we look at the problem of logistic regression with dependent data.

### 3.1 Our model

We are interested in a generalization of the Ising model on graph with , where each vertex has a feature vector . Moreover there is an unknown parameter and the corresponding probability distribution induces to the following:

 Pr{y=σ}=exp(∑ni=1(θ⊤xi)σi+βσ⊤Aσ)Z(G), (6)

where is a symmetric matrix with zeros on the diagonal. Given one sample and the knowledge of the matrix , we would like to infer .

We now study some conditions under which we can attain consistent estimates of the parameters of the model. Combined with some standard assumptions on the data-generating process of the feature vectors all our assumptions are listed in Table 1.

###### Theorem 3.1 (Logistic Regression with Dependent Samples).

Consider the model of (6). The Maximum Pseudo-Likelihood Estimate (MPLE) is consistent with a rate of as long as and the features satisfy the conditions of Column 2 in Table 1. Formally, for each constant and sufficiently large

 ∥∥(^θMPL,^βMPL)−(θ0,β0)∥∥2≤Od(√1n)

with probability . Moreover, we can compute a vector with in iterations of projected gradient descent (Algorithm in Section 5) where each iteration takes at most time, with probability .

###### Remark 3.1 (Necessity of an Upper Bound on ∥A∥∞ and boundedness of β0).

If scales with then no consistent estimator might exist. This is because the peer effects through will dominate the outcome of the samples and will nullify the signal coming from . Similarly one requires to be bounded as well to preserve some signal to enable recovery of .

###### Remark 3.2 (Necessity of the Lower Bound on ∥A∥F).

It was shown in [4] (Corollary 2.4 (b)) and [23] (Theorem 1.13) that when the condition is violated, we have specific examples where it is impossible to get consistent estimators for . The first instance is the Curie-Weiss model ( for all ). Note that in this case. The second instance is dense random graphs, i.e. where is a constant independent of and is chosen to be the adjacency matrix scaled down by the average degree of the graph, i.e. .

###### Remark 3.3.

If the parameter is known, the condition that is not necessary for consistency of the MPL estimate . For instance, consider the independent case where . Then, to recover , we do not need .

###### Remark 3.4.

Our approach achieves a rate of consistency if .

Example Instantiations of Theorem 3.1 Two example settings where the conditions required for Theorem 3.1 to hold are satisfied are

• is the adjacency matrix of graphs with bounded degree scaled down so that .

• is the adjacency matrix of a random -regular graph.

### 3.2 Technical Overview

Estimation in Ising models is a well-studied problem which offers a lot of interesting technical challenges. A first approach one considers is maximum likelihood estimation. However the intractability of computing the partition function poses a serious obstacle for the MLE. Even if one could approximate the partition function, proving consistency of the MLE is a hard task. To circumvent these issues we take a maximum pseudo-likelihood approach. This was proposed by Julian Besag [3] and analyzed for inference problems on Ising models by Chatterjee [10] and others ([4],[23]). Given a sample of response variables let denote the condition likelihood of observing conditioned on everyone else. The pseudo-likelihood estimator of is

 (^θMPL,^βMPL)=argmaxθ,βn∏i=1fi(θ,β,y). (7)

This does away with the problematic partition function and retains concavity in the parameters . To show that the MPLE is consistent we need to show that its global optimum is close in distance to . We achieve this by showing two things hold simultaneously.

• The log pseudo-likelihood is strongly concave everywhere. This will tell us that the gradient of the log pseudo-likelihood quickly increases as we move away from where it is 0.

• The norm of the gradient of the log pseudo-likelihood is small at when evaluated at hence implying proximity to the MPL estimates due to strong concavity.

We show that both these conditions are satisfied with high probability over the draw of our samples. Showing that the norm of the gradient is bounded involves obtaining variance bounds on two functions of the Ising model (Lemmas 3.2 and 3.3), and showing strong concavity amounts to showing a linear in lower bound on a particular quadratic function (see initial steps of proof in Lemma 3.4). Both these properties are challenging to prove because of the dependences between samples. To tackle the lack of independence, the proofs require a rich set of technical frameworks. In particular, to show the variance bounds we use the technique of exchangeable pairs developed by Chatterjee [8]. The boundedness of is necessary to have these concentration results. To show strong concavity of the log pseudolikelihood we first prove some properties of the matrix together with an additional variance bound again shown via exchangeable pairs. The lower bound on is necessary to achieve strong concavity. Finally, we show in Section 5 that computing the MPLE can be achieved efficiently using projected gradient descent where after each step we project back into the space restriced by the conditions of Table 1. We describe each of these steps formally now.

### 3.3 Analyzing the Maximum Pseudolikelihood Estimator (MPLE)

We will treat terms not involving as constants for the purposes of our analysis. We start by analyzing the maximum pseudo-likelihood estimator. Given the feature vector of the sample , we denote by the element of . Let and let (the true parameters lie in the interior of ). The pseudolikelihood for a specific sample is given by:

 PL(θ,β):=n∏i=1exp(θ⊤xiyi+βmi(y)yi)exp(θ⊤xi+βmi(y))+exp(−θ⊤xi−βmi(y)). (8)

The normalized log pseudolikelihood for a specific sample is given by:

 LPL(θ,β):=1nlogPL(θ,β)=−ln2+1n∑ni=1[yiβmi(y)+yi(θ⊤xi)−lncosh(βmi(y)+θ⊤xi)]. (9)

The first order conditions give:

 ∂LPL(^θMPL,^βMPL)∂β=1n∑ni=1[yimi(y)−mi(y)tanh(^βMPLmi(y)+^θ⊤MPLxi)]=0,∂LPL(^θMPL,β)∂θk=1n∑ni=1[yixi,k−xi,ktanh(^βMPLmi(y)+^θ⊤MPLxi)]=0. (10)

The Hessian is given by:

 ∂2LPL(θ,β)∂β2=−1n∑ni=1m2i(y)cosh2(βmi(y)+θ⊤xi),∂2LPL(θ,β)∂β∂θk=−1n∑ni=1xi,kmi(y)cosh2(βmi(y)+θ⊤xi),∂2LPL(θ,β)∂θl∂θk=−1n∑ni=1xi,lxi,kcosh2(βmi(y)+θ⊤xi). (11)

Writing the Hessian in a compact way we get

 H(θ,β)=−1nn∑i=11cosh2(βmi(y)+θ⊤xi)XiX⊤i

where . Thus is a positive semidefinite matrix and is concave. Moreover if it follows that

 1cosh2(B+d⋅M⋅Θ)⋅(1n∑ni=1XiX⊤i)⪯−H(θ,β)⪯(1n∑ni=1XiX⊤i). (12)
###### Remark 3.5.

Observe that (assuming that trivially holds ). It is easy to see that for all , hence is a -smooth function, i.e. is -Lipschitz.

### 3.4 Consistency of the MPLE

Our argument for showing consistency of the MPLE uses Lemma 3.1.

###### Lemma 3.1.

Let be the true parameter. We define and let be the largest value such that (if it does not intersect the boundary of , then ). Then,

 ∥∇LPL(θ0,β0)∥2≥Dmin(θ,β)∈Bλmin(−H(θ,β))∥∥(θ0−^θMPL,β0−^βMPL)∥∥2 =min(θ,β)∈Bλmin(−H(θ,β))∥(θ0−θD,β0−βD)∥2
###### Proof.

We drop the subscript from the estimates for brevity. We set

 g(t):=(θ0−^θ,β0−^β)⊤∇LPL(θt,βt),
 g′(t)=−(θ0−^θ,β0−^β)⊤H(θt,βt)(θ0−^θ,β0−^β).

Observe that . Since is negative semidefinite we have that (*). It holds that

 ∥∥(θ0−^θ,β0−^β)∥∥2⋅∥∇LPL(θ0,β0)∥2 ≥|(θ0−^θ,β0−^β)⊤∇LPL(θ0,β0)| =|g(1)−g(0)|=∣∣∣∫10g′(t)dt∣∣∣ ≥∣∣∣∫D0g′(t)dt∣∣∣ by% (*) ≥Dmin(θ,β)∈Bλmin(−H(θ,β))∥∥(θ0−^θ,β0−^β)∥∥22 =min(θ,β)∈Bλmin(−H(θ,β))∥(θD−θ0,βD−β0)∥2× ×∥∥(θ0−^θ,β0−^β)∥∥2

We apply Lemma 3.1 by showing

1. a concentration result for around (Section 3.5)and

2. a (positive constant) lower bound for (Section 3.6).

We combine the above with the observation that as (i.e., for sufficiently large). This is true because as (is of order by showing the promised concentration result and the lower bound). Also note that any point on the boundary of has a fixed distance to since it lies in the interior. Hence implies that . This gives the desired rate of consistency which we show in Section 3.7.

### 3.5 Variance Bounds using Exchangeable Pairs

In this Section we state the lemmata which are required to show that the norm of the gradient of the log pseudo-likelihood is bounded at the true parameters.

###### Lemma 3.2 (Variance Bound 1).

It holds that

 Eθ0,β0⎡⎣(n∑i=1yimi(y)−mi(y)tanh(β0mi(y)+θ⊤0xi))2⎤⎦≤(12+4B)n.
###### Proof.

We use the powerful technique of exchangeable pairs as introduced by Chatterjee ([8]) and employed by Chatterjee and Dembo (see [9]). First it holds that . Also note that since it trivially follows that for all and . Set

 Q(y):=∑i(yi−tanh(β0mi(y)+θ⊤0xi))mi(y), (13)

hence we get

 ∂Q(y)∂yj=∑i(1i=j−β0Aijcosh2(β0mi(y)+θ⊤0xi))mi(y)+(yi−tanh(β0mi(y)+θ⊤0xi))∂mi(y)∂yj. (14)

We will bound the absolute value of each summand. First from above we can bound the second term as follows

 ∣∣∣(yi−tanh(β0mi(y)+θ⊤0xi))∂mi(y)∂yj∣∣∣≤2∣∣Aij∣∣. (15)

Using the fact that it also follows that

 ∣∣ ∣∣∑i(1i=j−β0Aijcosh2(β0mi(y)+θ⊤0xi))mi(y)∣∣ ∣∣≤|mj(y)|+∑i≠jβ0∣∣Aijmi(y)∣∣ (16)

Using (15) and (16) it follows that . Finally let and note that

 |Q(y)−Q(yj)|≤2⋅⎛⎝∑i≠j∣∣Aij∣∣(2+β0|mi(y−j,w)|)+|mj(y−j,w)|⎞⎠≤2⋅(∑i≠j∣∣Aij∣∣(2+B)+1)≤6+2B, (17)

where is the argmax of along the line with endpoints (Taylor). In the last inequality we used that . We have all the ingredients to complete the proof. We first observe that

 ∑iEθ0,β0[(yi−tanh(β0mi(y)+θ⊤0xi))Q(yi)mi(y)]=0, (18)

since

 (19)

Therefore it follows

 Eθ0,β0[Q2(y)] =Eθ0,β0[Q(y)⋅(∑i(yi−tanh(β0mi(y)+θ⊤0xi))mi(y))] =Eθ0,β0[∑i(Q(y)(yi−tanh(β0mi(y)+θ⊤0xi))mi(y))] =∑iEθ0,β0[(Q(y)−Q(yi))⋅(yi−tanh(β0mi(y)+θ⊤0xi))mi(y)] ≤∑i2⋅(6+2B)=(12+4B)n.

###### Lemma 3.3 (Variance Bound 2).
 Eθ0,β0⎡⎣d∑k=1(n∑i=1xi,kyi−xi,ktanh(β0mi(y)+θ⊤0xi))2⎤⎦≤(4+4B)M2⋅dn.
###### Proof.

We use the powerful technique of exchangeable pairs as employed by Chatterjee and Dembo (see [9]). Note that since it trivially follows that for all and . We fix a coordinate and set

 Q(y):=∑i(yi−tanh(β0mi(y)+θ⊤0xi))xi,k, (20)

hence we get We will bound the term as follows