Memory-Sample Tradeoffs for Linear Regression with Small Error††thanks: The contributions of Vatsal and Gregory were supported by NSF awards AF-1813049 and CCF-1704417, and ONR Young Investigator Award N00014-18-1-2295. Aaron’s contributions were supported by NSF CAREER Award CCF-1844855.
We consider the problem of performing linear regression over a stream of -dimensional examples, and show that any algorithm that uses a subquadratic amount of memory exhibits a slower rate of convergence than can be achieved without memory constraints. Specifically, consider a sequence of labeled examples with drawn independently from a -dimensional isotropic Gaussian, and where for a fixed with and with independent noise drawn uniformly from the interval . We show that any algorithm with at most bits of memory requires at least samples to approximate to error with probability of success at least , for sufficiently small as a function of . In contrast, for such , can be recovered to error with probability with memory using examples. This represents the first nontrivial lower bounds for regression with super-linear memory, and may open the door for strong memory/sample tradeoffs for continuous optimization.
What are the implications of memory constraints on the ability to efficiently learn or optimize? As has been revealed in a recent series of striking results (Raz, 2016, 2017; Beame et al., 2018; Kol et al., 2017; Moshkovitz and Moshkovitz, 2017, 2018; Garg et al., 2018), for a broad class of natural learning problems over the Boolean hypercube and other finite fields, there is a sharp threshold for the amount of memory required to learn with a polynomial amount of data.
This line of work was sparked by Raz’ breakthrough result (Raz, 2016), which considered the problem of learning a parity: given access to a stream of labeled examples, where each is drawn uniformly from the -dimensional hypercube and for some fixed vector , Raz showed that any algorithm with memory would require an exponential number of examples to learn (with any significant probability of success). Of course, given a quadratic amount of memory, can be efficiently computed by taking the first examples and solving the corresponding linear system over . Subsequent work extended this result to a broad class of discrete learning problems, including Raz (2017); Moshkovitz and Moshkovitz (2017, 2018) which generalized the results to a class of Boolean learning problems that satisfy a certain combinatorial condition, Kol et al. (2017) which extended the techniques to the problems of learning sparse parities (parities involving coordinates) which implies hardness of several other natural Boolean learning problems including learning small juntas, small decision trees, and small DNF formulae, and the works Beame et al. (2018) and Garg et al. (2018) whcih strengthened the approach of Raz (2017) to yield tight tradeoffs for a larger class of learning problems over finite fields, including homogeneous -variate polynomials over .
For continuous, real-valued optimization and learning problems, much less is known about memory/sample tradeoffs. This is in spite of fact that the problem of learning a linear regression—the real-valued analog of learning parities—lies at the core of machine learning and is a prototypical convex optimization problem. Indeed, one of the original motivations for the conjecture that learning a parity required either a quadratic memory or exponential time, originally stated in Steinhardt et al. (2016), was the question of the memory/sample tradeoffs for linear regression.
This question of the memory/sample tradeoffs for linear regression is also extremely important from a practical perspective. Gradient-based ‘first-order’ methods are the workhorse of modern machine learning, in contrast to ‘second-order’ methods. This is explained by the efficiency benefit conferred by the linear memory footprint of first-order methods as opposed to the quadratic memory requirements of second-order methods. For large-scale learning problems, this reduction in memory usage of first-order methods more than compensates for the increase in the number of iterations or datapoints processed. If methods with comparable memory usage to first-order methods (or at least significantly subquadratic) were capable of achieving similar convergence rates to second-order methods, that could have far-reaching practical implications.
The question of the memory/sample tradeoffs for linear regression is also a natural and largely unexplored frontier of continuous optimization research. There is classic line of research which has proven information theoretic lower bounds on continuous optimization (Nemirovski and Yudin, 1983; Nesterov, 2014; Bubeck, 2015) in a restricted oracle model where the input real-valued function can only be queried via black box queries to an oracle that returns local information about the function, e.g. function values, gradients, etc. Given only mild regularity assumptions on a function, e.g. Lipschitz continuity, smoothness, convexity, etc., proving tight bounds on the number of queries to an oracle needed to approximately minimize a function is well-studied and has been a driving force behind the development of modern optimization theory. While there is some work studying the effect of parallelism on these lower bounds (Nemirovski, 1994; Balkanski and Singer, 2018; Diakonikolas and Guzmán, 2018), we are unaware of previous work proving gaps between the query complexity of optimization problems under differing memory constraints.
In this work, we provide the first nontrivial memory/sample tradeoffs for linear regression which apply in the regime where the available memory is significantly larger than what would be required to store each datapoint to high precision.
Theorem 4. Consider a sequence of labeled examples with drawn independently from a -dimensional isotropic Gaussian with an identity covariance matrix, and where for a fixed with and with independent noise drawn uniformly from the interval . Let for any . Then any algorithm with at most bits of memory requires at least samples to approximate to error with probability of success at least . In particular, for , this implies that any algorithm with bits of memory requires at least samples to approximate to error with probability of success at least .
For comparison, note that for , the trivial algorithm recovers to error with probability using bits of memory and examples.111This follows from the fact that the condition number of the system with examples is at most with high probability, and hence we can solve the system to accuracy by doing all computations with bits of precision, for .
1.1 Further Directions
Our result establishes the existence of a sharp gap in sample complexities for regression with bounded memory. Nevertheless, this still leaves a significant margin between our lower bounds on the convergence rate of bounded memory algorithms, and those achieved by the best known first-order methods which use memory . For example, randomized Kaczmarz (Strohmer and Vershynin, 2006) can be easily shown to compute a point such that with constant probability using samples and memory . This is the best known sample complexity for achieving error given this amount of memory, though our lower bound leaves open the question of whether it is optimal.
Beyond tightening our result, it is also worth considering the analogous regression question in the setting where the datapoints are drawn from an ill-conditioned distribution. Both in practice, and theory, first-order methods suffer a convergence rate that degrades with the condition number. Over the past decade there has been extensive research on providing iterative methods which use sub-quadratic space and seek better dependencies on the eigenvalues of the covariance of the distribution of . Though there have been several improvements to randomized Kaczmarz and variants of SGD in recent years (Hu et al., 2009; Lan, 2012; Ghadimi and Lan, 2013; Lee and Sidford, 2013; Frostig et al., 2015; Liu and Wright, 2016; Needell et al., 2016; Dieuleveut et al., 2017; Jain et al., 2018) the number of samples required by these methods all depend polynomially on some measure of eigenvalue range or conditioning of the underlying covariance matrix. This is in sharp contrast to second-order methods which can simply store samples and invert an associated matrix to compute an accurate solution with memory where is a condition number measure of the matrix.
For any bounded by a polynomial of , there exists a distribution over -dimensional Gaussian distributions whose covariance has condition number , such that for , given a sequence of examples with drawn from and , any algorithm that recovers the unit vector to small constant error with constant probability either requires bits of memory, or examples.
The work of this paper may be a first step towards proving the above conjecture. We hope this paper will inspire efforts to establish strong memory/query complexity trade-offs for continuous optimization more broadly.
In a different direction, it may be worth considering the extent to which results of the form of Theorem 4 apply beyond the stochastic streaming setting. Rather than considering a stream of independent examples, one could consider the analogous questions in a cell-probe setting: suppose there is a set of examples stored in read-only memory, and one is charged according to the number of times each example is ‘downloaded’ into working memory. What are the tradeoffs between the amount of working memory, error of the recovered linear regression, and number of ‘downloads’? This setting closely corresponds to the data pipeline employed in many large scale learning settings, and any strong results in this setting would be extremely interesting.
It is worth noting that, even in the setting of learning parities, the stochasticity of the examples is essential to the exponential sample complexity of memory-constrained algorithms. Analogous results are not true in the above cell-probe model. For example, given examples stored in read-only memory, there exists a successful learning algorithm for the parity problem with working memory that uses runtime and cell-probes (Kong, 2018) (though, to the best of our knowledge, it is not known if there is a successful learning algorithm for the real-valued regression problem which uses working memory and cell-probes). Still, establishing any nontrivial gap between memory-constrained and unconstrained learning (for either the real-valued regression or parity problems) in the cell-probe setting would be exciting, though may be quite difficult.
1.2 Related Work
A number of recent works have examined learning problems such as sparse linear regression (Steinhardt and Duchi, 2015) and detecting correlations (Shamir, 2014; Dagan and Shamir, 2018) under information constraints such as limited memory or communication constraints. These results usually develop information-theoretic inequalities (Braverman et al., 2016; Steinhardt et al., 2016; Dagan and Shamir, 2018) to show that unless a set of parties exchange a minimum amount of information, they cannot solve the learning problem—with the memory bound following as a consequence of the communication lower bound. At a high level, the idea is to show that if the learning problem requires distinguishing between a set of distributions, and if the distributions are sufficiently uncorrelated, then at least bits of communication are needed to solve the learning problem. While initial results only obtained lower bounds for settings where the memory budget is less than the size of each data point, the recent work Dagan and Shamir (2018) circumvented this barrier and showed strong lower bounds for detecting correlations for natural distributions under information constraints.
Many of these information theoretic tools seem to break down for learning problems such as parity learning where communication lower bounds do not directly give meaningful memory bounds. Hence these settings require explicitly taking into account the memory constraint of the algorithm; the recent line of work discussed in the introduction, starting with Raz (2016) achieves sharp lower bounds for memory-bounded learning by directly analyzing the structure of width-bounded branching programs for these problems (Raz, 2017; Kol et al., 2017; Moshkovitz and Moshkovitz, 2017, 2018; Beame et al., 2018; Garg et al., 2018). Our work directly builds on the analysis framework developed in Raz (2017), and extended in Beame et al. (2018) and Garg et al. (2018), with the crucial difference that the geometry of the continuous space corresponding to linear regression lacks many of the combinatorial properties that are leveraged in the analysis of these prior works.
There is also a large literature on memory lower bounds for streaming algorithms (for e.g. Alon et al. (1999); Bar-Yossef et al. (2004), although these are mostly for non-learning problems and assume that the input stream is constructed in an adversarial fashion.
On the optimization side, there is a long history of proving information theoretic lower bounds on optimization methods. These results typically show that given a type of restricted local oracle to access the input, i.e. an oracle which only returns information about values, gradients, higher derivatives, separation oracles, etc., lower bounds can be formally proven on the number of queries needed to approximately minimize the function. Such results date back early work of Nemirovski and Yudin (1983) on the oracle complexity of optimization and there are too many results to do a complete review (see Nesterov (2014); Bubeck (2015) for more recent surveys). Key results in this broad area of research include, tight oracle bounds known for computing approximate minimizer of smooth convex functions given a gradient oracle (Nemirovski and Yudin, 1983; Nesterov, 2014; Bubeck, 2015) even when randomization is allowed (Woodworth and Srebro, 2016), tight oracle bounds known for computing approximate minimizer of Lipschitz convex function given by a subgradient oracle (Braun et al., 2017; Nemirovski and Yudin, 1983; Nesterov, 2014), and even tight oracle bounds for computing critical points, that is points of small gradient, for smooth non-convex functions given by a gradient oracle (Nesterov, 2012; Cartis et al., 2012, 2017; Carmon et al., 2017a, b). There has also been extensive research on the oracle complexity of stochastic optimization (Nemirovski and Yudin, 1983; Devolder et al., 2013, 2014; Shamir, 2013; Duchi et al., 2015) and work on the tradeoff between oracle complexity and parallelism for nonsmooth optimization with a subgradient oracle (Nemirovski, 1994; Balkanski and Singer, 2018; Diakonikolas and Guzmán, 2018). However, to the best of the authors knowledge the problem of memory / query complexity tradeoffs for real-valued continuous optimization has been largely unexplored.
2 Setup and Proof Overview
In this section, we provide an overview of our proof approach. We begin by describing the notation and formalism we will use in analyzing the branching program representing a memory-bounded learning algorithm.
Branching Programs for Learning
We model the learner by a branching program . A branching program is a general non-uniform model for space bounded computation. The branching program has layers, corresponding to time steps, with each layer having at most vertices, where denotes the width of . Each vertex of corresponds to a memory state, and a branching program with width corresponds to an algorithm with memory usage . A vertex with no outgoing edges is called a leaf, and all vertices in the last layer are leaves (though there may be additional leaves). Each non-leaf vertex has an associated transition function , representing the mapping from an example to a vertex in the next layer. Without loss of generality, we may assume that these transition functions are deterministic, as randomization cannot improve the probability of success.222This can be easily seen by noting that any branching program with randomized transitions can be converted to a deterministic one by iteratively derandomizing each vertex by replacing its randomized transition function with the deterministic one that select the transition that maximizes the probability of success (breaking ties arbitrarily), where the probability is taken over the randomization in the subsequent examples and whatever randomization remains in the transition functions corresponding to other vertices. To be consistent with the literature on branching programs, we will refer to this transition function as a series of ‘edges’ indexed by the (infinite number of) potential examples Finally, each leaf vertex, , of the branching program is labeled by a label, , representing the output value that the corresponding algorithm would produce on the sequence of examples that led to vertex .
The success probability of the branching program for a specified accuracy parameter is the probability that , where is the vector returned by , and the probability is with respect to the randomness in the sequence of examples and choice of the true .
We consider branching programs whose goal is to learn some true with to error , in the setting where is drawn uniformly at random from the dimensional unit sphere. At every time step, a -dimensional vector is sampled from , and the branching program is given and the (noisy) inner product , where the noise is sampled from for . The addition of this noise facilitates the analysis, and we could have equivalently assumed that the true inner product is discretized according to some exponentially small discretization error . Note that as long as the goal is to estimate up to accuracy for a small constant , the small uniform noise or discretization does not create any information theoretic obstacles.
Our proof follows and builds on the recent analytic framework for showing time-space lower bounds developed in Raz (2017), and further extended in Beame et al. (2018) and Garg et al. (2018). The analysis in our case is complicated by the fact that the gap in the sample complexity of first order and second order methods for regression on well-conditioned matrices is not very large, and depends on the desired error . To capture this dependence of the sample complexity on , we divide the branching program into multiple stages, where a stage is a group of consecutive layers of the branching program. Each stage will intuitively correspond to the branching program reducing the error of the estimate of by a factor of two. We will argue that each of these stages cannot be too short if the algorithm has small memory. We now sketch the proof, describing the high-level framework of Raz (2017) and how we adapt it to our setup.
As in Raz (2017), we define a truncated computation path which follows the computation path of the branching program , but may stop before reaching a leaf vertex. The conditions under which stops before reaching a leaf vertex will depend on the stage of the branching program which is in. For any vertex in the branching program, let be the posterior distribution of conditioned on being at . We will quantify the progress made by a vertex towards learning by the norm of the posterior distribution of , note that a large norm indicates a concentrated posterior with an accurate estimate of . The truncated path stops at any significant vertex where is larger than some threshold, where the threshold is chosen as a function of the stage of the branching program being analyzed. Intuitively, if is larger than a given threshold, then has more information about then we expect it to have at that stage. Most of the effort in Raz (2017) and in our work goes into ensuring that the probability of any significant vertex is small enough that the probability of stopping due to reaching a significant vertex is small.
We now sketch the argument for showing that the probability of reaching a significant vertex, is small, for any stage of the branching program . Let denote the set of all vertices in the th layer of the -th stage . The following potential function tracks the progress which the -th layer of has made towards a fixed significant vertex in ,
We claim that if is small and the significant vertex lies in the -th layer of , then the probability of must also be small. This follows because we define significant vertices as those for which is large, and if is in the th layer then . Hence our goal will be to show that is small. Note that raising to the power of in our expression for allows us to show that the probability of significant vertices is small enough that we can do a union bound over all vertices in the branching program, and is the largest power to which we can raise while keeping the contribution of the low probability events small.
We prove that is small via an induction argument. We first show that must be small, as the previous stage of the branching program could not have made too much progress. We next show that cannot be much larger than . To show this, we introduce another potential which tracks how much progress any edge of the branching program has made towards . Let be the posterior distribution of conditioned on the event of traversing edge in the branching program. Let denote the set of all edges from the -th layer to the -th layer of the -th stage of the branching program, and let be the p.d.f. of the distribution over edges evaluted at edge . For any , we define the potential,
A straightforward convexity argument shows that . Hence the main challenge is showing that cannot be much larger than . This is where our analysis differs significantly from Raz (2017) (this is also where Beame et al. (2018) and Garg et al. (2018) differ the most from Raz (2017)). In these previous works which concern learning over finite fields, the learning problem is viewed as a certain matrix, and properties of this matrix are used to show that cannot be much larger than . It is worth noting that in these settings, it is possible to argue that the example in the next time step looks almost random to the branching program if it does not have significant knowledge about the answer, and then use this to show that the branching program cannot make too much progress when it gets an example. In our case though, first-order methods which require only linear memory can learn up to non-negligible error with only linear sample complexity, hence the examples do not have as much randomness. Also, as we work over continuous spaces we lack the combinatorial properties that enables the analysis in the previous works to go through, and need to develop different tools.
We now sketch our argument for showing that cannot be much larger than . For intuition, we first describe the argument as it would pertain to the branching program corresponding to the linear memory, first-order method for regression. At a high level, by the end of the -th stage of this branching program, the algorithm has learned up to error roughly , and the posterior of a vertex in this stage roughly corresponds to a spherical Gaussian with standard deviation in every direction. A target significant vertex, , in the -th stage will have posterior roughly corresponding to another spherical Gaussian, but with standard deviation in every direction. This significant vertex represents a memory state that has learned significantly more than is expected of a vertex in this stage, and we will show that the probability of reaching such a vertex is small. As every example has some small uniform noise added to , if the branching program is initially at vertex and then gets the example , the posterior is updated by restricting it to the thin slice of the spherical Gaussian where . We need to argue that this slicing does not significantly increase the inner product with the posterior corresponding to the smaller, target Gaussian. This holds, provided the target Gaussian does not have significantly higher probability mass in the slice to which we are restricting. This is easy to analyze in this special setting where the posteriors are spherical Gaussians, by simply analyzing the projections of the two Gaussians along a random direction . In our actual proof, to bound the rate of progress via this argument, we cannot assume that the posteriors have such a nice form. Nevertheless, we show a concentration result that guarantees that, for any distribution with sufficiently small norm, the projections can not behave too much worse than projections of spherical Gaussians with the corresponding norms.
To sketch the argument more formally, we need to define some notation. We define as the point-wise product of the distributions and , with suitable normalization. Hence for any on the dimensional unit sphere,
Let be the interval . For any distribution and fixed , define . Note that for a vertex with posterior distribution , is the probability mass on vectors which are consistent with the example , up to the noise level . With some technical work, we can approximately relate and for an edge labelled by as follows,
Intuitively, the above relation says that the progress that the truncated path makes towards some target distribution after receiving example depends on the ratio of the probability mass of which is consistent with , and that of which is consistent with . Hence in order to bound in terms of , our goal will be to upper bound
Note that as where , we can show that examples where is too small have small probability. Hence we can lower bound the denominator by making the truncated path stop if is too small, while still ensuring that the probability of stopping due to this reason is small.
It is more complicated to upper bound . Note that is the probability mass of the distribution which lies in the interval when is projected onto a random direction . The linear projection of a high-dimensional distribution onto a random direction is a well-studied topic, and it is known that under mild conditions on such as bounded second moments, its projection onto a random direction is approximately Gaussian (Bobkov, 2003; Anttila et al., 2003; von Weizsäcker, 1997) or a mixture of Gaussians (Dasgupta et al., 2006) with high probability. However, these results typically only give an additive error guarantee for the difference between the probability mass of on any interval and that of an appropriate Gaussian on that interval (and this is tight given only second moment constraints). Note that in our case the intervals have exponentially small width , and we care about the multiplicative approximation error, hence these additive error guarantees are not strong enough. We show that we can obtain stronger guarantees in our case by ensuring that is small, which we guarantee by appropriate conditions on the truncated path . With a bound on , we prove the following concentration result for projections of high-dimensional distributions onto a random direction—
Let be a distribution over the dimensional sphere, with for some . For an absolute constant and fixed ,
Finally, note that the above bound is for a fixed , but if the branching program knows to a small error then it also knows the inner product for any to a small error, hence the distribution of is itself highly dependent on . To get around this, we prove a version of the above lemma where is obtained by first sampling from and then adding noise to . These concentration bounds may be useful beyond this work, and it may be interesting to further develop our understanding of properties of the projection of high-dimensional distributions with small norm onto a random direction.
Let be the set of all vectors on the -dimensional unit sphere, and be the uniform distribution over . Hence for all , for some which depends on .
Let denote the event that the truncated path reaches a vertex . For any random variable we denote the distribution of by . We denote the probability of any vertex in the branching program by . As the edges of the branching program are indexed by real valued , for any edge we denote the p.d.f. of the distribution over all edges of the branching program evaluated at the edge by . Let the sample at the th time step be . Recall that the distribution of is , and we will denote its p.d.f. at a vector by . Similarly, we denote the p.d.f. of conditioned on being at vertex and seeing example as . For any function from , we denote by the norm of with respect to the uniform distribution over ,
Recall from the previous section that for any distribution , where is the interval . For notational convenience, we will subsequently denote for a vertex by .
4 Proof of Theorem 4
In this section, we formally define the stages of the branching program and the truncated computation path , and then provide a proof for Theorem 4.
Stages of the Branching Program
We partition the branching program into stages , for some which depends on the desired accuracy . The th stage continues for time steps, where and is an absolute constant to be determined later. We define the stages inductively. The first stage consists of all vertices up until and including the -th layer of the branching program . The -th stage consists of layers beginning with and including the last layer of the previous stage .
We define the truncated path corresponding to the branching program . The truncated path follows the same path as , except that it sometimes stops before reaching a leaf vertex. The conditions under which the truncated path stops before reaching a leaf vertex will be different depending on the stage . Define . Intuitively, determines the accuracy to which could know in the -th stage. In the -th stage of , the truncated path stops at a non-leaf vertex for any of the following three reasons—
If is a significant vertex, where .
If belongs to the set of vectors which have non-trivial probability mass under , defined as .
If the branching program is about to traverse a bad edge. The set of bad edges for the vertex is defined as the set of edges for which either i) , or ii) .
If the truncated path does not stop at a non-leaf vertex, then it follows the same path as the computation path of the branching program . Lemma 4 proved in Section 5 shows that the probability of the truncated path stopping at a non-leaf vertex is small.
If the number of samples and the width of the branching program , then the probability of stopping at a non-leaf vertex is at most .
To prove Lemma 4, we show that the probability of the truncated path stopping at a non-leaf vertex due to each of the three above reasons is small. Most of the effort goes into proving that the probability of stopping due to the first reason, reaching a significant vertex, is small. This is proved in Section 6. Using Lemma 4, we are now ready to prove our main theorem.
Let be a branching program to find , where for some . For a small absolute constant , if has length at most and width at most , then the success probability of is at most .
We partition the branching program into stages and consider the truncated path . We first bound the number of stages required to do the partition if . We claim that . As the -th stage consists of steps, the number of steps in stages can be lower bounded as follows,
Hence taking , the number of stages in steps is at most . Note that if does not stop before reaching a leaf, then it follows the same path as the branching program . By Lemma 4, the probability that stops before reaching a leaf is at most . Hence we now only need to bound the probability that a non-significant leaf of outputs such that . However, for a non-significant leaf we know that . Further, the following lemma (proved in Section 7) shows that this condition implies that the probability of outputting an accurate answer is small.
Let be a distribution over the dimensional sphere , with for some . Then for any ,
Now for , . Hence by Lemma 3, the probability that a non-significant leaf outputting an accurate answer is at most . By a union bound over the probability of the truncated path stopping before a non-leaf vertex and the probability of a non-significant leaf outputting a valid answer, the probability of outputting an accurate answer is at most . ∎
5 Bounding the Probability of the Truncated Path Stopping Early
In this section, we show that the probability the truncated path stop sat a non-leaf vertex is small. Lemma 5 shows that probability of stopping because of the first reason (reaching a significant vertex) is small. Most of the remainder of the paper will be devoted to proving Lemma 5.
If the total number of stages and the width of the branching program , then the probability that reaches a significant vertex in any stage is at most .
If is not a significant vertex of , then
Assume is in the -th stage in the branching program. Since is not a significant vertex,
Hence by Markov’s inequality,
Since conditioned on , the distribution of is , we get,
As , by standard concentration bounds for random variables, . Conditioned on , . As is generated by adding noise drawn uniformly at random from to the true inner product , , where we use our notation . Let be the p.d.f. of the uniform distribution on with support . Note that,
Therefore as and ,
By a union bound, it follows that ∎
Using these results, we can show that the probability of stopping at a non-leaf vertex is small—
By Lemma 5, the probability that reaches a significant vertex and hence stops due to the first reason is at most . If does not reach a significant vertex, then by Lemma 5, the probability of stopping due to the second reason at any non-significant vertex is at most . Taking a union bound over all the steps, the probability of stopping due to the second reason is at most . By Lemma 6, the probability of getting a bad sample at any time step and hence stopping due to the third reason is at most . Taking a union bound over the time steps, the probability of stopping due to the third reason at any time step is at most . Hence the overall probability of the truncated path stopping at a non-leaf vertex is at most . ∎
6 Bounding the Probability of Significant Vertices
In this section, we bound the probability of the truncated path reaching a significant vertex in the -th stage, for any . We begin by first finding an expression for the posterior distribution of conditioned on traversing an edge , and then upper bound the norm of a significant vertex in the -th stage of .
Relating and , and bounding
We relate and . Recall that is the interval . We claim that,
For any labeled by , such that ,
Let be an edge labeled by , such that . Since , the vertex is not significant, as otherwise stops on . Also, as , , as otherwise never traverses edge .
If reaches it traverses the edge if and only if: (as otherwise stops on ) and the next sample received is . Also, note that , where the noise is uniform on . Hence the set of which are consistent with the example are those where . Therefore for any ,
where is a normalization factor, given by
Since is not significant, by Lemma 5,
Also, since ,
Hence by a union bound and using the fact that ,
We now show that cannot be too large. To show this, we will first show that cannot be too large for any edge such that the .
For any edge in the -th stage of the branching program such that , .
Let be an edge of the branching program from vertex to vertex such that . Since , the vertex is not significant (as otherwise stops on and ). As is not significant,
By Lemma 8,
where . Therefore, . ∎
We now use Lemma 9 to bound .
For any significant vertex in the -th stage of the branching program, .
Let be a significant vertex in . Let be the set of edges going into . We can write,
By the law of total probability, for every ,
By using Jensen’s inequality,
Summing over all ,
By Lemma 9, for any edge , . Hence . ∎
Similarity to a Target Distribution
To show that the probability of reaching a significant vertex is small, we will argue that the posterior of on seeing a new example is not significantly similar to the target posterior distribution of a significant vertex. We use the inner product of two distributions to measure their similarity, and define it as follows. For two functions , define the inner product
Note that for a significant vertex in the -th stage,
We now bound the inner product of with all states in the first layer of the -th stage of .
For all states with in the first layer of the -th stage of ,
We claim that for all states in the first layer of . Consider the -th stage of the branching program . The truncated path stops at any significant vertex, and recall that a significant vertex for the th stage is defined as a vertex where
Hence for all non-significant vertices in the -th stage of ,
Also, by Lemma 10 for all significant vertices in ,
Hence for all vertices in with ,
Note that , hence for all states with , as is also the last layer of . Now by using Cauchy Schwartz,
Note that the inner product of with itself is larger than the inner product of with for in the first layer by a factor of about , and in the next section we will argue that this inner product cannot increase too quickly in a small number of time steps, via a suitable potential function.
Progress Towards Target Distribution
In this section, we bound how much progress can make towards a significant vertex in the -th stage of . For notational convenience, we will reindex all the layers in the -th stage so that the first layer in is labelled as .
Let denote the set of all vertices in the th layer of the -th stage , with . Let denote the set of all edges from the th layer to the th layer of . For and , let
For , let