A functional central limit theorem for branching random walks, almost sure weak convergence, and applications to random trees
Let be the limit of the Biggins martingale associated to a supercritical branching random walk with mean number of offspring . We prove a functional central limit theorem stating that as the process
converges weakly, on a suitable space of analytic functions, to a Gaussian random analytic function with random variance. Using this result we prove central limit theorems for the total path length of random trees. In the setting of binary search trees, we recover a recent result of R. Neininger [Refined Quicksort Asymptotics, Rand. Struct. and Alg., to appear], but we also prove a similar theorem for uniform random recursive trees. Moreover, we replace weak convergence in Neininger’s theorem by the almost sure weak (a.s.w.) convergence of probability transition kernels. In the case of binary search trees, our result states that
where is the external path length of a binary search tree with vertices, is the limit of the Régnier martingale, and denotes the conditional distribution w.r.t. the -algebra generated by . A.s.w. convergence is stronger than weak and even stable convergence. We prove several basic properties of the a.s.w. convergence and study a number of further examples in which the a.s.w. convergence appears naturally. These include the classical central limit theorem for Galton–Watson processes and the Pólya urn.
Key words and phrases:Branching random walk, functional central limit theorem, Gaussian analytic function, binary search trees, random recursive trees, Quicksort distribution, stable convergence, mixing convergence, almost sure weak convergence, Pólya urns, Galton–Watson processes
2010 Mathematics Subject Classification:Primary, 60J80; secondary, 60F05, 60F17, 60B10, 68P10, 60G42
The research that led to the present paper was motivated by a question from the analysis of algorithms, specifically of the famous Quicksort and the closely related binary search tree (BST) algorithms. The question concerns the second-order (distributional) asymptotics of the number of comparisons needed by Quicksort or, equivalently, of the total path length of the associated random binary search trees, if the input to the algorithm is random.
Let the input sequence consist of independent random variables distributed uniformly on the interval . In the version considered here the Quicksort algorithm applied to the list proceeds as follows. It places , the first element of the list, at the root of a binary tree and divides the remaining elements into two sublists: The elements that are smaller than are collected into a sublist located to the left of , whereas the elements larger than are put into a sublist located to the right of . (Hence the first element of the list serves as the pivot, that is, the element used to subdivide the list). The procedure is then applied recursively to both sublists until only sublists of size remain. The random tree which is created in this way is called the binary search tree (BST); a more detailed description will be provided in Section 5.5.1.
For the analysis of the complexity of Quicksort the number of comparisons needed to sort the list is of major interest. In terms of the tree structure of sublists this is the sum of the depths of the nodes (also called the internal path length) of the binary search tree. As shown by Régnier , a suitable rescaling of leads to a martingale that converges almost surely to some limit variable as ,
The law of the limit is known as the Quicksort distribution; it has been characterized in terms of a stochastic fixed point equation by Rösler .
where is the standard normal distribution. Neininger used the contraction method, which in the present context has been introduced by Rösler  in connection with the distributional convergence in (1). A proof based on the method of moments followed shortly .
The result (2) is surprising as for many martingales the step from a strong convergence result to a second-order distributional limit theorem leads to a variance mixture of normal distributions; see Hall and Heyde . Quite generally, whenever one has a martingale convergence result it is natural to ask whether there is a corresponding distributional limit theorem in the sense that, for some normalizing sequence and some non-degenerate random variable ,
Indeed, provided that appropriate technical conditions (which can be found in the references cited below) are satisfied, a distributional limit theorem of the type (3) is known to hold if
is the proportion of black balls in the Pólya urn after draws; see Hall and Heyde [17, pp. 80–81].
, where are i.i.d. random variables with zero mean, unit variance, and is an appropriate square summable deterministic sequence; see Loynes .
is the Biggins martingale of the branching random walk; see Rösler et al. .
In this list, (a), (c) and (d) can be related to the analysis of Quicksort, and in all three cases, the limit distribution is a nondegenerate mixture of normals.
We will use the well-known connection between the BST algorithm and the continuous-time branching random walk (BRW) to explain the degeneracy phenomenon. The state at time of a BRW is a random point measure recording the particle positions at that time; see Section 2 for a detailed description. A specific choice of branching mechanism and shift distribution leads to a representation of the point measure given by the depths of the external nodes in the BST with input size as the value at the random time of the birth of the th particle; see Chauvin et al. , , as well as the earlier work by Devroye  that connected Galton–Watson processes and random search trees. The BRW detour provides a new and independent proof of Neininger’s result. In addition we obtain a stronger mode of convergence. Again, this is a topic familiar in connection with martingale central limit theorems, where it is known that a strengthening of distributional convergence to Rényi’s concept of stable convergence is often possible. In our situation we can go beyond even the stable convergence, obtaining what we call almost sure weak convergence: With the martingale filtration we regard the conditional distribution of the left hand side of (3) given as a random variable with values in the set of Borel probability measures on the real line, on this set we take the topology of weak convergence, and we show that the conditional distribution converges almost surely in this space as . In the Quicksort context, with the -field generated by , this results in
This can be applied to obtain strong prediction intervals; see Remark 5.21.
It turns out that in our context the familiar encoding of the BRW point measures by the Biggins martingale can best be exploited via a suitable functional central limit theorem for the latter. The Biggins martingale arises as a suitably standardized moment generating function of the point measures of particle positions and may thus be regarded, together with its limit, as a stochastic process indexed by a complex parameter that varies over some open set containing . For fixed, an associated second order distributional limit has already been obtained by Rösler et al. , see (d) in the above list. Noting that the Régnier martingale appears as the derivative at of this process we are lead to rescale locally in order to obtain a the functional version that captures the local behaviour. Of course, we also want a non-trivial limit. This is indeed possible and leads to Theorems 3.1 and 5.1, which we regard as our main results. Again, we obtain almost sure weak convergence, now on a suitable space of analytic functions. Further, the distribution of the limit can be represented as the distribution of the Gaussian random analytic function given by
where is a sequence of independent standard normals. Much as in the classical case of Donsker’s theorem, see Billingsley , this may serve as the starting point for distributional limit theorems for various functionals of the processes, but we believe that, apart from its applicability to the question that we started with, the BRW functional limit theorem is of interest in its own.
Finally, the above approach is not limited to binary search trees: We also obtain an analogue of Neininger’s result for random recursive trees (RRTs). In fact, we obtain a new result even in the setting of the Pólya urn, see Section 4.2, and we treat Galton-Watson processes, BRW, BST, RRT with a unified method.
The paper is organized as follows. In Section 2 we define the branching random walk and introduce the basic notation. The functional central limit theorem for the BRW is stated in Section 3. In Section 4 we define the almost sure weak convergence and prove some of its properties. A stronger version of the functional CLT involving the notion of the a.s.w. convergence is stated in Section 5. In the same section, we state a number of applications of the functional CLT including (2) and its analogues for other random trees. Proofs are given in Sections 6, 7, and 8.
2. Branching random walk
2.1. Description of the model
An informal picture of a branching random walk (BRW) is that of a time-dependent random cloud of particles located on the real line and evolving through a combination of splitting (branching) and shifting (random walk). The particles are replaced at the end of their possibly random lifetimes by a random number of offspring, with locations relative to their parent being random too. Our results will be valid for branching random walks both in discrete and continuous time. Let us describe both models.
Discrete-time branching random walk. At time we start with one particle located at zero. At any time every particle which is alive at this time disappears and is replaced (independently of all other particles and of the past of the process) by a random, non-empty cluster of particles whose displacements w.r.t. the original particle are distributed according to some fixed point process on . The number of particles in a cluster is (in general) random and is always assumed to be a.s. finite. Let be the number of particles which are alive at time . Note that is a Galton–Watson branching process. Denote by the positions of the particles at time . Let
be the point process recording the positions of the particles at time . The only parameter needed to identify the law of the discrete-time BRW is the law of the point process encoding the shifts of the offspring particles w.r.t. their parent.
Continuous-time branching random walk. At time one particle is born at position . After its birth, any particle moves (independently of all other particles and of the past of the process) according to a Lévy process. After an exponential time with parameter , the particle disappears and at the same moment of time it is replaced by a random cluster of particles whose displacements w.r.t. the original particle are distributed according to some fixed point process . The new-born particles behave in the same way. All the random mechanisms involved are independent. Denote the number of particles at time by and note that is a branching process in continuous time. Let be the positions of the particles at time . Let
be the point process recording the positions of the particles at time . The law of the continuous-time BRW is determined by the parameters of the Lévy process, the intensity , and the law of the point process .
Both models can be treated by essentially the same methods. To simplify the notation, we will henceforth deal with the discrete-time BRW and indicate, whenever necessary, how the proofs should be modified in the continuous-time case.
2.2. Standing assumptions and the Biggins martingale
Let us agree that means a sum taken over all points of the point process , where the points are counted with multiplicities. We make the following standing assumptions on the BRW.
Assumption A: The cluster point process is a.s. non-empty, finite, and the probability that it consists of exactly one particle is strictly less than .
Assumption B: There are and such that for all ,
It follows from (5) that the function
is well-defined and analytic in the strip . Note that is the moment generating function of the intensity measure of . Assumption A implies that the BRW under consideration is supercritical, that is the mean number of particles at time satisfies
In a sufficiently small neighborhood of the function
is well-defined and analytic, and the restriction of to real is convex. By the martingale convergence theorem, there is a random variable such that
Since (by Assumption B) and the BRW never dies out (by Assumption A), we have a.s. The assumption that is non-empty could be removed (while retaining supercriticality); all results would then hold on the survival event.
A crucial role in the study of the branching random walk is played by the Biggins martingale:
Uchiyama  and Biggins  proved that if Assumption (5) holds with some , then there is such that the martingale is bounded in , , uniformly over all with . Furthermore, there is a random analytic function defined for such that a.s.,
We denote by the normal distribution with mean and variance . Given a non-negative random variable we denote by the mixture of zero mean normal distributions with random variance given by . Throughout the paper we will use the notation
A generic constant which may change from line to line is denoted by .
3. Functional Central Limit Theorem for the Biggins martingale
3.1. Statement of the FCLT
Under suitable conditions, Rösler et al.  proved for real in a certain interval around a CLT of the form
We will prove a functional version of (12). That is, we will consider the left-hand side of (12) as a random analytic function and prove weak convergence on a suitable function space. In order to obtain a non-degenerate limit process it will be necessary to introduce a spatial rescaling into the Biggins martingale. Namely, we consider
We have to be explicit about the function space to which belongs. Given let (resp., ) be the open (resp., closed) disk of radius centered at the origin. Denote by the set of functions which are continuous on and analytic in . Endowed with the supremum norm, becomes a Banach space. Note that is a closed linear subspace of the Banach space of continuous functions on . Being closed under multiplication, is even a Banach algebra. We always consider as a random element with values in (which is endowed with the Borel -algebra generated by the topology of uniform convergence). Recall that and are well defined on the disk for some , so that is indeed well defined as an element of for . Our results remain valid for some other choices o f the function space, for example one could replace by the Hardy space . Recall that and .
Fix any . The following convergence of random analytic functions holds weakly on the Banach space :
where is a random analytic function which is defined in Section 3.2 below, and which is independent of .
3.2. Gaussian analytic function
The random analytic function appearing in Theorem 3.1 is defined as follows. Let be independent real standard normal variables. Consider the random analytic function defined by
With probability , the series converges uniformly on every bounded set because a.s. Note that for every and , the -dimensional real random vector is Gaussian with zero mean. The covariance structure of the process is given by
It follows that , , is a stationary real-valued Gaussian process with covariance function
The spectral measure of is the standard normal distribution. We can view the process as an analytic continuation of the process , , to the complex plane.
A modification of in which the variables are independent complex standard normal is a fascinating object called the plane Gaussian Analytic Function (GAF) . A remarkable feature of the plane GAF is that its zeros form a point process whose distribution is invariant with respect to arbitrary translations and rotations of the complex plane. The law of the zero set of as defined in the present paper is invariant with respect to real translations only. The function and its complex analogue appeared as limits of certain random partition functions; see [21, 22].
4. Almost sure weak convergence of probability kernels
Our results are most naturally stated using the notion of almost sure weak (a.s.w.) convergence of probability kernels. This mode of convergence seems especially natural when dealing with randomly growing structures. In this section we define a.s.w. convergence and study its relation to other modes of convergence.
4.1. Basic definitions
Let be a complete separable metric (Polish) space endowed with the Borel -algebra . Let be the space of probability measures on . The weak convergence on is metrized by the Lévy–Prokhorov metric which turns into a complete separable metric space.
A (probability transition) kernel is a random variable defined on a probability space and taking values in . We will write for the probability measure on corresponding to the outcome , and for the value assigned by the probability measure to a set . Instead of the above definition of kernels we can use the following: A kernel from a probability space to is a function such that
for every set , the map is -Borel-measurable;
for every , the map defines a probability measure on .
Probability kernels are also called random probability measures on .
In this paper, kernels will mostly appear in form of a conditional distribution of a random variable given a -algebra. Let be a random variable defined on and taking values in a Polish space . Given a -algebra , a kernel is called (a version of) the conditional distribution of given if
is -measurable as a map from to ,
for all bounded Borel functions and all ,
In this case we use the notation .
Almost sure weak convergence
A sequence of kernels defined on a common probability space is said to converge almost surely with respect to weak convergence (a.s.w.) as if there exists a set with such that, for all , the probability measure converges weakly on to the probability measure , again as .
Let us state the above definition in a slightly different (but equivalent) form. Given a bounded Borel function and a kernel consider the random variable defined by
Then, a sequence of kernels converges to a kernel in the a.s.w. sense if and only if for every bounded continuous function we have
In fact, if we know that for every bounded continuous function , the random variable converges to some limit in the a.s. sense, then there is a kernel such that converges to a.s.w.; see .
A.s.w. convergence contains a.s. convergence as a special case. Indeed, let be random variables on the probability space . Then, the sequence converges a.s. to the random variable if and only if the sequence of kernels a.s.w. converges to the kernel .
A.s.w. convergence contains weak convergence as a special case. Let be probability measures on . The sequence converges weakly to if and only if the sequence of kernels converges a.s.w. to the kernel .
Stable and mixing convergence
The a.s.w. convergence is related to the stable convergence which was introduced by Rényi , , . We recall the definition of stable convergence referring to  for more details and references. A sequence of kernels converges stably to a kernel if for every set and every bounded continuous function , we have
Of particular interest for us will be the following special case of this definition. Let be a sequence of random variables defined on a probability space and taking values in a Polish space . We say that converges stably to a kernel if the sequence of kernels converges stably to . That is to say, for every set and every bounded continuous function , we have
Taking in this definition we see that stable convergence implies weak convergence of to the law obtained by mixing over .
A special case of stable convergence is the mixing convergence. We say that converges to a probability distribution on in the mixing sense if converges stably to the kernel . In this case, we write
By the above, mixing convergence implies weak convergence to the same limit.
Another way of expressing these definitions is the following: A sequence of random variables converges stably if for every event with the conditional distribution of given converges weakly to some probability distribution on . The limiting probability distribution is given by
and, in general, depends on . The limiting kernel can be seen as the Radon–Nikodym density of the -valued measure . If the limiting distribution does not depend on the choice of , then we have mixing convergence.
4.2. An example of a.s.w. convergence: The Pólya urn
Consider an urn initially containing black and red balls. In each step, draw a ball from the urn at random and replace it together with balls of the same color. Let and be the number of black and red balls after draws and let be the -algebra generated by the first draws. It is well-known that the proportion of black balls after draws is a martingale w.r.t. to the filtration and that
We claim that
where . The kernel on the right-hand side maps an outcome to the centered normal distribution on with variance . We will prove in Proposition 4.7 and Remark 4.8 below that (21) implies distributional convergence to the normal mixture:
One can establish (22) as a direct consequence of the de Moivre–Laplace CLT by noting that conditionally on , the results of individual draws are i.i.d. Bernoulli variables with parameter . Of course, (22) is well-known; see [20, Section 3] or [17, pp. 80–81] (where it is deduced as a special case of the CLT for martingales), but (21) is stronger than (22).
Proof of (21).
The random variables are -measurable. For the conditional law of given we have, recalling (20),
So, the conditional law on the left-hand side of (21) is given by the kernel
where denotes a random variable with distribution.
We will use the following CLT for the Beta distribution. Let be two sequences such that and , as . Then,
The proof of (23) is standard and proceeds as follows. Denote by independent random variables having Gamma distributions with shape parameters and respectively, and scale parameter . Since has the same distribution as , we can rewrite the left-hand side of (23) as follows:
The first factor converges weakly to the standard normal distribution (as one can easily see by computing its characteristic function), whereas the second factor converges in probability to . Slutsky’s lemma completes the proof of (23).
Now, we apply (23) to and . Noting that for a.a. , we have and , we obtain that converges weakly to , for a.a. . ∎
4.3. Properties of the a.s.w. convergence
Taken together, the following proposition and examples show that a.s.w. convergence is strictly stronger than stable convergence.
Let be a sequence of kernels converging to a kernel in the a.s.w. sense. Then, converges to stably.
Let be a bounded continuous function. By definition of the a.s.w. convergence, the sequence converges to for a.a. . Also, is bounded by . By the dominated convergence theorem, (18) holds. So, converges to stably. ∎
Let us show that, in general, stable convergence does not imply a.s.w. convergence. Let be non-degenerate i.i.d. random variables with probability distribution . Then, the sequence of kernels converges stably (in fact, mixing) to the kernel . This is equivalent to saying that the i.i.d. sequence is mixing in the sense of ergodic theory. Alternatively, note that by the i.i.d. property, for every fixed , and apply [31, Thm. 2]. However, does not converge a.s.w. because the sequence does not converge a.s.
Let be i.i.d. random variables with , . Consider the random variables . Then, the kernels converge stably (in fact, mixing) to the kernel ; see [31, Thm. 4] or [1, Thm. 2]. However, does not converge a.s.w. because the sequence does not converge a.s. On the other hand, the central limit theorems for branching random walks which we will state and prove below hold not only stably but even in the a.s.w. sense.
Let be a filtration on a probability space . Let be a sequence of random variables defined on and taking values in a Polish space . Assume that for every , the random variable is measurable w.r.t. the -algebra (but not necessarily w.r.t. ). If the sequence of conditional laws converges to a kernel in the a.s.w. sense, then converges stably to .
In particular, converges in distribution to the probability measure obtained by mixing the probability measures over . That is, for every Borel set ,
Proof of Proposition 4.7.
Let be a bounded continuous function. We will show that for every bounded -measurable function ,
Let first for some , where is fixed. Because of the filtration property, for all . Applying (17) to the conditional law , we obtain that for all ,
For a.a. the probability measure converges weakly to , and hence, the sequence (which is bounded by ) converges as to . By the dominated convergence theorem we immediately obtain (24).
A standard approximation argument extends (24) to all -measurable bounded functions . Finally, let be -measurable and bounded. In this case, one can reduce (24) to the case of -measurable function . Namely, since is -measurable, we have
Similarly, since the -valued map is -measurable (as an a.s. limit of -measurable maps ),
So, it suffices to establish (24) for the function instead of , but this was already done above since is -measurable and bounded. ∎
Let be a filtration on a probability space . Write . Let be random variables defined on such that a.s. and for some constant . Then,
Let be a filtration on a probability space . Let , , be complex-valued random variables defined on . Suppose that for some kernel ,
If a.s., then converges to a.s.w.
If a.s., then converges to a.s.w.
Note that we do not assume to be -measurable. With this assumption, the proposition would become trivial.
Proof of part (a).
We can find a sequence of uniformly continuous, bounded functions with the property that a sequence of probability measures converges weakly on to a probability measure if and only if for every ,
Fix some . We know from (25) that
where denotes the random variable . Since is uniformly continuous and a.s., we have
This holds for every . Hence, converges a.s.w. to .
Proof of part (b). Part (b) can be reduced to part (a) by noting that and converges a.s. to . ∎
The following result shows that a.s.w. convergence of conditional laws is preserved under filtration coarsening.
Let be a filtration on a probability space . Let be random variables defined on and taking values in a Polish space . Suppose that the sequence of conditional laws converges as to the kernel in the a.s.w. sense. Let be another filtration on such that and let . Then,
Let be bounded continuous functions such that a sequence of probability measures on converges weakly to if and only if converges to as , for all . Let be the function and define similarly. Then, a.s.w. means that a.s., for all . Using the definition of conditional distributions, it is easy to check that . By Lemma 4.9, we have
Since this holds for every we obtain that a.s.w. ∎
5. Conditional Functional Central Limit Theorem and applications to random trees
5.1. Statement of the conditional FCLT
We are almost ready to state a stronger version of Theorem 3.1. Consider a branching random walk in discrete or continuous time defined on a probability space and satisfying the assumptions of Section 2.2. Denote by the -algebra generated by the BRW up to time (discrete-time case) or (continuous-time case). For our applications to the analysis of algorithms we need to state a functional CLT valid over an arbitrary increasing sequence of stopping times. Let be a monotone increasing sequence of stopping times w.r.t. the filtration such that a.s.,
In the discrete-time case we assume additionally that takes values in . Two special cases (which make sense both for discrete and continuous time) will be of interest to us:
is the time at which the -th particle is born.
The second special case will be needed for the above-mentioned applications. Let be the -algebra generated by the branching random walk up to the stopping time .
Fix . Consider the following random analytic function on the disk :
We will prove that the conditional distribution of under converges to some limiting kernel , in the a.s.w. sense. To describe the limiting kernel , we use the random variable from (8) (defined on the same probability space as the branching random walk) and the random analytic function described in Section 3.2 ( may be defined on a different probability space). For we define to be the distribution (on ) of the random analytic function