Polynomial approximations to continuous functions and stochastic compositions
This paper presents a stochastic approach to theorems concerning the behavior of iterations of the Bernstein operator taking a continuous function to a degree- polynomial when the number of iterations tends to infinity and is kept fixed or when tends to infinity as well. In the first instance, the underlying stochastic process is the so-called Wright-Fisher model, whereas, in the second instance, the underlying stochastic process is the Wright-Fisher diffusion. Both processes are probably the most basic ones in mathematical genetics. By using Markov chain theory and stochastic compositions, we explain probabilistically a theorem due to Kelisky and Rivlin, and by using stochastic calculus we compute a formula for the application of a number of times to a polynomial when tends to a constant.
About 100 years ago, Bernstein  introduced a concrete sequence of polynomials approximating a continuous function on a compact interval. That polynomials are dense in the set of continuous functions was shown by Weierstrass , but Bernstein was the first to give a concrete method, one that has withstood the test of time. We refer to  for a history of approximation theory, including inter alia historical references to Weierstrass’ life and work and to the subsequent work of Bernstein. Bernstein’s approach was probabilistic and is nowadays included in numerous textbooks on probability theory, see, e.g., [21, p. 54] or [4, Theorem 6.2].
Several years after Bernstein’s work, the nowadays known as Wright-Fisher stochastic model was introduced and proved to be a founding one for the area of quantitative genetics. The work was done in the context of Mendelian genetics by Ronald A. Fisher [11, 10] and Sewall Wright .
This paper aims to explain the relation between the Wright-Fisher model and the Bernstein operator , that takes a function and outputs a degree- approximating polynomial. Bernstein’s original proof was probabilistic. It is thus natural to expect that subsequent properties of can also be explained via probability theory. In doing so, we shed new light to what happens when we apply the Bernstein operator a large number of times to a function . In fact, things become particularly interesting when and converge simultaneously to . This convergence can be explained by means of the original Wright-Fisher model as well as a continuous-time approximation to it known as Wright-Fisher diffusion.
Our paper was inspired by the Monthly paper of Abel and Ivan  that gives a short proof of the Kelisky and Rivlin theorem  regarding the limit of the iterates of when is fixed. We asked what is the underlying stochastic phenomenon that explains this convergence and found that it is the composition of independent copies of the empirical distribution function of i.i.d. uniform random variables. The composition turns out to be precisely the Wright-Fisher model. Being a Markov chain with absorbing states, and , its distributional limit is a random variable that takes values in ; whence the Kelisky and Rivlin theorem .
Composing stochastic processes is in line with the first author’s current research interests . Indeed, such compositions often turn out to have interesting, nontrivial, limits . Stochastic compositions become particularly interesting when they explain some natural mathematical or physical principles. This is what we do, in a particular case, in this paper. Besides giving fresh proofs to some phenomena, stochastic compositions help find what questions to ask as well.
We will specifically provide probabilistic proofs for a number of results associated to the Bernstein operator (1). First, we briefly recall Bernstein’s probabilistic proof (Theorem 1) that says that converges uniformly to as the degree converges to infinity. Second, we look at iterates of , meaning that we compose times with itself and give a probabilistic proof of the Kelisky and Rivlin theorem stating that converges to as the number of iterations tends to infinity (Theorem 2). Third, we exhibit, probabilistically, a geometric rate of convergence to the Kelisky and Rivlin theorem (Proposition 1). Fourth, we examine the limit of when both and converge to infinity in a way that converges to a constant (Theorem 3) and show that probability theory gives us a way to prove and set up computation methods for the limit for “simple” functions such as polynomials (Proposition 2). A crucial step is the so-called Voronovskaya’s theorem (Theorem 4) which gives a rate of convergence to Bernstein’s theorem but also provides the reason why the Wright-Fisher model converges to the Wright-Fisher diffusion; this is explained in Section 5.
Regarding notation, we let be the set of continuous functions , and the set of functions having a continuous second derivative , including the boundary points, so (respectively, ) is interpreted as derivative from the right (respectively, left). For a bounded function , we denote by the quantity .
2 Recalling Bernstein’s theorem
The Bernstein operator maps any function into the polynomial
We are mostly interested in viewing as an operator on . Bernstein’s theorem is:
Theorem 1 (Bernstein, 1912).
If then converges uniformly to :
The proof of this theorem is elementary if probability theory is used and goes like this. Let be independent Bernoulli random variables with , for some . If denotes the number of variables with value then has a binomial distribution:
Since is continuous on the compact set it is also uniformly continuous and so as . Let be the event that and the indicator of (a function that is on and on its complement). We then write
By Chebyshev’s inequality,
Letting the last term goes to and letting the first term vanishes too, thus establishing the theorem.
A variant of Bernstein’s theorem due to Marc Kac  gives better estimate if is Lipschitz or, more generally, Hölder continuous. Indeed, if satisfies for some then,
where the first inequality used the Hölder continuity of , while the second used Jensen’s inequality twice; indeed, if is a positive random variable then , by the concavity of the function when .
Hölder continuous functions with small are “rough” functions. The previous remark tells us that we may not have a good rate of convergence for these functions. On the other hand, if is smooth can we expect a good rate of convergence? A simple calculation with shows that . Excluding the trivial case (the only functions for which ), can the rate of convergence be better than for some smooth function ? No, and this is due to Voronovskaya’s theorem (Theorem 4 in Section 7).
Remark 3 (Some properties of the Bernstein operator).
3 Iterating Bernstein operators
Let be the composition of with itself and, similarly, let ( times). Abel and Ivan  give a short proof of the following.
Theorem 2 (Kelisky and Rivlin, 1967).
For fixed , and any function ,
A probabilistic proof for Theorem 2.
To prepare the ground, we construct the earlier Bernoulli random variables in a different way. We take to be independent random variables, all uniformly distributed on the interval , and their empirical distribution function
We shall think of as a random function. Note that, for each , has the binomial distribution (2). The advantage of the current representation is that , instead of being a parameter of the probability distribution, is now an explicit parameter of the new random object . We are allowed to (and we will) pick a sequence of independent copies of . For a positive integer let
be the composition of the first random functions. So is itself a random function. By using the independence and the definition of we have that (See also Section A1 in the Appendix)
for any function . Hence the limit over of the right-hand side is the expectation of the limit of the random variable , if this limit exists. (We make use of the fact that (6) is a finite sum!) To see that this is the case, we fix and and consider the sequence
with values in
We observe that this has the Markov property,111Admittedly, it is a bit unconventional to use an upper index for the time parameter of a Markov chain but, in our case, we keep it this way because it appears naturally in the composition operation. namely, is independent of , conditional on . By (2), the one-step transition probability of this Markov chain is
Since , states and are absorbing, whereas for any , we have for all . Define the absorption time
Elementary Markov chain theory [12, Ch. 11] tells us that
For all , .
Therefore, with probability , we have that
Hence for all bur finitely many and so
But the random variable takes two values: and . Notice that for all . Hence . But . Thus , and . Hence
This proves the announced limit of Theorem 2 but without the uniform convergence. However, since all polynomials of the sequence are of degree at most , and is a fixed number, convergence for each implies convergence of the coefficients of the polynomials. ∎
Lemma 2 (Convexity preservation).
If is convex then so is .
Let and . We shall prove that
This can be done by direct computation using (2). Alternatively, we can give a probabilistic argument. Consider and compute first-order terms in . By (4), is nonzero if and only if at least one of the ’s falls in the interval . The probability that or more variables fall in this interval is , as . Hence, if is the event that exactly one of the variables falls in this interval, then
If we let be the event that only is in , then is independent of , so . So (10) becomes
Bring in now the assumption that is convex, whence , for all , and deduce that for all . So is a convex function. ∎
Convergence of the iterates of as
for a convex (left) and a nonconvex one (right).
We can now exhibit a rate of convergence.
For all , ,
We have, for all positive integers and ,
where we used the inequality , for all . Therefore,
we obtain the recursion
Taking into account that we find that . ∎
This should be compared with [1, Eq. (4)] that says that , for some constant which has not been computed in , whereas we have an explicit constant . Now, the factor is probably wasteful and this comes from the fact that the inequality is not good when is large. We only used it because of the simplicity of the right-hand side that enabled us to compute very easily. We have a better inequality, namely (11), but to make it explicit one needs to compute .
4 Interlude: population genetics and the Wright-Fisher model
We now take a closer look at the Markov chain described by the sequence for fixed . We repeat formula (7):
We recognize that it describes the simplest stochastic model for reproduction in population genetics that goes as follows. There is a population of individuals each of which carries 2 genes. Genes come in 2 variants, I and II, say, Thus, an individual may have 2 genes of type I both, or of type II both, or one of each. Hence there are genes in total. We observe the population at successive generations and assume that generations are non-overlapping. Suppose that the -th generation consists of genes of type I and of type II. In Figure 1 below, type I genes are yellow, and type II are red.
To specify generation , we let each gene of generation select a single “parent” at random from the genes of the previous generation. The gene adopts the type of its parent. The parent selection is independent across genes. The probability that a specific gene selects a parent of type I is . Since we have independent trials, the probability that generation will contain genes of type I is given by the right-hand side of formula (4). If we start the process at generation with genes of type I being chosen, independently, with probability each, then the number of alleles at the -th generation has the distribution of .
This stochastic model we just described is known as the Wright-Fisher model, and is fundamental in mathematical biology for populations of fixed size. The model is very far from reality, but has nevertheless been extensively studied and used.
Early on, Wright and Fisher observed that performing exact computations with this model is hard. They devised a continuous approximation observing that the probability as a function of , and can, when is large, be approximated by a smooth function of and . (See, e.g., Kimura  and the recent paper by Tran, Hofrichter, and Jost .) Rather than approximating this probability, we follow modern methods of stochastic analysis in order to approximate the discrete stochastic process by a continuous-time continuous-space stochastic process that is nowadays known as Wright-Fisher diffusion.
5 The Wright-Fisher diffusion
Our eventual goal is to understand what happens when we consider the limit of , when both and tend to infinity. From the Bernstein and the Kelisky-Rivlin theorems we should expect that the order at which limits over and are taken matters. It turns out that the only way to obtain a limit is when the ratio tends to a constant, say, . This is intimately connected to the Wright-Fisher diffusion that we introduce next. We assume that the reader has some knowledge of stochastic calculus, including the Itô formula and stochastic differential equations driven by a Brownian motion at the basic level of Øksendal  or at the more advanced level of Bass .
We first explain why we expect that the Markov chain , has a limit, in a certain sense, as . Our explanation here will be informal. We shall give rigorous proofs of only what we need in the following sections.
The first thing we do is to compute the expected variance of the increment of the chain, and examine whether it converges to zero and at which rate: see (42), Theorem 5, Section A3 in the Appendix. The rate of convergence gives us the right time scale. Our case at hand is particularly simple because we have an exact formula:
This suggests that the right time scale at which we should run the Markov chain is such that the time steps are of size . In other words, consider the points
|, , ,||(16)|
and draw the random curve
(where is the integer part of ) as in Figure 2. This is at the right time scale.
with initial condition , where , , is a standard Brownian motion.
It is actually possible to prove that converges weakly to , but this requires an additional estimate on the size of the increments of the Markov chain that is, in our case, provided by the following inequality: for any ,
To see this, apply Hoeffding’s inequality (see (40) Section A2 in the Appendix).
We thus have that (15), (17), (19) are the conditions (42), (43) and (44) of Theorem 5, Section A3. In addition, it can be shown that the stochastic differential equation (18) admits a unique strong solution for any initial condition . This is, e.g., a consequence of the Yamada-Watanabe theorem [2, Theorem 24.4]. Hence, by Theorem 5, the sequence of continuous random curves converges weakly to the continuous random function .
One particular conclusion of weak convergence is that for any or, equivalently, that
Theorem 3 (joint limits theorem).
For any and any ,
Since understanding the theorem of Stroock and Varadhan requires advanced machinery, we shall prove Theorem 3 directly. The proof is deferred until Section 8. The nice thing with this theorem is that we have a way to compute the limit by means of stochastic calculus, the tools of which we shall assume as known.
If we let
take expectations in (21), and differentiate with respect to , we obtain
the so-called forward equation of the diffusion. Now let, for all ,
Letting , we arrive at the backward equation
which is valid if is twice continuously differentiable. The class of functions such that both and are in is nontrivial in our case. It contains, at least polynomials. This is what we show next.
6 Moments of the Wright-Fisher diffusion
It turns out that in order to prove Theorem 3 we need to compute when is a polynomial.
For a positive integer , the following holds for the Wright-Fisher diffusion:
(where, as usual, a product over an empty set equals ).
Write to save space. By Itô’s formula (21) applied to ,
Since the first integral is (as a function of ) a martingale starting from its expectation is 0. Thus, if we let
Thus, , as expected, and
Defining the Laplace transform
and using integration by parts to see that we have
Iterating this easy recursion yields
where the second equality was obtained by partial fraction expansion (and the notation is as in (26)). Since the inverse Laplace transform of is , the claim follows. ∎
Formula (25) was proved by Kelisky and Rivlin [17, Eq. (3.13)] and Karlin and Ziegler [16, Eq. (1.13)] by entirely different methods. (the latter paper contains a typo in the formula). Eq. (3.13) of  reads:
valid for any integers with . This equality can be verified directly by simple algebra.
7 Convergence rate to Bernstein’s theorem: Voronovskaya’s theorem
An important result in the theory of approximation of continuous functions is Voronovskaya’s theorem . It is the simplest example of saturation, namely that, for certain operators, convergence cannot be too fast even for very smooth functions. See DeVore and Lorentz[7, Theorem 3.1]. Voronovskaya’s theorem gives a rate of convergence to Bernstein’s theorem. From a probabilistic point of view, the theorem is nothing else but the convergence of the generator of the discrete Markov chain to the generator of the Wright-Fisher diffusion. We shall not use anything from the theory of generators, but we shall give an independent probabilistic proof below for functions , including a slightly improved form under the assumption that is Lipschitz. In this case, its Lipschitz constant is
Recall that is defined by (22).
Theorem 4 (Voronovskaya, 1932).
For any ,
If moreover is Lipschitz then, for any ,
Using Taylor’s theorem with the remainder in integral form,
Since , we have, from (15), . Therefore,
We estimate by splitting the expectation as
where is chosen by the uniform continuity of : for let be such that