Low-rank diffusion matrix estimation for high-dimensional time-changed Lévy processes
The estimation of the diffusion matrix of a high-dimensional, possibly time-changed Lévy process is studied, based on discrete observations of the process with a fixed distance. A low-rank condition is imposed on . Applying a spectral approach, we construct a weighted least-squares estimator with nuclear-norm-penalisation. We prove oracle inequalities and derive convergence rates for the diffusion matrix estimator. The convergence rates show a surprising dependency on the rank of and are optimal in the minimax sense for fixed dimensions. Theoretical results are illustrated by a simulation study.
[language=french] In French.
Low-rank diffusion matrix estimation for Lévy processes
Volatility estimation, Lasso-type estimator, minimax convergence rates, nonlinear inverse problem, oracle inequalities, time-changed Lévy process
Non-parametric statistical inference for Lévy-type processes has been attracting the attention of researchers for many years initiated by the works of Rubin and Tucker  as well as Basawa and Brockwell . The popularity of Lévy processes is related to their simplicity on the one hand and the ability to reproduce many stylised patterns presented in economic and financial data or to model stochastic phenomena in biology or physics, on the other hand. While non-parametric inference for one dimensional Lévy processes is nowadays well understood (see e.g. the lecture notes ), there are surprisingly few results for multidimensional Lévy or related jump processes. Possible applications demand however for multi- or even high-dimensional methods. As a first contribution to statistics for jump processes in high dimensions, we investigate the optimal estimation of the diffusion matrix of a -dimensional Lévy process where may grow with the number of observations.
More general, we consider a larger class of time-changed Lévy process. Let be a -dimensional Lévy process and let be a non-negative, non-decreasing stochastic process with . Then the time-changed Lévy process is defined via for . For instance, in the finance context, the change of time is motivated by the fact that some economical effects (as nervousness of the market which is indicated by volatility) can be better understood in terms of a “business time” which may run faster than the physical one in some periods (see, e.g. Veraart and Winkel ). In view of large portfolios and a huge number of assets traded at the stock markets we have high-dimensional observations.
Monroe  has shown that even in the case of the Brownian motion , the resulting class of time-changed Lévy processes is rather large and basically coincides, at least in the one-dimensional case, with the class of all semi-martingales. Nevertheless, a practical application of this fact for statistical modelling meets two major problems: first, the change of time can be highly intricate - for instance, if has discontinuous trajectories, cf. Barndorff-Nielsen and Shiryaev ; second, the dependence structure between and can be also quite involved. In order to avoid these difficulties and to achieve identifiability, we allow to be a general Lévy processes, but assume independence of and .
A first natural question is which parameters of the underlying Lévy process can be identified from discrete observations for some as This question has been recently addressed in the literature (see Belomestny  and references therein) and the answer turns out to crucially depend on the asymptotic behaviour of and on the degree of our knowledge about We study the so-called low-frequency regime, meaning that the observation distance is fixed. In this case, we have severe identifiability problems which can be easily seen from the fact that subordinated Lévy processes are again Lévy processes. Assuming the time-change is known, all parameters of the underlying Lévy process can be identified as Since any Lévy process can be uniquely parametrised by the so-called Lévy triplet with a positive semi-definite diffusion (or volatility) matrix , a drift vector and a jump measure we face a semi-parametric estimation problem. Describing the covariance of the diffusion part of the Lévy process , the matrix is of particular interest. Aiming for a high dimensional setting, we study the estimation of under the assumption that it has low rank. This low rank condition reflects the idea of a few principal components driving the whole process. We also discuss the case where the diffusion matrix can be decomposed into a low rank matrix and a sparse matrix as has been suggested by Fan et al. [17, 18].
For discretely observed Lévy processes it has been recently shown that estimation methods coming from low-frequency observations attain also in the high-frequency case, where , optimal convergence results, cf. Jacod and Reiß  for volatility estimation and Nickl et al.  for the estimation of the jump measure. In contrast to estimators which are tailor-made for high-frequency observations, these low-frequency methods thus turn out to be robust with respect to the sampling frequency. This observation is a key motivation for considering a fixed in the present article. It is very common in the literature on statistics for stochastic processes, that the observations are supposed to be equidistant. In view of the empirical results by , this model assumption might however be often violated in financial applications. Our observation scheme can be equivalently interpreted as random observation times of the Lévy process . For one-dimensional Lévy processes general observation distances as been studied by Kappus .
The research area on statistical inference for discretely observed stochastic processes is rapidly growing. Let us only mention some closely related contributions while we refer to  for a recent overview of estimation for Lévy processes. Non-parametric estimation for time-changed Lévy models has been studied in [6, 8, 13]. In a two-dimensional setting the jump measure of a Lévy process has been estimated in . For inference on the volatility matrix (in small dimensions) for continuous semi-martingales in a high-frequency regime we refer to [20, 10] and references therein. Large sparse volatility matrix estimation for continuous Itô processes has been recently studied in [42, 37, 38].
Our statistical problem turns out to be closely related to a deconvolution problem for a multidimensional normal distribution with zero mean and covariance matrix convolved with a nuisance distribution which is unknown except for some of its structural properties. Due to the time-change or random sampling, respectively, the normal vector is additionally multiplied with a non-negative random variable. Hence, we face a covariance estimation problem in a generalised mixture model. Since we have no direct access to a sample of the underlying normal distribution, our situation differs considerably from the inference problems which have been studied in the literature on high-dimensional covariance matrix estimation so far.
The analysis of many deconvolution and mixture models becomes more transparent in spectral domain. Our accordingly reformulated problem bears some similarity to the so-called trace regression problem and the related matrix estimation, which recently got a lot of attention in statistical literature (see, e.g. [34, 29, 1, 24]). We face a non-linear analogue to the estimation of low-rank matrices based on noisy observations. Adapting some ideas, especially from Koltchinskii et al. , we construct a weighted least squares estimator with nuclear norm penalisation in the spectral domain.
We prove oracle inequalities for the estimator of implying that the estimator adapts to the low-rank condition on and that it also applies in the miss-specified case of only approximated sparsity. The resulting convergence rate fundamentally depends on the time-change while the dimension may grow polynomially in the sample size. Lower bounds verify that the rates are optimal in the minimax sense for fixed dimensions. The influence of the rank of on the convergence rate reveals a new phenomenon which was not observed in multi-dimensional non-parametric statistics before. Namely, in a certain regime the convergence rate in is faster for a larger rank of . To understand this surprising phase transition from a more abstract perspective, we briefly discuss a related regression model.
The paper is organised as follows. In Section 2, we introduce notations and formulate the statistical model. In Section 3 the estimator for the diffusion matrix is constructed and the oracle inequalities for this estimator are derived. Based on these oracle inequalities, we derive in Section 4 upper and lower bounds for the convergence rate. Section 5 is devoted to several extensions, including mixing time-changes, incorporating a sparse component of the diffusion matrix and the related trace-regression model. Some numerical examples can be found in Section 6. In Sections 7 and 8 the proofs of the oracle inequalities and of the minimax convergence rates, respectively, are collected. The appendix contains some (uniform) concentration results for multivariate characteristic functions which are of independent interest.
Acknowledgement. The authors thank Markus Reiß and two anonymous referees for helpful comments. D.B. acknowledges the financial support from the Russian Academic Excellence Project “5-100” and from the Deutsche Forschungsgemeinschaft (DFG) through the SFB 823 “Statistical modelling of nonlinear dynamic processes”. M.T. gratefully acknowledges the financial support by the DFG research fellowship TR 1349/1-1. A part of this paper has been written while M.T. was affiliated to the Université Paris-Dauphine.
2 Main setup
Recall that a random time change is an increasing right-continuous process with left limits such that . For each the random variable is a stopping time with respect to the underlying filtration. For a Lévy process the time-changed Lévy process is given by . We throughout assume that is independent of . By reparametrising the time change we can assume without loss of generality that and that the observations are thus given by the increments
for . Note that in general the observations are not independent, in contrast to the special case of low-frequently observed Lévy processes. The estimation procedure relies on the following crucial insight: If the sequence
is stationary and admits an invariant measure , then the independence of and together with the Lévy-Khintchine formula yield that the observations have a common characteristic function given by
where is the Laplace transform of and where the characteristic exponent is given by
with the diffusion matrix , for a drift parameter and a jump measure . If and , we end up with the problem of estimating a covariance matrix. The function in (2.1) appears due to the presence of jumps and can be viewed as a nuisance parameter. Let us give some examples of typical time-changes and their Laplace transforms.
Low-frequency observations of with observation distance :
Poisson process time-change or exponential waiting times with intensity parameter :
Gamma process time-change with parameters :
Integrated CIR-process for some such that (which has -mixing increments, cf. ):
If the time change is unknown, one faces severe identifiability problems. Indeed, even in the case of a multivariate time-changed Brownian motion we can identify the matrix and only up to a multiplicative factor. More generally, if is not restricted to be Brownian motion there is much more ambiguity as the following example illustrates.
Consider two time changes and where is a Gamma subordinator with the unit parameters and is deterministic. We have and Note that Let be a one-dimensional Lévy process with the characteristic exponent and let be another Lévy process with the characteristic exponent (the existence of such Lévy process follows from the well-known fact that is the characteristic function of some infinitely divisible distribution). Then we obviously have for all Hence, the corresponding time-changed Lévy processes and satisfy
for any Moreover, the increments are independent in this case.
The above example shows that even under the additional assumption , we cannot, in general, consistently estimate the parameters of the one-dimensional time-changed Lévy process from the low-frequency observations. As shown by Belomestny  all parameters of a -dimensional (with ) Lévy process can be identified under additional assumption that all components of are independent and However this assumption would imply that the matrix is diagonal and is thus much too restrictive for our setting. Therefore, we throughout assume that the time-change and consequently its Laplace transform are known, see Section 5.1 for a discussion for unknown . Let us note that similar identifications problems appear even in high-frequency setting, see .
Before we construct the volatility matrix estimator in the next section, we should introduce some notation. Recall the Frobenius or trace inner product for matrices . For the Schatten- norm of is given by
with being the singular values of . In particular, , and denote the nuclear, the Frobenius and the spectral norm of , respectively. We will frequently apply the trace duality property
Moreover, we may consider the entry-wise norms for with the usual modification denoting the number of non-zero entries of .
For any matrices we write
For we write if there is a constant independent of and the involved parameters such that .
3 The estimator and oracle inequalities
In view of the Lévy-Khintchine formula, we apply a spectral approach. The natural estimator of is given by the empirical characteristic function
which is consistent whenever is ergodic. Even more, it concentrates around the true characteristic function with parametric rate uniformly on compact sets, cf. Theorems A.2 and A.4 in the appendix. A plug-in approach yields the estimator for the characteristic exponent given by
where denotes a continuously chosen inverse of the map . Since only appears in the real part of the characteristic exponent, we may use .
For high-frequency observations, that is , the characteristic function of the observations can be written as where is the characteristic exponent as before and where is the Laplace transform of the rescaled time-change increments , which may be assumed to be stationary. Under some regularity assumption on , we obtain the asymptotic expansion
Instead of the inversion in (3.1) the estimator can be used in this case, depending only on Laplace transform via . Note that the latter is given by the first moment of rescaled time-change increments. However, this asymptotic identification can only be applied in the high-frequency regime , while the estimator (3.1) is robust with respect to the sampling frequency.
Applying , estimating the low-rank matrix can thus be reformulated as the regression problem
where we have normalised the whole formula by the factor . The design matrix for an arbitrary unit vector is deterministic and degenerated. The second term in (3.2) is a deterministic error which will be small only for large . The last term reflects the stochastic error. Due to the identification via the Laplace transform , the estimation problem is non-linear and turns out to be ill-posed: the stochastic error grows for large frequencies.
Inspired by the regression formula (3.2), we define the diffusion matrix estimator as a penalised least squares type estimator with regularisation parameter and a spectral cut-off :
where is a subset of positive semi-definite matrices and is a weight function. We impose the following standing assumption on the weight function which is chosen by the practitioner.
Let be a radial non-negative function which is supported on the annulus . For any let .
In the special case of a Lévy process with a finite jump activity , we can write remainder from (2.1) as
for . Since the Fourier transform converges to zero as under a mild regularity assumption on the finite measure , we can reduce the bias of our estimator by the following modification
and some interval . The most interesting cases are , where and coincide, and . The factor in front of is natural, since the ill-posedness of the estimation problem for the jump activity is two degrees larger than for estimating the volatility, cf. Belomestny and Reiß . As a side product, is an estimator for the jump intensity. By penalising large , the estimator is pushed back to zero if the least squares part cannot profit from a finite . It thus coincides with the convention to set in the infinite intensity case.
Our estimators and are related to the weighted scalar product
with the usual notation . For convenience we abbreviate . The scalar product is the counterpart to the empirical scalar product in the matrix estimation literature. As the following lemma reveals, the weighted norm and the Frobenius norm are equivalent. As the isometry is not restricted to any sparse sub-class of matrices, it is especially implies the often imposed restricted isometry property. The proof is postponed to Section 7.1.
For any positive semi-definite matrix and any it holds
where and .
Using well-known calculations from the Lasso literature, we obtain the following elementary oracle inequality, which is proven in Section 7.2. The condition is trivially satisfied for .
Let be an arbitrary subset of matrices and define
where we set if . Suppose . On the event for some we have
If the set is convex, then sub-differential calculus can be used to refine the inequality (3.7). The proof is inspired by Koltchinskii et al. [24, Thm. 1] and postponed to Section 7.3. Note that this second oracle inequality improves (3.7) with respect to two aspects. Instead of we have in the second term on the right-hand side and the nuclear-norm of is replaced by its rank.
This oracle inequality is purely non-asymptotic and sharp in the sense that we have the constant one in front of on the right-hand side. Combining Lemma 3.2 and Theorem 3.4 immediately yields an oracle inequality for the error in the Frobenius norm.
Let be a convex subset of the positive semi-definite matrices and let . On the event for some and for from (3.6) we have
for a constant depending only on and .
The remaining question is how the parameters and should be chosen in order to control the event . The answer will be given in the next section.
4 Convergence rates
The above oracle inequalities hold true for any Lévy-process and any independent time-change with stationary increments. However, to find some and such that probability of the event with the error term from (3.6) is indeed large, we naturally need to impose some assumptions. For simplicity we will concentrate on the well-specified case noting that, thanks to Corollary 3.5, the results carry over to the miss-specified situation.
Let the jump measure of fulfil:
for some and for some constant let
for some .
Let the time-change satisfy:
for some .
The sequence , is mutually independent and identically distributed with some law on .
The derivatives of the Laplace transform satisfy for some for all with .
If the Laplace transform decays polynomially, we may impose the stronger assumption
for some for all with .
For Assumption B(i) allows for Lévy processes with infinite jump activity. In that case in (4.1) is an upper bound for the Blumenthal–Getoor index of the process. A more pronounced singularity of the jump measure at zero corresponds to larger values of . The lower bound is natural, since any jump measure satisfies . On the contrary, in (4.1) implies that is a finite measure. In that case we could further profit from its regularity which we measure by the decay of its Fourier transform. Altogether, the parameter will determine the approximation error that is due to the term in (3.2).
Assumption B(ii) implies that for all and together with the moment condition Assumption C(i), we conclude that , cf. Lemma 8.1. Assumption C(ii) implies that the increments are independent and identically distributed. Note that being stationary with invariant measure is necessary to construct the estimator of . The independence can, however, be relaxed to an -mixing condition as discussed in Section 5.2. Finally, Assumptions C(iii) and (iv) are used to linearise the stochastic error.
In the sequel we denote by the -dimensional ball of radius . We can now state the first main result of this section.
In order to prove this theorem, we decompose into a stochastic error term and a deterministic approximation error. More precisely, the regression formula (3.2) and for any yield
The order of the deterministic error is , cf. Lemma 8.3, which decays as . The rate deteriorates for higher jump activities. This is reasonable since even for high frequency observations it is very difficult to distinguish between small jumps and fluctuations due to the diffusion component, cf. Jacod and Reiß . The stochastic error is dominated by its linearisation
which is increasing in due to the denominator as . To obtain a sharp oracle inequality for the spectral norm of the linearised stochastic error, we use the noncommutative Bernstein inequality by Recht . To bound the remainder, we apply a concentration result (Theorem A.4) for the empirical characteristic function around , uniformly on .
The lower bound on reflects the typical dependence on the dimension that also appear is in theory on matrix completion, cf. Corollary 2 by Koltchinskii et al. . Our upper bound on ensures that the remainder term in the stochastic error is negligible.
The choice of is determined by the trade-off between approximation error and stochastic error. Since neither nor depend on the rank of , the theorem verifies that the estimator is adaptive with respect to . Supposing lower bounds for that depend only on the radius , the only term that depends on the dimension is . Owing to , it is uniformly bounded by the volume of the unit ball in which in turn is uniformly bounded in (in fact it is decreasing as ). If , we even have . We discuss this quite surprising factor further after Corollary 4.4 and, from a more abstract perspective, in Section 5.5.
For specific decay behaviours of we can now conclude convergence rates for the diffusion matrix estimator. In contrast to the nonparametric estimation of the coefficients of a diffusion process, as studied by Chorowski and Trabs , the convergence rates depend enormously on the sampling distribution (resp. time-change). We start with an exponential decay of the Laplace transform of . This is especially the case if the Lévy process is observed at equidistant time points for some and thus .
for some and a constant which can be chosen independently of and . Then we have for sufficiently large
with probability larger than for some depending only on and .
This corollary shows that the exponential decay of leads to a severely ill-posed problem and, as a consequence, the rate is only logarithmic. Therein, the interplay between sample size and dimension , i.e. the term has been also observed for matrix completion [24, Cor. 2]. Whenever there is some such that the logarithmic term in the upper bound simplifies to . The slow rate implies that is consistent in absolute error only if . Considering the relative error , this restriction vanishes. Note also that in a high-dimensional principal component analysis, the spectral norm may grow in , too, cf. the discussion for the factor model by Fan et al. .
If decays only polynomially, as for instance for the gamma subordinator, cf. Examples 2.1, the estimation problem is only mildly ill-posed and the convergence rates are polynomial in .
Grant Assumptions B and C(i)-(iv) and assume . For a convex subset of positive semi-definite matrices, let , and . Suppose and for with and for some such that . Denoting the smallest strictly positive eigenvalue of by , we set
for a constant independent of and . Then we have for sufficiently large
with probability larger than for some depending only on and .
The convergence rate reflects the regularity and the degree of ill-posedness of the statistical inverse problem, where is the decay rate of the characteristic function and we gain two degrees since grows like . The term appearing in the denominator is very surprising since the rate becomes faster as the rank of increases up to some critical threshold value . To see this, it remains to note that the assumption yields
In order to shed some light on this phenomenon, we consider a related regression problem in Section 5.5.
It is interesting to note that for very small we almost attain the parametric rate. Having the example of gamma-distributed increments in mind, a small corresponds to measures which are highly concentrated at the origin. In that case we expect many quite small increments where the jump component has only a small effect. On the other hand, for the remaining few rather large increments, the estimator is not worse than in a purely low-frequency setting. Hence, heuristically corresponds to an interpolation between high- and low-frequency observations, cf. also Kappus .
If the condition is not satisfied, the linearised stochastic error appears to be smaller than the remainder of the linearsation. It that case we only obtain the presumably suboptimal rate (for fixed ). It is still an open problem whether this condition is only an artefact of our proofs or if there is some intrinsic reason.
Let us now investigate whether the rates are optimal in the minimax sense. The dependence on the the rank of is the same as usual in matrix estimation problems and seems natural. The optimality of the influence of the dimension is less clear. In lower bounds for matrix completion, cf. Koltchinskii et al. [24, Thm. 6], a similar dependence appears except for the fact that we face non-parametric rates. In the following we prove lower bounds for the rate of convergence in the number of observations for a fixed . More general lower bounds in a high-dimensional setting where may grow with are highly intricate since usually lower bounds for high-dimensional statistical problems are based on Kullback-Leibler divergences in Gaussian models - an approach that cannot be applied in our model. This problem is left open for future research.
Let us introduce the class of all Lévy measures satisfying Assumption B with and a constant . In order to prove sharp lower bounds, we need the following stronger assumptions on the time change:
Grant (i) and (ii) from Assumption C. For and let the Laplace transform satisfy
for some and all
for some and all with