Regime variance testing - a quantile approach
This paper is devoted to testing time series that exhibit behavior related to two or more regimes with different statistical properties. Motivation of our study are two real data sets from plasma physics with observable two-regimes structure. In this paper we develop estimation procedure for critical point of division the structure change of a time series. Moreover we propose three tests for recognition such specific behavior. The presented methodology is based on the empirical second moment and its main advantage is lack of the distribution assumption. Moreover, the examined statistical properties we express in the language of empirical quantiles of the squared data therefore the methodology is an extension of the approach known from the literature [5, 17, 1, 13]. The theoretical results we confirm by simulations and analysis of real data of turbulent laboratory plasma.
Keywords: statistical test, nonstationarity, two regimes, empirical second moment, quantile
PACS: 02.50.Tt, 02.50.Cw, 02.70.Uu
The main issue in real data analysis is testing distribution. This problem appears not only in case of independent identically distributed (i.i.d.) sample [22, 20, 25] but also when we calibrate a model to real data set [2, 12, 3]. In this case the distribution is fitted to the residual series that is assumed to be i.i.d. But many independent variables seem to display changes in the underlying data generating process over time  therefore they can not be considered as identically distributed sample. This typical behavior we observe also in time series described in Section 2 that presents increments of floating potential fluctuations of turbulent laboratory plasma for the small torus radial position cm. For this data set the known statistical tests for stationarity mentioned in Section 3 [7, 8, 21, 18, 14] are not useful. What more, under some assumptions they indicate the data are i.i.d. that is in contradiction with behavior of the data observable in Figure 1.
Therefore in this paper we introduce three tests that can be useful to time series for which we observe more than one regime with different statistical properties. Two of them are visual therefore we call them pre-tests and propose to use in the preliminary analysis to identify the specific behavior. In order to confirm two or more regimes in the data set we have developed statistical test for regime variance. Moreover, we also introduce the estimation procedure for the critical point that divides examined time series into two parts with different statistical properties. However, the inspection only of the data can lead sometimes to the wrong preliminary choice of the model therefore the mentioned tests are based on behavior of the empirical second moment of the examined time series. The advantage of the methodology based on the empirical second moments is emphasized in [4, 26] and is also confirmed by the bottom panel of Figure1 that presents the squared data for which the difference between two regimes is more visible. In the presented methodology we do not assume the distribution because the introduced tests exploit only the empirical properties of examined data set. What is more important, they can be used to data for which the point of division into two regimes is well-defined (is clearly observable), but also for data for which the point is not visible. Moreover, we show by simulation study that the proposed methodology can be also useful for infinite-variance time series.
The rest of the paper is organized as follows: in Section 2 we present the examined data sets that are motivation of developing the presented methodology. Next, in Section 3 we overview the known statistical tests for stationarity and present the estimation procedure of recognition the critical point introduced in . In Section 4 we introduce two visual pre-tests that indicate at specific behavior of examined time series, i.e. two regimes related to different statistical properties. In this Section we propose also the innovative procedure of estimation for critical point based on the behavior of empirical second moment of real data set and present the simulation study. In Section 5 we introduce the statistical method for testing regime variance and test the procedure by using simulated data. In the next Section we analyze real data sets form plasma physics in the context of presented methodology. Finally, the last Section gives a few concluding remarks.
Motivation of our study is presented in Figure 1 real data set. This time series describes increments of floating potential fluctuations (in volts) of turbulent laboratory plasma for the small torus radial position cm. Precise description of the experiment is presented in . The signal was registered on 15 June 2006 with movable probe in scrape-off layer (SOL) plasma of stellarator ”URAGAN 3M”. Because the signal was registered every second therefore total length of time series is , but to the analysis we only take observations between .
As we observe in Figure 1 the empirical data set exhibits very special behavior, namely the statistical properties of the time series change in time. It can be related to the fact that the first observations constitute random sample that comes from another distribution, than the last part or those two parts come from the same distribution with different parameters. Therefore we can suspect that the time series satisfies the following property:
where and are independent and have different statistical properties and is fixed point. As we have mentioned in Section 1, inspection of the data can lead sometimes to the wrong preliminary conclusions, therefore we propose to consider the squared time series. As we observe in Figure 1, the difference between two parts is more visible for squared data. The statistical properties we express in the language of quantiles of squared time series and we assume the random variables and in relation (1) have different quantiles and for given confidence level . Here we take the notation as quantile of order .
After preliminary analysis of the data set and confirmation that it constitutes realizations of independent random variables (see Figure 11) we have tested hypothesis of the same distribution of time series. The known statistical tests such as Augmented Dickey-Fuller, Phillips-Perron or Kwiatkowski–Phillips–Schmidt–Shin test for stationarity reject the hypothesis that the data are nonstationary (in the sense presented in Section 3) that suggests they are not useful for this data set. Therefore we propose three tests that can be used for data that exhibit similar behavior as this observed in Figure 1, but also to this that after preliminary analysis we can not reject the hypothesis about the same distribution. An example is shown in Figure 2. This time series presents increments of floating potential fluctuations (in volts) of turbulent laboratory plasma for the small torus radial position cm. Similar as for the first data set, the signal was registered on 15 June 2006 and total number of observations was but to illustration we take only observations from to . After analysis of the time plot for series and squared series we can suspect that the data can not be considered as identically distributed sample but here the point, when some statistical properties change is not so visible as for the first data set. Moreover the mentioned tests for stationarity presented in Section 3 indicate that under some assumptions the time series can be considered as stationary process but in next Sections we will show that this hypothesis is not true. Moreover we will find such point that divides examined data into two i.i.d. samples.
3 Statistical tests for stationarity
In order to make any inferences about the structure of a time series we need some regularity over time in the behavior of the underlying series. This regularity one can formalize using a concept of stationarity, see . We say that time series is weakly stationary if the mean of the series is constant over time and the covariance between observations on time and depends only on their absolute difference .
However stationarity is not a common feature of time series and mostly we observe nonstationary behavior of the process. There are several types of nonstationarity. The trend nonstationarity means that the data posses some deterministic trend (for example linear trend) but otherwise are stationary. This can be easily seen based on autocorrelation function (for instance linear trend can be seen as a linear slow in time decay of autocorrelation function) . The second type of nonstationarity is called difference nonstationarity, which means the process has to be differenced in order to become stationary. This two types examples of nonstationarity are often encountered in real-life data. The class of unit-root tests help to distinguish difference from trend nonstationarity. Under the null hypothesis that the series is difference nonstationary one can mention here Dickey-Fuller unit root tests [7, 8, 21] and Phillips-Perron unit root tests . Testing in opposite direction, namely assuming that time series is trend stationary against it is difference one can apply the KPSS test due to Kwiatkowski, Phillips, Schmidt and Shin .
Mentioned above types of nonstationarity can be successfully tested and recognized from the data but they are not the only problems one may encounters during data analysis. Atypical observations, level shifts or variance change are common features of many real-life data sets [10, 5, 17]. Neglecting such effects may lead to inaccurate estimation of parameters of the model and in consequence inaccurate or completely wrong prediction. In present work we will discuss the effect of variance change in the data sets, thus there is no trend and differenced data have the same behavior as initially before differentiation. Such specific two-regimes time series was also considered in , where the following model for the innovations (independent sample) was considered:
for some point , fixed number and under the assumption constitutes i.i.d. random variables from normal distribution. We can thus calculate the variance ratio of before and after the structural change:
where and are greater than zero. The variance ratio is an estimate of and is likelihood ratio test statistics of variance change under the assumption of normality. The test is the most powerful for step change in variance when the point is known. If the critical point is unknown one can apply variance ratio statistics to find it. In this case we compute the variance ratio statistics for stochastically independent series and obtain its minimum and maximum values:
where is the positive integer denoting the minimum number of observations used to estimate the variance at the beginning and at the end of the sample. Then we calculate
The critical point is the one at which occurs.
The complete description of procedure for detecting and adjusting the time series with two-regimes structure of the type (2) is presented in . Because the presented methodology is based on the assumption of normal distribution, that is a main disadvantage, therefore in the next Section we introduce the innovative procedure of estimation for the critical point that does not require any assumption of the distribution. This procedure is based on the behavior of empirical second moment of examined time series and is compatible with two visual pre-tests for two-regimes structure.
4 Visual pre-tests for regime variance
In the first part of this Section we present two visual pre-tests that can confirm information if the observed time series constitutes sample that satisfy relation (1). Those two pre-tests are based on the behavior of empirical second moment of the data. In the first method we propose to consider the following statistics:
If the random variables and given in relation (1) have distributions with finite second moments and , respectively , then the statistics has the following property:
If , then the mean of statistics is equal to for all , therefore for i.i.d. sample expected value of the statistics is a linear function with the shift parameter equal to zero. Of course this relation is not satisfied for distributions with infinite variances, but even in this cases we observe significant changes in behavior of statistics. Results of this pre-test we present in Figure 3 for different distributions of random variables and in relation (1). We consider two cases each consisting of three distributions, namely pure Gaussian (), pure Lévy–stable (), and Gaussian–Lévy–stable. In the first scenario we consider the case when the parameters of distributions are close to each other and thus the structure change point is not well visible in the visual pre-test (see left panel of Figure 3). In the second scenario we consider distributions with very different parameters, thus the critical point is observable in the simulated sample (the right panel of Figure 3).
For the first scenario we consider the following cases:
the pure Gaussian case with and distribution for first 800 and last 1000 observations, respectively,
the pure Lévy–stable case with and distribution for first 800 and last 1000 observations, respectively,
the Lévy–stable–Gaussian case with and distribution for first 800 and last 1000 observations, respectively.
In the second scenario we consider following parameters of distributions:
the pure Gaussian case with and distribution for first 800 and last 1000 observations, respectively,
the pure Lévy–stable case with and distribution for first 800 and last 1000 observations, respectively,
the Gaussian–Lévy–stable case with and distribution for first 800 and last 1000 observations, respectively.
In the second visual pre-test we observe behavior of the empirical second moment of the data from windows of width . The examined statistics has the following form:
where is a given positive number called window width. We assume . For finite variance distributions of and we can also calculate the expected value of statistic, namely:
where and are the second moments of the random variables and respectively. As we observe in (7), the mean of statistics for given window width is constant when or . For the statistics has mean that is a linear function with respect to . When and have the same distributions, then expected value of the statistics defined in (6) is constant for given . The results of this pre-test are presented in Figure 4 for two considered scenarios with different distributions presented above.
4.1 Estimation procedure for the critical point
In this part we introduce the innovative method of estimating the critical point of change the statistical properties in the sample that fulfills relation (1). The idea of estimation procedure comes from the first visual pre-test described above. More precisely, we use the statistics defined in (4) and its mean function given in (5).
The algorithm starts with dividing for fixed the statistics into two sets and Next, we fit the linear regression lines and to the first and the second set respectively. From ordinary regression theory for such lines the sums of distance squares and are minimized and therefore the line coefficients have the form, see :
The coefficients have analogous formulas with summation from to Our estimator of the point in relation (1) we define as the number that minimalizes mentioned sums of distance squares:
Let us stress that the proposed estimator is invariant with respect to sample distribution.
We compare the robustness of detecting the critical point of the underlying sample satisfying relation (1) with the method proposed in  and based on variance ratio statistics given in (3). Let us remind that variance ratio statistics is intended to detect change point under the assumption of normal distribution of the examined series. The procedure is the following: we simulate trajectories of length of stochastically independent random variables with the change point placed on observation. Similar as for visual pre-tests, we consider two cases each consisting of three distributions. Details of examined scenarios are presented above.
The results for the first scenario, where the critical point is not well visible are presented in Figure 5, where the first boxplot denotes results of estimator presented in (3) while the second - is related to estimator defined in (8). One clearly sees that detection of the critical point based on the is far more accurate than based on even in case of Gaussian distribution (see panel a in Figure 5).
The results for the second scenario with clear critical point are presented in Figure 6. Also in this case one can see that estimator performs better than .
5 Statistical test for regime variance
In this Section we introduce the regime variance test that confirms our assumption of two-regimes behavior given in relation (1). It confirms also the preliminary results obtained by using the visual pre-tests presented in the previous Section.
The procedure is based on the analysis of the empirical second moment of given sample. Let us point that it can be used for distributions with theoretical second moment but also for this with infinite variance. Even in this case the theoretical second moment exists. Moreover the test is based on the quantiles that without assumption of the distribution we can determine on the basis of the empirical distribution function.
The hypothesis we define as follows: observed time series does not satisfy relation (1), that means the quantiles of the squared series do not change in time. The hypothesis is satisfied in case of i.i.d. random variables but also in case when distributions of two parts (divided by point ) are different but quantiles and of squared data are on the same level.
The hypothesis we formulate as: observed time series has at least representation (1), i.e. there are at least two regimes of the data for which the appropriate quantiles of squared time series are different. Let us point that the hypothesis will be rejected when the squared series has more than two regimes.
The testing of regime variance is based on the assumption the real data constitute sample of independent variables therefore before testing we have to confirm that given sample constitutes independent data. We propose here to use the simple visual method based on the autocorrelation function (ACF). For independent sample the ACF is close to zero for all lags greater than zero. More information and basic properties of this methodology one can find in .
The procedure of regime variance testing for given time series proceeds as follows:
Determine the critical point according to the procedure presented in Section 4. Let us emphasize that also under hypothesis, the point exists and is between and .
Divide the squared time series into two vectors: and . Find empirical standard deviations and of and , respectively. For simplicity let us assume that . Let us point that in case of distribution without theoretical second moment, the empirical standard deviation exists and can be calculated on the basis of the observed data.
Construct quantiles from the distribution of squared time series from the vector (for that the empirical standard deviation was smaller), i.e. numbers and that satisfy the relation
where is a given confidence level. Under the hypothesis without the assumption of the distribution, the appropriate quantiles we can determine on the basis on the empirical cumulative distribution function. Because are independent therefore the statistics has Binomial distribution with parameters and . Therefore the p-value of the test we calculate as , where has Binomial distribution with parameters.
If the calculated p-value is greater than the parameter, then we accept the hypothesis. Otherwise if the calculated p-value is smaller than the parameter, then we reject the hypothesis and accept .
The complementary part of this Section is the simulation examination of the proposed estimator (8) and variance regime test described above. First we check the committed error of the first order for our test, i.e. the rejecting a true hypothesis. For this purpose we generate 1000 trajectories of length of stochastically independent random variables for each of three cases:
the Gaussian case with distribution,
the Lévy–stable case with distribution,
the Gaussian–Lévy–stable case with and distribution for each half of the sample, randomly permuted.
In our simulations we always assume the significance level and the unknown distribution of samples. Therefore in the testing procedure the empirical quantiles are applied. We note that the first two cases (pure Gaussian and Lévy–stable) are the special simplified versions of i.e. i.i.d. data. Obviously the constancy of theoretical quantiles and implies the closeness between the empirical versions computed in testing algorithm. The Gaussian–Lévy–stable case concerns two different distributions of data changing dynamically (randomly permuted) on the time domain. Therefore it is contrast to the hypothesis where different distributions are concentrated in two disjoint time intervals.
The results of conducted simulations we present in Table 1. For testing procedure we apply the sample mean value of obtained estimators from each generated sample. That mean value of is 881.37, 916.34 and 104.31 for each of three considered cases, respectively. They are close to the half of the sample length, which is quite intuitive for data satisfying . We see that extremely more times the test correctly does not reject the true null hypothesis and the error of the first order is strongly rare, see the –column. Moreover, the p-values corresponding to the acceptance of are rightly higher than significance level see Figure 7. The column of Table 1 with p-value contains the mean of such p-values. Moreover in the –column and –column we present the numbers of correct accepting and incorrect rejecting , respectively.
|Distribution of samples||p-value|
Our next task is to explore the statistical power of the examined test. This is equivalent issue to investigation of committing the error of the second order, i.e. accepting a false hypothesis. In order to calculate the error of the second order, we simulate 1000 trajectories of length of stochastically independent random variables for each of three cases from the first scenario described in Section 4 satisfying the hypothesis.
In all three cases the differences of distribution parameters are quite small and the hypothesis statement can be invisible from the data or its squares, see Figure 8. This means that we check the efficiency of proposed test in a very sophisticated cases.
We apply the estimator (8) and adopt the regime variance test assuming the unknown data distribution. The results of conducted simulations with significant level we present in Table 2. For testing procedure we apply the sample mean value of obtained estimators from each generated sample. That mean value of is 822.28, 943.72 and 646.42 for each of three considered cases from the first scenario, respectively. We see that more times the test correctly reject the false null hypothesis and the error of the second order is rare, see the –column. The worst result we obtain in the third case with different distributions. However the p-values corresponding to the rejection of are rightly lower than significance level see Figure 9. The column of Table 2 with p-value contains the mean of such p-values. Moreover in the –column and –column we present the numbers of incorrect accepting and correct accepting (the power of the test), respectively. We also strongly stress that from the construction of the studied test the rejection of hypothesis is equivalent to the acceptance of In other words, the rejection of is only possible when is true or in the case of first order error.
|Distribution of samples||p-value|
6 Plasma data analysis
In this Section we analyze the real data sets presented in Figure 1 and 2 by using the tests for regime variance described in Sections 4 and 5. In Figure 10 we demonstrate results of the visual pre-tests for increments of floating potential fluctuations of turbulent laboratory plasma for the small torus radial position cm (corresponding to Figure 1).
As we observe, the visual pre-tests indicate at the behavior formulated in (1). Moreover we can also determine the critical point , that divides the time series into two independent samples with appropriate statistical properties that do not change in time. We estimate the point by using the procedure described in Section 4 and get . In the next step of our analysis we test the , i.e. the hypothesis that the squared time series has quantiles that do not change in time. According to the procedure presented in Section 5 first we confirm independence by using ACF function, see Figure 11.
The regime variance test confirms that the examined data set has at least two regimes, i.e. it has representation (1). This is related to the fact that with confidence level , the obtained p-value is equal to (we reject ). Because we have estimated the critical point , that divides the analyzed time series into two parts, therefore we can examine if the separate vectors can be considered as independent samples with the same appropriate quantiles of squared data. In order to do this, we use the regime variance test once again for samples and . For the first time series, the test returns p-value on the level , that indicates we can assume the squared data have appropriate quantiles that do not change in time. If we test the second part of the data set, namely observations from to , we get p-value equal to , therefore also for this vector we can conclude that appropriate statistical properties do not change. Moreover if we assume the data from two considered parts constitute i.i.d. samples (that is one of the possibility when hypothesis is satisfied), we can test the distributions. By using tests based on the empirical cumulative distribution function completely described in , we conclude the observations come from Lévy–stable distribution with stable parameter equal to and , while the data have Lévy–stable distribution with parameters and . For both samples we use the McCulloch’s estimation method, .
As we have mentioned in Section 2, the testing procedure we can use also for data for which the critical point is not so visible as in the previous case, see Figure 2. In Figure 12 we present results of the visual pre-tests described in Section 4 for data that describes increments of floating potential fluctuations of turbulent laboratory plasma for the small torus radial position cm. As we observe, on the basis of the behavior of and statistics defined in (4) and (6) respectively we can not conclude that the data set exhibits behavior described in (1). But the procedure of estimating the critical point returns .
Next we can test if the hypothesis is satisfied for time series presented in Figure 2. The obtained p-value equal to indicates the data has at lest two regimes with different statistical properties. Similar, as for the first data set, we divide the time series into two separate vectors and test if we can consider them as samples for which the characteristics do not change with respect to time. For the first part, namely data from to we get p-value on the level , while for the second vector (i.e. observations form to ) the p-vale is equal to . These results indicate that two considered vectors do not satisfy relation (1) and can be considered as i.i.d. samples. Under this assumption we test the distributions and obtain that the two considered parts come from Lévy–stable distribution. For the first vector we obtain following estimates of the parameters: and , while estimated values of parameters for the vector containing observations are: and .
This paper is devoted to analysis of time series that exhibit two-regimes behavior. We have introduced the new estimation procedure for recognition the critical point that divides the observed time series into two regimes with different statistical properties expressed in the language of the quantiles for squared data (Section 4). We have developed also three tests that can confirm our assumption of two-regimes behavior (Sections 4 and 5). The universality of the presented methodology comes from the fact that it does not assume the distribution of examined time series therefore it can by applied to rich class of real data sets. The theoretical results we have illustrated by using the simulated time series and analysis of two real data sets related to turbulent laboratory plasma.
The JG and GS would like to acknowledge a partial support of the Fellowship co-financed by European Union within European Social Fund.
-  Brown R.L., Durbin J., Evans J.M.: Techniques for testing the constancy of regression relationships over time. Journal of the Royal Statistical Society Series B 35, 149–162 (1975).
-  Burnecki K., Gajda J., Sikora G.: Stability and lack of memory of the returns of the Hang Seng index. Physica A 390, 3136-3146 (2011).
-  Burnecki K., Weron A.: Fractional Lévy stable motion can model subdiffusive dynamics. Phys. Rev. E 82, 021130 (2010).
-  Burnecki K., Wyłomańska A., Beletskii A., Gonchar V. and Chechkin A.: Recognition of stable distribution with Lévy index close to . Preprint (2011).
-  Chow G.C.: Tests of equality between sets of coefficients in two linear regressions. Econometrica 28(3), 591-605 (1960).
-  Cryer J. D. , Chan K-S.: Time Series Analysis with Applications in R, 2 ed., Springer (2008).
-  Dickey D., Fuller W.: Distribution of the estimators for autoregressive time series with a unit root. Journal of the American Statistical Association 74, 427-431 (1979).
-  Dickey D., Fuller W.: Likelihood ratio statistics for autoregressive time series with a unit root. Econometrica 49, 1057-1072 (1981).
-  Draper N.R., Smith H.: Applied Regression Analysis, Wiley Series in Probability and Statistics (1998).
-  Gencn I.H. , Arzaghi M.: A confidence interval test for the detection of structural breaks. Journal of the Franklin Institute 348, 1615-1626 (2011).
-  Gonchar V.Yu., Chechkin A.V. , Sorokovoy E.L. , Chechkin V.V., Grigoreva L.I., and Volkov E.D.: Stable Lévy distributions of the density and potential fluctuations in the edge plasma of the U-3M torsatron. Plas. Phys. Rep. 29, 380-390 (2003).
-  Janczura J., Orzeł S., Wyłomańska A.: Subordinated -stable Ornstein-Uhlenbeck process as a tool for financial data description. Physica A 390, 4379-4387 (2011).
-  Kim H. J., Siegmund D.: The likelihood ratio test for a change-point in simple linear regression. Biometrika 76(3), 409-423 (1989).
-  Kwiatkowski D., Phillips P.C.B., Schmidt P. and Shin Y.: Testing the null hypothesis of stationarity against the alternative of a unit root?: How sure are we that economic time series have a unit root? Journal of Econometrics 54, 159-178 (1992).
-  Maddala G.S., In-Moo Kim I.M.: Structural change and unit roots. Journal of Statistical Planning and Inference 49, 73-103 (1996).
-  McCulloch J.H.: Simple consistent estimators of stable distribution parameters. Commun. Statist. - Simul. 15(4), 1109-1136 (1986).
-  Quandt R.E.: Tests of hypothesis that a linear regression system obeys two separate regimes. Journal of the American Statistical Association 55, 324-330 (1960).
-  Phillips P.C.B , Perron P.: Testing for a unit-root in time series regression. Biometrika 75, 335-346 (1988).
-  Priestley M.B.: Spectral analysis and time series, Academic Press, London, New York (1982).
-  Repetowicz P., Richmond P.: Statistical inference of multivariate distribution parameters for non-Gaussian distributed time series. Acta Phys. Polon. B 36(9), 2785-2796 (2005).
-  Said S.E., Dickey D.: Testing for unit roots in autoregressive-moving average models of unknown order. Biometrika 71, 599-607 (1984).
-  Sarabia J. M., Prieto F.: The Pareto-positive stable distribution: A new descriptive model for city size data. Physica A 388(19), 4179-4191 (2009).
-  Stephens M. A.: Vector correlation. Biometrika 66(3), 591-595 (1979).
-  Tsay R. S.: Outliers, level shifts and variance changes in time series. Journal of Forecasting 7, 1-20 (1988).
-  Weron R.: Computationally intensive Value at Risk calculations, in Handbook of Computational Statistics: Concepts and Methods, eds. J.E. Gentle, W. Haerdle, Y. Mori, Springer, Berlin (2004).
Wyłomańska A.: How identify the proper model?, available: ,