Dynamic Large Spatial Covariance Matrix Estimation in Application to Semiparametric Model Construction via Variable Clustering: the SCE approach
Abstract
To better understand the spatial structure of large panels of economic and financial time series and provide a guideline for constructing semiparametric models, this paper first considers estimating a large spatial covariance matrix of the generalized dependent and mixing time series (with variables and observations) by hard thresholding regularization as long as (the former scheme with some time dependence measure ) or (the latter scheme with the mixing coefficient . We quantify the interplay between the estimators’ consistency rate and the time dependence level, discuss an intuitive resampling scheme for threshold selection, and also prove a general crossvalidation result justifying this. Given a consistently estimated covariance (correlation) matrix, by utilizing its natural links with graphical models and semiparametrics, after “screening” the (explanatory) variables, we implement a novel forward (and backward) label permutation procedure to cluster the “relevant” variables and construct the corresponding semiparametric model, which is further estimated by the groupwise dimension reduction method with sign constraints. We call this the SCE (screen  cluster  estimate) approach for modeling high dimensional data with complex spatial structure. Finally we apply this method to study the spatial structure of large panels of economic and financial time series and find the proper semiparametric structure for estimating the consumer price index (CPI) to illustrate its superiority over the linear models.
Keywords: Time Series, Covariance Estimation, Regularization, Sparsity, Thresholding, Semiparametrics, Graphical Model, Variable Clustering
JEL classification: C13, C14, C32, E30, G10
1 Introduction
1.1 Large Spatial Covariance Matrix
Recent breakthroughs in technology have created an urgent need for highdimensional data analysis tools. Examples include economic and financial time series, genetic data, brain imaging, spectroscopic imaging, climate data and many others. To model high dimensional data, especially large panels of economic and financial time series as our focus here, it is very important to begin with understanding the “spatial” structure (over the space of variables instead of from a geographic point of view; also used in future for convenience) instead of simply assuming any specific type of parametric (e.g. linear) model first. Estimation of large spatial covariance matrix plays a fundamental role here since it can indicate a predictive relationship that can be exploited in practice. It is also very important in numerous other areas of economics and finance, including but not limited to handling heteroscedasticity of high dimensional econometric models, risk management of large portfolios, setting confidence intervals (or interval forecasts) on linear functions of the means of the components, variable grouping via graphs, dimension reduction by principal component analysis (PCA) and classification by linear or quadratic discriminant analysis (LDA and QDA). In recent years, many application areas where these tools are used have dealt with very highdimensional datasets with relatively small sample size, e.g. the typically low frequency macroeconomic data.
It is well known by now that the empirical covariance matrix for samples of size from a variate Gaussian distribution, is not a good estimator of the population covariance if is large. If and the covariance matrix (the identity), then the empirical distribution of the eigenvalues of the sample covariance matrix follow the MarĉenkoPastur Law (Marĉenko and Pastur (1967)) and the eigenvalues are supported on . Thus, the larger is, the more spread out the eigenvalues are.
Therefore, alternative estimators for large covariance matrices have attracted a lot of attention recently. Two broad classes of covariance estimators have emerged. One is to remedy the sample covariance matrix and construct a better estimate by using approaches such as banding, tapering and thresholding. The other is to reduce dimensionality by imposing some structure on the data such as factor models, Fan et al. (2008). Among the first class, regularizing the covariance matrix by banding or tapering relies on a natural ordering among variables and assumes that variables far apart in the ordering are only weakly correlated, Wu and Pourahmadi (2003), Bickel and Levina (2008b), Cai and Zhou (2011) among others. However, there are many applications, such as large panels of macroeconomic and financial time series, gene expression arrays and other spatial data, where there is no total ordering on the plane and no defined notion of distance among variables at all. These existing applications require estimators to be invariant under variable permutations such as regularizing the covariance matrix by thresholding, El Karoui (2008) and Bickel and Levina (2008a). In this paper, we consider thresholding of the sample spatial covariance matrix for high dimensional time series, which extends the existing work from the iid to the dependent scenarios. Under the time series setup, a very important question to ask is: how the time dependence will affect the estimate’s consistency? This is the first question this paper is going to answer.
For time series, there have been two recent works by Bickel and Gel (2011) and Xiao and Wu (2011) about banding and tapering the large autocovariance matrices for univariate time series. But our goal here is to better understand the spatial structure of high dimensional time series, so we need a consistent estimate of the large spatial covariance matrix, especially under a mixture of serial correlation (temporal dynamics), high dimensional (spatial) dependence structure and moderate sample size (relative to dimensionality).
1.2 Relation with Semiparametric Model Construction
As mentioned at the very beginning, when the spatial structure of the high dimensional data (time series) is complex, instead of simply assuming any specific type of parametric (e.g. linear) model first, we could adopt the flexible nonparametric approach. Due to the “curse of dimensionality” disadvantage of full nonparametrics, various semiparametric models have been considered to maintain flexibility in modeling while attempting to deal with the “curse of dimensionality” problem. However, most of the prior semiparametric works were carried out under some (prefixed) specific classes of semiparametric models without discussing which ones might be closer to the actual data structure. More specifically, to model some dependent variable (or ) using explanatory variables (very large ), they might suggest the following high dimensional single index (Huang et al. (2010)) or additive models (Meier et al. (2009), Ravikumar et al. (2009)) first and then perform various variable selection techniques to eliminate some ’s to avoid overfitting.

, where is an unknown univariate link function, and are unknown parameters that belong to the parameter space.

, where are the unknown functions to be estimated nonparametrically.
This approach encounters limitations from the following three perspectives. First, when the dimensionality , the prefixed assumption itself becomes more and more questionable. Is the single index model or the additive one closer to the actual data structure? Or maybe some other type of semiparametric structures is more suitable? We do not know. And this becomes more challenging when the sample size is small (with respect to dimensionality).
Second, this  prefixing some specific semiparametric classes first and then selecting variables accordingly  approach is also challenged by another character of high dimensional economic and financial time series: strong spatial dependence (nearcollinearity). Under nearcollinearity, we expect variable selection to be unstable and very sensitive to minor perturbation of the data. In this sense, we do not expect variable selection to provide results that lead to clearer economic interpretation than principal components or ridge regression. This is actually due to the fact that although compared with the information criteria based and ridge regression type regularization methods, the Lasso type variable selection techniques (Tibshirani (1996)) could deal with large and require weaker assumptions on the design matrix (composed of ), it still requires the following (as one of many similar requirements) restricted eigenvalue (RE) assumptions from Bickel et al. (2009): there exists a positive number such that
where denotes the cardinality of the set , denotes the complement of the set of indices , and denotes the vector formed by the coordinates of the vector w.r.t. the index set . It is essentially a restriction on the eigenvalues of the Gram matrix as a function of sparsity . To see this, recall the definitions of restricted eigenvalue and restricted correlation in Bickel et al. (2009):
where denotes the cardinality of and is the submatrix of obtained by removing from the columns that do not correspond to the indices in . Lemma 4.1 in Bickel et al. (2009) shows that if the restricted eigenvalue of the Gram matrix satisfies for some integer , Assumption RE holds. Under this condition, the Lasso type estimate’s various oracle inequalities could be derived, e.g. Bickel et al. (2009), where the upper bounds typically negatively depend on . From an economic point of view, this in fact requires that the dependence can not be too strong, which, unfortunately, is often unsatisfied for large panels of macroeconomic and financial data.
Third, when the proposed high dimensional semiparametric model has a complex structure, finding a proper penalty term and the corresponding estimation method for variable selection in general might be very difficult, since, ideally, the penalty should depend not only on the coefficients, but also on the (shapes of the) unknown nonparametric link functions. Several examples of regularizing high dimensional semiparametric models could be found in Chapter and of Bühlmann and van de Geer (2011), Ravikumar et al. (2009) among others.
To this end, developing a specific model free high dimensional spatial structure first and then constructing the right class of semiparametric models seems important. Specifically speaking, given and (or ), we try to find the index sets (possibly with overlapping elements) such that could be well approximated by:
(1) 
where

denotes the cardinality of the set ; is the number of index sets and also the number of the unknown univariate nonparametric link functions ;

is a vector of regressors w.r.t. the index set ; are the unknown parameters in the parametric space; ;

, , , and are (conditionally) independent given other ’s.
If (although is still possibly not moderate), we could strike a balance between dimension reduction and flexibility of modeling. Model (1) is very general and includes the single index model if (Ichimura (1993)), the additive model if (Hastie and Tibshirani (1990)), the partial linear model if , is the identity function and (Speckman (1988)) and the partial linear single index model if and is the identity function (Ahn and Powell (1993), Carroll et al. (1995), Yu and Ruppert (2002)). Model (1) could also be viewed as an extension of the multiple index model (Stoker (1986), Ichimura and Lee (1991), Horowitz (1998), Xia (2008)) and can be further generalized if the RHS is where is some known link function. As also considered by Li et al. (2010), if are disjoint (no overlapping elements), then for each group of variables , we could say denotes (the only) one index. Thus according to Li et al. (2010), model (1) is identifiable as every subspace of every group is identifiable and could be solved efficiently by the grouping dimension reduction method in Li et al. (2010), where they primarily assume that the grouping information is available. An immediate question is that given (), how can we extract groups of “relevant” ’s with corresponding index sets and ? This is the second question this paper is going to answer. From now on, we mainly study the case where are disjoint, although in Section 4 we will also present the method generating overlapping index sets.
Before moving on, let us study the differences among various semiparametric models from the graphical point of view. If we use a vertex in the graph to represent a relevant variable, a solid edge in a “block” to represent linear relationship among variables inside, a bandy edge (connecting a “block” with the dependent variable ) to represent a nonparametric link function, a crossed vertex to represent an “unrelated” ones, then we can visualize different semiparametric models through corresponding graphs. For instance, we can get Figure 1 (left) for the single index model; Figure 1 (right) for the additive model; Figure 2 (left) for the (more general) multiple index model, among many others. As we can see, the underlying difference among various semiparametric models is where to allocate the nonparametric link function and linearity through clustering variables. Consequently, assuming that all the variables have been included (complete graph), if we can find the corresponding type of graphs, we can construct the right class of semiparametric models. Sparse concentration matrices are of special interest in graphical models because zero partial correlations help establish independence and conditional independence relations in the context of graphical models and thus imply a graphical structure. For example, if we have a sparse covariance matrix for as the one in Figure 2 (right), we know that are “relevant” to , and due to the “block” structure w.r.t. and , we can construct the following class of semiparametric models as a specific case of (1):
(2) 
Now we have found the links among semiparametrics, graphical models and sparse large spatial covariance matrix. Thus consistently estimating the large sparse covariance matrix first and clustering the (explanatory) variables (or forming a block diagonal structure for the corresponding partition of the covariance matrix) are the key focuses. In this article, we assume that the grouping structure (or the corresponding covariance matrix) and parametric coefficients s are both time invariant to simply the study.
Another related and potential application of clustering variables comes from group regularization (e.g. group Lasso, Yuan and Lin (2006)) in the modern sparsity analysis. Huang and Zhang (2009) show that, if the underlying structure is strongly groupsparse, group Lasso is more robust to noise due to the stability associated with group structure and thus requires a smaller sample size to meet the sparse eigenvalue condition required in modern sparsity analysis. However, other than the situations, e.g. multitask learning, where we have clear background knowledge about how to group variables, in general, it is hard to tell how to properly group the variables to make use of group regularization. An example could be found in a paper in preparation with Bickel, Song and Bickel (2011), where they discuss three types of estimates for large vector auto regression w.r.t. different grouping methods. To this end, proper “grouping” of the variables is also significant.
For the semiparametric modeling in econometrics, people usually “group” the variables in a “rule of thumb” way. For example, to model the consumer price index (CPI  all items), they might subjectively group the variables “CPI  apparel & upkeep; transportation; medical care; commodities; durables; services” in the first group, “CPI  all items less food; all items less shelter; all items less medical care” in the second group; “ Producer Price Index (PPI)  Finished Goods; Finished Consumer Goods; Intermed Mat. Supplies & Components; Crude Materials” in the third group, “Implicit Price Deflator (of Personal Consumption Expenditures) PCE  all items; durables; nondurables; services” in the fourth group, all other variables in the last group. Is this way of grouping closest to the actual data structure? Why not put “CPI  Durables; PCE  Durables” in one group and “CPI  Services; PCE  Services” in another group? We are going to provide a procedure of grouping these variables from a datadriving approach.
In summary, the novelty of this article lies in the following two aspects. First, under the high dimensional time series situation, we show consistency (and the explicit rate of convergence) of the threshold estimator in the operator norm, uniformly over the class of matrices that satisfy our notion of sparsity as long as (for the generalized dependent time series; the meaning of is presented later) or (for the mixing process with the mixing coefficient . Furthermore, we quantify the interplay between the estimators’ consistency rate and the time dependence level, which is novel in this context. There are various arguments showing that convergence in the operator norm implies convergence of eigenvalues of eigenvectors, El Karoui (2008) and Bickel and Levina (2008a), so this norm is particularly appropriate for various applications. We also discuss an intuitive resampling scheme for threshold selection for high dimensional time series, and prove a general crossvalidation result that justifies this approach. Second, we propose a SCE (screen  cluster  estimate) approach for modeling high dimensional data with complex spatial structure. Specifically, given a consistently estimated large spatial covariance (correlation) matrix, by utilizing its natural links with graphical models and semiparametrics and using the correlation (or covariance for the standardized observations) between variables as a measure of similarity, after “screening” the (explanatory) variables, we propose a novel forward (and backward) label permutation procedure to cluster the “relevant” (explanatory) variables (or to form a block diagonal structure for the regularized large spatial matrix) and construct the corresponding semiparametric model, which is further estimated by the groupwise dimension reduction method (Li et al. (2010)) with sign constraints.
It is noteworthy that the “screening” in Step , “clustering” in Step , and the “sign constraints” in Step here are crucial for applying the groupwise dimension reduction method of Li et al. (2010) in the high dimensional situation. First, their method requires the use of the high dimensional kernel function, which faces some limitations when . The Step here help reduce the dimensionality from () to a more manageable level. Second, they primarily assume that the grouping information is available from the background knowledge, which is often not available from the typically (spatially) unordered high dimensional data sets. Although they also proposed an information criterion based grouping method, this  “trying” many different combinations of grouping  approach is very computationally intensive and less practical. The Step here provides this grouping information from a data driven approach with feasible computation. Third, without adding the “sign constraints”, the signs of the estimated parametric coefficients might violate the economic laws (details presented in Section 5). Overall, together with Li et al. (2010)’s very timely and stimulating work, we provide an integrated approach for modeling high dimensional data with complex spatial structure.
The rest of the article is organized as follows. In the next section, we present the main notations of the thresholding estimator. The estimates’ properties are presented in Section 3. In Section 4 we state the details of the SCE procedure and in Section 5 apply it to study the spatial structure of large panels of macroeconomic and financial times series and find the proper semiparametric structure for estimating the consumer price index (CPI). Section 6 contains concluding remarks with a brief discussion. All technical proofs are sketched in the appendix.
2 Dynamic Large Spatial Covariance Matrix Estimation
We start by setting up notations and corresponding concepts for covariance matrix , which are mostly from Bickel and Levina (2008b) and Bickel and Levina (2008a). We write for the eigenvalues of a matrix . Following the notations of Bickel and Levina (2008b) and Bickel and Levina (2008a), we define that, for any and a matrix , where . In particular, we write , which is the operator norm for a symmetric matrix. We also use the Frobenius matrix norm, . Dividing it by a factor brings , which is the average of a set of eigenvalues, while the operator norm means the maximum of the same set of eigenvalues. Bickel and Levina (2008a) defines the thresholding operator by
which we refer to as thresholded at . Notice that preserves symmetry; it is invariant under permutations of variable labels; and if and , it preserves positive definiteness.
We study the properties of the following uniformity class of covariance matrices invariant under permutations
We will mainly write for in the future. Suppose that we observe dimensional observations with (without loss of generality), and , which is independent of . We consider the sample covariance matrix by
(3) 
with .
Let us first recall the fractional cover theory based definition, which was introduced by Janson (2004) and can be viewed as a generalization of dependency. Given a set and random variables , , we say:

A subset of is if the corresponding random variables are independent.

A family of subsets of is a of if .

A family of pairs , where and is a fractional cover of if , i.e. for each .

A (fractional) cover is if each set in it is independent.

is the size of the smallest proper cover of , i.e. the smallest such that is the union of independent subsets.

is the minimum of over all proper fractional covers .
Notice that, in spirit of these notations, and depend not only on but also on the family . Further note that (unless ) and that if and only if the variables are independent, i.e. is a measure of the dependence structure of . For example, if only depends on but is independent of all , we will have independent sets:
s.t. . So (if ).
Besides the generalized dependent process, we are also going to consider the mixing process, which is related to the underlying measures of dependence between fields. More precisely, let be a probability space and be two sub algebras of , the mixing coefficient be a measure of dependence between and , which has been defined by Kolmogorov and first appeared in the paper by Volkonskii and Rozanov (1959). By its definition, the closer to is, the more independent the time series is. For examples of the mixing process, we refer to Doukhan (1994). Through this article, we use to denote the mixing coefficient for notational convenience.
3 Estimates’ Properties
We have the following two results which parallel those in Bickel and Levina (2008b) and Bickel and Levina (2008a).
3.1 Interplay Between Consistency Rate and Time Dependence Level
Theorem 3.1 (Dependence level affects consistency?)
Suppose for all , holds with a high probability and is bounded by some constant . Then, uniformly on , for sufficiently large also depending on , if
and , then
Not surprisingly, this theorem states that if we use the hard thresholding method to regularize the large sample covariance matrices, the consistency rate gets slower when the dependence level () increases, or in other words, the rate is maximized when , same as what Bickel and Levina (2008a) shows for the i.i.d case. When reaches , it will be offset by in the denominator. The intuition behind is clear: if dependence is strong, then additional information brought by a “new” observation will be effectively less, i.e. the overall information from observations will be less correspondingly, which will result in a slower consistency rate. On the other hand, according to the requirement, when the dependence level increases, must decrease and must increase to retain the same amount of information.
A very natural question to ask next is: to what extent, the degree of dependence (in terms of mixing coefficients) is allowed, while the consistency rate is still the same as the i.i.d. case, i.e. to study the relationship among high dimensionality , moderate sample size and mixing coefficient .
Assumption 3.1

,

, ,

,
Theorem 3.2 (Balance “” to achieve “good” consistency rate)
Assume the mixing sequence satisfies Assumption 3.1 with a high probability. Then, uniformly on , for sufficiently large also depending on , if , and the mixing coefficient , we have:
As we can see, when dimensionality increases, since the mixing coefficient is controlled by , the dependence level must decrease at the rate of (skipping the slow varying s). When is very large, this means “nearly” independent, which again confirms the result from the previous theorem.
3.2 Choice of Threshold via Cross Validation
Choices of threshold play a fundamental role in implementing this estimation procedure. We choose an optimal threshold by a crossvalidation procedure as in Bickel and Levina (2008a) and Bickel and Levina (2008b). In particular, we divide the data set of size into two consecutive segments, and of size and respectively, where is typically about . Then we compare the regularized (via thresholding) “target” quantity , estimated from , with the “target” quantity , estimated from . Hence can be viewed as a proxy to the population “target” quantity . The subindex in and indicates values from the th split from a total of repeats. The optimal threshold is then selected as a minimizer (w.r.t. ) of the empirical loss function over repeats, i.e.
(4) 
Similarly, the oracle threshold is then selected as a minimizer w.r.t. of the oracle loss function over repeats, i.e.
(5) 
Since the data are observed in time, the order of is of importance, and hence a random split of to and is not appropriate in a time series context. Alternatively, we randomly select a consecutive segment of size as from the data set first, and then take the first third of as () and the remaining two thirds as . Figure 3 provides an illustration for the crossvalidation procedure. We repeat this times as before. Our goal now is to show that the rates of convergence for the empirical loss function and the oracle loss function are of the same order and hence, asymptotically the empirical threshold performs as well as the oracle threshold selection.
Our theoretical justification is based on adapting the results on the optimal threshold selection in Bickel and Levina (2008a) and the optimal band selection in Bickel and Gel (2011) to a case of optimal choice of a threshold for high dimensional mixing time series. Let be vectors with common mean . Let and . Then the empirical and oracle estimates based on are defined as
(6)  
(7) 
respectively, where is estimated using .
We use Theorem in Bickel and Levina (2008a) as Lemma 3.1 here, which states a result on asymptotic relation between the empirical and oracle estimates and .
Lemma 3.1 (Theorem 3 in Bickel and Levina (2008a))
If the following assumptions (A4, A5, A6) are satisfied

;

for ;

,
then we have
Without loss of generality, assume that the number of repeats . Notice that the empirical estimates and play the role of and here respectively. Hence, if we can verify the conditions of Lemma 3.1, we can apply it to justify the choice of a threshold by crossvalidation and show that such regularized covariance matrix of high dimensional time series , with an empirical selected threshold, asymptotically coincides with the regularized estimate selected by oracle. To this end, we also need the auxiliary Lemma 3.2.
Lemma 3.2
Assume that is white noise satisfying and for . Let . For the mixing process satisfying the conditions of Theorem 3.2, we have
with some constants and .
4 The Screen  Cluster  Estimate (SCE) Procedure
To circumvent the problems in semiparametric modeling for high dimensional data with complex spatial structure, in the following three subsections, we state the threestep SCE procedure for constructing and estimating semiparametric models from a large number of unordered explanatory variables with a moderate sample size.
4.1 Screen

Estimate the (dependent variable () and all explanatory variables ) large covariance (Spearman’s correlation) matrix using hard thresholding as , and only keep and consider the (say ) ’s with nonzero correlation entries with for following steps. Without loss of generality, we rename the ’s as .
Since all observations are standardized first, the previously considered covariance matrix is actually the (Pearson’s) correlation coefficient matrix. However, at the “screening” step, we estimate and threshold the large Spearman’s rank correlation matrix, where Spearman’s rank correlation between and is defined as:
(8) 
and and are the cumulative distribution functions of and respectively. It can be seen that the population version of Spearman’s rank correlation is just the classic Person’s correlation between and . Here we consider Spearman’s rank correlation instead of the Pearson’s correlation coefficient is because the latter one is sensitive only to a linear relationship between two variables, while the former one is more robust than the Pearson’s correlation  that is, more sensitive to nonlinear relationships. It could be viewed as a nonparametric measure of correlation and especially suitable for the non and semiparametric situations we consider here. It assesses how well an arbitrary monotonic function could describe the relationship between two variables. Specifically speaking, it measures the extent to which, as one variable increases, the other variable tends to increase, without requiring that increase to be represented by a linear relationship. If, as the one variable increases, the other decreases, the rank correlation coefficients will be negative. Similar to the consistency results towards the large spatial thresholding covariance (correlation) matrix studied here, Xu and Bickel (2010) established those for the large Spearman’s rank correlation matrix (for the i.i.d. case).
At step , via hard thresholding, we single out the important predictors by using their Spearman’s rank correlations with the response variable and eliminate all explanatory variables that are “irrelevant” to . In light of equation (1), we actually get an estimate for . Thus we could reduce the feature space significantly from to a lower dimensional and more manageable space. Correlation learning is a specific case of independent learning, which ranks the features according to the marginal utility of each feature. The computational expediency and stability are prominently featured in independent learning. This kind of idea is frequently used in applications (Guyon and Elisseeff (2003)) and recently has been carefully studied for its theoretical properties by Fan and Lv (2008) using Pearson’s correlation for variable screening of linear models; Huang et al. (2008) , who proposed the use of marginal bridge estimators to select variables for sparse high dimensional regression models; Fan et al. (2011) using the marginal strength of the marginal nonparametric regression for variable screening of additive models; Hall and Miller (2009) using the generalized correlation for variable selection of linear models.
It is also worthy noticing that the threshold is a global measure (implicitly) depending on all variables. If we remove some ’s from the original explanatory variables set, the threshold value will be changed correspondingly. Thus the “relevant” and “irrelevant” regressors will also change.
4.2 Cluster
Motivated by the fact that in a block diagonal matrix, the nonzero entries along the diagonal are denser than those in the offdiagonal region and the assumption w.r.t. equation (1): “, , , and are (conditionally) independent given other ’s”, we define the following “averaged nonzero” score for a index set : . Here we do not distinguish between the positive and negative values of since they could also be reflected by the corresponding linear coefficients as in equation (1).

Perform the label permutation procedure for to form clusters of (explanatory) variables (or ) by utilizing the “averaged nonzero” score .

Rank (in decreasing order) and relabel all according to to obtain the “new” . Always assume is in the first block (index set) ;

Forward Include () in the first index set () if , and continue searching until the th variable . Without loss of generality (otherwise just relabel them), we assume .

(For the case of no overlapping indices among )
Given formed in the last step, perform Steps 2.1 and 2.2 again for the variables not in the set , i.e. and construct . 
Backward (replace Step 2.3a, for the case allowing overlapping indices among )
Given , perform Step 2.1 again for the variables not in the set , i.e. and start to construct , for example, . Let only if and continue searching until . Notice that it is impossible for all because of the way we construct in the forward step. Continue to construct as in the forward step by selecting variables w.r.t. . 
Continue this procedure until all variables have been included into some index set(s) of , where is the number of selected index sets and . Given these, construct the corresponding semiparametric models by equation (1).

At Step , if we can permute the variables’ labels to have a block diagonal structure for the partition of the consistently estimated covariance matrix, such as the one in Figure 2 (right), we can construct the corresponding class of semiparametric models as specific cases of equation (1). By this step, we are grouping the (explanatory) variables into highly correlated groups, which are, however, weakly correlated with each other. This “independence” property (between the “new” predictors ) is actually also required for the groupwise dimension reduction method of Li et al. (2010) for estimation. Except that we use the Spearman’s rank correlation instead of the Pearson’s correlation for “screening”, a second difference between this work and Fan and Lv (2008) is that we consider the covariance (correlation) matrix for all variables instead of just between and , which is because we need to further group the relevant explanatory variables for semiparametric model construction at the second step.
A very important feature of the proposed label permutation procedure is that it is based on the thresholding regularized covariance matrix instead of the sample one. A related work which tries to discover the ordering of the variables through the metric multidimensional scaling method could be found in Wagaman and Levina (2009) for the i.i.d. Gaussian case. Their ultimate goal is to improve covariance matrix estimation rather than order the variables itself. Thus by utilizing the discovered “order” based on the sample covariance matrix, they estimate the large covariance matrix through banding regularization to enjoy the benefits brought by ordering. But in the case of large panels of economic and financial variables as we consider here, our ultimate goal is to cluster the variables to construct the proper semiparametric models instead of “ordering”. For example, in the multiple index model, the order of the first index (or first cluster of variables) and the second index (or second cluster of variables) and the order of variables inside each “cluster” are both unimportant.
This case is also related to the hierarchical clustering, kmeans algorithm and correlation clustering problem in computer science (Demaine and Immorlica (2003), Bansal et al. (2004)), which aims to partition a weighted graph with positive and negative edge weights so that negative edges are broken up and positive edges are kept together. However, the correlation clustering algorithm is also based on the sample correlation(s), and has also been shown to be NPhard. Thus, as a key difference with other works in the literature, instead of using the sample covariance (or correlation) matrix for ordering and clustering as Wagaman and Levina (2009), Demaine and Immorlica (2003) and Bansal et al. (2004) did, we implement thresholding regularization for the sample covariance matrix and screening first and then find the corresponding groups through the stepwise label permutation procedure. It is simpler to be implemented than their’s, since the thresholding regularized covariance matrix only has limited number of nonzero entries. By doing so, w.r.t. the regression setup, we also simultaneously extract the “relevant” explanatory variables for (Step ). Thus, we actually combine dimension reduction and variable clustering, which is especially suitable for modeling high dimensional data via semiparametric methods.
This procedure is computationally simple for a typical macroeconomic and financial data set since the thresholding regularization procedure removes “irrelevant” variables first, and then rank the remaining ones before entering this label permutation procedure. Thus we avoid the NPhard correlation clustering problem based on the sample covariance matrix.
4.3 Estimate

Groupwise dimension reduction with sign constraints.
For Step , we implement the groupwise dimension reduction estimation procedure modified from Li et al. (2010). If we implement their method directly, as we can see from Table 1 (details of data presented later), the Spearman’s rank correlations between and are all positive, however, some of their corresponding parametric coefficients are estimated to be negative (details presented later in Table 2). This means that the consumer price index negatively depends on them, which is unlikely to be true from an economic point of view. Since we have disjoint groups of variables here and given the meaning of the Spearman’s rank correlation, ideally, the sign of the corresponding parametric coefficient estimate w.r.t. should be the same as the sign of the corresponding Spearman’s rank correlation. This motivates us to add the sign constraint, as a refinement, to the groupwise dimension reduction method developed in Li et al. (2010) to secure the sign consistency.
Let us first consider a simple linear regression model with the constraint . The linear coefficient could be estimated as the minimizer of with the corresponding nonnegative Lagrange multipliers ’s, . If we denote by , . Intuitively, in case some entry of , say the th, is negative, which contradicts the initial requirement , plays the role of adding a positive increment to it, s.t. .
Similarly, in our setup, if we use to denote the Spearman’s rank correlation estimate between and y () extracted from and add the sign constraint to the estimation procedure of Li et al. (2010) ( here corresponds to their ), a simple calculation shows that we just need replace their estimation equation (15) for by ( here corresponds to their ):
(9) 
where is a diagonal matrix ; are the hyperparameters; and other variables are the same as in equation (15) of Li et al. (2010). In general, selection of ’s requires minimizing some loss function. Motivated by the discussion above for the simple linear regression case, when is the same as , we simply choose , otherwise choose to be the minimum (positive) value s.t. . By our experience, this works well and the convergence of the iterative estimation procedure is achieved within iteration steps ( as the tolerance) for modeling CPI, which is to be presented in Section 5. Then by the property of the convex minimization problem, if a local minimum exists, it is also a global minimum.
Overall, similar to Fan and Lv (2008)’s “screen first; fit later” approach for modeling high dimensional data, ours could be considered as the “screen first; group second; fit third” approach. Alternatively, Bickel et al. (2009) and Meinshausen and Bühlmann (2006) consider the “fit first; screen later” approach. In general, a great deal of work is needed to compare “screen first; fit later” type of methods with “fit first; screen later” types of method in terms of consistency and oracle properties. But when the spatial structure is complex (thus we need deviate from linearity), in terms of semiparametric modeling, as we have discussed in Section 1, the later one might face several main limitations, while ours, together with the estimation method modified from Li et al. (2010), as a special case of the former one, could circumvent these issues and would be faster when dealing with higher dimensionality.
5 Application
We use the dataset of Stock and Watson (2005). This dataset contains monthly macro indicators covering a broad range of categories including income, industrial production, capacity, employment and unemployment, consumer prices, producer prices, wages, housing starts, inventories and orders, stock prices, interest rates for different maturities, exchange rates, money aggregates and so on. The time span is from January to December . We apply logarithms to most of the series except those already expressed in rates. The series are transformed to obtain stationarity by taking (the st or nd order) differences of the raw data series (or the logarithm of the raw series). Then all observations are standardized.
CPI  0.62  0.76  0.65  0.65  0.51  0.67  0.65  0.25  0.28  0.28  0.27  0.17  0.25  0.14  0.25 

121  0.61  0.55  0.51  0.56  0.52  0.51  0.17  0.22  0.27  0.19  0.20  0.22  0.14  0.32  
123  0.66  0.62  0.47  0.73  0.65  0.24  0.26  0.26  0.25  0.19  0.22  0.13  0.25  
122  0.65  0.55  0.70  0.71  0.29  0.28  0.24  0.23  0.13  0.21  0  0  
124  0.45  0.63  0.80  0.23  0.25  0.26  0.30  0  0.58  0.26  0  
116  0.54  0.51  0.20  0.25  0.24  0  0.19  0  0.15  0  
118  0.77  0.29  0.31  0.29  0.27  0.19  0  0  0  
126  0.31  0.35  0.30  0.34  0  0.14  0  0  
108  0.87  0.46  0  0  0  0  0  
109  0.47  0  0 