Smoothing spline ANOVA for super-large samples: Scalable computation via rounding parameters111A version of this paper will appear in the upcoming special issue of Statistics and Its Interface on Statistical and Computational Theory and Methodology for Big Data.
In the current era of big data, researchers routinely collect and analyze data of super-large sample sizes. Data-oriented statistical methods have been developed to extract information from super-large data. Smoothing spline ANOVA (SSANOVA) is a promising approach for extracting information from noisy data; however, the heavy computational cost of SSANOVA hinders its wide application. In this paper, we propose a new algorithm for fitting SSANOVA models to super-large sample data. In this algorithm, we introduce rounding parameters to make the computation scalable. To demonstrate the benefits of the rounding parameters, we present a simulation study and a real data example using electroencephalography data. Our results reveal that (using the rounding parameters) a researcher can fit nonparametric regression models to very large samples within a few seconds using a standard laptop or tablet computer.
Keywords: Smoothing spline ANOVA, Rounding parameter, Scalable algorithm
In the current era of big data, it is common for researchers to collect super-large sample data ranging from hundreds of thousands to hundreds of millions of observations. The ambitious BRAIN Initiative of NIH is expected to bring a torrent of data, e.g, 100 terabytes of data per day from a single brain lab. These super-large datasets provide a wealth of information. To effectively extract the information, numerous data-oriented statistical learning methods have been developed. Among these methods, data-driven nonparametric regression models (see Ruppert et al., 2003; Silverman, 1985) have achieved remarkable success in identifying subtle patterns and discovering functional relationships in large noisy data; such models require few assumptions about the observed data, but produce a powerful prediction.
For example, smoothing splines (see Silverman, 1985; Wahba, 1990) offer a powerful and flexible framework for nonparametric modeling. Smoothing spline analysis of variance (SSANOVA) models (Gu, 2013) further expand the research horizon of the smoothing spline; SSANOVAs can model multivariate data and provide nice interpretability of the modeling and prediction outcome. Furthermore, assuming that the smoothing parameters are selected via cross-validation, SSANOVA models have been shown to have desirable asymptotic properties (see Gu, 2013; Li, 1987; Wahba, 1990). The main drawback of the SSANOVA approach is its computational expense: the computational complexity of SSANOVA is on the order of , where is sample size.
Over the years, many efforts have been made to design scalable algorithms for SSANOVA. Generalized additive models (GAMs; Hastie & Tibshirani, 1990; Wood, 2006) provide scalable computation at the price of eliminating or reparameterizing all interaction terms of an SSANOVA model. By collapsing similar subspaces, Helwig and Ma (2015) provide an algorithm for modeling all interactions with affordable computational complexity. However, even using the most efficient SSANOVA approximation (Kim & Gu, 2004; Ma et al., 2015) and algorithm (Helwig & Ma, 2015), the computational burden grows linearly with the sample size, which makes the approach impractical for analyzing super-large datasets.
One possibility is to fit the model to a subset of the observed data. For example, when analyzing ultra large datasets, Ma et al. (2014) suggest fitting regression models to a randomly selected influential sample of the full dataset. This sort of smart-sampling approach works well, as long as a representative sample of observations is selected for analysis; however, the fitted model varies from time to time as the subsample is randomly taken. Furthermore, determining the appropriate size of the subsample could be difficult in some situations.
In this paper, we propose a new approach for fitting SSANOVA models to super-large samples. Specifically, we introduce user-tunable rounding parameters in the SSANOVA model, which makes it possible to control the precision of each predictor. As we demonstrate, fitting a nonparametric regression model to the rounded data can result in substantial computational savings without introducing much bias to the resulting estimate. In the following sections, we provide a brief introduction to SSANOVA (Section 2), develop the concept of rounding parameters for nonparametric regression (Section 3), present finite-sample and asymptotic results concerning the quality of the rounded SSANOVA estimator (Section 4), demonstrate the benefits of the rounding parameters with a simulation study (Section 5), and provide an example with real data to reveal the practical potential of the rounding parameters (Section 6).
2 Smoothing Splines
A typical (Gaussian) nonparametric regression model has the form
where is the response variable, is the predictor vector, is the unknown smooth function relating the response and predictors, and is unknown, normally-distributed measurement error (see Gu, 2013; Ruppert et al., 2003; Wahba, 1990). Typically, is estimated by minimizing the penalized least-squares functional
where the nonnegative penalty functional quantifies the roughness of , and the smoothing parameter balances the trade-off between fitting the data and smoothing .
Given fixed smoothing parameters and a set of selected knots , the minimizing Equation (2) can be approximated using
where are functions spanning the null space (i.e., ), is the reproducing kernel (RK) of the contrast space (i.e., ), and and are the unknown function coefficients (see Helwig & Ma, 2015; Kim & Gu, 2004; Gu & Wahba, 1991). Note that , where denotes the RK of the -th orthogonal contrast space, and are additional smoothing parameters with .
where denotes the squared Frobenius norm, , for and , with for and , and where for . Given a choice of , the optimal function coefficients are given by
where denotes the Moore-Penrose pseudoinverse.
The fitted values are given by , where
3 Rounding Parameters
When fitting a nonparametric regression model to ultra large samples, we propose including user-tunable rounding parameters in the model (see Helwig, 2013, for preliminary work). Assuming that all (continuous) predictors have been transformed to the interval [0,1], the rounding parameters are used to create locally-smoothed versions of the (continuous) predictor variables, such as
for and , where the rounding function rounds the input value to the nearest integer. Note that the scores are formed simply by rounding the original scores to the precision defined by the rounding parameter for the -th predictor variable, e.g., if , then each value is rounded to the nearest .02 to form .
Let with defined according to Equation (8), and let denote the rounded knots; then, the penalized least-squares function in Equation (4) can be approximated as , where , , and are defined according to Equation (4) with replacing . Similarly, the optimal basis function coefficients corresponding to the rounded data (i.e., and ) can be defined according to Equation (5) with with replacing . Finally, smoothing matrix corresponding to these coefficients (denoted by ) can be defined according to Equation (6) with with replacing .
One could calculate the fitted values using (and this is what we recommend for the smoothing parameter estimation), however this could introduce a small bias to each predicted score. So, when interpreting specific scores, we recommend using the and coefficients and basis function matrices with unrounded predictor variable scores
where and are defined according to Equation (4).
3.2 Computational Benefits
Let denote the set of unique observed vectors with , and note that has an upper-bound that is determined by the rounding parameters and the predictor variables. For example, suppose that with and ; then, defining , it is evident that , given that can have a maximum of 101 unique values for the first predictor, and maximum of unique values for the second predictor. As a second example, suppose that with ; then, defining , it is evident that , given that can have a maximum of 101 unique values for each predictor. Similar reasoning can be used to place an upper bound on for different combinations of rounding parameters and predictor variable types.
Note that the inner-portion of can be written as
where for and , where for and , and with denoting the number of that are equal to (for ). Next, define and define the reduced smoothing matrix , such as
Note that is a matrix, and note that if there are replicate predictor vectors after the rounding (which is guaranteed if is larger than ’s upper bound).
Next, suppose that the scores are ordered such that observations have predictor scores , observations have predictor scores , and so on. Then can be written in terms of , such as
where denotes a vector with a one in the -th position and zeros elsewhere, and denotes a vector of ones (for ). Furthermore, note that the fitted values corresponding to can be written as
where with and denoting the set of indices such that is equal to .
Now, let denote the fitted value corresponding to (for ), and note that the numerator of the GCV score in Equation (7) can be written as
In addition, note that the denominator of the GCV score can be written as using the relation in Equation (12).
The above formulas imply that, after initializing , , and , it is only necessary to calculate the reduced smoothing matrix to evaluate the GCV score. Furthermore, note that the optimal function coefficients can be estimated from the reduced smoothing matrix using
which implies that it is never necessary to construct the full smoothing matrix to estimate when using the rounding parameters.
3.3 Choosing Rounding Parameters
In many situations, a rounding parameter can be determined by the measurement precision of the predictor variable. For example, suppose we have one predictor recorded with the precision of two decimals on the interval [0,1], i.e., for . In this case, setting will produce the exact same solution as using the unrounded predictors (i.e., ) and can immensely reduce the computational burden. Note that even if is very large, and it is only necessary to evaluate the functions and for the unique predictor scores to estimate .
Now, for large , note that a cubic smoothing spline is approximately a weighted moving average smoother (see Silverman, 1985, Section 3). In particular, let denote the entry in the -th row and -th column of , and note that asymptotically depends on a kernel function whose influence decreases exponentially as increases (see Silverman, 1985, equations 3.1–3.4). Also, note that the rounding parameter proposed in this paper widens the peak of the kernel (see Figure 1).
For relatively smooth functions (e.g., ), the shape of the asymptotic kernel function is stable for ; however, for more jagged functions (e.g., ), the rounding parameter will need to be set smaller (e.g., ) for the rounded kernel function to resemble the true asymptotic kernel (see Figure 1).
4 Quality of Rounded Solution
4.1 A Taylor Heuristic
Note that the rounded predictor can be written as
where by definition and so that . This implies where and . Consider the linear approximation of at the point
where denotes the gradient of . If the gradient of were known, we could approximate the rounding error using
where ; note that the last line is due to the fact that .
where are scaled Bernoulli polynomials, are the selected knots, and
is the reproducing kernel of the contrast space. Using the properties of Bernoulli polynomials we have
Consequently, for polynomial splines we can approximate the rounding error using
where with for and for , and . Note that the contrast space reproducing kernel is rather smooth for the classic cubic smoothing spline, and the magnitude of the derivatives are rather small (see Figure 2). This implies that setting will not introduce much rounding error to the contrast kernel evaluation when using cubic smoothing splines on .
The rounding error depends on the norm , so the relative impact of a particular choice of rounding parameters will depend on the (unknown) function coefficients . For practical use, we can approximate the rounding error relative to the norm of the coefficients, such as
where is the largest eigenvalue of ; note that we have and by definition. For practical computation, it is possible to estimate by taking a random sample of observations, and then approximate the relative rounding error as . Clearly this sort of approach can be extended to assess the relative rounding error for tensor product smoothing splines, but the gradient formulas become a bit more complicated.
4.2 Finite Sample Performance
To quantify the finite-sample error introduced by rounding, define the loss function
where and are the smoothing matrices corresponding to the unrounded and rounded predictors (i.e., and , respectively). Denote the risk function as
where contains the realizations of the (unknown) true function . Note that the first term of corresponds to the (squared) bias difference between and , and the second term is related to (but not equal to) the variance difference. Also note that we can write
where are the eigenvalues of .
The risk depends on the squared norm of the unknown function , so the practical relevance of a particular value of , e.g., , differs depending on the situation, i.e., unknown true function. To overcome this practical issue, we can examine the risk relative to the squared norm of the unknown function, such as
where relates to the noise-to-signal ratio, i.e., inverse of signal-to-noise ratio (SNR). Furthermore, for a fixed SNR and a large enough , the second term in the upper-bound of the relative risk is negligible, and we have that . Consequently, it is only necessary to know the largest eigenvalue of to understand the expected performance of a given set of rounding parameters for a large sample size .
In practice, calculating and for various values of is a computational challenge for large . For practical computation, we recommend examining and/or using a random sample of observations. Using this approach, the unknown parameters (i.e., and can be estimated using the results of the unrounded solution. For example, the SNR can be estimated as where and are the estimated function and error variance using the observations with unrounded predictors. Or, if the approximate SNR is known, Equation (21) can be used to place an upper-bound on the relative risk .
We demonstrate this approach in Figures 3–4, which plot functions with various degrees of smoothness (Figure 3) and the median estimated rounding risk across five samples of observations (Figure 4). Note that Figure 4 illustrates that the expected difference between the unrounded and rounded solutions increases as the error variance increases. Furthermore, note that Figure 4 affirms that for setting can be expected to introduce minimal rounding error for a variety of functions and SNRs. Finally, Figure 4 reveals that setting will not introduce much rounding error whenever the underlying function is relatively smooth. For example, for the functions and , we should expect a negligible difference between the unrounded and rounded solutions using for a variety of different SNRs.
4.3 Asymptotic Bias and Variance
To establish the asymptotic properties of the proposed estimate, we employ an equivalent kernel approach developed in Nychka (1995). The key idea is that a smoothing spline estimate can be written as kernel estimate
where the kernel function can be well approximated by a Green’s function. Then the asymptotic properties of can be established via the analytical properties of the Green’s function.
Following Nychka (1995), we establish the asymptotic properties of our rounding estimate for the one dimensional case. In addition, we assume that we use a full basis where all distinct rounded data are used as knots, i.e., . Then our estimate is the minimizer of
Let denote the empirical distribution function for the rounded predictor , , let be the limiting distribution of the original predictor with a continuous and strictly positive density function on and let
and . Then we have the following theorem.
Assume that is a smoothing spline estimate of (23) with and are not equally spaced. Suppose that and satisfies the Hölder condition for some and some . Assume that has a uniformly continuous derivative and as . Choose and let and as . Then
uniformly for and as .
The theorem states that both the bias and variance of our estimate depend on , which is required to be sufficiently small relative to as . Consequently, the theorem reveals that the rounding parameter will have to be set smaller when
the true function is rougher
the spline order is larger
the predictor distribution is rougher
the sample size is larger.
These conclusions derive directly from the requirement that be sufficiently small relative to as .
5 Simulation Study
5.1 Design and Analyses
We conducted a simulation study to demonstrate the benefits of the rounding parameters. As a part of the simulation, we manipulated two conditions: (a) the function smoothness (8 levels: see Figure 3), and (b) the number of observations (3 levels: for ). Note that the functions are defined such that and for , so the function smoothness is systematically manipulated. We generated by (a) independently sampling the predictor(s) from a uniform distribution, (b) independently sampling from a standard normal distribution, and (c) defining the observed response as for .
Then, we fit a nonparametric regression model using six different methods: Method 1 is an SSANOVA using unrounded data (see Helwig & Ma, 2015), Method 2 is an SSANOVA with , Method 3 is an SSANOVA with , Method 4 is an SSANOVA with , Method 5 is standard GAM implemented through Wood’s (2015) gam.R function, and Method 6 is batch-processed GAM implemented through Wood’s (2015) bam.R function. Methods 1–4 are implemented through Helwig’s (2015a) bigspline.R function (for ) and bigssa.R function (for ).
For the functions we used knots to fit the model, and for functions we used knots. For Methods 1–4, we used a bin-sampling approach to select knots spread throughout the covariate domain (Helwig & Ma, in prep); for Methods 5 and 6, we used the default gam.R and bam.R knot-selection algorithm (see Wood, 2015). For each method, we used cubic splines and selected the smoothing parameters that minimized the GCV score. Given the optimal smoothing parameters, we calculated the fitted values, and then defined the true mean-squared-error (MSE) as . Finally, we used 100 replications of the above procedure within each cell of the simulation design.
The true MSE for each combination of simulation conditions is plotted in Figure 5.
First, note that for each method, the true MSE decreased as increased, which was expected. Next, note that all of the methods recovered quite well (i.e., all MSEs smaller than 0.01). Comparing Methods 1–4, it is evident that setting introduced minimal bias to the resulting solution. In contrast, setting produced a more noticeable bias, particularly when analyzing the more jagged and functions, i.e., those with larger . However, the bias introduced with was small relative to the norm of , so there is little practical difference between the solutions with . Examining the true MSEs of Methods 5 and 6, it is apparent that the standard GAM performed almost identical to the batch-processed GAM throughout the simulation.
Comparing the true MSEs of Methods 1–4 to those of Methods 5 and 6, it apparent that the SSANOVAs performed similar to the GAMs in every simulation condition. In the one-dimensional case ( functions), the GAMs have slightly smaller true MSEs for , but the difference is trivial compared to the norm of the functions. In the two-dimensional case ( functions), the SSANOVAs have slightly smaller true MSEs for . Differences between the SSANOVA and GAM solutions are most pronounced when analyzing the function; in this case, the median true MSE of the GAM solutions is over 10 times larger than the corresponding median of the SSANOVA solutions with . However, the difference is still quite small compared to the norm of the function.
The median analysis runtimes (in seconds) for each simulation condition are displayed in Tables 1 and 2. First, note that for each method, the runtime increased as increased, which was expected. Next, note that the runtimes for Methods 1, 5, and 6 were substantially larger than the corresponding runtimes of Methods 2–4. When analyzing the functions, the median runtimes for Methods 2–4 were less than one-tenth of a second for all examined , and were anywhere from 40–60 times faster than the median runtimes for Methods 5 and 6. When analyzing the functions, the median runtimes for Methods 3–4 were less than one second for all examined , and were anywhere from 10–20 times faster than the median runtimes for Methods 5 and 6.
|Method 1 ( NA)||0.35||0.64||1.31||0.37||0.64||1.28||0.30||0.64||1.31||0.36||0.64||1.31|
|Method 2 ()||0.02||0.03||0.07||0.02||0.03||0.07||0.02||0.03||0.07||0.02||0.03||0.07|
|Method 3 ()||0.02||0.03||0.06||0.02||0.03||0.06||0.01||0.03||0.06||0.01||0.03||0.06|
|Method 4 ()||0.01||0.03||0.06||0.02||0.02||0.06||0.01||0.02||0.06||0.01||0.02||0.06|
|Method 5 (GAM)||1.44||2.24||4.05||1.40||2.11||4.03||1.47||2.12||4.06||1.40||2.11||4.06|
|Method 6 (BAM)||1.35||2.02||4.26||1.37||2.05||4.30||1.32||2.05||4.28||1.38||2.05||4.29|
|Method 1 ( NA)||3.80||6.60||14.84||3.80||6.60||14.82||3.81||6.61||14.85||3.81||6.60||14.85|
|Method 2 ()||0.85||0.80||1.35||0.85||0.80||1.34||0.85||0.80||1.35||0.85||0.80||1.35|
|Method 3 ()||0.34||0.51||0.99||0.34||0.51||0.99||0.34||0.51||0.99||0.34||0.51||0.99|
|Method 4 ()||0.28||0.43||0.90||0.28||0.43||0.90||0.28||0.43||0.90||0.28||0.43||0.90|
|Method 5 (GAM)||4.48||9.16||22.31||4.45||9.12||22.29||4.45||9.16||22.38||4.50||9.20||22.43|
|Method 6 (BAM)||4.75||7.81||18.55||4.73||7.78||18.55||4.74||7.80||18.61||4.77||7.85||18.65|
6 Real Data Example
6.1 Data and Analyses
To demonstrate the practical benefits of the rounding parameters when working with real data, we use electroencephalography (EEG) data obtained from Bache and Lichman (2013). Note that EEG data consist of electrical activities that are recorded from various electrodes on the scalp, and EEG patterns are used to infer information about mental processing. The EEG data used in this example were recorded from both control and alcoholic subjects participating in an experiment at the Henri Begleiter Neurodynamic Lab at SUNY Brooklyn. The data were recorded during a standard visual stimulus event-related potential (ERP) experiment using a 61-channel EEG cap (see Figure 6). The data were recorded at a frequency of 256 Hz for one second following the presentation of the visual stimulus.
For the example, we analyzed data from the Pz electrode of 120 subjects (44 controls and 76 alcoholics), and we used 10 replications of the ERP experiment for each subject.222Note that data from subjects co2a0000425 and co2c0000391 were excluded from the analysis due to small amounts of data, and we used the first 10 replications for each subject. This resulted in 307,200 data points (120 subjects 256 time points 10 replications). We analyzed the data using a two-way SSANOVA on the domain , where the first predictor is the time effect and the second predictor is the group effect (control vs. alcoholic); see the Appendix for an explanation of how the rounding parameter can be applied when working with continuous and nominal predictors. We used a cubic spline for the time effect, a nominal spline for the group effect, and bin-sampled knots. Finally, we fit the model both with the unrounded data and with the time covariate rounded to the nearest .01 second (i.e., on the interval [0,1]); note that setting for the time covariate results in unique covariate vectors, which is substantially less than the original data points.
The predicted ERPs for the unrounded and rounded data are plotted in Figure 7.
Note that there are no practical differences between the two solutions (c.f. Figure 7a,b). Furthermore, note that both solutions produced a GCV score of GCV=85.96 and variance-accounted-for value of , suggesting that the rounded solution fits the data as well as the unrounded solution. It is also worth noting that the unrounded solution took over five times longer to fit compared to the rounded solution; furthermore, the unrounded solution required a substantial amount of RAM to fit the model, whereas the rounded solution is easily fittable on a standard laptop or tablet.
Comparing the estimated ERPs of the controls and alcoholics, there are obvious differences (see Figure 7). In particular, the alcoholic subjects are missing the P300 component of the ERP waveform (i.e., large positive peak occurring about 300 ms after the stimulus). Note that the P300 component is thought to relate to a subject’s internalization and/or categorization of stimuli, so these results suggest that alcoholic subjects have different information processing patterns for standard visual stimuli. This finding is consistent with previous findings regarding EEG patterns of alcoholic subjects (see Porjesz et al., 1980; Porjesz et al., 1987), and some research suggests that this sort of EEG pattern may predispose individuals to alcoholism (see Porjesz & Begleiter, 1990a, b).
This paper proposes the use of rounding parameters to overcome the computational burden of fitting nonparametric regression models to super-large samples of data. By rounding each predictor to a given precision (e.g., 0.01), it is possible to estimate using the unique rounded predictor variables. We have provided a simple Taylor heuristic that justifies the use of a small rounding parameter (e.g., ) when using cubic smoothing splines for . Furthermore, we have provided methods for assessing the finite sample and asymptotic performance of the rounded SSANOVA estimator in various situations.
The simulation study and EEG example clearly demonstrate the benefits of the proposed rounding parameters. When fitting nonparametric regression models with large , the simulation results reveal that setting can result in substantial computational savings without introducing much bias to the solution. Furthermore, the EEG data example reveals that there are no practical differences between the unrounded and rounded solutions (using ) when analyzing real data. Thus, the rounding parameters offer a fast and stable method for fitting nonparametric regression models to very large samples.
In addition to providing a fast method for smoothing large datasets, the rounding parameters are also quite memory efficient. Because the rounding approach only uses the unique rounded-covariate values, it is never necessary to construct the full model design matrix (or the smoothing matrix). So, using the rounding parameters, it is possible to fit nonparametric regression models to very large samples using a standard laptop or tablet, e.g., all of the rounded SSANOVA models in this paper are easily fittable on a laptop with 4 GB of RAM. As a result, typical researchers now have the ability to discover functional relationships in super-large data sets without needing access to supercomputers or computing clusters.
As a final point, it should be noted that in some cases (e.g., large ) the number of unique rounded-covariate values may be very large. In such cases, forming the model design matrix may require a substantial amount of memory (because is so large). However, as is noted in Helwig (2013) and Helwig and Ma (2015), fitting an SSANOVA model only depends on various crossproduct vectors and matrices. So, if is too large to form the full model design matrix, then the needed crossproduct statistics can be formed in a batch-processing manner similar to the approach used by Wood’s (2015) bam.R function.
Appendix: Rounding Algorithm
In this section, we provide algorithms for rounding SSANOVA predictors and obtaining the sufficient statistics for the SSANOVA estimation. The first algorithm assumes that all of the covariates are continuous; extensions for nominal covariates will be discussed after the presentation of the initial algorithm.
First, let denote the rounding parameter for the -th predictor, let denote the vector containing the -th predictor’s scores, and let denote the -th order statistic of the -th predictor. Next, initialize and , and then calculate
where the rounding function rounds the input to the nearest integer. After running the for loop, we have , where denotes the -th element of