Multiple Learning for Regression in Big Data
Abstract
Regression problems that have closedform solutions are well understood and can be easily implemented when the dataset is small enough to be all loaded into the RAM. Challenges arise when data are too big to be stored in RAM to compute the closed form solutions. Many techniques were proposed to overcome or alleviate the memory barrier problem but the solutions are often local optima. In addition, most approaches require loading the raw data to the memory again when updating the models. Parallel computing clusters are often expected in practice if multiple models need to be computed and compared. We propose multiple learning approaches that utilize an array of sufficient statistics (SS) to address the aforementioned big data challenges. The memory oblivious approaches break the memory barrier when computing regressions with closedform solutions, including but not limited to linear regression, weighted linear regression, linear regression with BoxCox transformation (BoxCox regression) and ridge regression models. The computation and update of the SS arrays can be handled at per row level or per minibatch level. And updating a model is as easy as matrix addition and subtraction. Furthermore, the proposed approaches also enable the computational parallelizability of multiple models because multiple SS arrays for different models can be computed simultaneously with a single pass of slow disk I/O access to the dataset. We implemented our approaches on Spark and evaluated over the simulated datasets. Results showed our approaches can achieve exact solutions of multiple models. The training time saved compared to the traditional methods is proportional to the number of models need to be investigated.
I Introduction
Linear regression, weighted linear regression, linear regression with BoxCox transformation (BoxCox regression) and ridge regression have powered the society in many respects by modeling the relationship between a scalar response variable and explanatory variable(s). From housing price prediction to stock price prediction, and from face recognition to marketing analysis, the related applications span a wide spectrum [20, 1, 21]. After entering the big data era, these regression models are still prevalent in academia and industry. Even though more advanced models, such as XGBoost and deep learning, have seen significant successes lately, the regression models continue their impact in many fields due to their transparency, reliability and explainability [6, 10]. However, it is not easy to compute these models if the dataset is massive. Closedform solutions would be impossible if the physical memory cannot hold all the data or the intermediate results needed for the computation. And tradeoffs must be made between the accuracy and the time if the iterative methods should to be applied. Hence, it is of high value to propose a set of bigdata oriented approaches that can preserve the benefits of linear, weighted linear, BoxCox and ridge regression.
For linear regression, academia and industry resort to two major techniques, ordinary least squares (OLS) and the iterative methods. The OLS method is designed to calculate the closedform solution [13]. By solving the normal equation, OLS can immediately derive the solution from the data. The normal equation consists of , if is singular, the normal equation will become unsolvable. One solution is to use generalized inverse [9, 3, 4]. Although OLS is efficient timewise in deriving the closedform solution, it also introduces the memory barrier issue in that the RAM needs to be big enough to store the entire dataset to solve the equation. To overcome the memory barrier, the distributed matrix could be applied to perform the calculation as a remedy [18]. But the time cost makes this algorithm infeasible nevertheless. Due to this reason, the applications of this technique are limited. And another technique, the iterative methods, which include gradient descent, Newton’s method and QuasiNewton’s method, are commonly used to provide approximate solutions. [14, 25].
Gradient descent, also known as steepest descent, targets to find the minimum of a function. It approaches the minimum by taking steps along the negative gradient of the function with a learning rate proportional to the gradient. It is more universal than OLS as the variations, such as minibatch gradient descent and stochastic gradient descent, overcome the memory barrier issue by performing a calculation in small batches instead of feeding all the data into memory at once [23]. But, gradient descent oscillates around the minimum region when the algorithm gets close to the minimum. And its asymptotic rate of convergence is inferior to many other iterative methods. If an easier approach to the minimum or higher asymptotic rate of convergence is demanded, Newton’s method is an alternative.
Newton’s method is a rootfinding algorithm, utilizing the Taylor series. To find a minimum/maximum, it needs the knowledge of the second derivative. Unlike gradient descent, this strategy enables Newton’s method to approach the extrema/optima more easily rather than oscillations. Besides, it has been proven that Newton’s method has the quadratic asymptotic rate of convergence. However, this algorithm is faster than gradient descent only if the Hessian matrix is known or easy to compute [25]. Unfortunately, the expressions of the second derivatives for large scale optimization problem are often complicated and intractable.
Quasinewton methods, for instance, DFP, BFGS and LBFGS, were proposed as alternatives to Newton’s method when the Hessian matrix is unavailable or too expensive to calculate [7, 2, 16]. Instead of inverting the Hessian matrix in Newton’s method, quasinewton methods build up an approximation for the inverse matrix to reduce the computational load. With this mechanism, quasinewton methods are usually faster than Newton’s method for large datasets. In linear regression, LBFGS, a variation of BFGS, is one of the most widely used quasinewton method [26]. Generally, LBFGS outperforms gradient descent in linear regression.
For the aforementioned approaches, the majority of them require multiple pass through the dataset. Donald Knuth proposed an efficient solution which requires only singlepass through the dataset, however, this approach is only applicable for variance computation [15].
Weighted linear regression is a more generalized version of linear regression by quantifying the importance of different observations [19]. A weighted version of OLS is designed to obtain the corresponding closedform solution. The iterative methods with slight modifications are also applicable to weighted linear regression [12].
For BoxCox regression, it is linear regression with the response variable changed by BoxCox transformation [5, 24]. The design philosophy of BoxCox regression is to handle nonlinearity between the response variable and explanatory variables by casting power transformation on the response variable. Naturally, approaches for linear regression are applicable to BoxCox regression.
As linear regression is deficient in handling highlycorrelated data, ridge regression is then proposed [11]. The basic idea of ridge regression is to add a penalty term to the error sum of squares (SSE) cost function of linear regression [11, 17]. A constrained version of OLS can solve this problem, producing similar closedform solution. The only difference is that the component from OLS is substituted by , where is the coefficient of penalty, and is the identity matrix. By means of , the constrained OLS no longer has to deal with the singularity issue but the memory barrier issue from OLS remains. Gradient descent, Newton’s method and quasinewton methods as well can be applied [14, 25, 8].
From the above discussions, it can be concluded that research gaps remain in the following two perspectives: (i) OLS and its extended versions are difficult in handling the memory barrier issue; and (ii) The iterative methods are time inefficient and require many iterations to welltrain regression models. In addition, parameter tuning is inevitable under most conditions. It may probably take several days or even weeks for large scale projects to accomplish the desired performance goals of models. For BoxCox regression or ridge regression, the situation gets worse as a set of power or ridge parameters are usually applied to pick the best one, which, of course, also multiply the time cost [22].
In order to integrate the pros of OLS based approaches that use closedform solutions to produce the exact results and the iterative methods that overcome the memory barrier, we propose multiple learning approaches that utilize sufficient statistics (SS). The main contributions of our algorithms are summarized as below:

We introduced a SS array which can be computed at per row or per minibatch level for calculating closedform solutions.

Once the closedform solutions are obtained, the optimums are found, i.e., the prediction performance is at least as good as OLS.

With SS, the datasets stored in the large secondary storage, such as HDD or SSD, needs to be loaded to the primary storage one time only. The time efficiency is therefore greatly improved in contrast to the iterative methods that require multiple slow disk I/Os.

Because multiple SS arrays for different models can be computed simultaneously, multiple models can be computed and updated with a single pass of the entire dataset with one iteration of slow disk I/Os.
Ii Background Concepts
For regression analysis, not only the estimators of the regression model coefficients are required, but also the estimators of variance and the variancecovariance matrices should be computed for significant test. For the ease of presentation, necessary notions and notations closely relevant to linear regression, weighted linear regression, BoxCox regression and ridge regression are explained below.
Iia Linear Regression
Assume the dataset contains observations each of which has features. Consider a linear regression model
(1) 
where is a vector of the response variables, is a matrix of explanatory variables, is a vector of regression coefficient parameters, and is the error term which is a vector following the normal distribution .
The estimators of model coefficients, variance and variancecovariance matrix are shown in (3).
(3) 
Note that are all known observations. This means, the value of can be easily computed and included as an explanatory variable in equation (1). As a result, this approach can also be used to fit polynomial regressions models, in addition to linear regression models.
IiB Weighted Linear Regression
The weighted linear regression is similar to linear regression, except it assumes all the offdiagonal entries of the correlation matrix of the residuals are . By means of minimizing the corresponding SSE cost function in (4), the estimators of the model coefficients, variance and variancecovariance matrix are shown in (5).
(4) 
where is a diagonal matrix of weights.
(5) 
IiC BoxCox Regression
BoxCox regression model is a linear regression model with an additional power transformation on the response variable, as shown in (6).
(6) 
where is the elementwise power transformation defined in (7).
(7) 
Normally, a set of power parameters are applied to the response variable. In this case, for every , the one maximizes the profile loglikelihood (8) is chosen as the best power parameter.
(8) 
The estimator of the model coefficients, variance and variancecovariance matrix for BoxCox regression are
(9) 
IiD Ridge Regression
Ridge regression is linear regression with an penalty term added. The corresponding SSE cost function is:
(10) 
where is a nonnegative tuning parameter used to control the penalty magnitude. For any , (10) can be analytically minimized, yielding the estimator of as
(11)  
Iii Methodology
The main goal is to find approaches that are able to overcome the memory barrier issue of closedform solutions and make them as widely applicable as the iterative methods in big data. In pursuit of this goal, the array of sufficient statistics (SS) is formally defined. And SS based multiple learning algorithms are proposed in this section.
Iiia Sufficient Statistics Array
SS array is an array of sufficient statistics used to calculate the estimators of the models and the loglikelihood function (or SSE cost function) without a second visit to the dataset. It’s inspired by the computationwise rowindependent of the equivalent forms of (3) of linear regression [28, 27].
Rewritting from (3) in (12), is computationwise row independent, i.e., for any two observations and , calculating the summation of doesn’t depend on . Likewise, and are computationwise rowindependent as well.
(12) 
Inspired by this thought, the array of SS is formally defined as follows.
Definition 1.
Sufficient statistics (SS) array is an array of sufficient statistics that computed at per row level or per mini batch level from the dataset and can be used to compute the estimators of the model coefficients , the variance , the variancecovariance matrices and the loglikelihood (or SSE cost function) without revisiting the dataset.
IiiB Linear Regression
Based on (2) and (3), is presented as an array of SS for linear regression.
(13) 
where is a scalar, is a vector, and is a matrix.
By (13), we obtain the following
(14) 
Theorem 1.
is an array of SS for linear regression to derive , , and .
Proof.
From (13), the loglikelihood can be expressed as a functin of .
(15) 
which only depends on SS for linear regression. ∎
To accelerate the computation, rowbyrow calculation could be optimized by batchbybatch computation, i.e. , and could be written in the form of batch:
(16) 
where denotes the total number of batches, , and denotes SS array in batch . is a vector, is a array and is the batch size for batch . The multiple learning approach for linear regression algorithm by minibatch is shown in Algorithm 1.
IiiC Weighted Linear Regression
Weighted linear regression uses weights to adjust the importance of different observations. Therefore, the SS array for weighted linear regression is slightly different.
(17) 
where is scalar, is a vector, and is a matrix.
The estimators are reexpressed as follows:
(18) 
Theorem 2.
is an array of SS for weighted linear regression to derive the estimators of , , and .
Proof.
Similar to multiple learning approach for linear regression algorithm, calculating SS batch by batch is also feasible.
(20) 
where is a diagonal weight matrix in batch .
The multiple learning approach for weighted linear regressoin is shown in Algorithm 2.
IiiD BoxCox Regression
BoxCox regression requires a power transformation on the response variable. Commonly, a set of power parameters are applied. And the maximizes the (8) is picked as the best parameter. As the profile loglikelihood is required for parameter picking, is necessarily needed.
The arrays of SS for BoxCox regression is shown in (21). For every ,
(21) 
where and are scalars, is a vector and is a matrix. Notably, is sharable to all models.
Thus, for every ,
(22) 
Theorem 3.
For any , the corresponding is an array of SS for BoxCox regressoin, which can be used to compute , , and .
Batched version of SS for any is shown in (24).
(24) 
where is a vector in batch .
The SSbased BoxCox regression algorithm by minibatch is presented in Algorithm 3.
IiiE Ridge Regression
Although ridge regression requires a set of ridge parameters, the SS array is reusable to all ridge parameters and could be borrowed directly from linear regression.
Let , for every , the corresponding estimators , , and the SSE cost function are:
(25) 
(26) 
The best is selected by the ridge trace method.
Theorem 4.
is the SS array for ridge regression.
The batched version for SS is also identical to that of linear regression. The corresponding algorithm is presented in Algorithm 4.
Iv Experiments
To evaluate the proposed multiple learning algorithms, extensive experiments were conducted on a fournode Spark cluster. All the algorithms were implemented and tested on Spark.
Master  Slave1  Slave2  Slave3  
CPU  i73770  i73770  Quad Q8400  Quad Q9400 
Memory  16GB  16GB  4GB  4GB 
Disk  1TB  1TB  250GB  250GB 
Iva Setup
The 4node Spark cluster was configured with 1 master node and 3 worker nodes. The hardware specs of each of the four computers are shown in Table I.
IvA1 Data Simulation
To understand how massive datasets could impact the computing, we simulated 3 datasets with 0.6 million, 6 million and 60 million observations. The sizes of these datasets are approximately 1GB, 10GB, and 100GB. Generally, the 1GB and 10GB datasets can be loaded into memory easily. However, the 100GB dataset cannot be entirely loaded into the memory at one time. Each row of the data has 100 features for the experiments and all the features are of double type and continuous variables. In each response , the corresponding error follows the normal distribution, i.e. . Additionally, another 3 similar datasets are generated with all the responses set to be positive for proper BoxCox regression.
IvA2 Experiment Design
We designed two experiments, one for time performance and the other for prediction quality, to compare the results between the multiple learning algorithms and the traditional ones on Spark.
Model  Time Used (s)  

1GB  10GB  100GB  
LR  Spark  41.86  338.27  3266.16 
SS 1  19.59  154.16  1505.64  
SS 128  15.67  126.33  1267.96  
Weighted LR  Spark  42.23  339.54  3263.37 
SS 1  19.76  155.47  1528.75  
SS 128  16.73  125.35  1289.54  
BoxCox  Spark  42.63  341.31  3264.33 
SS 1  19.16  156.41  1532.00  
SS 128  15.19  122.49  1200.49  
BoxCox  Spark  431.29  3429.34  33701.51 
SS 1  19.87  160.13  1674.62  
SS 128  16.52  122.21  1206.17  
Ridge  Spark  41.58  328.48  3276.10 
SS 1  19.87  152.47  1620.46  
SS 128  16.10  127.92  1213.64  
Ridge  Spark  423.63  3342.58  32688.28 
SS 1  20.56  154.34  1651.33  
SS 128  16.80  125.63  1230.45 
Model  MSE (s)  

1GB  10GB  100GB  
LR  Spark  1009520.77  993455.96  994025.56 
SS  1009520.77  993455.96  994025.56  
Weighted LR  Spark  1009520.77  993455.96  994025.56 
SS  1009520.77  993455.96  994025.56  
BoxCox  Spark  1138432.54  1053491.23  1011557.43 
SS  1138432.54  1053491.23  1011557.43  
Ridge  Spark  1009520.77  993455.96  994025.56 
SS  1009520.77  993455.96  994025.56 
Experiment I: Time Performance Comparison
The first experiment is to evaluate the time used for training different models. In this experiment, we compared the time performance of the multiple learning approaches with the traditional approaches. For the multiple learning approaches, we measured the time performance with regard to different batch sizes.
Experiment II: Prediction Quality Comparison
To experimentally support that our algorithms are as accurate as OLS algorithms with one pass through the datasets, we compared our algorithms with the traditional ones. In this experiment, we used 1GB, 10GB, and 100GB as the training sets and an additional 0.2GB, 2GB and 20GB data for testing (the testing sets are sampled in accordance with the same strategy for the generation of the training sets). To compare the prediction quality, Mean Squared Error (MSE), defined in equation (28), is used as performance matircs.
(28) 
where is the real value for observation and is the predicted value, is the total number of observations.
IvB Results
Experiment I: Time Performance Comparison
Based on the results from Table II, the training time of our methods is twice efficient than that of the traditional ones on Spark. However, it’s mainly ascribed to the embedded model summary functionality of Spark which requires a second visit to the dataset. Excluding this factor, the performance of our algorithms are nearly the same as the traditional ones on Spark. But for model training with multiple parameters (e.g. model selection) from a set of candidate models, the proposed multiple learning has a great advantage. As is shown in Table II, the computation time needed to perform traditional BoxCox and ridge regression are affected drastically by the number of power parameters and ridge parameters. In contrast, the time overhead of the proposed multiple learning algorithms increased marginally by computing multiple parameters (or multiple models) simultaneously with multiple SS arrays. In Table II, our approaches are almost 20 times faster than the traditional approaches on Spark when computing 31 BoxCox models or 20 Ridge regression models for the batch size . Speedup factors can be further increased to around 27 if we increased the batch size to 128, i.e. the sufficient statistical arrays are updated every 128 rows. Essentially, the training time saved with the multiple learning approach is proportional to the number of models needed to train.
It is also evident in Table II that bigger batch size also decreases the training time. The effect of batch size becomes more significant when the data size is larger. Comparing batch size of 128 against batch size of 1, the time reduction for data size of 1GB, 10GB and 100GB dataset are approximately 16%, 22%, and 30%, respectively. It can be inferred that more time is likely to be saved with bigger batch size for larger datasets.
From experiment I, we conclude that if model selection is needed for a given large scale dataset, the proposed multiple learning approach can significantly outperform the traditional approaches by reducing the disk I/Os to one time. This feature is highly desirable when multiple models need to be calculated and compared in real life applications.
Experiment II: Prediction Quality Comparison
Table III shows the prediction quality, using MSE, for the multiple learning approaches and the traditional ones given 1GB, 10GB, and 100GB datasets. As expected, the prediction accuracy of our approaches is identical to the builtin spark algorithms, providing experimental support to the proof presented in Section 3. Given the same accuracy, the proposed approaches outperformed the traditional approaches with with faster training time. And the larger the datasets, the more advantageous the proposed methods are.
V Conclusion
In this paper, the multiple learning approaches for regression are proposed for big data. With only one pass through the dataset, a SS array is computed to derive the closedform solutions for linear regression, weighted linear regression, BoxCox regression and ridge regression. Theoretically and experimentally, it’s proven that multiple learning is capable of overcoming the memory barrier issue.
Furthermore, multiple SS arrays could be applied to obtain multiple models at once. Unlike other traditional methods that can only learn one model at a time, multiple learning outperforms the traditional techniques as far as time is concerned. Results also showed our approaches are extremely efficient when calculating multiple models as opposed to the traditional methods. Basically, the training time saved compared to the traditional methods is proportional to the number of models need to be investigated.
We believe this to be promising for big data for two main reasons: firstly, the coefficients of the models could be easily obtained as long as the SS arrays are calculated. Secondly, most of the models require a large amount of training and retraining, tuning and retuning to get better performance. While, multiple learning is able to solve or largely alleviate this time consuming problem.
Multiple learning approaches can be implemented on a single node as well as parallel computing frameworks, e.g. Spark. Due to time and resource constraints, our work is currently limited to closedform solutions. For our further work, we would like to conduct more experiments over large scale datasets form real world applications and extend the multiple learning to models with no closedform solutions.
References
 [1] (2005) Stock market forecasting: artificial neural network and linear regression comparison in an emerging market. Journal of Financial Management & Analysis 18 (2), pp. 18. Cited by: §I.
 [2] (2003) Nonlinear programming: analysis and methods. Courier Corporation. Cited by: §I.
 [3] (2012) The moore–penrose pseudoinverse: a tutorial review of the theory. Brazilian Journal of Physics 42 (12), pp. 146–165. Cited by: §I.
 [4] (2003) Generalized inverses: theory and applications. Vol. 15, Springer Science & Business Media. Cited by: §I.
 [5] (1964) An analysis of transformations. Journal of the Royal Statistical Society: Series B (Methodological) 26 (2), pp. 211–243. Cited by: §I.
 [6] (2016) Xgboost: a scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794. Cited by: §I.
 [7] (1991) Variable metric method for minimization. SIAM Journal on Optimization 1 (1), pp. 1–17. Cited by: §I.
 [8] (1977) Quasinewton methods, motivation and theory. SIAM review 19 (1), pp. 46–89. Cited by: §I.
 [9] (192006) The fourteenth western meeting of the american mathematical society. Bull. Amer. Math. Soc. 26 (9), pp. 385–396. External Links: Link Cited by: §I.
 [10] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §I.
 [11] (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12 (1), pp. 55–67. Cited by: §I.
 [12] (1977) Robust regression using iteratively reweighted leastsquares. Communications in Statisticstheory and Methods 6 (9), pp. 813–827. Cited by: §I.
 [13] (1962) Linear regression and correlation. Mathematics of statistics 1, pp. 252–285. Cited by: §I.
 [14] (2001) Convergence and efficiency of subgradient methods for quasiconvex minimization. Mathematical programming 90 (1), pp. 1–25. Cited by: §I, §I.
 [15] (2014) Art of computer programming, volume 2: seminumerical algorithms. 3rd edition, AddisonWesley Professional. Cited by: §I.
 [16] (2002) A comparison of algorithms for maximum entropy parameter estimation. In proceedings of the 6th conference on Natural language learningVolume 20, pp. 1–7. Cited by: §I.
 [17] (1970) Generalized inverses, ridge regression, biased linear estimation, and nonlinear estimation. Technometrics 12 (3), pp. 591–612. Cited by: §I.
 [18] (1986) Matrix computation on distributed memory multiprocessors. Hypercube Multiprocessors 86 (181195), pp. 31. Cited by: §I.
 [19] (1990) Classical and modern regression with applications. Vol. 2, Duxbury Press Belmont, CA. Cited by: §I.
 [20] (2010) Linear regression for face recognition. IEEE transactions on pattern analysis and machine intelligence 32 (11), pp. 2106–2112. Cited by: §I.
 [21] (2001) Predicting housing value: a comparison of multiple regression analysis and artificial neural networks. Journal of real estate research 22 (3), pp. 313–336. Cited by: §I.
 [22] (2011) Scikitlearn: machine learning in python. Journal of machine learning research 12 (Oct), pp. 2825–2830. Cited by: §I.
 [23] (2016) An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. Cited by: §I.
 [24] (1992) The boxcox transformation technique: a review. Journal of the Royal Statistical Society: Series D (The Statistician) 41 (2), pp. 169–178. Cited by: §I.
 [25] (1974) Quasilikelihood functions, generalized linear models, and the gauss—newton method. Biometrika 61 (3), pp. 439–447. Cited by: §I, §I, §I.
 [26] (2010) Spark: cluster computing with working sets.. HotCloud 10 (1010), pp. 95. Cited by: §I.
 [27] (2017) An exact approach to ridge regression for big data. Computational Statistics 32 (3), pp. 909–928. Cited by: §IIIA.
 [28] (2017) Box–cox transformation in big data. Technometrics 59 (2), pp. 189–201. Cited by: §IIIA.