Robust Matrix Regression
Modern technologies are producing datasets with complex intrinsic structures, and they can be naturally represented as matrices instead of vectors. To preserve the latent data structures during processing, modern regression approaches incorporate the low-rank property to the model, and achieve satisfactory performance for certain applications. These approaches all assume that both predictors and labels for each pair of data within the training set are accurate. However, in real world applications, it is common to see the training data contaminated by noises, which can affect the robustness of these matrix regression methods. In this paper, we address this issue by introducing a novel robust matrix regression method. We also derive efficient proximal algorithms for model training. To evaluate the performance of our methods, we apply it on real world applications with comparative studies. Our method achieves the state-of-the-art performance, which shows the effectiveness and the practical value of our method.
Robust Matrix Regression
, , Department of Computer Science and Engineering, The Chinese University of Hong Kong College of Electronic and Information Engineering, Sichuan University firstname.lastname@example.org, email@example.com, firstname.lastname@example.org
Classical regression methods, such as ridge regression (Hoerl and Kennard, 1970) and lasso (Tibshirani, 1996) are basically designed for data in vector form. However, with the development of modern technology, it is common to meet datasets with sample unit not in vector form but instead in matrix form. Examples include the two-dimensional digital images, with quantized values of different colors at certain rows and columns of pixels; and electroencephalography (EEG) data with voltage fluctuations at multiple channels over a period of time. When using traditional regression methods to process these data, we have to reshape them into vectors, which may destroy the latent topological structural information, such as the correlation between different channels for EEG data (Zhou and Li, 2014), and the spatial relation within an image (Wolf, Jhuang, and Hazan, 2007).
To tackle this issue, several methods have been proposed to perform regression on data in matrix form directly. One such model is the regularized matrix regression (R-GLM) (Zhou and Li, 2014). Given a dataset , where is the sample size, denotes the th data matrix as predictor, and is the corresponding output, the R-GLM model aims to learn a function to identify the output given a newly observed data matrix with
where represents the trace of a matrix, is the regression matrix with low-rank property to preserve structural information of each data matrix, and denotes the offset. is zero-mean Gaussian noise to model the small uncertainty of the output. With this setting, the R-GLM has achieved satisfying results in several applications. However, there still exist certain issues that should be further addressed. Firstly, R-GLM uses the Gaussian noise for model fitting, and take all deviations of predicted values from labels into account. This setting can be reasonable in certain cases, but may not make sense for particular applications. As an example, consider the problem of head pose estimation (Sherrah and Gong, 2001), where for each data pair, the predictor is a two dimensional digital image for the head of a person, while the output denotes the angle of his head. Because the real angle of head cannot be measured precisely, there should exist certain deviations of provided labels from the real ones empirically. In this case, the regression model should be able to tolerant such small deviations instead of taking them all into account. Another important issue is that, the predictors are also assumed to be noise free, which can be irrational in certain applications. Practically, it is common to see signals corrupted by noise, such as image signals with occlusion, specular reflections or noise (Huang, Cabral, and De la Torre, 2016), and financial data with noise (Magdon-Ismail, Nicholson, and Abu-Mostafa, 1998). Thus, it is important for a regression model to be tolerant of noise on predictors and labels to enhance its robustness empirically.
In this paper, we introduce two novel matrix regression methods to tackle the above mentioned issues. We first propose a “Robust Matrix Regression” (RMR) to tackle the noisy label problem, by introducing hinge loss to model the uncertainty of regression labels. In this way, our method only considers error larger than a pre-specified value, and can tolerate error around each labeled output within a small range. This approach is also favored for other advantages in certain scenarios. As an example, in applications like financial time-series prediction, it is common to require not to lose more than money when dealing with data like exchange rates, and this issue can be well addressed with our setting. Even though the hinge loss error has been used in the support vector regression model (Smola and Schölkopf, 2004), it is an algorithm based on vector-form data, which can ruin the latent structure for matrix regression problem. We then propose efficient ADMM method to solve the optimization problem iteratively.
To further enhance the robustness of RMR with noisy predictors, we propose a generalized RMR (G-RMR) by decomposing each data matrix as latent clean signal plus sparse outliers. For model training, we also derive a proximal algorithm to estimate both the regression matrix and latent clean signals iteratively. To evaluate the performance of our methods, we conduct extensive experiments on both approaches with comparison of state-of-the-art ones. Our methods achieve superior performance consistently, which shows their efficiency in real world problems.
Notations: We present the scalar values with lower case letters (e.g., ); vectors by bold lower case letters (e.g., ); and matrix by bold upper case letters (e.g., ). For a matrix , its -entity is represented as . denotes the trace of a matrix, and . We further set and as the Frobenius norm and nuclear norm of a matrix respectively.
Robust Matrix Regression
We first introduce the RMR model to address the noisy label problem, with an ADMM algorithm for model training.
For matrix regression, classical techniques need to reshape each matrix into a vector , which will destroy its intrinsic structures, resulting in the loss of information. The R-GLM approach (Zhou and Li, 2014) addresses this issue by enforcing the regression matrix to be low-rank representable with nuclear norm penalty. However, this method is based on the Gaussian loss, which may affect the robustness with existence of noisy labels.
To tackle this issue, an intuitive idea is to ignore noises within a small margin around each label for robust model fitting. Motivated by this idea, we propose our RMR, by introducing the hinge loss for model fitting, where the residual corresponding to each data is defined as follows
With the above formulation of residuals, when learning the regression model, our approach only takes residuals larger than into account, thus, the labels contaminated by noise within a small margin is tolerable accordingly. Similar residual modeling approach has also been used in the method of support vector regression (Smola and Schölkopf, 2004). However, this approach is proposed for vector data regression and cannot capture the latent structure within each data matrix. Differently, our method can capture such latent structure by incorporating the spectral elastic net penalty (Luo et al., 2015) into the regression matrix , which can model the correlation of each data matrix effectively. And the corresponding optimization problem is defined as follows
with as the spectral elastic net penalty, we incorporate low-rank property into and consider the group effect of the eigenvalues, to capture the latent structures among data matrices (Luo et al., 2015).
As the object function contains both hinge loss and nuclear norm, the Nesterov method used in R-GLM (Zhou and Li, 2014) is no longer available because the derivative of our loss function is not Lipschitz-continuous. Nevertheless, since our model is convex with respect to both and , we here derive an efficient learning algorithm based on Alternating Direction Method of Multipliers (ADMM) (Boyd et al., 2011) with the restart rule (Goldstein et al., 2014) to solve the optimization problem. The optimization problem defined in Eq. (3) can be equivalently written as follows:
where is an auxiliary variable and . In this way, the original optimization has been split into two subproblems with respect to and respectively. In this way, we can develop efficient ADMM method to solve Eq. (4) by using the Augmented Lagrangian approach as follows:
where is a hyper parameter.
Optimization for Auxiliary Variable
We first derive the optimization method for solving the auxiliary variable . The first subproblem for solving is
Then we can get in the iteration by solving the problem in Eq. (6) and get the analytical solution as follows.
where and .
Optimization for Regression Matrix
To solve the regression matrix, we have
And one solution of this optimization problem is
where , and . and can be constructed from the solution of the following box constrained quadratic programming problem:
Here we have
and are independent of with,
The algorithm is summarized in Algorithm 1. And for its convergence, we have the following theorem.
Suppose the optimal solution of Problem (4) is . Then
Because the hinge loss and nuclear norm are weakly convex, the convergence property of Algorithm 1 can be proved immediately based on the result in (Goldstein et al., 2014; He and Yuan, 2015). That is, we have
We further theoretically analyze the excess risk of our RMR model. Following the framework of (Wimalawarne, Tomioka, and Sugiyama, 2016), we assume each entity of a data matrix follows standard Gaussian distribution. Then, the optimization problem of our RMR can be rewritten as
where and are certain constants, and is the hinge loss. Based on the relation between and in Eq. ((9)), the loss function can be simplified as
which is a -Lipschitz continuous function, and , with to be the empirical mean of data, which tends to be when is large. Thus, the can also be considered as standard Gaussian distributed.
Let and be the empirical risk and expected risk respectively (Maurer and Pontil, 2013). Also, we set be the optimal solution to
and be the optimal solution of
Then, we provide the upper bound of the excess risk of RMR in the following theorem, with proof in supplementary.
With probability at least , let be the rank of (), and the excess risk of RMR is bounded with
Proof. To prove the above theorem, We can first reformulate the excess risk with respect to and as follows
Here, the second term is negative naturally. And following the Hoeffding’s inequality, the third one can be bounded as , with probability .
For the first term, it is easy to obtain that
Further using the McDiarmid’s inequality, we can simply obtain the Rademacher complexity with probability , with
where represents the Rademacher variables. Let , we can obtain the upper bound of as follows
Further applying the Hölder’s inequality, we have
where is the dual norm of nuclear norm and
is equivalent to
whose optimal solution is .
Since is the sum of random variables, its entries should also be considered as Gaussian distributed, with variance . Thus, following Maurer and Pontil (2013), with the Gordan’s theorem, we have
Combining all the above together, we can obtain the upper bound of the excess risk with probability at least as follows
Generalized Robust Matrix Regression
The RMR can tolerate label noises for matrix regression. However, it cannot handle noise or outliers on each predictor empirically. To address this issue, we further introduce a generalize RMR (G-RMR) which assumes each noisy data matrix can be decomposed as a latent clean signal plus outliers. The clean signals can be recovered from each noisy ones when learning the regression model.
As discussed in (Candès et al., 2011), it is common for natural signals to contain correlation empirically. Thus, when stacking each vectorized latent clean data matrix, the resulting matrix should be low-rank representable. We also introduce the sparsity feature with the norm to the outliers for robust modeling. With these settings, the optimization problem of G-RMR can be defined as follows
Here, denotes the th noisy input matrix, and we assume that it can be decomposed as , with as the latent clean matrix signal, and as the outliers. is a matrix with the row as the vector form of , denotes the matrix with the row as the vector form of , which is assumed to a low-rank representable with a nuclear norm penalty; and contains the outliers for each data matrix, with its row as the vector form of and is encouraged to be sparse with the norm. It can be noticed that the decomposition form of each is the same as that in Robust Principal Component Analysis, which is effective in data recovery. But our approach is different because we update the regression matrix and recover the clean signals simultaneously within one optimization problem, which can benefit both tasks empirically.
The optimization problem for G-RMR contain three additional non-smooth term. Fortunately, the object function in Eq. (26) is still bi-convex, which means that it is convex with respect to with and fixed, and convex with respect to and with fixed. Therefore, we derive an iterative ADMM method to solve this problem, where the optimization problem in Eq. (26) is divided into two subproblems and each subproblem can be solved by ADMM individually.
The first subproblem is to solve with and fixed, which is equivalent to solve the problem in Eq. (3). The second subproblem is to solve and with fixed. Different from the first subproblem, an norm is introduced to the object function. And the subproblem can be written as
Since contains hinge loss function and other two non-smooth terms, the optimization problem is hard to solve. Thus, we relax the hinge loss to squared one and develop another ADMM method to solve this subproblem in with Augmented Lagrangian approach as follows:
where is the coefficient variable in vector form, and is a hyper parameter.
We first solve with fixed, and the optimization problem is as follows,
which can be further written as
where and . Both of them are convex and their derivatives are Lipschitz continuous. Thus, we use the proximal gradient method to update with,
where and is the size of a gradient step.
After solving , we proceed to solve with fixed. Similarly, we re-write the this subproblem as follows:
where and . Since both and are convex, is smooth and is non-smooth and their derivatives are Lipschitz continuous, we can update by
In this section, we conduct extensive experiments with comparative studies. The proposed method is implemented by Matlab R2015a in a machine with four-core 3.7GHz CPU and 16GB memory. We investigate the performance of our RMR and G-RMR with comparison of two state-of-the-art methods: 1) The classical Support Vector Regression (SVR) (Smola and Schölkopf, 2004); 2) Regularized Matrix Regression (R-GLM) (Zhou and Li, 2014).
To evaluate and compare the performance of these algorithms, we apply them on three empirical tasks. Firstly, we elaborate on the illustrative examples by examining different geometric and natural shapes on the regression matrix. Secondly, we apply them on real-world finical time series data, where each sample can be represented as a matrix. At last, we apply RMR and G-RMR on the application of human head pose estimation (Sherrah and Gong, 2001).
To evaluate the performance of each algorithm, we use the Relative Absolute Error (RAE) that measures error between true labels and estimated labels with . For compared method, SVR, RMR and G-RMR, we fix the coefficients and ,. And all the hyper-parameters are selected via cross validation.
We first conduct the illustrative examples by examining various signal shapes, where each of them represents a regression matrix. We use the regression matrix to generate each sample by the following equation:
where is the regression matrix illustrated by a signal shape, is a randomly generated sample, is a bias term and is the noise term on label , which is sampled from Laplacian distribution (The probability distribution function is ).
In the experiment, we randomly generate samples for 10 rounds. In each round, half of the samples are used for model training and the rest are for testing. Then we compute the mean and standard deviation of RAE error on classifier matrix for each approach. And detailed comparison of our RMR and G-RMR with other methods are shown in Table 1, where column “shape” denotes the type of signals. The illustration of true signal shapes followed by the estimation results from the four methods can be found in the Fig. 1.
It can be clearly seen from the illustrative examples that our method outperforms R-GLM in recovering lower rank signals, such as square, cross and T shape. Although they yield comparable results in recovering the high rank signal, it can be seen from Table 1 that our approaches still outperform R-GLM in terms of the RAE error on quantitatively. It clearly shows that the hinge loss in our approaches for model fitting is more robust empirically. Besides, it can be observed that our RMR and G-RMR methods substantially outperforms the traditional SVR method for all illustrative examples, because the SVR fails to capture the correlation among each data matrix. Specifically, we further use the square shaped signal to display the RAE error along the solution path of the nuclear norm for matrix regression, which is shown in Fig. 2. It can be seen that, when the weight is larger than , the performance is better, which shows the effectiveness of incorporating the nuclear norm as penalty.
Financial Time Series Data Analysis
We further evaluate our RMR on the financial time series data. We use daily price data (details can be found in the Table 2) from (Akbilgic, Bozdogan, and Balaban, 2014), where the prices are converted to returns. In this data set, there are 536 daily returns from January 5, 2009 to Feburary 22, 2011. Particularly, the days are excluded when the Turkish stock exchange was closed.
Intuitively, the value of an index in a certain date may be related to others as well as previous values. Thus, it is natural to process data in matrix form instead of a vector to preserve the latent topological structure of data. Besides, there exist many fluctuations in different indices due to the complicated stock market and many entries of the data are contaminated accordingly. Therefore, it is imperative to tackle the above issues with the proposed methods.
|ISE100||Istanbul stock exchange national 100 index|
|SP||Standard & poorâs 500 return index|
|DAX||Stock market return index of Germany|
|FTSE||Stock market return index of UK|
|NIK||Stock market return index of Japan|
|BVSP||Stock market return index of Brazil|
|EU||MSCI European index|
|EM||MSCI emerging markets index|
Besides using the RAE for evaluation, we further use 2 extra criterions to evaluate our results on the financial data set, i.e., percentage of correctly predicted (PCP) days, which can interpret our results in a simple and logical manner, and the Dollar 100 (D100) criterion, which gives us the theoretical future value of invested at the beginning of predicted term and traded accordingly. We use the first days’ data to train our model and predict the left days’ index values for evaluation.
As we can see from Table 3, our proposed methods outperform other state-of-the-art methods in terms of all 3 different criterions. Our approaches achieve better results than previous R-GLM, because there exist many fluctuations in everyday’s stock exchange rates due to the complicated situations, and allowing a range of error for the returns can be more robust than directly applying the squared error for model fitting. Also, the SVR ignores the latent topological structure among different indices and historical data, and is beaten by our proposed methods. It can be observed that our G-RMR achieves a significant improvement over the RMR. This is because financial data always contain serious noise problems (Magdon-Ismail, Nicholson, and Abu-Mostafa, 1998) not only in the label but also in the data matrix entries. It shows that considering noise on predictors in G-RMR is also effective in certain real world situations.
Head Pose Data Analysis
We further test the performance of our methods on the application of head pose estimation with dataset used in (Sherrah and Gong, 2001) . The dataset arises from a study to estimate the human pose via digital images captured from cameras, with 37 people in gray scale of image size, which can be represented as data in matrix form naturally. Each person has facial images covering a view sphere of degrees in yaw and degrees in tilt at degrees increment. Several example images can be found in the Fig. 3. As discussed before, it is difficult to measure the real angle of human head accurately, thus, it can be expected that the label of each image may contain several small errors, resulting in the noisy label problem.
For each person, we keep the degree in tilt fixed and use the degree in yaw angle as our label. We then set tilting degree of the face to , which denotes the frontal face. Each image is cropped around the face and resized to .
The numerical performance of each algorithm is shown in Table. 4, where column “Trn#” lists the corresponding number of training samples and column “G-RMR (C)” denotes that data used here is corrupted by adding white square blocks ( samples are corrupted). On one hand, as shown in Table 4, our methods outperform all competitive cones in terms of RAE error on the head angle, because our methods take both the correlation within each data and the noisy label issue into consideration, resulting in more robust estimation of the model. On the other hand, our G-RMR also achieves competitive result on corrupted data compared with the result on normal data, which shows the robustness of our model on noisy entries.
In this paper, we addressed the robust matrix regression issue with two novel methods proposed, i.e., RMR and G-RMR. For RMR, we introduced the hinge loss for model fitting, to enhance the robustness of matrix regression methods against the problem of noisy labels. An ADMM algorithm was further derived for model training. As an extension of RMR, the G-RMR was proposed to take noisy predictors into consideration by clean matrix signal recovery during model training procedure. We also conducted extensive empirical studies to evaluate the performance of RMR and G-RMR, and our methods achieve state-of-the-art performance. It shows that our approaches can address real-world problems effectively.
- Akbilgic, Bozdogan, and Balaban (2014) Akbilgic, O.; Bozdogan, H.; and Balaban, M. E. 2014. A novel hybrid rbf neural networks model as a forecaster. Statistics and Computing 24(3):365–375.
- Boyd et al. (2011) Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; and Eckstein, J. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning 3(1):1–122.
- Candès et al. (2011) Candès, E. J.; Li, X.; Ma, Y.; and Wright, J. 2011. Robust principal component analysis? Journal of the ACM (JACM) 58(3):11.
- Goldstein et al. (2014) Goldstein, T.; O’Donoghue, B.; Setzer, S.; and Baraniuk, R. 2014. Fast alternating direction optimization methods. SIAM Journal on Imaging Sciences 7(3):1588–1623.
- He and Yuan (2015) He, B., and Yuan, X. 2015. On non-ergodic convergence rate of douglas–rachford alternating direction method of multipliers. Numerische Mathematik 130(3):567–577.
- Hoerl and Kennard (1970) Hoerl, A. E., and Kennard, R. W. 1970. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12(1):55–67.
- Huang, Cabral, and De la Torre (2016) Huang, D.; Cabral, R.; and De la Torre, F. 2016. Robust regression. IEEE transactions on pattern analysis and machine intelligence 38(2):363–375.
- Keerthi and Gilbert (2002) Keerthi, S. S., and Gilbert, E. G. 2002. Convergence of a generalized smo algorithm for svm classifier design. Machine Learning 46(1-3):351–360.
- Luo et al. (2015) Luo, L.; Xie, Y.; Zhang, Z.; and Li, W.-J. 2015. Support matrix machines. In International Conference on Machine Learning (ICML).
- Magdon-Ismail, Nicholson, and Abu-Mostafa (1998) Magdon-Ismail, M.; Nicholson, A.; and Abu-Mostafa, Y. S. 1998. Financial markets: very noisy information processing. Proceedings of the IEEE 86(11):2184–2195.
- Maurer and Pontil (2013) Maurer, A., and Pontil, M. 2013. Excess risk bounds for multitask learning with trace norm regularization. In Conference on Learning Theory (COLT), volume 30, 55–76.
- Platt and others (1998) Platt, J., et al. 1998. Sequential minimal optimization: A fast algorithm for training support vector machines.
- Sherrah and Gong (2001) Sherrah, J., and Gong, S. 2001. Fusion of perceptual cues for robust tracking of head pose and position. Pattern Recognition 34(8):1565–1572.
- Smola and Schölkopf (2004) Smola, A. J., and Schölkopf, B. 2004. A tutorial on support vector regression. Statistics and computing 14(3):199–222.
- Tibshirani (1996) Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 267–288.
- Wimalawarne, Tomioka, and Sugiyama (2016) Wimalawarne, K.; Tomioka, R.; and Sugiyama, M. 2016. Theoretical and experimental analyses of tensor-based regression and classification. Neural computation 28(4):686–715.
- Wolf, Jhuang, and Hazan (2007) Wolf, L.; Jhuang, H.; and Hazan, T. 2007. Modeling appearances with low-rank svm. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, 1–6. IEEE.
- Zhou and Li (2014) Zhou, H., and Li, L. 2014. Regularized matrix regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76(2):463–483.