Robust Matrix Regression

Robust Matrix Regression

, ,
Department of Computer Science and Engineering, The Chinese University of Hong Kong
College of Electronic and Information Engineering, Sichuan University,,

Modern technologies are producing datasets with complex intrinsic structures, and they can be naturally represented as matrices instead of vectors. To preserve the latent data structures during processing, modern regression approaches incorporate the low-rank property to the model, and achieve satisfactory performance for certain applications. These approaches all assume that both predictors and labels for each pair of data within the training set are accurate. However, in real world applications, it is common to see the training data contaminated by noises, which can affect the robustness of these matrix regression methods. In this paper, we address this issue by introducing a novel robust matrix regression method. We also derive efficient proximal algorithms for model training. To evaluate the performance of our methods, we apply it on real world applications with comparative studies. Our method achieves the state-of-the-art performance, which shows the effectiveness and the practical value of our method.

Robust Matrix Regression

, , Department of Computer Science and Engineering, The Chinese University of Hong Kong College of Electronic and Information Engineering, Sichuan University,,


Classical regression methods, such as ridge regression (Hoerl and Kennard, 1970) and lasso (Tibshirani, 1996) are basically designed for data in vector form. However, with the development of modern technology, it is common to meet datasets with sample unit not in vector form but instead in matrix form. Examples include the two-dimensional digital images, with quantized values of different colors at certain rows and columns of pixels; and electroencephalography (EEG) data with voltage fluctuations at multiple channels over a period of time. When using traditional regression methods to process these data, we have to reshape them into vectors, which may destroy the latent topological structural information, such as the correlation between different channels for EEG data (Zhou and Li, 2014), and the spatial relation within an image (Wolf, Jhuang, and Hazan, 2007).

To tackle this issue, several methods have been proposed to perform regression on data in matrix form directly. One such model is the regularized matrix regression (R-GLM) (Zhou and Li, 2014). Given a dataset , where is the sample size, denotes the th data matrix as predictor, and is the corresponding output, the R-GLM model aims to learn a function to identify the output given a newly observed data matrix with


where represents the trace of a matrix, is the regression matrix with low-rank property to preserve structural information of each data matrix, and denotes the offset. is zero-mean Gaussian noise to model the small uncertainty of the output. With this setting, the R-GLM has achieved satisfying results in several applications. However, there still exist certain issues that should be further addressed. Firstly, R-GLM uses the Gaussian noise for model fitting, and take all deviations of predicted values from labels into account. This setting can be reasonable in certain cases, but may not make sense for particular applications. As an example, consider the problem of head pose estimation (Sherrah and Gong, 2001), where for each data pair, the predictor is a two dimensional digital image for the head of a person, while the output denotes the angle of his head. Because the real angle of head cannot be measured precisely, there should exist certain deviations of provided labels from the real ones empirically. In this case, the regression model should be able to tolerant such small deviations instead of taking them all into account. Another important issue is that, the predictors are also assumed to be noise free, which can be irrational in certain applications. Practically, it is common to see signals corrupted by noise, such as image signals with occlusion, specular reflections or noise (Huang, Cabral, and De la Torre, 2016), and financial data with noise (Magdon-Ismail, Nicholson, and Abu-Mostafa, 1998). Thus, it is important for a regression model to be tolerant of noise on predictors and labels to enhance its robustness empirically.

In this paper, we introduce two novel matrix regression methods to tackle the above mentioned issues. We first propose a “Robust Matrix Regression” (RMR) to tackle the noisy label problem, by introducing hinge loss to model the uncertainty of regression labels. In this way, our method only considers error larger than a pre-specified value, and can tolerate error around each labeled output within a small range. This approach is also favored for other advantages in certain scenarios. As an example, in applications like financial time-series prediction, it is common to require not to lose more than money when dealing with data like exchange rates, and this issue can be well addressed with our setting. Even though the hinge loss error has been used in the support vector regression model (Smola and Schölkopf, 2004), it is an algorithm based on vector-form data, which can ruin the latent structure for matrix regression problem. We then propose efficient ADMM method to solve the optimization problem iteratively.

To further enhance the robustness of RMR with noisy predictors, we propose a generalized RMR (G-RMR) by decomposing each data matrix as latent clean signal plus sparse outliers. For model training, we also derive a proximal algorithm to estimate both the regression matrix and latent clean signals iteratively. To evaluate the performance of our methods, we conduct extensive experiments on both approaches with comparison of state-of-the-art ones. Our methods achieve superior performance consistently, which shows their efficiency in real world problems.

Notations: We present the scalar values with lower case letters (e.g., ); vectors by bold lower case letters (e.g., ); and matrix by bold upper case letters (e.g., ). For a matrix , its -entity is represented as . denotes the trace of a matrix, and . We further set and as the Frobenius norm and nuclear norm of a matrix respectively.

Robust Matrix Regression

We first introduce the RMR model to address the noisy label problem, with an ADMM algorithm for model training.


For matrix regression, classical techniques need to reshape each matrix into a vector , which will destroy its intrinsic structures, resulting in the loss of information. The R-GLM approach (Zhou and Li, 2014) addresses this issue by enforcing the regression matrix to be low-rank representable with nuclear norm penalty. However, this method is based on the Gaussian loss, which may affect the robustness with existence of noisy labels.

To tackle this issue, an intuitive idea is to ignore noises within a small margin around each label for robust model fitting. Motivated by this idea, we propose our RMR, by introducing the hinge loss for model fitting, where the residual corresponding to each data is defined as follows


With the above formulation of residuals, when learning the regression model, our approach only takes residuals larger than into account, thus, the labels contaminated by noise within a small margin is tolerable accordingly. Similar residual modeling approach has also been used in the method of support vector regression (Smola and Schölkopf, 2004). However, this approach is proposed for vector data regression and cannot capture the latent structure within each data matrix. Differently, our method can capture such latent structure by incorporating the spectral elastic net penalty (Luo et al., 2015) into the regression matrix , which can model the correlation of each data matrix effectively. And the corresponding optimization problem is defined as follows



with as the spectral elastic net penalty, we incorporate low-rank property into and consider the group effect of the eigenvalues, to capture the latent structures among data matrices (Luo et al., 2015).


As the object function contains both hinge loss and nuclear norm, the Nesterov method used in R-GLM (Zhou and Li, 2014) is no longer available because the derivative of our loss function is not Lipschitz-continuous. Nevertheless, since our model is convex with respect to both and , we here derive an efficient learning algorithm based on Alternating Direction Method of Multipliers (ADMM) (Boyd et al., 2011) with the restart rule (Goldstein et al., 2014) to solve the optimization problem. The optimization problem defined in Eq. (3) can be equivalently written as follows:


where is an auxiliary variable and . In this way, the original optimization has been split into two subproblems with respect to and respectively. In this way, we can develop efficient ADMM method to solve Eq. (4) by using the Augmented Lagrangian approach as follows:


where is a hyper parameter.

Optimization for Auxiliary Variable

We first derive the optimization method for solving the auxiliary variable . The first subproblem for solving is


Then we can get in the iteration by solving the problem in Eq. (6) and get the analytical solution as follows.


where and .

Optimization for Regression Matrix

To solve the regression matrix, we have


And one solution of this optimization problem is


where , and . and can be constructed from the solution of the following box constrained quadratic programming problem:


Here we have


and are independent of with,

In this way, we can get and in each iteration with sequential minimization optimization algorithm (Platt and others, 1998; Keerthi and Gilbert, 2002).

The algorithm is summarized in Algorithm 1. And for its convergence, we have the following theorem.

Theorem 1 

Suppose the optimal solution of Problem (4) is . Then


Because the hinge loss and nuclear norm are weakly convex, the convergence property of Algorithm 1 can be proved immediately based on the result in (Goldstein et al., 2014; He and Yuan, 2015). That is, we have

Theorem 2 

For any and , the iteration sequence given by Algorithm 1 converges to the optimal solution of Problem (4).

2:for  do
7:     if  then
11:     else
16:     end if
17:end for
Algorithm 1 ADMM for Subproblem 1 in Eq. (4)

Theoretical Analysis

We further theoretically analyze the excess risk of our RMR model. Following the framework of (Wimalawarne, Tomioka, and Sugiyama, 2016), we assume each entity of a data matrix follows standard Gaussian distribution. Then, the optimization problem of our RMR can be rewritten as


where and are certain constants, and is the hinge loss. Based on the relation between and in Eq. ((9)), the loss function can be simplified as


which is a -Lipschitz continuous function, and , with to be the empirical mean of data, which tends to be when is large. Thus, the can also be considered as standard Gaussian distributed.

Let and be the empirical risk and expected risk respectively (Maurer and Pontil, 2013). Also, we set be the optimal solution to


and be the optimal solution of


Then, we provide the upper bound of the excess risk of RMR in the following theorem, with proof in supplementary.

Theorem 3 

With probability at least , let be the rank of (), and the excess risk of RMR is bounded with

Proof. To prove the above theorem, We can first reformulate the excess risk with respect to and as follows


Here, the second term is negative naturally. And following the Hoeffding’s inequality, the third one can be bounded as , with probability .

For the first term, it is easy to obtain that


Further using the McDiarmid’s inequality, we can simply obtain the Rademacher complexity with probability , with


where represents the Rademacher variables. Let , we can obtain the upper bound of as follows


Further applying the Hölder’s inequality, we have


where is the dual norm of nuclear norm and


is equivalent to


whose optimal solution is .

Since is the sum of random variables, its entries should also be considered as Gaussian distributed, with variance . Thus, following Maurer and Pontil (2013), with the Gordan’s theorem, we have


Combining all the above together, we can obtain the upper bound of the excess risk with probability at least as follows


Generalized Robust Matrix Regression

The RMR can tolerate label noises for matrix regression. However, it cannot handle noise or outliers on each predictor empirically. To address this issue, we further introduce a generalize RMR (G-RMR) which assumes each noisy data matrix can be decomposed as a latent clean signal plus outliers. The clean signals can be recovered from each noisy ones when learning the regression model.


As discussed in (Candès et al., 2011), it is common for natural signals to contain correlation empirically. Thus, when stacking each vectorized latent clean data matrix, the resulting matrix should be low-rank representable. We also introduce the sparsity feature with the norm to the outliers for robust modeling. With these settings, the optimization problem of G-RMR can be defined as follows


Here, denotes the th noisy input matrix, and we assume that it can be decomposed as , with as the latent clean matrix signal, and as the outliers. is a matrix with the row as the vector form of , denotes the matrix with the row as the vector form of , which is assumed to a low-rank representable with a nuclear norm penalty; and contains the outliers for each data matrix, with its row as the vector form of and is encouraged to be sparse with the norm. It can be noticed that the decomposition form of each is the same as that in Robust Principal Component Analysis, which is effective in data recovery. But our approach is different because we update the regression matrix and recover the clean signals simultaneously within one optimization problem, which can benefit both tasks empirically.


The optimization problem for G-RMR contain three additional non-smooth term. Fortunately, the object function in Eq. (26) is still bi-convex, which means that it is convex with respect to with and fixed, and convex with respect to and with fixed. Therefore, we derive an iterative ADMM method to solve this problem, where the optimization problem in Eq. (26) is divided into two subproblems and each subproblem can be solved by ADMM individually.

The first subproblem is to solve with and fixed, which is equivalent to solve the problem in Eq. (3). The second subproblem is to solve and with fixed. Different from the first subproblem, an norm is introduced to the object function. And the subproblem can be written as


Since contains hinge loss function and other two non-smooth terms, the optimization problem is hard to solve. Thus, we relax the hinge loss to squared one and develop another ADMM method to solve this subproblem in with Augmented Lagrangian approach as follows:


where is the coefficient variable in vector form, and is a hyper parameter.

We summarized our method to solve the problem in Eq. (Solver) in Algorithm 2. The key steps include the computations of and , where the derivation of both and are based on proximal gradient method.

We first solve with fixed, and the optimization problem is as follows,


which can be further written as


where and . Both of them are convex and their derivatives are Lipschitz continuous. Thus, we use the proximal gradient method to update with,


where and is the size of a gradient step.

After solving , we proceed to solve with fixed. Similarly, we re-write the this subproblem as follows:


where and . Since both and are convex, is smooth and is non-smooth and their derivatives are Lipschitz continuous, we can update by

1:Initialize .
2:for  do
7:     if  then
11:     else
16:     end if;
17:end for
Algorithm 2 ADMM for Subproblem 2 in Eq. (27)

We summarize our method for solving G-RMR in Algorithm 3, based on Algorithm 1 and Algorithm 2, .

1:Input: Input training data and related parameters for the solver.
3:     Update using Algorithm. 1;
4:     Update and using Algorithm. 2;
5:until Convergence or maximum iteration number;
6:Output: The estimation of .
Algorithm 3 Solve Robust Matrix Regression


In this section, we conduct extensive experiments with comparative studies. The proposed method is implemented by Matlab R2015a in a machine with four-core 3.7GHz CPU and 16GB memory. We investigate the performance of our RMR and G-RMR with comparison of two state-of-the-art methods: 1) The classical Support Vector Regression (SVR) (Smola and Schölkopf, 2004); 2) Regularized Matrix Regression (R-GLM) (Zhou and Li, 2014).

To evaluate and compare the performance of these algorithms, we apply them on three empirical tasks. Firstly, we elaborate on the illustrative examples by examining different geometric and natural shapes on the regression matrix. Secondly, we apply them on real-world finical time series data, where each sample can be represented as a matrix. At last, we apply RMR and G-RMR on the application of human head pose estimation (Sherrah and Gong, 2001).

To evaluate the performance of each algorithm, we use the Relative Absolute Error (RAE) that measures error between true labels and estimated labels with . For compared method, SVR, RMR and G-RMR, we fix the coefficients and ,. And all the hyper-parameters are selected via cross validation.

Shape Recovery

We first conduct the illustrative examples by examining various signal shapes, where each of them represents a regression matrix. We use the regression matrix to generate each sample by the following equation:


where is the regression matrix illustrated by a signal shape, is a randomly generated sample, is a bias term and is the noise term on label , which is sampled from Laplacian distribution (The probability distribution function is ).

In the experiment, we randomly generate samples for 10 rounds. In each round, half of the samples are used for model training and the rest are for testing. Then we compute the mean and standard deviation of RAE error on classifier matrix for each approach. And detailed comparison of our RMR and G-RMR with other methods are shown in Table 1, where column “shape” denotes the type of signals. The illustration of true signal shapes followed by the estimation results from the four methods can be found in the Fig. 1.



T Shape
Table 1: The RAE error on for different approaches.
Figure 1: Comparison of our methods and other regression methods on the illustrative examples; Columns from left to right are true signal, SVR estimate, R-GLM estimate, RMR estimate and G-RMR estimate respectively.

It can be clearly seen from the illustrative examples that our method outperforms R-GLM in recovering lower rank signals, such as square, cross and T shape. Although they yield comparable results in recovering the high rank signal, it can be seen from Table 1 that our approaches still outperform R-GLM in terms of the RAE error on quantitatively. It clearly shows that the hinge loss in our approaches for model fitting is more robust empirically. Besides, it can be observed that our RMR and G-RMR methods substantially outperforms the traditional SVR method for all illustrative examples, because the SVR fails to capture the correlation among each data matrix. Specifically, we further use the square shaped signal to display the RAE error along the solution path of the nuclear norm for matrix regression, which is shown in Fig. 2. It can be seen that, when the weight is larger than , the performance is better, which shows the effectiveness of incorporating the nuclear norm as penalty.

Figure 2: along the solution path for nuclear norm regularization in RMR.

Financial Time Series Data Analysis

We further evaluate our RMR on the financial time series data. We use daily price data (details can be found in the Table 2) from (Akbilgic, Bozdogan, and Balaban, 2014), where the prices are converted to returns. In this data set, there are 536 daily returns from January 5, 2009 to Feburary 22, 2011. Particularly, the days are excluded when the Turkish stock exchange was closed.

Intuitively, the value of an index in a certain date may be related to others as well as previous values. Thus, it is natural to process data in matrix form instead of a vector to preserve the latent topological structure of data. Besides, there exist many fluctuations in different indices due to the complicated stock market and many entries of the data are contaminated accordingly. Therefore, it is imperative to tackle the above issues with the proposed methods.

Variable name
Variable explanation


ISE100 Istanbul stock exchange national 100 index
SP Standard & poor’s 500 return index
DAX Stock market return index of Germany
FTSE Stock market return index of UK
NIK Stock market return index of Japan
BVSP Stock market return index of Brazil
EU MSCI European index
EM MSCI emerging markets index
Table 2: Abbreviations list for stock market indices

Besides using the RAE for evaluation, we further use 2 extra criterions to evaluate our results on the financial data set, i.e., percentage of correctly predicted (PCP) days, which can interpret our results in a simple and logical manner, and the Dollar 100 (D100) criterion, which gives us the theoretical future value of invested at the beginning of predicted term and traded accordingly. We use the first days’ data to train our model and predict the left days’ index values for evaluation.



PCF 55.43% 54.62% 57.07% 58.42%
D100 163.5 122.2 169.7 195.2
RAE 1.2729 1.0160 0.9914 0.9868
Table 3: Results comparison on financial data

As we can see from Table 3, our proposed methods outperform other state-of-the-art methods in terms of all 3 different criterions. Our approaches achieve better results than previous R-GLM, because there exist many fluctuations in everyday’s stock exchange rates due to the complicated situations, and allowing a range of error for the returns can be more robust than directly applying the squared error for model fitting. Also, the SVR ignores the latent topological structure among different indices and historical data, and is beaten by our proposed methods. It can be observed that our G-RMR achieves a significant improvement over the RMR. This is because financial data always contain serious noise problems (Magdon-Ismail, Nicholson, and Abu-Mostafa, 1998) not only in the label but also in the data matrix entries. It shows that considering noise on predictors in G-RMR is also effective in certain real world situations.

Head Pose Data Analysis

Figure 3: The example of the head pose estimation dataset.

We further test the performance of our methods on the application of head pose estimation with dataset used in (Sherrah and Gong, 2001) . The dataset arises from a study to estimate the human pose via digital images captured from cameras, with 37 people in gray scale of image size, which can be represented as data in matrix form naturally. Each person has facial images covering a view sphere of degrees in yaw and degrees in tilt at degrees increment. Several example images can be found in the Fig. 3. As discussed before, it is difficult to measure the real angle of human head accurately, thus, it can be expected that the label of each image may contain several small errors, resulting in the noisy label problem.

For each person, we keep the degree in tilt fixed and use the degree in yaw angle as our label. We then set tilting degree of the face to , which denotes the frontal face. Each image is cropped around the face and resized to .



5 0.3343 0.2637 0.2613 0.2613 0.2622
10 0.3996 0.2355 0.2195 0.2194 0.2292
15 0.3245 0.2086 0.2049 0.2049 0.2054
20 0.2918 0.2042 0.2029 0.2029 0.2040
Table 4: RAE error on the head angle of head pose dataset

The numerical performance of each algorithm is shown in Table. 4, where column “Trn#” lists the corresponding number of training samples and column “G-RMR (C)” denotes that data used here is corrupted by adding white square blocks ( samples are corrupted). On one hand, as shown in Table 4, our methods outperform all competitive cones in terms of RAE error on the head angle, because our methods take both the correlation within each data and the noisy label issue into consideration, resulting in more robust estimation of the model. On the other hand, our G-RMR also achieves competitive result on corrupted data compared with the result on normal data, which shows the robustness of our model on noisy entries.


In this paper, we addressed the robust matrix regression issue with two novel methods proposed, i.e., RMR and G-RMR. For RMR, we introduced the hinge loss for model fitting, to enhance the robustness of matrix regression methods against the problem of noisy labels. An ADMM algorithm was further derived for model training. As an extension of RMR, the G-RMR was proposed to take noisy predictors into consideration by clean matrix signal recovery during model training procedure. We also conducted extensive empirical studies to evaluate the performance of RMR and G-RMR, and our methods achieve state-of-the-art performance. It shows that our approaches can address real-world problems effectively.


  • Akbilgic, Bozdogan, and Balaban (2014) Akbilgic, O.; Bozdogan, H.; and Balaban, M. E. 2014. A novel hybrid rbf neural networks model as a forecaster. Statistics and Computing 24(3):365–375.
  • Boyd et al. (2011) Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; and Eckstein, J. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning 3(1):1–122.
  • Candès et al. (2011) Candès, E. J.; Li, X.; Ma, Y.; and Wright, J. 2011. Robust principal component analysis? Journal of the ACM (JACM) 58(3):11.
  • Goldstein et al. (2014) Goldstein, T.; O’Donoghue, B.; Setzer, S.; and Baraniuk, R. 2014. Fast alternating direction optimization methods. SIAM Journal on Imaging Sciences 7(3):1588–1623.
  • He and Yuan (2015) He, B., and Yuan, X. 2015. On non-ergodic convergence rate of douglas–rachford alternating direction method of multipliers. Numerische Mathematik 130(3):567–577.
  • Hoerl and Kennard (1970) Hoerl, A. E., and Kennard, R. W. 1970. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12(1):55–67.
  • Huang, Cabral, and De la Torre (2016) Huang, D.; Cabral, R.; and De la Torre, F. 2016. Robust regression. IEEE transactions on pattern analysis and machine intelligence 38(2):363–375.
  • Keerthi and Gilbert (2002) Keerthi, S. S., and Gilbert, E. G. 2002. Convergence of a generalized smo algorithm for svm classifier design. Machine Learning 46(1-3):351–360.
  • Luo et al. (2015) Luo, L.; Xie, Y.; Zhang, Z.; and Li, W.-J. 2015. Support matrix machines. In International Conference on Machine Learning (ICML).
  • Magdon-Ismail, Nicholson, and Abu-Mostafa (1998) Magdon-Ismail, M.; Nicholson, A.; and Abu-Mostafa, Y. S. 1998. Financial markets: very noisy information processing. Proceedings of the IEEE 86(11):2184–2195.
  • Maurer and Pontil (2013) Maurer, A., and Pontil, M. 2013. Excess risk bounds for multitask learning with trace norm regularization. In Conference on Learning Theory (COLT), volume 30, 55–76.
  • Platt and others (1998) Platt, J., et al. 1998. Sequential minimal optimization: A fast algorithm for training support vector machines.
  • Sherrah and Gong (2001) Sherrah, J., and Gong, S. 2001. Fusion of perceptual cues for robust tracking of head pose and position. Pattern Recognition 34(8):1565–1572.
  • Smola and Schölkopf (2004) Smola, A. J., and Schölkopf, B. 2004. A tutorial on support vector regression. Statistics and computing 14(3):199–222.
  • Tibshirani (1996) Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 267–288.
  • Wimalawarne, Tomioka, and Sugiyama (2016) Wimalawarne, K.; Tomioka, R.; and Sugiyama, M. 2016. Theoretical and experimental analyses of tensor-based regression and classification. Neural computation 28(4):686–715.
  • Wolf, Jhuang, and Hazan (2007) Wolf, L.; Jhuang, H.; and Hazan, T. 2007. Modeling appearances with low-rank svm. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, 1–6. IEEE.
  • Zhou and Li (2014) Zhou, H., and Li, L. 2014. Regularized matrix regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76(2):463–483.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description