LowRank Discriminative Least Squares Regression for Image Classification
Abstract
Latest least squares regression (LSR) methods aim to learn slack regression targets to replace strict zeroone labels. However, the difference between intraclass targets can also be highlighted when enhancing the distance between different classes, and roughly persuing relaxed targets may lead to the problem of overfitting. To solve above problems, we propose a lowrank discriminative least squares regression model (LRDLSR) for multiclass image classification. Specifically, LRDLSR classwisely imposes lowrank constraint on the intraclass regression targets to encourage its compactness and similarity. Moreover, LRDLSR introduces an additional regularization term on the learned targets to avoid the problem of overfitting. We show that these two improvements help to learn a more discriminative projection for regression, thus achieving better classification performance. The experimental results over a range of image databases demonstrate the effectiveness of the proposed LRDLSR method.
I Introduction
Least squares regression (LSR) is a very popular method in the field of multicategory image classification. LSR aims at learning a projection to transform the original data into the corresponding zeroone labels with a minimum loss. Over the past decades, many LSR based variants have been developed, such as locally weighted LSR [1], local LSR [2], LASSO regression [3], kernel ridge LSR [4], kernel LSR [5], weighted LSR [6], a leastsquares support vector machine (LSSVM) [7] and partial LSR [8]. Besides linear regression, sparse representation, collaborative representation, and probabilistic collaborative representation based classification methods (LRC, SRC, CRC and ProCRC) [9][10][11][12] also take advantage of the LSR framework to find representation coefficients.
However, there are still many issues associated with the above LSR based methods. First, taking the zeroone label matrix as the regression targets is too strict. It is not ideal for classification, as calculating the least squares loss between the extracted features and binary targets cannot reflect the classification performance of a regression model, especially in the multiclass conditions. For instance, the Euclidean distance of any two of the interclass regression targets is constant, i.e., , and for each sample, the difference between the targets of the true and the false class identically equals to 1. These characteristics are contrary to the expectation that the transformed interclass features should be as far as possible from each other. To address this problem, some representative algorithms, i.e., discriminative LSR (DLSR) [13], retargeted LSR (ReLSR) [14], and groupwise ReLSR [15], were proposed to learn relaxed regression targets instead of the original binary targets. Concretely, DLSR utilizes the dragging technique to encourage the interclass regression targets moving in the opposite directions, thus enlarging the distances between different classes. Different from DLSR, ReLSR learns the regression targets from the original data rather than directly adopting the zeroone labels of the samples, in which the margins between classes are forced to be greater than 1. Lately, Wang et al. [14] proved that DLSR is a special model of ReLSR, with the translation values set to zero, and proposed a new formulation for ReLSR. With the new formulation, GReLSR introduces a groupwise constraint to guarantee that intraclass samples have similar translation values.
Besides, the traditional LSR based methods also do not take into account the data correlation during the projection learning procedure, which may result in the loss of some useful structural information and cause overfitting. To explore the underlying relationships, Fang et al. [16] constructed a classcompactnessgraph to ensure that the projected intraclass features are compact so that the overfitting problem can be mitigated to some degree. Wen et al. [17] proposed a novel framework called interclass sparsity based DLSR (ICS_DLSR), which introduces an interclass sparsity constraint on the DLSR model to make the projected features of each class retain sparse structure. In fact, both, the RLRLR and ICS_DLSR algorithms, are based on the model of DLSR, that is they adopt the dragging technique. In addition to learning slack regression targets, RLSL [19] proposed to jointly learn the latent feature subspace and classification model so that the data representation extracted is more dicriminative and compact for classification. The learned latent subspace can be regarded as a transition between the original samples and binary labels.
The various measures adopted by the algorithms mentioned above improve the classification performance. However the dragging technique or the margin constraint used to relax the label matrix also amplify the difference among the intraclass regression targets, which may deteriorate the classification performance. In this paper, a novel relaxed targets based regression model named lowrank discriminative least squares regression (LRDLSR) is proposed to learn a more discriminative projection. Based on the model of DLSR, LRDLSR classwise imposes a lowrank constraint on the relaxed regression targets to ensure the intraclass targets are compact and similar. In this way, the dragging technique will be exploited to a better effect so that both the intraclass similarity and the interclass seperability of regression targets can be guaranteed. Moreover, LRDLSR minimizes the energy of the resulting dynamic regression targets to avoid the problem of overfitting.
The rest of this paper is organized as follows. First, the related works are briefly introduced in Section II. The proposed LRDLSR model and the corresponding optimization procedure are described in Section III. The properties of the algorithm are analysed in Section IV. The experimental results are presented in Section V and Section VI concludes this paper.
Ii Related works
In this section, we briefly review the related works. Let denote the training samples from classes , where is the dimensionality of the samples. denotes the subset of the samples belonging to the th class. denote the binary label matrix of , where column of , i.e., , corresponds to the training sample . If sample belongs to the th class, then the th element of is 1 and all the others are 0.
Iia Original LSR
The main idea of LSR is to learn a projection matrix that maps the original training samples into the binary label space. The objective function of LSR can be formulated as
where is the matrix Frobenius norm and is a positive regularization parameter. is the projection matrix. The first term in problem (1) is a least squares loss function, while the second term is used to avoid the problem of overfitting. Obviously, (1) has a closedform solution as
Given a new sample , LSR calculates its label as where is the th value of .
IiB DLSR and ReLSR
As previously said, making the regression features to pursue strict zeroone outputs is inappropriate for classification tasks. Unlike original LSR, DLSR [13] and ReLSR [14] aim at learning relaxed regression targets rather than using the binary labels as their targets. The main idea of DLSR is to enlarge the distance between the true and the false classes by using an dragging technique. Its regression model can be formulated as
where denotes the Hadamardproduct operator. is a nonnegative dragging label relaxation matrix. is a constant matrix which is defined as
(4) 
Compared to the original LSR, it can be seen that the regression targets are extended to be in DLSR. To help the understanding, we use four samples to explain why the new relaxed target matrix is more discriminative than . Let , , , be four training samples in which the first two samples are from the first class and the latter two are from the second class. Thus their binary label matrix is defined as
(5) 
It is obvious that the distance between any two interclass targets is . Such a fixed distance cannot reflect the classification ability of regression model well. But if we use to replace , then we have
(6) 
In doing so, the distance between the first and the fourth target is rather than a constant. The margin between the two classes is also enlarged by changing the regression outputs in the opposite directions. For example, the class margin of the first regression target is . These meet the expectation that interclass samples should be as far as possible from each other after being projected.
Likewise, ReLSR directly learns relaxed regression targets from the original data to ensure that samples are correctly classified with large margins. The ReLSR model is defined as
(7) 
where indicates the true label of sample . is optimized from with a large margin constraint which enhances the class separability. Hence ReLSR performs better flexibility than DLSR.
Iii From DLSR and ReLSR to LRDLSR
Iiia Problem Formulation and New Regression Model
Although DLSR and ReLSR can learn soft targets and maintain the closedform solution for the projection, an undue focus on large margins will also result in overfitting. As indicated before, exploiting the data correlations is helpful in learning a discriminative data representation. From the classification point of view, both, the intraclass similarity and the interclass incoherence of regression targets should be promoted. However, DLSR and ReLSR ignore the former, because their relaxation values are dynamic. Hence the dragging technique in DLSR and the margin constrain method in ReLSR will also promote the intraclass regression targets to be discrete. If the intraclass similarity of learned targets is weakened, the discriminative power will be compromised. Therefore, based on the model of DLSR, we propose a lowrank discriminative least squares regression model (LRDLSR) as follows
(8) 
where , , and are the regularization parameters and denotes the nuclear norm (the sum of singular values). and denote the projection matrix and the slack target matrix, respectively. The second term is used to learn relaxed regression targets with large interclass margins, the third term is used to learn similar intraclass regression targets, and the fourth term is used to avoid the overfitting problem of .
With our formulation, we note that the major difference between our LRDLSR and DLSR is that in LRDLSR we encourage the relaxed regression targets of each class to be lowrank so that the compactness and similarity of the regression targets from each class can be enhanced. Combined with the dragging technique, both the intraclass similarity and interclass separability of regression targets will be preserved, thus producing a discriminative projection. In fact, the proposed classwise lowrank constraint term can also be extended to the ReLSR and GReLSR models, or other relaxed target learning based LSR models. In addition to the above difference, we also add a simple norm constraint on , i.e., , to restrict the energy of the targets . This is because there are no any restrictions on the variation magnitude of the dynamically updated regression targets in DLSR. In this way, the slack matrix, i.e., , may be very fluctuant and discrete because of aggressively exploiting the largest class margins, thus leading to the problem of overfitting.
IiiB Optimization of LRDLSR
To directly solve the optimization problem in (8) is impossible because three variables , and are correlated. Therefore, an iterative update rule is devised to solve it so as to guarantee that it has a closedform solution in each iteration. In this paper, the alternating direction multipliers method (ADMM) [20] is exploited to optimize LRDLSR. In order to make (8) separable, we first introduce an auxiliary variable as follows
(9) 
Then we obtain the augmented Lagrangian function of (9)
(10) 
where is the Lagrangian multiplier, is the penalty parameter. Next we update variables one by one.
Update : By fixing variables , , , can be obtained by minimizing the following problem
(11) 
Obviously, has a closedform solution as
Update : Given , and , can be classwisely updated by
We can use the singular value thresholding algorithm [21] to classwisely optimize (13). The optimal solution of is
where is the singular value shrinkage operator.
(I) Given a matrix , its singular value decomposition can be formulated as
where is the rank of , and are columnorthogonal matrices.
(II) Given a threshold ,
Update : Analogously, can be solved by minimizing
We set the derivative of with respect to to zero, and obtain the following closedform solution
Let , we find that is independent of , thus can be precalculated before starting the iteration.
Update : After optimizing , and , we can update the nonnegative relaxation matrix by
Let , according to [13], the optimal solution of can be calculated by
The optimization procedure of LRDLSR is overviewed in Algorithm 1.
Algorithm 1. Optimizing LRDLSR by ADMM
Input: Normalized training samples and its label matrix ; Parameters .
Initialization: , , , , , , , .
While not converged do:

Update by using Eq. (12).

Update by using Eq. (13).

Update by using Eq. (18).

Update by using Eq. (20).

Update Lagrange multipliers as

Update penalty parameter as

Check convergence:
End While
Output: and .
IiiC Classification
Once (8) is solved, we can obtain the optimal projection matrix . Then, we use to obtain the projection features of the training samples, i.e., . Suppose is a test sample, then its projection feature is . For convenience, we use the NN classifier to implement classification in our paper.
Iv Analysis of the proposed method
Iva Computational Complexity
In this section, we analyze the computational complexity of Algorithm 1. The main timeconsuming steps of Algorithm 1 are
(1) Singular value decomposition in Eq. (13).
(2) Matrix inverse in (18).
Since the remaining steps only consist of simple matrix addition, substraction and multiplication operations, and elementwise multiplication operation, similar to [17][18], we also ignore the time complexity of these operations. The complexity of singular value decomposition in Eq. (13) is . The complexity of precomputing in Eq. (14) is . Thus the final time complexity for Algorithm 1 is about , where is the number of iterations.
IvB Convergence validation
In this section, we experimentally validate the convergence property of the ADMM optimization algorithm. Fig. 1 gives an empirical evidence that Algorithm 1 converges very well. The value of objective function monotonically decreases with the increasing number of iterations in four different databases. This indicates the effectiveness of the optimization method. However, it is still arduous to theoretically demonstrate that our optimization algorithm has strong convergence because problem (8) has four different blocks and the overall model of LRDLSR is nonconvex.
V Experiments
In order to verify the effectiveness of the proposed LRDLSR model, we compare it with five stateoftheart LSR model based classification methods, including DLSR [13], ReLSR [14], GReLSR [15], RLRLR[16] and RLSL[19], and three representation based classification methods, including LRC [9], CRC [11], ProCRC [12], on five real image datasets. For LRDLSR, DLSR, ReLSR, GReLSR, RLRLR and RLSL, we use the NN classifier. For RLSL, the parameter is set to , where is the number of classes. When we test the performance of CRC, LRC and ProCRC, all the training samples are used as the dictionary. To make fair comparisons, we directly utilize the released codes of the methods being compared to conduct experiments and seek the best parameters for them as much as possible. All the experiments are repeated ten times with random splits of training and test samples. The average results and the standard deviations (meanstd) are reported. The image datasets used in our experiments can be divided into two types:
(1) Face: the AR [22], the CMU PIE [23], the Extended Yale B [24] and the Labeled Faces in the Wild (LFW) [25] datasets;
(2) Object: the COIL20 [26] dataset.
Train No.  LRC  CRC  ProCRC  DLSR  ReLSR  GReLSR  RLRLR  RLSL  LRDLSR(ours) 
10  92.301.15  89.091.48  90.610.95  93.271.43  93.651.94  90.981.62  92.611.04  94.801.16  95.121.22 
15  94.891.33  92.581.27  94.530.85  96.250.75  96.750.72  93.600.83  94.860.85  96.090.90  97.780.86 
20  97.490.51  94.151.15  96.170.82  97.520.67  98.170.67  95.650.82  96.270.33  97.450.29  98.510.85 
25  98.320.60  94.991.24  97.530.68  98.670.53  98.900.85  96.300.84  96.761.06  97.660.91  99.240.59 
Train No.  LRC  CRC  ProCRC  DLSR  ReLSR  GReLSR  RLRLR  RLSL  LRDLSR(ours) 
10  82.180.92  91.850.61  91.740.86  87.951.10  89.680.94  88.461.00  90.210.84  89.020.88  91.180.65 
15  89.430.58  94.760.66  95.410.76  93.370.99  93.980.52  93.130.82  94.800.64  93.290.73  95.070.66 
20  92.000.77  96.390.56  96.740.26  95.730.68  96.140.54  95.250.50  96.370.71  95.180.62  96.840.36 
25  93.730.79  97.690.40  97.580.37  97.340.55  97.750.64  97.060.37  97.340.50  96.690.63  98.160.46 
Train No.  LRC  CRC  ProCRC  DLSR  ReLSR  GReLSR  RLRLR  RLSL  LRDLSR(ours) 
3  28.730.99  71.420.59  76.161.12  73.581.63  73.531.47  74.771.45  76.391.56  75.701.01  78.800.76 
4  37.211.13  78.500.67  83.580.82  80.471.36  81.460.79  82.541.24  83.551.35  83.020.79  86.200.45 
5  44.691.22  83.540.67  87.330.74  85.330.93  86.430.94  87.351.21  86.680.54  86.370.40  90.160.75 
6  52.951.54  86.790.71  90.320.66  88.180.78  88.980.99  89.960.73  89.410.89  88.800.48  92.230.80 
Train No.  LRC  CRC  ProCRC  DLSR  ReLSR  GReLSR  RLRLR  RLSL  LRDLSR(ours) 
10  75.671.01  86.390.60  89.000.37  87.540.79  88.180.79  86.880.72  91.150.58  87.700.63  91.570.48 
15  85.260.63  91.140.43  92.180.25  92.220.54  92.290.42  91.210.51  93.520.32  91.380.43  94.450.51 
20  89.840.48  93.080.35  93.940.18  94.120.27  94.230.21  93.390.27  94.780.30  93.030.38  95.830.35 
25  92.550.39  94.120.30  94.580.21  95.250.20  95.530.16  94.320.31  95.400.18  94.040.27  96.590.21 
Train No.  LRC  CRC  ProCRC  DLSR  ReLSR  GReLSR  RLRLR  RLSL  LRDLSR(ours) 
5  29.992.21  31.671.16  33.190.99  30.431.38  31.431.13  36.761.37  36.211.60  36.101.82  37.201.66 
6  32.371.36  34.271.04  35.900.93  32.351.62  34.461.51  39.220.92  39.371.65  38.481.59  39.991.22 
7  35.531.69  35.961.40  36.871.55  34.672.45  37.502.61  43.022.19  42.031.42  41.431.58  43.821.23 
8  36.981.82  37.921.50  38.241.15  36.271.65  38.721.22  44.391.77  43.301.59  42.181.37  44.881.58 
Va Experiments for Object Classification
In this section, we validate the performance of our LRDLSR model on the COIL20 object dataset which has 1440 images of 20 classes. Each class consists of 72 images that are collected at pose intervals of 5 degrees. Some images from this database are shown in Fig. 2. In our experiments, all images are resized to pixels. For each class, we randomly choose 10, 15, 20, 25 samples to train the model and treat all the remaining images as the test set. The average classification accuracies are reported in Table I. As shown in Table I, we find that our LRDLSR algorithm achieves much better classification results than all the remaining methods used in the comparison, which proves the effectiveness of LRDLSR for the object classification tasks.
VB Experiments for Face Classification
In this section, we evaluate the classification performance of LRDLSR on four real face datasets.
(1) The Extended Yable B Dataset: The Extended Yale B database consists of 2414 face images of 38 individuals. Each individual has about 5964 images. All images are resized to 3232 pixels in advance. We randomly select 10, 15, 20, and 25 images of each individual as training samples, and set the remaining images as test samples.
(2) The AR Dataset: We select a subset which consists of 2600 images of 50 women and 50 men and use the projected 540dimensional features provided in [27]. In each individual, we randomly select 3, 4, 5, and 6 images as training samples and the remaining images are set as test samples.
(3) The CMU PIE Dataset: We select a subset of this dataset where each individual has 170 images that are collected under five different poses (C05, C07, C09, C27 and C29). All images are resized to 3232 pixels. We randomly select 10, 15, 20, and 25 images of each individual as training samples, and treat the remaining images as test samples.
(4) The LFW Dataset: Similar to [17], we use a subset of this dataset which consists of 1251 images of 86 individuals to conduct experiments. Each individual has 1120 images. In our experiments, all images are resized to 3232. We randomly select 5, 6, 7, and 8 images from each individual as training samples and use the remaining images to test. Some images from the above four face databases are shown in Fig. 3.
Database  AR (6)  EYB (15)  CMU PIE (15)  LFW (8)  COIL20 (15) 
LRDLSR  92.210.54  94.670.87  94.640.28  45.560.63  97.780.85 
LRDLSR()  86.771.25  93.390.53  90.560.32  36.501.60  94.480.99 
The average classification rates on these four face datasets are reported in Tables IIV, respectively. It can be observed that our LRDLSR outperforms all the other algorithms on the four face datasets. The main reason is that our LRDLSR can simultaneously guarantee the intraclass compactness and the interclass irrelevance of slack regression targets so that more discriminative information is preserved during the projection learning. It is worth noting that the standard deviation of accuracies of LRDLSR are also competetive which demonstrates the robustness of LRDLSR. Besides, we find that the performance gain of LRDLSR is significant when the number of training samples per subject is small, which indicates our model is applicable to smallsamplesize problems. Fig. 4 shows the tSNE [28] visualization of the features on the Extended Yale B dataset which are extracted by DLSR, ReLSR and LRDLSR, respectively. We randmly select 5 samples for each individual to validate. It is obvious that the features extracted by LRDLSR model present ideal interclass seperability and intraclass compactness which is favorable to classification.
In order to verify whether the lowrank constraint is useful, we set the parameter , then test its classification performance. We randomly select 6, 15, 15, 8, and 15 samples per class as the training samples, from the AR, Extended Yale B, CMU PIE, LFW and COIL20 database, and the remaining samples are treated as test samples. We repeat all experiments ten times and report the average results. The comparative results are shown in Table VI. It is apparent that for , the classification performance is degraded. Expecially on the LFW database, the difference is actually more than 9, which indicates that pursuing lowrank intraclass regression targets is indeed helpful to classification.
VC Classification using deep features
In this section, we conduct experiments on the COIL20, CMU PIE and LFW databases to further verify whether our model is also effective for deep features. In our experiments, two deep networks, VGG16 [29] and ResNet50 [30], are used. After obtaining the deep features of the original samples, since the dimensionality of features is very high, we first conduct a dimensionality reduction by using PCA so that 98% of the energy of features is preserved. For the CMU PIE and COIL20 databases, we randomly select 10 samples of each class for training and all the remaining samples are used for testing. For the LFW database, we randomly select 5 samples of each class for training. Similarly, we repeat all the experiments ten times and report the mean accuracy and standard deviation (meanstd) of the different algorithms. The experimental results are shown in Table VII. We see that both VGG and ResNet features can achieve better classification accuracy than the original features. Especially on the LFW database, there is nearly a 20 improvement. Our LRDLSR model with deep features is consistently superior to other algorithms which means that LRDLSR is also appropriate for the deep features.
Database  COIL20 (10)  CMU PIE (10)  LFW (5) 

LRDLSR (ours)  95.121.22  91.570.48  37.201.66 
VGG+LRDLSR (ours)  98.651.09  91.740.47  55.481.55 
ResNet+LRDLSR (ours)  98.590.63  92.980.54  56.191.06 
RLSL  94.801.16  87.700.63  36.101.82 
VGG+RLSL  97.440.72  89.050.46  53.892.12 
ResNet+RLSL  97.611.31  89.690.48  54.101.22 
RLRLR  92.611.04  91.150.58  36.211.60 
VGG+RLRLR  98.600.58  91.550.45  55.151.47 
ResNet+RLRLR  98.400.67  93.660.40  55.431.55 
GReLSR  90.981.62  86.880.72  36.761.37 
VGG+GReLSR  97.790.86  87.040.63  52.181.57 
ResNet+GReLSR  97.730.67  89.870.37  52.851.53 
ReLSR  93.651.94  88.180.79  31.431.13 
VGG+ReLSR  96.901.06  88.770.41  51.881.42 
ResNet+ReLSR  96.920.89  89.840.53  52.911.75 
DLSR  93.271.43  87.540.79  30.431.38 
VGG+DLSR  96.841.43  87.470.82  49.841.95 
ResNet+DLSR  96.701.65  89.660.63  52.071.91 
VD Parameter Sensitivity Validation
Up to now, it is still an unresolved problem to select optimal parameters for different datasets. In this section, we conduct a sensitivity analysis of the parameters of our LRDLSR model. Note that there are four parameters, i.e., , , and to be selected in LRDLSR. Among them, and are respectively used to balance the weight of the slack targets learning term and the classwise lowrank targets learning term, and are respectively used to avoid the overfitting problem of learned targets and the projection matrix . For convenience, we set the parameters and to 0.01 in advance and focus on selecting the optimal values of parameters and from the candidate set {0.0001, 0.001, 0.01, 0.1, 1} by crossvalidation. The classification accuracy as a function of different parameter values on three datasets is shown in Fig. 5. It is apparent that the optimal parameters are different on the respective datasets, but our LRDLSR model is not very sensitive to the values of and . This also demonstrates that compact and similar intraclass targets are critical to discriminative projection learning, but the classification performance does not completely depend on the choice of the parameters.
Vi Conclusion
In this paper, we proposed a lowrank discriminative least squares regression (LRDLSR) model for multiclass image classification. LRDLSR aims at improving the intraclass similarity of the regression targets learned by the dragging technique. This can ensure that the learned targets are not only relaxed but also discriminative, thus leading to more effective projection. Besides, LRDLSR introduces an extra regularization term to avoid the problem of overfitting by restricting the energy of learned regression targets. The experimental results on the object and face databases demonstrate the effectiveness of the proposed method.
Acknowledgment
This work is supported by the National Natural Science Foundation of China under Grant Nos. 61672265, U1836218, the 111 Project of Ministry of Education of China under Grant No. B12018, and UK EPSRC Grant EP/N007743/1, MURI/EPSRC/DSTL GRANT EP/R018456/1.
References
 [1] D. Ruppert and M. P. Wand, ”Multivariate locally weighted least squares regression,” Ann. Statist., vol. 22, no. 33, pp. 1346–1370, 1994.
 [2] Ruppert, D., Sheather, S. J., & Wand, M. P, ”An effective bandwidth selector for local least squares regression,” J. Amer. Statist. Assoc., 90(432), 1257–1270, 1995.
 [3] R. Tibshirani, ”Regression shrinkage and selection via the lasso,” J. Roy. Statist. Soc. B (Methodol.), vol. 58, no. 1, pp. 267288, 1996.
 [4] S. An, W. Liu, and S. Venkatesh, ”Face recognition using kernel ridge regression,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., Minneapolis, MN, USA, pp. 18, Jun. 2007.
 [5] Gao, J., Shi, D., & Liu, X. , ”Significant vector learning to construct sparse kernel regression models,” Neural Netw., 20(7), 791798, 2007.
 [6] T. Strutz, ”Data Fitting and Uncertainty: A Practical Introduction to Weighted Least Squares and Beyond,” Wiesbaden, Germany: Vieweg, 2010.
 [7] L. Jiao, L. Bo, and L. Wang, ”Fast sparse approximation for least squares support vector machine,” IEEE Trans. Neural Netw., vol. 18, no. 3, pp. 685697, May 2007.
 [8] Abdi, H., ”Partial least squares regression and projection on latent structure regression (pls regression)”, Wiley Interdiscip. Rev. Comput. Stat., 2(1), 97106, 2010.
 [9] I. Naseem, R. Togneri, and M. Bennamoun, ”Linear regression for face recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 11, pp. 21062112, 2010.
 [10] J. Wright, A.Y. Yang, A. Ganesh, et al, ”Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 210227, 2009.
 [11] L. Zhang, M. Yang, and X. Feng, ”Sparse representation or collaborative representation: Which helps face recognition?” in Proc. of IEEE Int. Conf. Comput. Vis., pp. 471478, 2011.
 [12] S. Cai, L. Zhang, W. Zuo, et al, ”A probabilistic collaborative representation based approach for pattern classification,” in Proc. of IEEE Conf. Comput. Vis. Pattern Recognit., pp. 29502959, 2016.
 [13] S. M. Xiang, F. P. Nie, G. F. Meng, C. H. Pan, and C. S. Zhang, ”Discriminative least squares regressions for multiclass classification and feature selection,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 11, pp. 17381754, Nov. 2012.
 [14] X.Y. Zhang, L. Wang, S. Xiang, and C.L. Liu, ”Retargeted least squares regression algorithm,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 9, pp. 22062213, Sep. 2015.
 [15] L. Wang and C. Pan, ”Groupwise retargeted leastsquares regression,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 4, pp. 13521358, Apr. 2018.
 [16] X. Fang, Y. Xu, X. Li, Z. Lai, W. K. Wong, and B. Fan, ”Regularized label relaxation linear regression,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 4, pp. 10061018, Apr. 2018.
 [17] J. Wen, Y. Xu, Z. Y. Li, Z. L. Ma, Y. R Xu, ”Interclass sparsity based discriminative least square regression,” Neural Networks, 102: 3647, 2018.
 [18] Z. Chen, X. J. Wu, J. Kittler, ”A sparse regularized nuclear norm based matrix regression for face recognition with contiguous occlusion”. Pattern Recognition Letters, 2019.
 [19] X. Z. Fang, S. H. Teng, Z. H. Lai, et al. ”Robust latent subspace learning for image classification,” IEEE transactions on neural networks and learning systems, 29(6): 25022515, 2018.
 [20] E. Chu B. Peleato S. Boyd, N. Parikh and J. Eckstein, ”Distributed optimization and statistical learning via the alternating direction method of multipliers,” Found. Trends Mach. Learn., pages 1122, 2011.
 [21] J. F. Cai, E. J. Candes, and Z. Shen, A singular value thresholding algorithm for matrix completion, SIAM J. Optimization, vol. 20, no. 4, pp. 19561982, 2010.
 [22] A. M. Martinez and R. Benavente, ”The AR face database,” CVC, New Delhi, India, Tech. Rep. 24, 1998.
 [23] T. Sim, S. Baker, M. Bsat, “The CMU pose, illumination, and expression (PIE) database,” in Proc. of IEEE Int. Conf. Autom. Face Gesture Recognit., pp. 4651, 2002.
 [24] A. S. Georghiades, P. N. Belhumeur, and D. Kriegman, From few to many: Illumination cone models for face recognition under variable lighting and pose, IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 6, pp. 643660, Jun. 2001.
 [25] G. B. Huang, M. Ramesh, T. Berg, and E. LearnedMiller, Labeled faces in the wild: A database for studying face recognition in unconstrained environments, College Comput. Sci., Univ. Massachusetts, Amherst, MA, USA, Tech. Rep. 0749, Oct. 2007.
 [26] S. A. Nene, S. K. Nayar and H. Murase, “Columbia Object Image Library (COIL100),” Technical Report, CUCS00696, 1996.
 [27] L.S. Davis Z. Jiang, Z. Lin. ”Label consistent ksvd: Learning a discriminative dictionary for recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 26512664, 2013.
 [28] Y. LeCun Y.L. Boureau, F. Bach and J. Ponce. Learning midlevel features for recognition. in Proc. 23rd IEEE Conf. Comput. Vis. Pattern Recognit., pages 25592566, 2010.
 [29] K. Simonyan, A. Zisserman. ”Very deep convolutional networks for largescale image recognition.” arXiv preprint arXiv:1409.1556, 2014.
 [30] K. He, X. Zhang, S. Ren, et al. ”Deep residual learning for image recognition”//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770778.