Covariancefree Partial Least Squares:
An Incremental Dimensionality Reduction Method
Abstract
Dimensionality reduction plays an important role in computer vision problems since it reduces computational cost and is often capable of yielding more discriminative data representation. In this context, Partial Least Squares (PLS) has presented notable results in tasks such as image classification and neural network optimization. However, PLS is infeasible on large datasets (e.g., ImageNet) because it requires all the data to be in memory in advance, which is often impractical due to hardware limitations. Additionally, this requirement prevents us from employing PLS on streaming applications where the data are being continuously generated. Motivated by this, we propose a novel incremental PLS, named Covariancefree Incremental Partial Least Squares (CIPLS), which learns a lowdimensional representation of the data using a single sample at a time. In contrast to other stateoftheart approaches, instead of adopting a partiallydiscriminative or SGDbased model, we extend Nonlinear Iterative Partial Least Squares (NIPALS) — the standard algorithm used to compute PLS — for incremental processing. Among the advantages of this approach are the preservation of discriminative information across all components, the possibility of employing its score matrices for feature selection, and its computational efficiency. We validate CIPLS on face verification and image classification tasks, where it outperforms several other incremental dimensionality reduction methods. In the context of feature selection, CIPLS achieves comparable results when compared to stateoftheart techniques.
1 Introduction
Dimensionality reduction is widely used in computer vision applications from image classification [11] [2] to neural network optimization [8]. The idea behind this technique is to estimate a transformation matrix that projects the highdimensional feature space onto a lowdimensional latent space [20][7]. Previous works have demonstrated that dimensionality reduction can improve not only computational cost but also the effectiveness of the data representation [18] [31] [29]. In this context, Partial Least Squares (PLS) has presented remarkable results when compared to other dimensionality reduction methods [29]. This is mainly due to the criterion through which PLS finds the low dimensional space, which is by capturing the relationship between independent and dependent variables. Another interesting aspect of PLS is that it can operate as a feature selection method, for instance, by employing Variable Importance in Projection (VIP) [21]. The VIP technique employs score matrices yielded by NIPALS (the standard algorithm used for traditional PLS) to compute the importance of each feature based on its contribution in the generation of the latent space.
Despite achieving notable results, PLS is not suitable for large datasets (e.g., ImageNet [5]) since it requires all the data to be in memory in advance, which is often impractical due to hardware limitations. Additionally, this requirement prevents us from employing PLS on streaming applications, where the data are being generated continuously. Such limitation is not particular to PLS, many dimensionality reduction methods, such as Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA), also suffer from this problem [32] [2].
To handle the aforementioned problem, many works have proposed incremental versions of traditional dimensionality reduction methods. The idea behind these methods is to estimate the projection matrix using a single data sample (or a subset) at a time while keeping some properties of the traditional dimensionality reduction methods. A wellknown class of incremental methods is the one based on Stochastic Gradient Descent (SGD) [3] [2]. These methods interpret dimensionality reduction as a stochastic optimization problem of an unknown distribution. As shown by Weng et al. [32], incremental methods based on SGD are computationally expensive, present convergence problems and require many parameters that depend on the nature of the data. To address this problem, Zeng et al. [34] proposed an efficient and lowcost incremental PLS (IPLS). In their work, the first dimension (component) of the latent space is found incrementally, while the other dimensions are estimated by projecting the first component onto the reconstructed covariance matrix, which is employed to address the issue of impractical memory requirements of a full covariance matrix.
Even though IPLS achieves better performance than SGDbased and other stateoftheart incremental methods, the discriminability of its higherorder components (i.e., all except the first) is not preserved, as shown in Figure 1 (a). This behavior appears because the higherorder components are estimated using only the independent variables, that is, it is based on an approximation of the covariance matrix (similar to PCA) instead of employed in PLS. This can degrade the discriminability of the latent model since preserving the relationship between independent and dependent variables is an important property of the original PLS [7]. It is important to emphasize that, for highdimensional data, employing several components often provides better results [29][9][10][17], hence, IPLS might not be suitable on these cases.
Motivated by limitations and drawbacks in incremental PLSbased approaches, we propose a novel incremental method^{1}^{1}1The code is available at:
https://github.com/arturjordao/IncrementalDimensionalityReduction. Our method relies on the hypothesis that the estimation of higherorder components using the covariance matrix, as proposed by Zeng et al. [34], is inadequate since the relationship between independent and dependent variables is lost. Therefore, to preserve this characteristic, we extend NIPALS [1] to avoid the computation of and, consequently, enable it for incremental operation. Since our proposed extension is based on a simple algebraic decomposition, we preserve the simplicity and efficiency that makes NIPALS popular, and we ensure that the relationship between independent and dependent variables is propagated to all components, differently from other methods.
As shown in Figure 1, our method is capable of separating data classes better than IPLS, mainly on the second component (i.e., yaxis). Since the proposed method does not use the covariance matrix () to estimate higherorder components, we refer to it as Covariancefree Incremental Partial Least Squares (CIPLS). Besides providing superior performance, our method can easily be extended as a feature selection technique since it provides all the requirements to execute VIP. Existing incremental PLS methods, on the other hand, require more complex techniques to operate as feature selection [21].
We compare the proposed method on the tasks of face verification and image classification, where it outperforms several other incremental methods in terms of accuracy and efficiency. In addition, in the context of feature selection, we evaluate and compare the proposed method to stateoftheart methods, where it achieves competitive results.
2 Related Work
To enable PCA to operate in an incremental scheme, Weng et al. [32] proposed to compute the principal components without estimating the covariance matrix, which is unknown and impossible to be calculated in incremental methods. For this purpose, their method (CCIPCA) updates the projection matrix for each sample , replacing the unknown covariance matrix by the sample covariance matrix (). While CCIPCA provides a minimum reconstruction error of the data, it might not improve or even preserve the discriminability of the resulting subspace since label information is ignored (similarly to traditional PCA) [20].
To achieve discriminability, incremental methods based on Linear Discriminant Analysis (LDA) have been proposed [12] [19]. In particular, this class of methods is less explored since they present issues such as the sample size problem [13], which makes them infeasible for some tasks. Different from incremental LDA methods, incremental PLS methods are more flexible and present better results [34]. Motivated by this, Arora et al. [3] proposed an incremental PLS based on stochastic optimization (SGDPLS), where the idea is to optimize an objective function using a single sample at a time. Similarly to Arora et al. [3], Stott et al. [30] proposed applying stochastic gradient maximization on NIPALS, extending it for incremental processing. Even though they present promising results on synthetic data, their approach presented convergence problems when evaluated on realworld datasets. Thus, in this work, we consider only the approach by Arora et al. [3], which was the one that converged for several of the datasets evaluated and presented better results.
While SGDPLS is effective, as demonstrated by Weng et al. [32] and Zeng et al. [34], SGDbased methods applied to dimensionality reduction are computationally expensive and present convergence problems. In addition, this class of approaches requires careful parameter tuning and their results are often sensitive to the type of dataset [32].
To address convergence problems in SGDbased PLS, Zeng et al. [34] proposed to decompose the relationship between independent and dependent matrices (variables) into a sample relationship (i.e., a single sample with its label). This process is performed only to compute the first component, the higherorder components are estimated by projecting the first component onto an approximated covariance matrix using a few PCA components. As we mentioned earlier, since traditional PCA cannot be employed in incremental methods, Zeng et al. [34] used CCIPCA to reconstruct the principal components of the covariance matrix.
In contrast to existing incremental PLS methods, our method presents superior performance in both accuracy and execution time for estimation of the projection matrix, which is an important requirement for timesensitive and resourceconstrained tasks. In particular, our method outperforms IPLS and SGDPLS in and percentage points, respectively, when using only higherorder components. The reason for these results is the quality of our higherorder components, which keeps the properties of traditional PLS.
Besides dimensionality reduction, another group of techniques widely employed to reduce computational cost are feature selection methods. One of the most recent and successful feature selection methods is the work by Roffo et al. [28], which proposed to interpret feature selection as a graph problem. In their method, named infinity feature selection (infFS), each feature represents a node in an undirected fullyconnected graph and the paths in this graph represent the combinations of features. Following this model, the goal is to find the best path taking into account all the possible paths (in this sense, all the subsets of features) on the graph, by exploring the convergence property of the geometric power series of a matrix. Improving upon this model, Roffo et al. [27] suggested quantizing the raw features into a small set of tokens before applying the process of Roffo et al. [28]. By using this preprocessing, their method (referred to as infinity latent feature selection — ilFS) achieved even better results than Roffo et al. [28]. Even though Roffo et al. [28] [27] achieved stateoftheart results on the context of neural network optimization, Jordao et al. [16] showed that PLS+VIP attains superior performance. We show that CIPLS+VIP achieves comparable results when compared to PLS+VIP and other stateoftheart feature selection techniques.
3 Proposed Approach
In this section, we start by describing the traditional Partial Least Squares (PLS). Then, we present the proposed Covariancefree Incremental Partial Least Squares (CIPLS) and the Variable Importance in Projection (VIP) technique, which enables PLS and CIPLS to be employed for feature selection. Unless stated otherwise, let be the matrix of independent variables denoting training samples in a dimensional space. Furthermore, let be the matrix of dependent variables representing the binary class label. Finally, let and be a single sample of and , respectively. We highlight that, in the context of streaming data, is a data sample acquired at time .
3.1 Partial Least Squares
Given a high dimensional space, PLS finds a projection matrix , which projects this space onto a low dimensional space, where . For this purpose, PLS aims at maximizing the covariance between the independent and dependent variables. Formally, PLS constructs such that
(1) 
where denotes the th component of the dimensional space. The exact solution to Equation 1 is given by
(2) 
From Equation 2, we can compute all the components using either Nonlinear Iterative Partial Least Squares (NIPALS) [1] or Singular Value Decomposition (SVD). Most works employ NIPALS since it is capable of finding only the first components, while SVD always finds all the components, being computationally inefficient compared to NIPALS [1].
3.2 Covariancefree Incremental PLS
The core idea in our method is to ensure that, as in traditional PLS, the relationship between independent and dependent variables (Equation 2) is kept on all the components. To achieve this goal, our method works as follows. First, we need to center the data to the mean of the training samples . However, different from traditional methods, in incremental approaches the mean is unknown since we cannot assume that all the data are known a priori [32] [34]. To face this problem, we center the current data sample using an approximate centralization process [32], which consists of estimating an incremental mean using the th sample. According to Weng et al. [32], we can compute the incremental mean w.r.t. the th data sample as
(3) 
Once we have centralized the sample, the next step in our method is to compute the component following Equation 2. As we mentioned, and its respective are unknown or are not in memory in advance, which prohibits us to apply Equation 2 directly. However, as suggested by Zeng et al. [34], we employ the following decomposition:
(4) 
By replacing in Equation 2 by Equation 4, it is possible to calculate the th component of PLS considering a single sample at a time. In other words, Equation 4 enables to compute incrementally.
To compute the higherorder components (, ), we employ a deflation process, which consists of subtracting the contribution of the current component on the sample before estimating the next component. Following the NIPALS algorithm, the deflation process works as follows
(5) 
(6) 
(7) 
where denotes the projected samples onto the current component , and and represent the loadings of this projection. It should be noted that while works in an incremental scheme (since we can project one sample at a time), and cannot be computed since and are neither known nor are in memory in advance. However, in light of Equation 4, we can decompose and as
(8) 
By embedding Equation 8 on the deflation process, we can remove the contribution of the current component and repeat the process to compute a single component . Observe that Equation 7 deflates each sample by its reconstructed value, therefore, Equation 7 can be computed samplebysample, working in an incremental scheme. With this formulation, we are now capable of computing the components incrementally. Algorithm 1 summarizes the steps of the proposed method. It should be mentioned that the matrices , and are initialized with zeros.
According to Algorithm 1, the proposed method maintains the property of capturing the relationship between and for all components (step in Algorithm 1). In addition, since we compute all components at once, our method has a time complexity of , where , and denote the number of samples, number of components, and dimensionality of the data, respectively.
3.3 CIPLS for Feature Selection
An advantage of PLS is that, after estimating the projection matrix , it is possible to estimate the importance of each feature, enabling PLS to operate as a feature selection method. For this purpose, it is possible to employ Variable Importance in Projection (VIP), which estimates the importance of each feature w.r.t its contribution to yield the low dimensional space. According to [21], VIP is defined as
(9) 
Once we have estimated the score of each feature, we can remove a percentage of features based on their scores. As can be verified in Algorithm 1, CIPLS preserves the ability of traditional PLS to be employed as a feature selection method via VIP (Equation 9). On the other hand, it is important to emphasize that IPLS and SGDPLS cannot be used to compute VIP since they do not provide the loading matrix ().
4 Experimental Results
In this section, we first introduce the experimental setup and the tasks employed to validate the proposed method. Then, we present the procedure conducted to calibrate the parameters of the methods. Next, we compare the proposed method with other incremental partial least squares methods as well as with the traditional PLS. Afterwards, we present the influence of higherorder components on the classification performance. Finally, we discuss the time complexity of the methods, their performance on a streaming scenario and compare our method on the feature selection context.
Experimental Setup. Throughout the experiments, we use a linear SVM for binary classification (face verification) because, according to Zeng et al. [34], a linear SVM coupled with dimensionality reduction is able to achieve remarkable results while being computationally efficient. In addition, it has been shown that simple classifiers when feed by features from convolutional networks are able to achieve results comparable to more sophisticated classifiers [6] [25]. For multiclass problems (image classification), on the other hand, we prefer to use a multilayer perceptron because it handles the multiclass problem naturally, avoiding the need for employing a binary classifier on a oneversusrest fashion, which would be computationally expensive. All experiments and methods were executed on an Intel Core i58400, 2.4 GHz processor with 16 GB of RAM.
To assess the differences in efficacy and efficiency among the compared methods, throughout the experiments we follow the approach by Jain et al. [15] and perform statistical tests based on a paired ttest using of confidence. We highlight that the statistical tests were conducted only for face verification due to the computational cost of retraining (i.e., finetuning) the convolutional neural network for image classification, which is considerably high since we employ largescale datasets in our assessment.
Face Verification. Given a pair of face images, face verification determines whether this pair belongs to the same person. For this purpose, we use a threestage pipeline [24] [4], which works as follows. First, we extract a feature vector of each face using a deep learning model. In this work, we use the feature maps from the last convolutional layer of the VGG16 model, learned on the VGGFaces dataset [23], as feature vector. Then, we compute the distance between the two feature vectors employing the distance metric and present the result of the distance metric to a classifier.
We conduct our evaluation on two face verification datasets, namely Labeled Faces in the Wild (LFW) [14] and Youtube Faces (YTF) [33].
Image Classification. Image classification consists of deciding to which one of a given set of categories an image belongs. Traditionally, this is done by extracting features from the samples and feeding these features to a classifier, which determines the category to which each image belongs. For this purpose, we use the feature maps from the last convolutional layer of the VGG16 model as features.
For the image classification task, we consider two versions of the ImageNet dataset, with images of size and pixels. The former is used since it is the original version of the dataset, while the latter is used because it has been demonstrated to be more challenging than the original version [26] [22].
Number of Components. One of the most important aspects in dimensionality reduction methods is the number of components of the resulting latent space. Therefore, to choose the best number of components for each method, we vary from to and select the value for which the method achieved the highest accuracy on the validation set ( of the training set). Once the best is chosen, we use the training and validation set to learn the projection method and the classifier. We repeat this process for each dataset.
LFW  YTF 




CCIPCA [32]  
SGDPLS [3]  –  –  
IPLS [34]  
CIPLS  
PLS  –  – 
Comparison with Incremental Methods. This experiment compares our CIPLS with other incremental dimensionality reduction methods. Table 1 summarizes the results and shows that, on LFW, our method outperformed SGDPLS and IPLS by and percentage points (p.p.), respectively. Similarly, on the YTF dataset, CIPLS outperformed SGDPLS and IPLS by and p.p., in this order.
Finally, on the ImageNet dataset, the difference in accuracy compared to IPLS was of and p.p., for the and versions, respectively. It is important to mention that we do not consider SGDPLS on these datasets due to convergence problems and high computational cost. Moreover, due to memory constraints, it was not possible to run the traditional PLS on the ImageNet dataset.
Average Accuracy  

CCIPCA [32]  
SGDPLS [3]  
IPLS [34]  
CIPLS (Ours) 
Comparison with Partial Least Squares. As suggested by Weng et al. [32], we compare the incremental methods with the traditional approach (in our case, traditional PLS), in which the closer to the accuracy of the baseline, the better.
According to Table 1, besides providing better results than IPLS and SGDPLS, CIPLS achieved the closest results to traditional PLS. For instance, on LFW, the difference in accuracy between PLS and CIPLS was p.p. while on the YTF dataset it was p.p. In contrast, the difference in accuracy between PLS and SGDPLS is higher – p.p. on LFW and p.p. on the YTF dataset. In addition, the difference in accuracy between PLS and IPLS is among the highest, and p.p. for the LFW and YTF, respectively. In particular, the results for PLS and CIPLS are statistically equivalent, while IPLS and SGDPLS present results statistically inferior compared to PLS.
It should be noted that the results of IPLS are closer to CCIPCA than PLS since only the first component of IPLS maintains the relationship between independent and dependent variables. On the other hand, the proposed method preserves this relation along higherorder components, which provides better discriminability, as seen in our results.
Higherorder Components. This experiment assesses the discriminability of the higherorder components of CIPLS compared to each of the other incremental methods. For this purpose, we follow a process suggested by Martinez [20], which consists of removing the first component of the latent space before presenting the projected data to the classifier. This evaluates the performance of the remaining components, not only the first one which tends to be better.
Table 2 shows the results. According to Table 2, CIPLS outperformed IPLS by p.p. Observe that when all the components are used, CIPLS outperformed IPLS by p.p. This larger difference when removing the first component is an effect of the better discriminability achieved by the components extracted by CIPLS. As we have argued, CIPLS preserves the relationship between dependent and independent variables across higherorder components, yielding more accurate results. Compared to SGDPLS, CIPLS outperforms it by p.p.
Time Complexity  

CCIPCA [32]  
SGDPLS [3]  
IPLS [34]  ) 
CIPLS (Ours) 
Time Issues. To demonstrate the efficiency of CIPLS, in this experiment, we compare its time complexity to compute the projection matrix with the incremental methods evaluated. Following Weng et al. [32] and Zeng et al. [34], we report this complexity w.r.t. dimensionality of the original data (), number of samples (), number of components () and number of PCA components ( — required only by IPLS and CCIPCA). Table 3 shows the time complexity of the methods.
According to Table 3, CIPLS presents a low time complexity for estimating the projection matrix. The complexity of CIPLS is not only on the same class as CCIPCA, which is the fastest among the compared methods, but it also has a very small constant factor. This constant factor is the number of components, for CIPLS and for CCIPCA. Experimentally, we found that the optimal constant factor for PLS is negligible, resulted in the highest accuracies. While, for fairness, the same number of components was adopted for all methods in Table 3, typically on practical applications. This is a known advantage of PLS, it has been shown to require substantially less components to achieve its optimal accuracy than PCA [29].
Finally, we report the average computation time (considering executions) of the methods for estimating the projection matrix for one new sample. To make a fair comparison, we set for all methods and for the other parameters we use the values where the methods achieved the best results in validation. As shown in Figure 2, SGDPLS is the slowest incremental PLS method, which is a consequence of its strategy for estimating the projection matrix, where for each sample the convergence step is run times. Our experiments showed that is required for good results. The computation time for estimating the projection matrix of our method was statistically equivalent (according to a paired ttest) to that of CCIPCA, which is the fastest among the incremental dimensionality reduction methods assessed. Moreover, CIPLS was statistically faster than IPLS and SGDPLS, demonstrating that it is the fastest among the compared incremental PLS methods.
Incremental Methods on the Streaming Scenario. As we argued before, incremental methods can be employed on streaming applications, where the training data are continuously generated. To demonstrate the robustness of our method on these scenarios, in this experiment, we evaluate the methods on a synthetic streaming context, as proposed by Zeng et al. [34]. The procedure works as follows. First, the training data is divided into blocks, where . The idea behind this process is to interpret each block as a new instance of arriving data. Then, we create a new training set and insert each th block at a time. Each time we insert a new block, we learn the projection method and evaluate its accuracy on the testing set. For instance, when adding the tenth block, all the blocks are being used as training. It is important to mention that a block contains more than one sample, however, this does not modify the strategy of the incremental methods, which is to estimate the projection matrix by using a single sample at a time.
Figure 3 (a) and (b) show the results on the LFW and YTF datasets, respectively. On LFW, until the fifth block, it is not possible to determine the best method since the accuracy presents high variance, however, from the sixth block onwards, CIPLS outperforms all other methods. On YTF, our method achieves the highest accuracy for all blocks. These results show that the proposed method is more adequate for streaming applications than existing incremental PLS methods.
Comparison with Feature Selection Methods. Our last experiment evaluates the performance of CIPLS as a feature selection method. Table 4 shows the results for different percentages of kept features on LFW and YTF.
LFW  YTF  

Percentage of Kept Features  Percentage of Kept Features  
infFS [28]  
ilFS [27]  
PLS+VIP  
CIPLS (Ours)+VIP 
According to Table 4, CIPLS achieves comparable results when compared to stateoftheart feature selection techniques. For example, on LFW the difference in accuracy, on average, from CIPLS to infFS and ilFS is of and p.p., respectively. In contrast, on YTF for some percentages of kept features (e.g., and ), CIPLS outperforms infFS and ilFS. We highlight that these methods were designed specifically for feature selection. Additionally, the difference, on average, between CIPLS and PLS is of , and p.p. on the LFW and YTF datasets, respectively. Moreover, the largest accuracy difference between PLS and CIPLS is of p.p., on LFW dataset with of features kept. This result reinforces that the proposed decompositions to extend the NIPALS and enable the employment of VIP are a good approximation of the original method.
Based on the results shown, it is possible to conclude that, besides dimensionality reduction, CIPLS achieves stateoftheart results in the context of feature selection.
5 Conclusions
This work presented a novel incremental partial least squares method, named Covariancefree Incremental Partial Least Squares (CIPLS). The method extends the NIPALS algorithm for incremental operation and enables computation of the projection matrix using one sample at a time while still presenting the main property of traditional PLS, namely preserving the relation between dependent and independent variables. Compared to existing incremental partial least squares methods, CIPLS attains superior performance besides being computationally efficient. In addition, different from previous incremental partial least squares, CIPLS can easily to operate as a feature selection method. In this context, the proposed method is able to achieve comparable results to the stateoftheart.
Acknowledgments
The authors would like to thank the Brazilian National Research Council – CNPq (Grants #311053/20165 and #438629/20183), the Minas Gerais Research Foundation – FAPEMIG (Grants APQ0056714 and PPM0054017) and the Coordination for the Improvement of Higher Education Personnel – CAPES (DeepEyes Project). The authors would like to thank XueQiang, Zeng GuoZheng Li, Raman Arora, Poorya Mianjy, Alexander Stott and Teodor Marinov for sharing their source code.
References
 [1] (2010) Partial least squares regression and projection on latent structure regression (pls regression). Wiley Interdisciplinary Reviews: Computational Statistics. Cited by: §1, §3.1.
 [2] (2019) An acceleration scheme for minibatch, streaming pca. British Machine Vision Conference (BMVC). Cited by: §1, §1, §1.
 [3] (2016) Stochastic optimization for multiview representation learning using partial least squares. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, International Conference on Machine Learning (ICML), Vol. 48. Cited by: §1, §2, Table 1, Table 2, Table 3.
 [4] (2018) Unconstrained still/videobased face verification with deep convolutional neural networks. International Journal of Computer Vision 126. Cited by: §4.
 [5] (2009) ImageNet: A LargeScale Hierarchical Image Database. In IEEE Computer Vision and Pattern Recognition (CVPR), pp. 248–255. Cited by: §1.
 [6] (2014) DeCAF: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning (ICML), Cited by: §4.
 [7] (1986) Partial leastsquares regression: a tutorial. Analytica Chimica Acta 185, pp. 1 – 17. Cited by: §1, §1.
 [8] (2015) Fast RCNN. In IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448. Cited by: §1.
 [9] (2011) Simultaneous dimensionality reduction and human age estimation via kernel partial least squares regression. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
 [10] (2013) Joint estimation of age, gender and ethnicity: CCA vs. PLS. In IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Cited by: §1.
 [11] (2016) PLSNet: A simple network using partial least squares regression for image classification. In International Conference on Pattern Recognition (ICPR), pp. 1601–1606. Cited by: §1.
 [12] (2000) Convergence analysis of online linear discriminant analysis. In IEEE International Joint Conference on Neural Network (IJCNN), pp. 387–391. Cited by: §2.
 [13] (2006) Solving the small sample size problem in face recognition using generalized discriminant analysis. Pattern Recognition 39 (2), pp. 277–287. Cited by: §2.
 [14] (2012) Learning to align from scratch. In Neural Information Processing Systems (NIPS), pp. 773–781. Cited by: §4.
 [15] (1990) The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. Wiley professional computing, John Wiley & Sons. Cited by: §4.
 [16] (2019) Pruning deep neural networks using partial least squares. In British Machine Vision Conference (BMVC) Workshops, pp. 1–9. Cited by: §2.
 [17] (2018) Face verification strategies for employing deep models. In IEEE International Conference on Automatic Face & Gesture Recognition (FG), pp. 258–262. Cited by: §1.
 [18] (2019) Dimensionality reduction for representing the knowledge of probabilistic models. In International Conference on Learning Representations (ICLR), Cited by: §1.
 [19] (2012) Incremental learning of complete linear discriminant analysis for face recognition. KnowledgeBased Systems 31, pp. 19–27. Cited by: §2.
 [20] (2001) PCA versus LDA. IEEE Pattern Analysis and Machine Intelligence (PAMI). Cited by: §1, §2, §4.
 [21] (2012) A review of variable selection methods in partial least squares regression. Chemometrics and Intelligent Laboratory Systems. Cited by: §1, §1, §3.3.
 [22] (2019) Differentiable unrolled alternating direction method of multipliers for onenet. British Machine Vision Conference (BMVC). Cited by: §4.
 [23] (2015) Deep face recognition. In British Machine Vision Conference (BMVC), pp. 41.1–41.12. Cited by: §4.
 [24] (2018) Deep learning for understanding faces: machines may be just as good, or better, than humans. Signal Processing Magazine 35. Cited by: §4.
 [25] (2014) CNN features offtheshelf: an astounding baseline for recognition. In Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §4.
 [26] (2017) Learning multiple visual domains with residual adapters. In Neural Information Processing Systems (NIPS), pp. 506–516. Cited by: §4.
 [27] (2017) Infinite latent feature selection: A probabilistic latent graphbased ranking approach. In IEEE International Conference on Computer Vision (ICCV), pp. 1407–1415. Cited by: §2, Table 4.
 [28] (2015) Infinite feature selection. In IEEE International Conference on Computer Vision (ICCV), pp. 4202–4210. Cited by: §2, Table 4.
 [29] (2009) Human detection using partial least squares analysis.. In IEEE International Conference on Computer Vision (ICCV), pp. 24–31. Cited by: §1, §1, §4.
 [30] (2017) An online NIPALS algorithm for partial least squares. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4177–4181. Cited by: §2.
 [31] (2018) Learning lowdimensional temporal representations. In International Conference on Machine Learning (ICML), Cited by: §1.
 [32] (2003) Candid covariancefree incremental principal component analysis. IEEE Pattern Analysis and Machine Intelligence (PAMI) 25 (8), pp. 1034–1040. Cited by: §1, §1, §2, §2, §3.2, Table 1, Table 2, Table 3, §4, §4.
 [33] (2011) Face recognition in unconstrained videos with matched background similarity. In Computer Vision and Pattern Recognition (CVPR), pp. 529–534. Cited by: §4.
 [34] (2014) Incremental partial least squares analysis of big streaming data. Pattern Recognition 47, pp. 3726–3735. Cited by: §1, §1, §2, §2, §2, §3.2, §3.2, Table 1, Table 2, Table 3, §4, §4, §4.