Auto-associative models, nonlinear Principal component analysis, manifolds and projection pursuit
Laboratoire Paul Painlevé, 59655 Villeneuve d’Ascq Cedex, France.
Principal component analysis (PCA) is a well-known method for extracting linear structures from high-dimensional datasets. It computes the subspace best approaching the dataset from the Euclidean point of view. This method benefits from efficient implementations based either on solving an eigenvalue problem or on iterative algorithms. We refer to  for details. In a similar fashion, multi-dimensional scaling [3, 35, 44] addresses the problem of finding the linear subspace best preserving the pairwise distances. More recently, new algorithms have been proposed to compute low dimensional embeddings of high dimensional data. For instance, Isomap , LLE (Locally linear embedding)  and CDA (Curvilinear distance analysis)  aim at reproducing in the projection space the structure of the initial local neighborhood. These methods are mainly dedicated to visualization purposes. They cannot produce an analytic form of the transformation function, making it difficult to map new points into the dimensionality-reduced space. Besides, since they rely on local properties of pairwise distances, these methods are sensitive to noise and outliers. We refer to  for a comparison between Isomap and CDA and to  for a comparison between some features of LLE and Isomap.
Finding nonlinear structures is a challenging problem. An important family of methods focuses on self-consistent structures. The self-consistency concept is precisely defined in . Geometrically speaking, it means that each point of the structure is the mean of all points that project orthogonally onto it. For instance, it can be shown that the -means algorithm  converges to a set of self-consistent points. Principal curves and surfaces [8, 24, 37, 47] are examples of one-dimensional and two-dimensional self-consistent structures. Their practical computation requires to solve a nonlinear optimization problem. The solution is usually non robust and suffers from a high estimation bias. In , a polygonal algorithm is proposed to reduce this bias. Higher dimensional self-consistent structures are often referred to as self-consistent manifolds even though their existence is not guaranteed for arbitrary datasets. An estimation algorithm based on a grid approximation is proposed in . The fitting criterion involves two smoothness penalty terms describing the elastic properties of the manifold.
In this paper, auto-associative models are proposed as candidates to the generalization of PCA. We show in paragraph 2.1 that these models are dedicated to the approximation of the dataset by a manifold. Here, the word ”manifold” refers to the topology properties of the structure . The approximating manifold is built by a projection pursuit algorithm presented in paragraph 2.2. At each step of the algorithm, the dimension of the manifold is incremented. Some theoretical properties are provided in paragraph 2.3. In particular, we can show that, at each step of the algorithm, the mean residuals norm is not increased. Moreover, it is also established that the algorithm converges in a finite number of steps. Section 3 is devoted to the presentation of some particular auto-associative models. They are compared to the classical PCA and some neural networks models. Implementation aspects are discussed in Section 4. We show that, in numerous cases, no optimization procedure is required. Some illustrations on simulated and real data are presented in Section 5.
2 Auto-associative models
In this chapter, for each unit vector , we denote by the linear projection from to . Besides, for all set , the identity function is denoted by .
2.1 Approximation by manifolds
A function : is a -dimensional auto-associative function if there exist unit orthogonal vectors , called principal directions, and continuously differentiable functions : , called regression functions, such that
where is the Kronecker symbol and
The main feature of auto-associative functions is mainly a consequence of (1):
The equation , defines a differentiable -dimensional manifold of .
We refer to  for a proof. Thus, the equation defines a space in which every point has a neighborhood which resembles the Euclidean space , but in which the global structure may be more complicated. As an example, on a -dimensional manifold, every point has a neighborhood that resembles a line. In a -manifold, every point has a neighborhood that looks like a plane. Examples include the sphere or the surface of a torus.
Now, let be a square integrable random vector of . Assume, without loss of generality, that is centered and introduce . For all auto-associative function , let us consider . Note that, from the results of Subsection 2.3 below, is necessarily a centered random vector. In this context, is called the residual variance. Geometrically speaking, the realizations of the random vector are approximated by the manifold , and represents the variance of ”outside” the manifold.
Of course, such random vector always satisfies a -dimensional auto-associative model with and . Similarly, always satisfies a -dimensional auto-associative model with and . In practice, it is important to find a balance between these two extreme cases by constructing a -dimensional model with and . For instance, in the case where the covariance matrix of is of rank , then is located on a -dimensional linear subspace defined by the equation with
and where , are the eigenvectors of associated to the positive eigenvalues. A little algebra shows that (3) can be rewritten as , where is a -dimensional auto-associative function with linear regression functions for . Moreover, we have . Since (3) is the model produced by a PCA, it straightforwardly follows that PCA is a special (linear) case of auto-associative models. In the next section, we propose an algorithm to build auto-associative models with non necessarily linear regression functions, small dimension and small residual variance. Such models could also be called ”semi-linear” or ”semi-parametric” since they include a linear/parametric part through the use of linear projection operators and a non-linear/non-parametric part through the regression functions.
2.2 A projection pursuit algorithm
Let us recall that, given an unit vector , an index : is a functional measuring the interest of the projection with a non negative real number. The meaning of the word ”interest” depends on the considered data analysis problem. For instance, a possible choice of is the projected variance . Some other examples are presented in Section 4.2. Thus, the maximization of with respect to yields the most interesting direction for this given criteria. An algorithm performing such an optimization is called a projection pursuit algorithm. We refer to  and  for a review on this topic.
Let , and consider the following algorithm which consists in applying iteratively the following steps: [A] computation of the Axes, [P] Projection, [R] Regression and [U] Update:
Determine s.t. , .
The random variables are called principal variables and the random vectors residuals. Step [A] consists in computing an axis orthogonal to the previous ones and maximizing a given index . Step [P] consists in projecting the residuals on this axis to determine the principal variables, and step [R] is devoted to the estimation of the regression function of the principal variables best approximating the residuals. Step [U] simply consists in updating the residuals. Thus, Algorithm 1 can be seen as a projection pursuit regression algorithm [14, 32] since it combines a projection pursuit step [A] and a regression step [R]. The main problem of such approaches is to define an efficient way to iterate from to . Here, the key property is that the residuals are orthogonal to the axis since
Thus, it is natural to iterate the model construction in the subspace orthogonal to , see the orthogonality constraint in step [A]. The theoretical results provided in the next paragraph are mainly consequences of this property.
2.3 Theoretical results
Basing on (4), it is easily shown by induction that both the residuals and the regression functions computed at the iteration are almost surely (a.s.) orthogonal to the axes computed before. More precisely, one has
Besides, the residuals, principal variables and regression functions are centered:
for all . Our main result is the following:
Algorithm 1 builds a -dimensional auto-associative model with principal directions , regression functions and residual . Moreover, one has the expansion
where the principal variables and are centered and non-correlated for .
since (a.s.) in view of (5). Finally, note that the approximation properties of the conditional expectation entails that the sequence of the residual norms is almost surely non increasing. As a consequence, the following corollary will prove useful to select the model dimension similarly to the PCA case.
Let be the information ratio represented by the -dimensional auto-associative model:
Then, , and the sequence is non decreasing.
Note that all these properties are quite general, since they do not depend either on the index , nor on the estimation method for the conditional expectation. In the next section, we show how, in particular cases, additional properties can be obtained.
We first focus on the auto-associative models which can be obtained using linear estimators of the regression functions. The existing links with PCA are highlighted. Second, we introduce the intermediate class of additive auto-associative models and compare it to some neural network approaches.
3.1 Linear auto-associative models and PCA
Here, we limit ourselves to linear estimators of the conditional expectation in step [R]. At iteration , we thus assume
Standard optimization arguments (see , Proposition 2) shows that, necessarily, the regression function obtained at step [R] is located on the axis
with the covariance matrix of :
and where, for all matrix , the transposed matrix is denoted by . As a consequence of Theorem 2, we have the following linear expansion:
As an interesting additional property of these so-called linear auto-associative models, we have for all . This property is established in , Proposition 2. Therefore, the limitation to a family of linear functions in step [R] allows to recover an important property of PCA models: the non-correlation of the principal variables. It is now shown that Algorithm 1 can also compute a PCA model for a well suited choice of the index.
3.2 Additive auto-associative models and neural networks
A -dimensional auto-associative function is called additive if (2) can be rewritten as
In , the following characterization of additive auto-associative functions is provided. A -dimensional auto-associative function is additive if and only if
As a consequence, we have:
In the linear subspace spanned by , every -dimensional additive auto-associative model reduces to the PCA model.
A similar result can be established for the nonlinear PCA based on a neural network and introduced in . The proposed model is obtained by introducing a nonlinear function , called activation function, in the PCA model (3) to obtain
Note that (11) is an additive auto-associative model as defined in (10) if and only if , i.e. if and only if it reduces to the PCA model in the linear subspace spanned by . Moreover, in all cases, we have
which means that this model is included in the PCA one. More generally, the auto-associative Perceptron with one hidden layer  is based on multidimensional activation functions :
Unfortunately, it can be shown  that a single hidden layer is not sufficient. Linear activation functions (leading to a PCA) already yield the best approximation of the data. In other words, the nonlinearity introduced in (12) has no significant effect on the final approximation of the dataset. Besides, determining , is a highly nonlinear problem with numerous local minima, and thus very dependent on the initialization.
4 Implementation aspects
In this section, we focus on the implementation aspects associated to Algorithm 1. Starting from a -sample , two problems are addressed. In Subsection 4.1, we propose some simple methods to estimate the regression functions appearing in step [R]. In Subsection 4.2, the choice of the index in step [A] is discussed. In particular, we propose a contiguity index whose maximization is explicit.
4.1 Estimation of the regression functions
4.1.1 Linear auto-associative models
To estimate the regression functions, the simplest solution is to use a linear approach leading to a linear auto-associative model. In this case, the regression axis is explicit, see (8), and it suffices to replace defined in (9) by its empirical counterpart
where is the residual associated to at iteration .
4.1.2 Nonlinear auto-associative models
Let us now focus on nonlinear estimators of the conditional expectation , . Let us highlight that is a univariate function and thus its estimation does not suffer from the curse of dimensionality . This important property is a consequence of the ”bottleneck” trick used in (2) and, more generally, in neural networks approaches. The key point is that, even though is a - variate function, its construction only requires the nonparametric estimation of a univariate function thanks to the projection operator.
For the sake of simplicity, we propose to work in the orthogonal basis of obtained by completing . Let us denote by the -th coordinate of in . In view of (5), for . Besides, from step [P], . Thus, the estimation of reduces to the estimation of functions
Each coordinate of the estimator can be written in the basis as:
where represents the -th coordinate of the residual associated to the observation at the -th iteration in the basis , is the value of the -th principal variable for the observation and is a Parzen-Rosenblatt kernel, that is to say a bounded real function, integrating to one and such that as . For instance, one may use a a standard Gaussian density. The parameter is a positive number called window in this context. In fact, can be seen as a weighted mean of the residuals which are close to :
where the weights are defined by
and are summing to one:
The amplitude of the smoothing is tuned by . In the case of a kernel with bounded support, for instance if supp, the smoothing is performed on an interval of length . For an automatic choice of the smoothing parameter , we refer to , Chapter 6.
Each coordinate of the estimator is expanded on a basis of real functions as:
The coefficients appearing in the linear combination of basis functions are determined such that for . More precisely,
and it is well-known that this least-square problem benefits from an explicit solution which can be matricially written as
where is the matrix with coefficients , , . Note that this matrix does not depend on the coordinate . Thus, the matrix inversion in (15) is performed only once at each iteration . Besides, the size of this matrix is and thus does not depend either on the dimension of the space , nor on the sample size . As an example, one can use a basis of cubic splines . In this case, the parameter is directly linked to the number of knots: . Remark that, in this case, condition is required so that the matrix is is regular.
4.2 Computation of principal directions
The choice of the index is the key point of any projection pursuit problem where it is needed to find ”interesting” directions. We refer to  and  for a review on this topic. Let us recall that the meaning of the word ”interesting” depends on the considered data analysis problem. As mentioned in Subsection 2.2, the most popular index is the projected variance
used in PCA. Remarking that this index can be rewritten as
it appears that the ”optimal” axis maximizes the mean distance between the projected points. An attractive feature of the index (16) is that its maximization benefits from an explicit solution in terms of the eigenvectors of the empirical covariance matrix defined in (13). Friedman et al [15, 13], and more recently Hall , proposed an index to find clusters or use deviation from the normality measures to reveal more complex structures of the scatter-plot. An alternative approach can be found in  where a particular metric is introduced in PCA so as to detect clusters. We can also mention indices dedicated to outliers detection . Similar problems occur in the neural networks context where the focus is on the construction of nonlinear mappings to unfold the manifold. It is usually required that such a mapping preserves that local topology of the dataset. In this aim, Demartines and Herault  introduce an index to detect the directions in which the nonlinear projection approximatively preserves distances. Such an index can be adapted to our framework by restricting ourselves to linear projections:
The function is assumed to be positive and non increasing in order to favor the local topology preservation. According the authors, the application of this function to the outputs instead of the inputs allows to obtain better performances than the Kohonen’s self-organizing maps [33, 34]. Similarly, the criterion introduced in  yields in our case
However, in both cases, the resulting functions are nonlinear and thus difficult to optimize with respect to .
Our approach is similar to Lebart one’s . It consists in defining a contiguity coefficient whose minimization allows to unfold nonlinear structures. At each iteration , the following Rayleigh quotient  is maximized with respect to :
The matrix is a first order contiguity matrix, whose value is when is the nearest neighbor of , otherwise. The upper part of (17) is proportional to the usual projected variance, see (16). The lower part is the distance between the projection of points which are nearest neighbor in . Then, the maximization of (17) should reveal directions in which the projection best preserves the first order neighborhood structure (see Figure 1). In this sense, the index (17) can be seen as a first order approximation of the index proposed in . Thanks to this approximation, the maximization step benefits from an explicit solution: The resulting principal direction is the eigenvector associated to the maximum eigenvalue of where
is proportional to the local covariance matrix. should be read as the generalized inverse of the singular matrix . Indeed, since is orthogonal to from (5), is, at most, of rank . Note that this approach is equivalent to Lebart’s one when the contiguity matrix is symmetric.
5 Illustration on real and simulated data
Our first illustration is done on the ”DistortedSShape” simulated dataset
introduced in , paragraph 5.2.1 and available on-line at the
The dataset consists of 100 data points in and located around a one-dimensional curve (solid line on Figure 3). The bold dashed curve is the one-dimensional manifold estimated by the principal curves approach . The estimated curve fails to follow the shape of the original curve. Using the auto-associative model, the estimated one-dimensional manifold (dashed curve) is closer to the original one. In this experiment, we used one iteration of Algorithm 1 with the contiguity index (17) in combination with a projection estimate of the regression functions. A basis of cubic splines was used to compute the projection.
Our second illustration is performed on the
”dataset I - Five types of breast cancer”
provided to us by the organizers of the
”Principal Manifolds-2006” workshop.
The dataset is available on-line at the following address:
It consists of micro-array data containing logarithms of expression levels of genes in samples. The data is divided into five types of breast cancer (lumA, lumB, normal, errb2 and basal) plus an unclassified group. Before all, let us note that, since points are necessarily located on a linear subspace of dimension , the covariance matrix is at most of rank . Thus, as a preprocessing step, the dimension of the data is reduced to by a classical PCA, and this, without any loss of information. Forgetting the labels, i.e. without using the initial classification into five types of breast cancer, the information ratio (see Corollary 1) obtained by the classical PCA and the generalized one (basing on auto-associative models), are compared. Figure 3 illustrates the behavior of as the dimension of the model increases. The bold curve, corresponding to the auto-associate model, was computed with the contiguity index (17) in combination with a projection estimate of the regression functions. A basis of cubic splines was used to compute the projection. One can see that the generalized PCA yields far better approximation results than the classical one. As an illustration, the one-dimensional manifold is superimposed to the dataset on Figure 5. Each class is represented with a different gray level. For the sake of the visualization, the dataset as well as the manifold are projected on the principal plane. Similarly, the two-dimensional manifold is represented on Figure 6 on the linear space spanned by the three first principal axes. Taking into account the labels, it is also possible to compute the one-dimensional manifold associated to each type of cancer and to the unclassified points, see Figure 5. Each manifold then represents a kind of skeleton of the corresponding dataset.
Other illustrations can be found in , Chapter 4, where auto-associative models are applied to some image analysis problems.
-  R. Bellman. Dynamic programming. Princeton university press, 1957.
-  D. Bosq and J.P. Lecoutre. Théorie de l’estimation fonctionnelle. Economie et Statistiques avancées. Economica, Paris, 1987.
-  J.D. Carroll and P. Arabie. Multidimensionnal scaling. Annual Rev. of Psychology, 31:607–649, 1980.
-  H. Caussinus and A. Ruiz-Gazen. Metrics for finding typical structures by means of principal component analysis. In Data science and its Applications, pages 177–192. Harcourt Brace, Japan, 1995.
-  B. Chalmond. Modeling and inverse problems in image analysis. In Applied Mathematics Science series, volume 155. Springer, New-York, 2002.
-  B. Chalmond and S. Girard. Nonlinear modeling of scattered multivariate data and its application to shape change. IEEE Pattern Analysis and Machine Intelligence, 21(5):422–432, 1999.
-  B. Cheng and D.M. Titterington. Neural networks: A review from a statistical perspective. Statistical Science, 9(1):2–54, 1994.
-  P. Delicado. Another look at principal curves and surfaces. Journal of Multivariate Analysis, 77:84–116, 2001.
-  P. Demartines and J. Hérault. Curvilinear component analysis: A self-organizing neural network for nonlinear mapping of data sets. IEEE Trans. on Neural Networks, 8(1):148–154, 1997.
-  K.L. Diamantaras and Kung S.Y. Principal component neural networks. Wiley, New-York, 1996.
-  R.L. Eubank. Spline smoothing and non-parametric regression. Decker, 1990.
-  F. Ferraty and P. Vieu. Nonparametric modelling for functional data. Springer, 2005.
-  J.H. Friedman. Exploratory projection pursuit. Journal of the American Statistical Association, 82(397):249–266, 1987.
-  J.H. Friedman and W. Stuetzle. Projection pursuit regression. Journal of the American Statistical Association, 76(376):817–823, 1981.
-  J.H. Friedman and J.W. Tukey. A projection pursuit algorithm for exploratory data analysis. IEEE Trans. on Computers, C23(9):881–890, 1974.
-  S. Girard. A nonlinear PCA based on manifold approximation. Computational Statistics, 15(2):145–167, 2000.
-  S. Girard, B. Chalmond, and J-M. Dinten. Position of principal component analysis among auto-associative composite models. Comptes-Rendus de l’Académie des Sciences, Série I, 326:763–768, 1998.
-  S. Girard and S. Iovleff. Auto-associative models and generalized principal component analysis. Journal of Multivariate Analysis, 93(1):21–39, 2005.
-  A. Gorban and A. Zinovyev. Elastic principal graphs and manifolds and their practical applications. Computing, 75(4):359–379, 2005.
-  P.J. Green and B.W. Silverman. Non-parametric regression and generalized linear models. Chapman and Hall, London, 1994.
-  P. Hall. On polynomial-based projection indices for exploratory projection pursuit. The Annals of Statistics, 17(2):589–605, 1990.
-  W. Härdle. Applied nonparametric regression. Cambridge University Press, Cambridge, 1990.
-  J.A. Hartigan. Clustering algorithms. Wiley, New-York, 1995.
-  T. Hastie and W. Stuetzle. Principal curves. Journal of the American Statistical Association, 84 (406):502–516, 1989.
-  T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning. In Springer Series in Statistics. Springer, 2001.
-  P.J. Huber. Projection pursuit. The Annals of Statistics, 13(2):435–475, 1985.
-  I. Jolliffe. Principal Component Analysis. Springer-Verlag, New York, 1986.
-  M.C. Jones and R. Sibson. What is projection pursuit? Journal of the Royal Statistical Society A, 150:1–36, 1987.
-  J. Karhunen and J. Joutsensalo. Generalizations of principal component analysis, optimization problems and neural networks. Neural Networks, 8:549–562, 1995.
-  B. Kégl. ”Principal curves: learning, design, and applications”. PhD thesis, Concordia University, Canada, 1999.
-  B. Kégl, A. Krzyzak, T. Linder, and K. Zeger. A polygonal line algorithm for constructing principal curves. In Proceedings of 12h NIPS, pages 501–507, Denver, Colorado, 1998.
-  S. Klinke and J. Grassmann. Projection pursuit regression. In Wiley Series in Probability and Statistics, pages 471–496. Wiley, 2000.
-  T. Kohonen. Self-organization of topologically correct feature maps. Biological cybernetics, 43:135–140, 1982.
-  T. Kohonen. Self-organization and associative memory, 3rd edition. Springer-Verlag, Berlin, 1989.
-  J.B. Kruskal and M. Wish. Multidimensional scaling. Sage, Beverly Hills, 1978.
-  L. Lebart. Contiguity analysis and classification. In Opitz O. Gaul W. and Schader M., editors, Data Analysis, pages 233–244. Springer, Berlin, 2000.
-  M. LeBlanc and R. Tibshirani. Adaptive principal surfaces. Journal of the American Statistical Association, 89(425):53–64, 1994.
-  J.A. Lee, A. Lendasse, and M. Verleysen. Curvilinear distance analysis versus isomap. In European Symposium on Artifical Neural Networks, pages 185–192, Bruges, Belgium, 2002.
-  J. Milnor. Topology from the differentiable point of view. University press of Virginia, Charlottesville, 1965.
-  J-X. Pan, W-K. Fung, and K-T. Fang. Multiple outlier detection in multivariate data using projection pursuit techniques. Journal of Statistical Planning and Inference, 83(1):153–167, 2000.
-  B. N. Parlett. The symmetric eigenvalue problem. In Classics in Applied Mathematics, volume 20. SIAM, Philadelphia, 1997.
-  S.T. Roweis and L.K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323–2326, 2000.
-  J.W. Sammson. A nonlinear mapping algorithm for data structure analysis. IEEE Transactions on Computer, 18(5):401–409, 1969.
-  R.N. Shepard and J.D. Carroll. Parametric representation of nonlinear data structures. In P.R. Krishnaiah, editor, Int. Symp. on Multivariate Analysis, pages 561–592. Academic-Press, 1965.
-  T. Tarpey and B. Flury. Self-consistency: A fundamental concept in statistics. Statistical Science, 11(3):229–243, 1996.
-  J.B. Tenenbaum, V. de Silva, and J.C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323, 2000.
-  R. Tibshirani. Principal surfaces revisited. Statistics and computing, 2:183–190, 1992.
-  M. Vlachos, C. Domeniconi, D. Gunopulos, G. Kollios, and N. Koudas. Non-linear dimensionality reduction techniques for classification and visualization. In Proceedings of 8th SIGKDD, pages 23–26, Edmonton, Canada, 2002.