Autoassociative models, nonlinear Principal component analysis, manifolds and projection pursuit
Stephane.Girard@inrialpes.fr
Laboratoire Paul Painlevé, 59655 Villeneuve d’Ascq Cedex, France.
serge.iovleff@univlille1.fr
1 Introduction
Principal component analysis (PCA) is a wellknown method for extracting linear structures from highdimensional datasets. It computes the subspace best approaching the dataset from the Euclidean point of view. This method benefits from efficient implementations based either on solving an eigenvalue problem or on iterative algorithms. We refer to [27] for details. In a similar fashion, multidimensional scaling [3, 35, 44] addresses the problem of finding the linear subspace best preserving the pairwise distances. More recently, new algorithms have been proposed to compute low dimensional embeddings of high dimensional data. For instance, Isomap [46], LLE (Locally linear embedding) [42] and CDA (Curvilinear distance analysis) [9] aim at reproducing in the projection space the structure of the initial local neighborhood. These methods are mainly dedicated to visualization purposes. They cannot produce an analytic form of the transformation function, making it difficult to map new points into the dimensionalityreduced space. Besides, since they rely on local properties of pairwise distances, these methods are sensitive to noise and outliers. We refer to [38] for a comparison between Isomap and CDA and to [48] for a comparison between some features of LLE and Isomap.
Finding nonlinear structures is a challenging problem. An important family of methods focuses on selfconsistent structures. The selfconsistency concept is precisely defined in [45]. Geometrically speaking, it means that each point of the structure is the mean of all points that project orthogonally onto it. For instance, it can be shown that the means algorithm [23] converges to a set of selfconsistent points. Principal curves and surfaces [8, 24, 37, 47] are examples of onedimensional and twodimensional selfconsistent structures. Their practical computation requires to solve a nonlinear optimization problem. The solution is usually non robust and suffers from a high estimation bias. In [31], a polygonal algorithm is proposed to reduce this bias. Higher dimensional selfconsistent structures are often referred to as selfconsistent manifolds even though their existence is not guaranteed for arbitrary datasets. An estimation algorithm based on a grid approximation is proposed in [19]. The fitting criterion involves two smoothness penalty terms describing the elastic properties of the manifold.
In this paper, autoassociative models are proposed as candidates to the generalization of PCA. We show in paragraph 2.1 that these models are dedicated to the approximation of the dataset by a manifold. Here, the word ”manifold” refers to the topology properties of the structure [39]. The approximating manifold is built by a projection pursuit algorithm presented in paragraph 2.2. At each step of the algorithm, the dimension of the manifold is incremented. Some theoretical properties are provided in paragraph 2.3. In particular, we can show that, at each step of the algorithm, the mean residuals norm is not increased. Moreover, it is also established that the algorithm converges in a finite number of steps. Section 3 is devoted to the presentation of some particular autoassociative models. They are compared to the classical PCA and some neural networks models. Implementation aspects are discussed in Section 4. We show that, in numerous cases, no optimization procedure is required. Some illustrations on simulated and real data are presented in Section 5.
2 Autoassociative models
In this chapter, for each unit vector , we denote by the linear projection from to . Besides, for all set , the identity function is denoted by .
2.1 Approximation by manifolds
A function : is a dimensional autoassociative function if there exist unit orthogonal vectors , called principal directions, and continuously differentiable functions : , called regression functions, such that
(1) 
where is the Kronecker symbol and
(2) 
The main feature of autoassociative functions is mainly a consequence of (1):
Theorem 1
The equation , defines a differentiable dimensional manifold of .
We refer to [16] for a proof. Thus, the equation defines a space in which every point has a neighborhood which resembles the Euclidean space , but in which the global structure may be more complicated. As an example, on a dimensional manifold, every point has a neighborhood that resembles a line. In a manifold, every point has a neighborhood that looks like a plane. Examples include the sphere or the surface of a torus.
Now, let be a square integrable random vector of . Assume, without loss of generality, that is centered and introduce . For all autoassociative function , let us consider . Note that, from the results of Subsection 2.3 below, is necessarily a centered random vector. In this context, is called the residual variance. Geometrically speaking, the realizations of the random vector are approximated by the manifold , and represents the variance of ”outside” the manifold.
Of course, such random vector always satisfies a dimensional autoassociative model with and . Similarly, always satisfies a dimensional autoassociative model with and . In practice, it is important to find a balance between these two extreme cases by constructing a dimensional model with and . For instance, in the case where the covariance matrix of is of rank , then is located on a dimensional linear subspace defined by the equation with
(3) 
and where , are the eigenvectors of associated to the positive eigenvalues. A little algebra shows that (3) can be rewritten as , where is a dimensional autoassociative function with linear regression functions for . Moreover, we have . Since (3) is the model produced by a PCA, it straightforwardly follows that PCA is a special (linear) case of autoassociative models. In the next section, we propose an algorithm to build autoassociative models with non necessarily linear regression functions, small dimension and small residual variance. Such models could also be called ”semilinear” or ”semiparametric” since they include a linear/parametric part through the use of linear projection operators and a nonlinear/nonparametric part through the regression functions.
2.2 A projection pursuit algorithm
Let us recall that, given an unit vector , an index : is a functional measuring the interest of the projection with a non negative real number. The meaning of the word ”interest” depends on the considered data analysis problem. For instance, a possible choice of is the projected variance . Some other examples are presented in Section 4.2. Thus, the maximization of with respect to yields the most interesting direction for this given criteria. An algorithm performing such an optimization is called a projection pursuit algorithm. We refer to [26] and [28] for a review on this topic.
Let , and consider the following algorithm which consists in applying iteratively the following steps: [A] computation of the Axes, [P] Projection, [R] Regression and [U] Update:
Algorithm 1
Define .
For :

Determine s.t. , .

Compute .

Estimate ,

Compute .
The random variables are called principal variables and the random vectors residuals. Step [A] consists in computing an axis orthogonal to the previous ones and maximizing a given index . Step [P] consists in projecting the residuals on this axis to determine the principal variables, and step [R] is devoted to the estimation of the regression function of the principal variables best approximating the residuals. Step [U] simply consists in updating the residuals. Thus, Algorithm 1 can be seen as a projection pursuit regression algorithm [14, 32] since it combines a projection pursuit step [A] and a regression step [R]. The main problem of such approaches is to define an efficient way to iterate from to . Here, the key property is that the residuals are orthogonal to the axis since
(4)  
Thus, it is natural to iterate the model construction in the subspace orthogonal to , see the orthogonality constraint in step [A]. The theoretical results provided in the next paragraph are mainly consequences of this property.
2.3 Theoretical results
Basing on (4), it is easily shown by induction that both the residuals and the regression functions computed at the iteration are almost surely (a.s.) orthogonal to the axes computed before. More precisely, one has
for all  (5)  
for all  (6) 
Besides, the residuals, principal variables and regression functions are centered:
for all . Our main result is the following:
Theorem 2
Algorithm 1 builds a dimensional autoassociative model with principal directions , regression functions and residual . Moreover, one has the expansion
(7) 
where the principal variables and are centered and noncorrelated for .
The proof is a direct consequence of the orthogonality properties (5) and (6). Let us highlight that, for , expansion (7) yields an exact expansion of the random vector as:
since (a.s.) in view of (5). Finally, note that the approximation properties of the conditional expectation entails that the sequence of the residual norms is almost surely non increasing. As a consequence, the following corollary will prove useful to select the model dimension similarly to the PCA case.
Corollary 1
Let be the information ratio represented by the dimensional autoassociative model:
Then, , and the sequence is non decreasing.
Note that all these properties are quite general, since they do not depend either on the index , nor on the estimation method for the conditional expectation. In the next section, we show how, in particular cases, additional properties can be obtained.
3 Examples
We first focus on the autoassociative models which can be obtained using linear estimators of the regression functions. The existing links with PCA are highlighted. Second, we introduce the intermediate class of additive autoassociative models and compare it to some neural network approaches.
3.1 Linear autoassociative models and PCA
Here, we limit ourselves to linear estimators of the conditional expectation in step [R]. At iteration , we thus assume
Standard optimization arguments (see [18], Proposition 2) shows that, necessarily, the regression function obtained at step [R] is located on the axis
(8) 
with the covariance matrix of :
(9) 
and where, for all matrix , the transposed matrix is denoted by . As a consequence of Theorem 2, we have the following linear expansion:
As an interesting additional property of these socalled linear autoassociative models, we have for all . This property is established in [18], Proposition 2. Therefore, the limitation to a family of linear functions in step [R] allows to recover an important property of PCA models: the noncorrelation of the principal variables. It is now shown that Algorithm 1 can also compute a PCA model for a well suited choice of the index.
Proposition 1
3.2 Additive autoassociative models and neural networks
A dimensional autoassociative function is called additive if (2) can be rewritten as
(10) 
In [17], the following characterization of additive autoassociative functions is provided. A dimensional autoassociative function is additive if and only if
As a consequence, we have:
Theorem 3
In the linear subspace spanned by , every dimensional additive autoassociative model reduces to the PCA model.
A similar result can be established for the nonlinear PCA based on a neural network and introduced in [29]. The proposed model is obtained by introducing a nonlinear function , called activation function, in the PCA model (3) to obtain
(11) 
Note that (11) is an additive autoassociative model as defined in (10) if and only if , i.e. if and only if it reduces to the PCA model in the linear subspace spanned by . Moreover, in all cases, we have
which means that this model is included in the PCA one. More generally, the autoassociative Perceptron with one hidden layer [7] is based on multidimensional activation functions :
(12) 
Unfortunately, it can be shown [10] that a single hidden layer is not sufficient. Linear activation functions (leading to a PCA) already yield the best approximation of the data. In other words, the nonlinearity introduced in (12) has no significant effect on the final approximation of the dataset. Besides, determining , is a highly nonlinear problem with numerous local minima, and thus very dependent on the initialization.
4 Implementation aspects
In this section, we focus on the implementation aspects associated to Algorithm 1. Starting from a sample , two problems are addressed. In Subsection 4.1, we propose some simple methods to estimate the regression functions appearing in step [R]. In Subsection 4.2, the choice of the index in step [A] is discussed. In particular, we propose a contiguity index whose maximization is explicit.
4.1 Estimation of the regression functions
4.1.1 Linear autoassociative models
To estimate the regression functions, the simplest solution is to use a linear approach leading to a linear autoassociative model. In this case, the regression axis is explicit, see (8), and it suffices to replace defined in (9) by its empirical counterpart
(13) 
where is the residual associated to at iteration .
4.1.2 Nonlinear autoassociative models
Let us now focus on nonlinear estimators of the conditional expectation , . Let us highlight that is a univariate function and thus its estimation does not suffer from the curse of dimensionality [1]. This important property is a consequence of the ”bottleneck” trick used in (2) and, more generally, in neural networks approaches. The key point is that, even though is a  variate function, its construction only requires the nonparametric estimation of a univariate function thanks to the projection operator.
For the sake of simplicity, we propose to work in the orthogonal basis of obtained by completing . Let us denote by the th coordinate of in . In view of (5), for . Besides, from step [P], . Thus, the estimation of reduces to the estimation of functions
This standard problem [22, 12] can be tackled either by kernel [2] or projection [20] estimates.
Kernel estimates
Each coordinate of the estimator can be written in the basis as:
(14) 
where represents the th coordinate of the residual associated to the observation at the th iteration in the basis , is the value of the th principal variable for the observation and is a ParzenRosenblatt kernel, that is to say a bounded real function, integrating to one and such that as . For instance, one may use a a standard Gaussian density. The parameter is a positive number called window in this context. In fact, can be seen as a weighted mean of the residuals which are close to :
where the weights are defined by
and are summing to one:
The amplitude of the smoothing is tuned by . In the case of a kernel with bounded support, for instance if supp, the smoothing is performed on an interval of length . For an automatic choice of the smoothing parameter , we refer to [25], Chapter 6.
Projection estimates
Each coordinate of the estimator is expanded on a basis of real functions as:
The coefficients appearing in the linear combination of basis functions are determined such that for . More precisely,
and it is wellknown that this leastsquare problem benefits from an explicit solution which can be matricially written as
(15) 
where is the matrix with coefficients , , . Note that this matrix does not depend on the coordinate . Thus, the matrix inversion in (15) is performed only once at each iteration . Besides, the size of this matrix is and thus does not depend either on the dimension of the space , nor on the sample size . As an example, one can use a basis of cubic splines [11]. In this case, the parameter is directly linked to the number of knots: . Remark that, in this case, condition is required so that the matrix is is regular.
4.2 Computation of principal directions
The choice of the index is the key point of any projection pursuit problem where it is needed to find ”interesting” directions. We refer to [26] and [28] for a review on this topic. Let us recall that the meaning of the word ”interesting” depends on the considered data analysis problem. As mentioned in Subsection 2.2, the most popular index is the projected variance
(16) 
used in PCA. Remarking that this index can be rewritten as
it appears that the ”optimal” axis maximizes the mean distance between the projected points. An attractive feature of the index (16) is that its maximization benefits from an explicit solution in terms of the eigenvectors of the empirical covariance matrix defined in (13). Friedman et al [15, 13], and more recently Hall [21], proposed an index to find clusters or use deviation from the normality measures to reveal more complex structures of the scatterplot. An alternative approach can be found in [4] where a particular metric is introduced in PCA so as to detect clusters. We can also mention indices dedicated to outliers detection [40]. Similar problems occur in the neural networks context where the focus is on the construction of nonlinear mappings to unfold the manifold. It is usually required that such a mapping preserves that local topology of the dataset. In this aim, Demartines and Herault [9] introduce an index to detect the directions in which the nonlinear projection approximatively preserves distances. Such an index can be adapted to our framework by restricting ourselves to linear projections:
The function is assumed to be positive and non increasing in order to favor the local topology preservation. According the authors, the application of this function to the outputs instead of the inputs allows to obtain better performances than the Kohonen’s selforganizing maps [33, 34]. Similarly, the criterion introduced in [43] yields in our case
However, in both cases, the resulting functions are nonlinear and thus difficult to optimize with respect to .
Our approach is similar to Lebart one’s [36]. It consists in defining a contiguity coefficient whose minimization allows to unfold nonlinear structures. At each iteration , the following Rayleigh quotient [41] is maximized with respect to :
(17) 
The matrix is a first order contiguity matrix, whose value is when is the nearest neighbor of , otherwise. The upper part of (17) is proportional to the usual projected variance, see (16). The lower part is the distance between the projection of points which are nearest neighbor in . Then, the maximization of (17) should reveal directions in which the projection best preserves the first order neighborhood structure (see Figure 1). In this sense, the index (17) can be seen as a first order approximation of the index proposed in [6]. Thanks to this approximation, the maximization step benefits from an explicit solution: The resulting principal direction is the eigenvector associated to the maximum eigenvalue of where
is proportional to the local covariance matrix. should be read as the generalized inverse of the singular matrix . Indeed, since is orthogonal to from (5), is, at most, of rank . Note that this approach is equivalent to Lebart’s one when the contiguity matrix is symmetric.
5 Illustration on real and simulated data
Our first illustration is done on the ”DistortedSShape” simulated dataset
introduced in [30], paragraph 5.2.1 and available online at the
following address:
http://www.iro.umontreal.ca/kegl/research/pcurves.
The dataset consists of 100 data points in and located
around a onedimensional curve (solid line on Figure 3).
The bold dashed curve is the onedimensional manifold estimated by the
principal curves approach [24]. The estimated curve
fails to follow the shape of the original curve.
Using the autoassociative model, the estimated onedimensional
manifold (dashed curve) is closer to the original one.
In this experiment, we used one iteration of Algorithm 1
with the contiguity index (17) in combination
with a projection estimate of the regression functions.
A basis of cubic splines was used to compute the projection.
Our second illustration is performed on the
”dataset I  Five types of breast cancer”
provided to us by the organizers of the
”Principal Manifolds2006” workshop.
The dataset is available online at the following address:
http://www.ihes.fr/zinovyev/princmanif2006/.
It consists of microarray data
containing logarithms of expression levels of genes in
samples.
The data is divided into five types of breast cancer
(lumA, lumB, normal, errb2 and basal) plus an unclassified group.
Before all, let us note that, since points are
necessarily located on a linear subspace of dimension ,
the covariance matrix is at most of rank .
Thus, as a preprocessing step, the dimension of the data
is reduced to by a classical PCA, and this, without
any loss of information.
Forgetting the labels, i.e. without using the initial
classification into five types of breast cancer,
the information ratio (see Corollary 1) obtained
by the classical PCA and the generalized one (basing on autoassociative
models), are compared. Figure 3 illustrates the behavior of
as the dimension of the model increases.
The bold curve, corresponding to the autoassociate model,
was computed with the contiguity index (17) in combination
with a projection estimate of the regression functions.
A basis of cubic splines was used to compute the projection.
One can see that the generalized PCA yields far better approximation
results than the classical one.
As an illustration, the onedimensional manifold is superimposed
to the dataset on Figure 5. Each class is represented
with a different gray level. For the sake of the visualization, the dataset
as well as the manifold are projected on the principal plane.
Similarly, the twodimensional manifold is represented on Figure 6
on the linear space spanned by the three first principal axes.
Taking into account the labels, it is also possible to compute the
onedimensional manifold associated to each type of cancer and
to the unclassified points,
see Figure 5. Each manifold then represents a kind
of skeleton of the corresponding dataset.
Other illustrations can be found in [5], Chapter 4, where autoassociative models are applied to some image analysis problems.
References
 [1] R. Bellman. Dynamic programming. Princeton university press, 1957.
 [2] D. Bosq and J.P. Lecoutre. Théorie de l’estimation fonctionnelle. Economie et Statistiques avancées. Economica, Paris, 1987.
 [3] J.D. Carroll and P. Arabie. Multidimensionnal scaling. Annual Rev. of Psychology, 31:607–649, 1980.
 [4] H. Caussinus and A. RuizGazen. Metrics for finding typical structures by means of principal component analysis. In Data science and its Applications, pages 177–192. Harcourt Brace, Japan, 1995.
 [5] B. Chalmond. Modeling and inverse problems in image analysis. In Applied Mathematics Science series, volume 155. Springer, NewYork, 2002.
 [6] B. Chalmond and S. Girard. Nonlinear modeling of scattered multivariate data and its application to shape change. IEEE Pattern Analysis and Machine Intelligence, 21(5):422–432, 1999.
 [7] B. Cheng and D.M. Titterington. Neural networks: A review from a statistical perspective. Statistical Science, 9(1):2–54, 1994.
 [8] P. Delicado. Another look at principal curves and surfaces. Journal of Multivariate Analysis, 77:84–116, 2001.
 [9] P. Demartines and J. Hérault. Curvilinear component analysis: A selforganizing neural network for nonlinear mapping of data sets. IEEE Trans. on Neural Networks, 8(1):148–154, 1997.
 [10] K.L. Diamantaras and Kung S.Y. Principal component neural networks. Wiley, NewYork, 1996.
 [11] R.L. Eubank. Spline smoothing and nonparametric regression. Decker, 1990.
 [12] F. Ferraty and P. Vieu. Nonparametric modelling for functional data. Springer, 2005.
 [13] J.H. Friedman. Exploratory projection pursuit. Journal of the American Statistical Association, 82(397):249–266, 1987.
 [14] J.H. Friedman and W. Stuetzle. Projection pursuit regression. Journal of the American Statistical Association, 76(376):817–823, 1981.
 [15] J.H. Friedman and J.W. Tukey. A projection pursuit algorithm for exploratory data analysis. IEEE Trans. on Computers, C23(9):881–890, 1974.
 [16] S. Girard. A nonlinear PCA based on manifold approximation. Computational Statistics, 15(2):145–167, 2000.
 [17] S. Girard, B. Chalmond, and JM. Dinten. Position of principal component analysis among autoassociative composite models. ComptesRendus de l’Académie des Sciences, Série I, 326:763–768, 1998.
 [18] S. Girard and S. Iovleff. Autoassociative models and generalized principal component analysis. Journal of Multivariate Analysis, 93(1):21–39, 2005.
 [19] A. Gorban and A. Zinovyev. Elastic principal graphs and manifolds and their practical applications. Computing, 75(4):359–379, 2005.
 [20] P.J. Green and B.W. Silverman. Nonparametric regression and generalized linear models. Chapman and Hall, London, 1994.
 [21] P. Hall. On polynomialbased projection indices for exploratory projection pursuit. The Annals of Statistics, 17(2):589–605, 1990.
 [22] W. Härdle. Applied nonparametric regression. Cambridge University Press, Cambridge, 1990.
 [23] J.A. Hartigan. Clustering algorithms. Wiley, NewYork, 1995.
 [24] T. Hastie and W. Stuetzle. Principal curves. Journal of the American Statistical Association, 84 (406):502–516, 1989.
 [25] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning. In Springer Series in Statistics. Springer, 2001.
 [26] P.J. Huber. Projection pursuit. The Annals of Statistics, 13(2):435–475, 1985.
 [27] I. Jolliffe. Principal Component Analysis. SpringerVerlag, New York, 1986.
 [28] M.C. Jones and R. Sibson. What is projection pursuit? Journal of the Royal Statistical Society A, 150:1–36, 1987.
 [29] J. Karhunen and J. Joutsensalo. Generalizations of principal component analysis, optimization problems and neural networks. Neural Networks, 8:549–562, 1995.
 [30] B. Kégl. ”Principal curves: learning, design, and applications”. PhD thesis, Concordia University, Canada, 1999.
 [31] B. Kégl, A. Krzyzak, T. Linder, and K. Zeger. A polygonal line algorithm for constructing principal curves. In Proceedings of 12h NIPS, pages 501–507, Denver, Colorado, 1998.
 [32] S. Klinke and J. Grassmann. Projection pursuit regression. In Wiley Series in Probability and Statistics, pages 471–496. Wiley, 2000.
 [33] T. Kohonen. Selforganization of topologically correct feature maps. Biological cybernetics, 43:135–140, 1982.
 [34] T. Kohonen. Selforganization and associative memory, 3rd edition. SpringerVerlag, Berlin, 1989.
 [35] J.B. Kruskal and M. Wish. Multidimensional scaling. Sage, Beverly Hills, 1978.
 [36] L. Lebart. Contiguity analysis and classification. In Opitz O. Gaul W. and Schader M., editors, Data Analysis, pages 233–244. Springer, Berlin, 2000.
 [37] M. LeBlanc and R. Tibshirani. Adaptive principal surfaces. Journal of the American Statistical Association, 89(425):53–64, 1994.
 [38] J.A. Lee, A. Lendasse, and M. Verleysen. Curvilinear distance analysis versus isomap. In European Symposium on Artifical Neural Networks, pages 185–192, Bruges, Belgium, 2002.
 [39] J. Milnor. Topology from the differentiable point of view. University press of Virginia, Charlottesville, 1965.
 [40] JX. Pan, WK. Fung, and KT. Fang. Multiple outlier detection in multivariate data using projection pursuit techniques. Journal of Statistical Planning and Inference, 83(1):153–167, 2000.
 [41] B. N. Parlett. The symmetric eigenvalue problem. In Classics in Applied Mathematics, volume 20. SIAM, Philadelphia, 1997.
 [42] S.T. Roweis and L.K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323–2326, 2000.
 [43] J.W. Sammson. A nonlinear mapping algorithm for data structure analysis. IEEE Transactions on Computer, 18(5):401–409, 1969.
 [44] R.N. Shepard and J.D. Carroll. Parametric representation of nonlinear data structures. In P.R. Krishnaiah, editor, Int. Symp. on Multivariate Analysis, pages 561–592. AcademicPress, 1965.
 [45] T. Tarpey and B. Flury. Selfconsistency: A fundamental concept in statistics. Statistical Science, 11(3):229–243, 1996.
 [46] J.B. Tenenbaum, V. de Silva, and J.C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323, 2000.
 [47] R. Tibshirani. Principal surfaces revisited. Statistics and computing, 2:183–190, 1992.
 [48] M. Vlachos, C. Domeniconi, D. Gunopulos, G. Kollios, and N. Koudas. Nonlinear dimensionality reduction techniques for classification and visualization. In Proceedings of 8th SIGKDD, pages 23–26, Edmonton, Canada, 2002.