Complex-Valued Autoencoders

Complex-Valued Autoencoders

Pierre Baldi Zhiqin Lu Department of Computer Science, UCI, Irvine, CA 92697-3435 Department of Mathematics, UCI, Irvine, CA 92697-3875
Abstract

Autoencoders are unsupervised machine learning circuits, with typically one hidden layer, whose learning goal is to minimize an average distortion measure between inputs and outputs. Linear autoencoders correspond to the special case where only linear transformations between visible and hidden variables are used. While linear autoencoders can be defined over any field, only real-valued linear autoencoders have been studied so far. Here we study complex-valued linear autoencoders where the components of the training vectors and adjustable matrices are defined over the complex field with the norm. We provide simpler and more general proofs that unify the real-valued and complex-valued cases, showing that in both cases the landscape of the error function is invariant under certain groups of transformations. The landscape has no local minima, a family of global minima associated with Principal Component Analysis, and many families of saddle points associated with orthogonal projections onto sub-space spanned by sub-optimal subsets of eigenvectors of the covariance matrix. The theory yields several iterative, convergent, learning algorithms, a clear understanding of the generalization properties of the trained autoencoders, and can equally be applied to the hetero-associative case when external targets are provided. Partial results on deep architecture as well as the differential geometry of autoencoders are also presented. The general framework described here is useful to classify autoencoders and identify general properties that ought to be investigated for each class, illuminating some of the connections between autoencoders, unsupervised learning, clustering, Hebbian learning, and information theory.

keywords:
autoencoders, unsupervised learning, complex numbers, complex neural networks, critical points, linear networks, Principal Component Analysis, EM algorithm, deep architectures, differential geometry
journal: Neural Networks

1 Introduction

Autoencoder circuits, which try to minimize a distortion measure between inputs and outputs, play a fundamental role in machine learning. They were introduced in the 1980s by the Parallel Distributed Processing (PDP) group Rumelhart et al. (1986) as a way to address the problem of unsupervised learning, in contrast to supervised learning in backpropagation networks, by using the inputs as learning targets. More recently, autoencoders have been used extensively in the “deep architecture” approach Hinton et al. (2006); Hinton and Salakhutdinov (2006); Bengio and LeCun (2007); Erhan et al. (2010), where autoencoders in the form of Restricted Boltzmann Machines (RBMS) are stacked and trained bottom up in unsupervised fashion to extract hidden features and efficient representations that can then be used to address supervised classification or regression tasks. In spite of the interest they have generated, and with a few exceptions Roux and Bengio (2010), little theoretical understanding of autoencoders and deep architectures has been obtained to date. One possible strategy for addressing these issues is to partition the autoencoder universe into different classes, for instance linear versus non-linear autoencoders, and identify classes that can be analyzed mathematically, with the hope that the precise understanding of several specific classes may lead to a clearer general picture. Within this background and strategy, the main purpose of this article is to provide a complete theory for a particular class of autoencoders, namely linear autoencoders over the complex field.

In addition to trying to progressively derive a more complete theoretical understanding of autoencoders, there are several other reasons, primarily theoretical ones, for looking at linear complex-valued autoencoders. First, linear autoencoders over the real numbers were solved by Baldi and Hornik Baldi and Hornik (1988) (see also Bourlard and Kamp (1988)). It is thus natural to ask whether linear autoencoders over the complex numbers share the same basic properties or not, and whether unified proofs can be derived to cover both the real- and complex-valued cases. More generally linear autoencoders can be defined over any field and therefore one can raise similar questions for linear autoencoders over other fields, such as finite Galois fields Lang (1984).

Second, a specific class of non-linear autoencoders was recently introduced and analyzed mathematically Baldi (2011). This is the class of Boolean autoencoders where all circuit operations are Boolean functions. It can be shown that this class of autoencoders is intimately connected to clustering and so it is reasonable to both compare Boolean autoencoders to linear autoencoders, and to examine linear autoencoders from a clustering perspective.

Third, there has been a trend in recent years towards the use of linear networks and methods to address difficult tasks, such as building recommender systems (e.g. the Netflix prize challenge Bell and Koren (2007); Takacs and O=Pilaszy and Nemeth and Tikk (2008)) or modeling the development of sensory systems, in clever ways by introducing particular restrictions on the relevant matrices, such as sparsity or low-rank Candes and Wakin (2008); Candès and Recht (2009). Autoencoders discussed in this paper can be viewed as linear, low-rank, approximations to the identity function and therefore fall within this general trend.

Finally, complex vector spaces and matrices have several areas of specific application, ranging from quantum mechanics, to fast Fourier transforms, to complex-valued neural networks Hirose (2003), and ought to be studied in their own right. Complex-valued linear autoencoders can be viewed as a particular class of complex-valued neural networks and may be used in applications involving complex-valued data.

With these motivations in mind, in order to provide a complete treatment of linear complex-valued autoencoders here we first introduce a general framework and notation, essential for a better understanding and classification of autoencoders, and for the identification of common properties that ought to be studied in any new specific autoencoder case. We then proceed to analytically solve the complex-valued linear autoencoder. While in the end the results obtained in the complex-valued case are similar to those previously obtained in the real-valued case Baldi and Hornik (1988) interchanging conjugate transposition with simple transposition, the approach adopted here allow us to derive simpler and more general proofs that unify both cases. In addition, we derive several new properties and results, addressing for instance learning algorithms and their convergence properties, and some of the connections to clustering, deep architectures, and other kinds of autoencoders. Finally, in the Appendix, we begin the study of real- and complex-valued autoencoders from a differential geometry perspective.

2 General Autoencoder Framework and Preliminaries

2.1 General Autoencoder Framework

To derive a fairly general framework, an autoencoder (Figure 1) is defined by a t-uple where:

1. and are sets.

2. and are positive integers. Here we consider primarily the case where .

3. is a class of functions from to .

4. is a class of functions from to .

5. is a set of (training) vectors in . When external targets are present, we let denote the corresponding set of target vectors in .

6. is a dissimilarity or distortion function defined over .

For any and , the autoencoder transforms an input vector into an output vector (Figure 1). The corresponding autoencoder problem is to find and that minimize the overall distortion (or error/energy) function:

 minE(A,B)=minA,Bm∑t=1E(xt)=minA,Bm∑t=1Δ(A∘B(xt),xt) (1)

In the non auto-associative case, when external targets are provided, the minimization problem becomes:

 minE(A,B)=minA,Bm∑t=1E(xt)=minA,Bm∑t=1Δ(A∘B(xt),yt) (2)

Note that corresponds to the regime where the autoencoder tries to implement some form of compression or feature extraction. The case is not treated here but can be interesting in situations which either (1) prevent the use of trivial solutions by enforcing additional constraints, such as sparsity, or (2) include noise in the hidden layer, corresponding to transmission over a noisy channel.

Obviously, from this general framework, different kinds of autoencoders can be derived depending, for instance, on the choice of sets and , transformation classes and , distortion function , as well as the presence of additional constraints. Linear autoencoders correspond to the case where and are fields and and are the classes of linear transformations, hence and are matrices of size and respectively. The linear real case where and is the squared Euclidean distance was addressed in Baldi and Hornik (1988) (see also Bourlard and Kamp (1988)).

2.2 Complex Linear Autoencoder

Here we consider the corresponding complex linear case where and the goal is the minimization of the squared Euclidean distance

 minE(A,B)=minA,Bm∑t=1||xt−AB(xt)||2=m∑t=1(xt−AB(xt))∗(xt−AB(xt)) (3)

Unless otherwise specified, all vectors are column vectors and we use (resp. ) to denote the conjugate transpose of a vector (resp. of a matrix ). Note that the same notation works for both the complex and real case. As we shall see, in the linear complex case as in the linear real case, one can also address the case where external targets are available, in which case the goal is the minimization of the distance

 minE(A,B)=minA,Bm∑t=1||yt−AB(xt)||2=m∑t=1(yt−AB(xt))∗(yt−AB(xt)) (4)

In practical applications, it is often preferable to work with centered data, after substraction of the mean. The centered and non- centered versions of the problem are two different problems with in general two different solutions. The general equations to be derived apply equally to both cases.

In general, we define the covariance matrices as follows

 ΣXY=∑txty∗t (5)

Using this definition, are Hermitian matrices and , and . We let also

 Σ=ΣYXΣ−1XXΣXY (6)

is also Hermitian. In the auto-associative case, for all resulting in . Note that any Hermitian matrix admits a set of orthonormal eigenvectors and all its eigenvalues are real. Finally, we let denote the identity matrix.

For several results, we make the assumption that is invertible. This is not a very restrictive assumption for several reasons. First, by adding a small amount of noise to the data, a non-invertible could be converted to an invertible , although this could potentially raise some numerical issues. More importantly, in most settings one can expect the training vectors to span the entire input space and thus to be invertible. If the training vectors span a smaller subspace, then the original problem can be transformed to an equivalent problem defined on the smaller subspace.

2.3 Useful Reminders

Standard Linear Regression. Consider the standard linear regression problem of minimizing , where is a matrix, corresponding to a linear neural network without any hidden layers. Then we can write

 E(B)=∑tx∗tB∗Bxt−2Re(y∗tBxt)+||yt||2 (7)

Thus is a convex function in because the associated quadratic form is equal to

 ∑tx∗tC∗Cxt=∑t||Cxt||2≥0 (8)

Let be a critical point. Then by definition for any matrix we must have . Expanding and simplifying this expression gives

 ∑tx∗tB∗Cxt−y∗tBCxt=0 (9)

for all matrices . Using the linearity of the trace operator and its invariance under circular permutation of its arguments111It is easy to show directly that for any matrices and of the proper size, Lang (1984). Therefore for any matrices , , and of the proper size, we have ., this is equivalent to

 Tr((ΣXXB∗−ΣXY)C)=0 (10)

for any . Thus we have and therefore

 BΣXX=ΣYX (11)

If is invertible, then for any is equivalent to , and thus the function is strictly convex in . The unique critical point is the global minimum given by . As we shall see, the solution to the standard linear regression problem, together with the general approach given here to solve it, is also key for solving the more general linear autoencoder problem. The solution will also involve projection matrices.

Projection Matrices. For any matrix with , let denote the orthogonal projection onto the subspace generated by the columns of . Then is a Hermitian symmetric matrix and , since the image of is spanned by the columns of and these are invariant under . The kernel of is the space orthogonal to the space spanned by the columns of . Obviously, we have and . The projection onto the space orthogonal to the space spanned by the columns of is given by . In addition, if the columns of are independent (i.e. has full rank ), then the matrix of the orthogonal projection is given by Meyer (2000) and . Note that all these relationships are true even when the columns of are not orthonormal.

2.4 Some Misconceptions

As we shall see, in the complex case as in the real case, the global minimum corresponds to Principal Component Analysis. While the global minimum solution of linear autoencoders over infinite fields can be expressed analytically, it is often not well appreciated that there is more to be understood about linear autoencoders and the landscape of . In particular, if one is interested in learning algorithms that proceed through incremental and somewhat “blind” weight adjustments, then one must study the entire landscape of , including all the critical points of , and derive and compare different learning algorithms. A second misconception is to believe that the problem is a convex optimization problem, hence somewhat trivial, since after all the error function is quadratic and the transformation is linear. The problem with this argument is that the small layer of size forces to be of rank or less, and the set of matrices or rank at most is not convex. Furthermore, the problem is not convex when finite fields are considered. What is true and crucial for solving the linear autoencoders over infinite fields is that the problem becomes convex when or is fixed. A third misconception, related to the illusion of convexity, is that the landscape of linear neural networks never has any local minima. In general this is not true, especially if there are additional constraints on the linear transformation, such as restricted connectivity between layers so that some of the matrix entries are constrained to assume fixed values.

3 Group Invariances

For any autoencoder, it is important to investigate whether there are any group of transformations that leave its properties invariant.

Change of Coordinates in the Hidden Layer. Note that for any invertible complex matrix , we have and . Thus all the properties of the linear autoencoder are fundamentally invariant with respect to any change of coordinates in the hidden layer.

Change of Coordinates in the Input/Output Spaces. Consider an orthonormal change of coordinates in the output space defined by an orthogonal (or unitary) matrix , and any change of coordinates in the input space defined by an invertible matrix . This leads to a new autoencoder problem with input vectors and target output vectors of the form with reconstruction error of the form

 E(A′,B′)=∑t||Dyt−A′B′Cxt||2 (12)

If we use the one-to-one mapping between pairs of matrices and defined by and , we have

 E(A′,B′)=∑t||Dyt−A′B′Cxt||2=∑t||Dyt−DABxt||2=∑t||yt−ABxt||2 (13)

the last equality using the fact that is an isometry which preserves distances. Thus, using the transformation and the original problem and the transformed problem are equivalent and the function and have the same landscape. In particular, in the auto-associative case, we can take to be a unitary matrix. This leads to an equivalent autoencoder problems with input vectors and covariance matrix . For the proper choice of there is an equivalent problem where basis of the space is provided by the eigenvectors of the covariance matrix and the covariance matrix is a diagonal matrix with diagonal entries equal to the eigenvalues of the original covariance matrix .

4 Fixed-Layer and Convexity Results

A key technique for studying any autoencoder, is to simplify the problem by fixing all its transformations but one. Thus in this section we study what happens to the complex-valued linear autoencoder problem when either or is fixed, essentially reducing the problem to standard linear regression. The same approach can be applied to an autoencoder with more than one hidden layer (see section on Deep Architectures).

Theorem 1.

(Fixed A) For any fixed matrix , the function is convex in the coefficients of and attains its minimum for any satisfying the equation

 A∗ABΣXX=A∗ΣYX (14)

If is invertible and is of full rank , then is strictly convex and has a unique minimum reached when

 B=(A∗A)−1A∗ΣYXΣ−1XX (15)

In the auto-associative case, if is invertible and is of full rank , then the optimal has full rank and does not depend on the data. It is given by

 B=(A∗A)−1A∗ (16)

and in this case, and .

Proof. We write

 E(A,B)=∑tx∗tB∗A∗ABxt−2Re(y∗tABxt)+||yt||2 (17)

Then for fixed , is a convex function because the associated quadratic form is equal to

 ∑tx∗tC∗A∗ACxt=∑t||ACxt||2≥0 (18)

for any matrix . Let be a critical point. Then by definition for any matrix we must have . Expanding and simplifying this expression gives

 ∑tx∗tB∗A∗ACxt−y∗tACxt=0 (19)

for all matrices . Using the linearity of the trace operator and its invariance under circular permutation of its arguments, this is equivalent to

 Tr((ΣXXB∗A∗A−ΣXYA)C)=0 (20)

for any . Thus we have and therefore

 A∗ABΣXX=A∗ΣYX (21)

Finally, if is invertible and if is of full rank, then for any is equivalent to , and thus the function is strictly convex in . Since is invertible, the unique critical point is obtained by solving Equation 14.

In similar fashion, we have the following theorem.

Theorem 2 (Fixed B).

For any fixed matrix , the function is convex in the coefficients of and attains its minimum for any satisfying the equation

 ABΣXXB∗=ΣYXB∗ (22)

If is invertible and is of full rank, then is strictly convex and has a unique minimum reached when

 A=ΣYXB∗(BΣXXB∗)−1 (23)

In the auto-associative case, if is invertible and is of full rank, then the optimal has full rank and depends on the data. It is given by

 A=ΣXXB∗(BΣXXB∗)−1 (24)

and .

Proof. From Equation 17, the function is a convex function in . The condition for to be a critical point is

 ∑tx∗tB∗A∗CBxt−y∗tCBxt=0 (25)

for any matrix , which is equivalent to

 Tr((BΣXXB∗A∗−BΣXY)C)=0 (26)

for any matrix . Thus which implies Equation 22. The other assertions of the theorem can easily be deduced.

Remark 1.

Note that from Theorems 1 and 2 and their proofs, we have that is a critical point of if and only if Equation 14 and Equation 22 are simultaneously satisfied, that is if and only if and .

5 Critical Points and the Landscape of E

In this section we further study the landscape of , its critical points, and the properties of at those critical points.

Theorem 3.

(Critical Points) Assume that is invertible. Then two matrices define a critical point of , if and only if the global map is of the form

 W=PAΣYXΣ−1XX (27)

with satisfying

 PAΣ=PAΣPA=ΣPA (28)

In the auto-associative case, the above becomes

 W=AB=PA (29)

and

 PAΣXX=PAΣXXPA=ΣXXPA (30)

If is of full rank, then the pair defines a critical point of if and only if satisfies Equation 28 and satisfies Equation 16. Hence must also be of full rank.

Proof. If is a critical point of , then from Equation 14, we must have

 A∗(AB−ΣYXΣ−1XX)=0 (31)

Let

 S=AB−PAΣYXΣ−1XX (32)

Then since , we have . Thus the space spanned by the columns of is a subset of the space orthogonal to the space spanned by the columns of (i.e. ). On the other hand, since

 PAS=S (33)

is also in the space spanned by the columns of (i.e. ). Taken together, these two facts imply that , resulting in , which proves Equation 27. Note that for this result, we need only to be critical (i.e. optimized with respect to ). Using the definition of , we have

 PAΣPA=PAΣYXΣ−1XXΣXXΣ−1XXΣXYPA (34)

Since , we have and thus

 PAΣPA=PAΣYXΣ−1XXΣXXΣ−1XXΣXYPA=ABΣXXB∗A∗ (35)

Similarly, we have

 PAΣ=ABΣXY (36)

and

 ΣPA=ΣYXB∗A∗ (37)

Then Equation 28 result immediately by combining Equations 35, 36, and 37 using Equation 22. The rest of the theorem follows easily.

Remark 2.

The above proof unifies the cases when is of rank and less than and avoids the need for two separate proofs, as was done in earlier work Baldi and Hornik (1988) for the real-valued case.

Theorem 4.

(Critical Points of Full Rank) Assume that is of full rank with distinct eigenvalues and let denote a corresponding basis of orthonormal eigenvectors. If is any ordered set of indices of size , let denote the matrix formed using the corresponding column eigenvectors. Then two full rank matrices define a critical point of if and only if there exists an ordered -index set and an invertible matrix such that

 A=UICandB=C−1U∗IΣYXΣ−1XX (38)

For such critical point, we have

 W=AB=PUIΣYXΣ−1XX (39)

and

 E(A,B)=TrΣYY−∑i∈Iλi (40)

In the auto-associative case, these equations reduce to

 A=UICandB=C−1U∗I (41)
 W=AB=PUI (42)

and

 E(A,B)=TrΣ−∑i∈Iλi=∑i∈¯Iλi (43)

where is the complement of .

Proof. Since , we have

 PAΣA=ΣPAA=ΣA (44)

Thus the columns of form an invariant space of . Thus is of the form . The conclusion for follows from Equation 27 and the rest is easily deduced, as in the real case. Equation 43 can be derived easily by using the remarks in Section 3 and using the unitary change of coordinates under which becomes a diagonal matrix. In this system of coordinates, we have

 E(A,B)=∑t||yt||2+∑tTr(x∗t(AB)∗ABxt)−2∑tTr(y∗tABxt)

Therefore, using the invariance property of the trace under permutation, we have

 E(A,B)=Tr(Σ)+Tr((AB)2Σ)−2Tr(ABΣ)

Since is a projection operator, this yields Equation 43. In the auto-associative case with these coordinates it is easy to see that and are easily computed from the values of . In particular, . In addition, at the critical points, we have if , and otherwise.

Remark 3.

All the previous theorems are true in the hetero-associative case with targets . Thus they can readily be applied to address the linear denoising autoencoder Vincent et al. (2008); Vincent (2011) over or . The linear denoising autoencoder is an autoencoder trained to remove noise by having to associate noisy versions of the inputs with the correct inputs. In other words, using the current notation, it is an autoencoder where the inputs are replaced by where is the noise vector and the target outputs are of the form . Thus the previous theorems can be applied using the following replacements: , , . Further simplifications can be obtained using particular assumptions on the noise, such as .

Theorem 5.

(Absence of Local Minima) The global minimum of the complex linear autoencoder is achieved by full rank matrices and associated with the index set of the largest eigenvalues of with and (and where is any invertible matrix). When , . All other critical points are saddle points associated with corresponding projections onto non-optimal sets of eigenvectors of of size or less.

Proof. The proof is by a perturbation argument, as in the real case, showing that critical points that are not associated with the global minimum there is always a direction of escape that can be derived using unused eigenvectors associated with higher eigenvalues in order to lower the error (see Baldi and Hornik (1988) for more details). The proof can be made very simple by using the group invariance properties under transformation of the coordinates by a unitary matrix. With such a transformation, it is sufficient to study the landscape of when is a diagonal matrix and .

Remark 4.

At the global minimum, if is the identity matrix (), in the auto-associative case then the activities in the hidden layer are given by , corresponding to the coordinates of along the first eigenvectors of . These are the so called principal components of and the autoencoder implements a form of Principal Component Analysis (PCA) also closely related to Singular Value Decomposition (SVD).

The theorem above shows that when is full rank, there is a special class of critical points associated with . In the auto-associative case, this class is characterized by the fact that and are conjugate transpose of each other () in the complex-valued case, or transpose of each other () in the real-valued case. This class of critical points is special for several reasons. For instance, in the related Restricted Boltzmann Machine Autoencoders the weights between visible and hidden units are require to be symmetric corresponding to . More importantly, these critical points are closely connected to Hebbian learning (see also Oja (1982, 1989, 1992)). In particular, for linear real-valued autoencoders, if and so that inputs are equal to outputs, any learning rule that is symmetric with respect to the pre- and post- synaptic activities–which is typically the case for Hebbian rules–will modify and but preserve the property that . This remains roughly true even if is not exactly zero. Thus for linear real-valued autoencoders, there is something special about transposition operating on and and more generally on can suspect a similar role is played by conjugate transposition in the case of linear complex-valued autoencoders. The next theorem and the following section on learning algorithm further clarify this point.

Theorem 6.

(Conjugate Transposition) Assume is of full rank in the auto-associative case. Consider any point where has been optimized with respect to , including all critical points. Then

 W=AB=B∗A∗AB=B∗A∗=W∗andE(A,B)=E(B∗,A∗) (45)

Furthermore, when is full rank

 W=PA=P∗A=W∗ (46)

Proof. By Theorem 1, in the auto-associate case, we have

 A∗AB=A∗

Thus, by taking the complex conjugate of each side, we have

 B∗A∗A=A

It follows that

 B∗A∗=B∗A∗AB=AB

which proves Equation 45. If in addition is full rank, then by Theorem 1 and the rest follows immediately.

Remark 5.

Note the following. Starting from a pair with and where has been optimized with respect to , let and optimize again so that . Then we also have

 W′=A′B′=W∗=W=PAandE(A,B)=E(A′,B′) (47)

6 Optimization or Learning Algorithms

Although mathematical formula for the global minimum solution of the linear autoencoder have been derived, the global solution may not be available immediately to a self-adjusting learning circuit capable of making only small adjustments at each learning steps. Small adjustments may also be preferable in a non-stationary environment where the set of training vectors changes with time. Furthermore, the study of small adjustment algorithms in linear circuits may shed some light on similar incremental algorithms applied to non-linear circuits where the global optimum cannot be derived analytically. Thus, from a learning algorithm standpoint, it is still useful to consider incremental optimization algorithms, such as gradient descent or partial EM steps, even when such algorithms are slower or less accurate than direct global optimization. The previous theorems suggest two kinds of operations that could be used in various combinations to iteratively minimize , taking full or partial steps: (1) Partial minimization: fix (resp. ) and minimize for (resp. ); (2) Conjugate Transposition: fix (resp. ), and set (resp. (the latter being reserved for the auto-associative case, and particularly so if one is interested in converging to solutions where and are conjugate transpose of each other, i.e. where ).

Theorem 7.

(Alternate Minimization) Consider the algorithm where and are optimized in alternation (starting from or ), holding the other one fixed. This algorithm will converge to a critical point of . Furthermore, if the starting value of or is initialized randomly, then with probability one the algorithm will converge to a critical point where both and are full rank.

Proof: A direct proof of convergence is given in Appendix B. Here we give an indirect, but perhaps more illuminating proof, by remarking that the alternate minimization algorithm is in fact an instance of the general EM algorithm Dempster et al. (1977) combined with a hard decision, similar to the Viterbi learning algorithm for HMM or the k-means clustering algorithm with hard assignment. For this, consider that we have a probabilistic model over the data with parameters and hidden variables , or vice versa, with parameters and hidden variables . The conditional probability of the data and the hidden variables is given by:

 P(X,Y,A|B)=1Z1e−E(A,B) (48)

or

 P(X,Y,B|A)=1Z2e−E(A,B) (49)

where and denote the proper normalizing constants (partition functions). During the E step, we find the most probable value of the hidden variables given the data and current value of the parameters. Since is quadratic, the model in Equation 48 is Gaussian and the mean and mode are identical. Thus the hard assignment of the hidden variables in the E step corresponds to optimizing or using Theorem 3 or Theorem 4. During the M step, the parameters are optimized given the value of the hidden variables. Thus the M step also corresponds to optimizing or using Theorem 3 or Theorem 4. As a result, convergence to a critical point of is ensured by the general convergence theorem of the EM algorithm Dempster et al. (1977). Since and are initialized randomly, they are full rank with probability one and, by Theorem 1 and 2 they retain their full rank after each optimization step. Note that the error is always positive, strictly convex in or , decreases at each optimization step, and thus must converge to a limit. By looking at every other step in the algorithm, it is easy to see that must converge. From which one can see that must converge, and so must .

Given the importance of conjugate transposition (Theorem 6) in the auto-associative case, one may also consider algorithms where the operations of conjugate transposition and partial optimization of and are interleaved. This can be carried in many ways. Let denote that is obtained from by optimization (Equation 16) and denote that is obtained from by conjugate transposition (), and similarly for (Equation 24) and (). Let also denote the operation where both and are obtained by simultaneous conjugate transposition from their current values. Then starting from (random) and , here are several possible algorithms:

• Algorithm 1: .

• Algorithm 2: .

• Algorithm 3: .

• Algorithm 4: .

• Algorithm 5: .

• Algorithm 6: .

• Algorithm 7: .

The theory presented so far allows us to understand their behavior easily (Figure 3), considering a consecutive update of and as one iteration. Algorithms 1 and 2 converge with probability one to a critical point where and are full rank. Algorithm 1 may be slightly faster than Algorithm 2 at the beginning since in the first step Algorithm 1 takes into account the data (Equation 24), whereas Algorithm 2 ignores it. Algorithms 3, 4, and 5 converge and lead to a solution where (or, equivalently, ). Algorithms 3 and 5 take the same time and are faster than Algorithm 4. Algorithm 2 and Algorithm 4 take the same time. Algorithm 3 requires almost twice the number of steps of Algorithm 1. But Algorithm 4 is faster than Algorithm 3. This is because in Algorithm 3, the steps is basically like switching the matrices and , and the error after the step is the same as the error after the step . Algorithms 6 and 7 in general will not converge. Only optimization steps with respect to the matrix are being carried and therefore the data is never considered.

7 Generalization Properties

One of the most fundamental problems in machine learning is to understand the generalization properties of a learning system. Although in general this is not a simple problem, in the case of the autoencoder the generalization properties can easily be understood. After learning, and must be at a critical point. Assuming without much loss of generality that is also full rank and is invertible, then from Theorem 1 we know in the auto-associative case that . Thus we have the following result.

Theorem 8.

(Generalization Properties) Assume in the auto-associative case that is invertible. For any learning algorithm that converges to a point where is optimized with respect to and is full rank (including all full rank critical points), then for any vector we have and

 E(x)=||x−ABx||2=||x−PAx||2 (50)
Remark 6.

Thus the reconstruction error of any vector is equal to the square of its distance to the subspace spanned by the columns of , or the square of the norm of its projection onto the orthogonal subspace. The general hetero-associative case can also be treated using Theorem 1. In this case, under the same assumptions, we have: .

8 Recycling or Iteration Properties

Likewise, for the linear auto-associative case, one can also easily understand what happens when the outputs of the network are recycled into the inputs after learning. In the RBMs case, this is similar to alternatively sampling from the input and hidden layer. Interestingly, this provides also an alternative characterization of the critical points. At a critical points where is a projection, we must have . Thus, after learning, the iterates are easy to understand and converge after a single cycle and all points become stable after a single cycle. If is in the space spanned by the columns of we have for any . If is not in the space spanned by the columns of , then for , where is the projection of onto the space spanned by the columns of ().

Theorem 9.

(Generalization Properties) Assume in the auto-associative case that is invertible. For any learning algorithm that converges to a point where is optimized with respect to and is full rank (including all full rank critical points), then for any vector and any integer , we have

 Wm(x)=PmA(x)=PA(x) (51)
Remark 7.

There is a partial converse to this result, in the following sense. Assume that is a projection () and therefore . If is of full rank, then . Furthermore, if is of full rank, then (note that immediately implies that ). Multiplying this relation by on the left and on the right, yields after simplification, and therefore Thus according to Theorem 1 is critical and . Note that under the sole assumption that is a projection, there is no reason for to be critical, since there is no reason for to depend on the data and on .

9 Deep Architectures

Autoencoders can be composed vertically (Figure 4), as in the deep architecture approach described in Hinton et al. (2006); Hinton and Salakhutdinov (2006), where a stack of RBMs is trained in an unsupervised way, in bottom up fashion, by using the activity in the hidden layer of a RBM in the stack as the input for the next RBM in the stack. Similar architectures and algorithms can be applied to linear networks. Consider for instance training a 10/5/10 autoencoder and then using the activities in the hidden layer to train a 5/3/5 autoencoder. This architecture can be contrasted with a 10/5/3/5/10 architecture, or a 10/3/10 architecture. In all cases, the overall transformation is linear and constrained in rank by the size of the smallest layer in the architecture. Thus all three architectures have the same optimal solution associated with Principal Component Analysis using the top 3 eigenvalues. However the landscapes of the error functions and the learning trajectories may be different and other considerations may play a role in the choice of an architecture.

In any case, the theory developed here can be adapted to multi-layer real-valued or complex-valued linear networks. Overall, such networks implement a linear transformation with a rank restriction associated with the smallest hidden layer. As in the single hidden layer case, the overall distortion is convex in any single matrix while all the other matrices are held fixed. Any algorithm that successively, or randomly, optimizes each matrix with respect to all the others will converge to a critical point, which will be full rank with probability one if the matrices are initialized randomly. For instance, to be more precise, consider a network with five stages associated with the five matrices and of the proper sizes and the error function .

Theorem 10.

For any fix set of matrices and , the function is convex in the coefficients of and attains its minimum for any satisfying the equation

 B∗A∗ABCDFΣXXF∗D∗=B∗A∗ΣYXF∗D∗ (52)

If is invertible and and are of full rank, then is strictly convex and has a unique minimum reached when

 C=(B∗A∗AB)−1B∗A∗ΣYXF∗D∗(DFΣXXF∗D∗)−1 (53)

Proof: We write

 E(A,B)=∑tx∗tF∗D∗C∗B∗A∗ABCDFxt−2Re(y∗tABCDFxt)+||yt||2 (54)

Then for fixed , is a convex function because the associated quadratic form is equal to

 ∑tx∗tF∗D∗L∗B∗A∗ABLDFxt=∑t||ABLDFxt||2≥0 (55)

for any matrix of the proper size. Let be a critical point. Then by definition for any matrix of the proper size, we must have . Expanding and simplifying this expression gives

 ∑tx∗tF∗D∗C∗B∗A∗ABLDFxt−y∗tABLDFxt=0 (56)

for all matrices of the proper size. Using the linearity of the trace operator and its invariance under circular permutation of its arguments, this is equivalent to

 Tr((DFΣXXF∗D∗C∗B∗A∗AB−DFΣXYAB)L)=0 (57)

for any . Thus we have and therefore

 B∗A∗ABCDFΣXXF∗D∗=B∗A∗ΣYXF∗D∗ (58)

Finally, if is invertible and and are of full rank, then for any is equivalent to , and thus the function is strictly convex in . Thus in this case we can solve Equation 58 for to get Equation 53.

10 Conclusion

We have provided a fairly complete and general analysis of complex-valued linear autoencoders. The analysis can readily be applied to special cases, for instance when the vectors are real-valued and the matrices are complex-valued, or the vectors are complex-valued and the matrices are real-valued. More importantly, the analysis provides a unified view of real-valued and complex-valued linear autoencoders. In the Appendix, we further extend the treatment of linear autoencoders over infinite fields by looking at their properties from a differential geometry perspective.

More broadly, the framework used here identifies key questions and strategies that ought to be studied for any class of autoencoders, whether linear or non-linear. For instance:

1. What are the relevant group actions and invariances for the problem?

2. Can one of the transformations ( or ) be solved while the other is held fixed? Are there useful convex relaxations or restrictions?

3. Are there any critical points, and how can they be characterized?

4. Is there a notion of symmetry or transposition between the transformations and around critical points?

5. Is there an overall analytical solution? Is the problem NP-hard? What is the landscape of ?

6. What are the learning algorithms and their properties?

7. What are the generalization properties?

8. What happens if the outputs are recycled?

9. What happens if autoencoders are stacked vertically?

All these questions can be raised anew for other linear autoencoders, for instance over or with the norm (), or over other fields, in particular over finite fields with the Hamming distance. While results for finite fields will be published elsewhere, it is clear that these questions have different answers in the finite field case. For instance, the notion of using convexity to analytically solve for or , while holding the other one fixed, breaks down in the finite field case.

These questions can also be applied to non-linear autoencoders. While in general non-linear autoencoders are difficult to treat analytically, the case of Boolean autoencoders was recently solved using this framework Baldi (2011). Boolean autoencoders implement a form of clustering when and, in retrospect, all linear autoencoders implement also a form of clustering when . In the linear case, for any vector and any , we have . is the kernel of which contains the kernel of , and is equal to it when is of full-rank. Thus, in general, linear autoencoders implement clustering “by hyperplane” associated with the kernel of . Taken together, these facts point to the more general unity connecting unsupervised learning, clustering, Hebbian learning, and autoencoders.

Finally, there is the case of autoencoders,linear or non-linear, with which has not been addressed here. Clearly, additional restrictions or conditions must be imposed in this case, such as sparse encoding in the hidden layer or sparse matrices using L1 regularization, to avoid trivial solutions associated with the identity function. Although beyond the scope of this paper, these autoencoders are also of interest. For instance, the linear case over finite fields with noise added to the hidden layer, subsumes the theory of linear codes in coding theory McEliece (1977). Thus, in short, one can expect autoencoders to continue to play an important role in machine learning and provide fertile connections to other areas, from clustering to information and coding theory.

Appendix A: Differential Geometry of Autoencoders

Methods from differential geometry has been applied effectively to statistical machine learning in previous studies by Amari Amari (1990); Amari and Nagaoka (2007) and others. Here however we introduce a novel approach for looking at the manifolds of relevant parameters for linear autoencoders over the real or complex fields. While the basic results in this section are not difficult, they do assume some understanding of the most basic concepts of differential geometry Spivak (1999).

Let be the set of complex matrices of rank at most equal to . Obviously, . In general, is a singular variety (a Brill-Noether variety). We let also be the set of matrices of rank exactly . As we shall see, is a complex manifold.

Definition 1.

We let

 Fp(W)=m∑t=1||yt−Wxt||2 (59)

where .

Let be the set of all complex matrices. Define the mapping

 ι:Mn×p×Mp×n→Rpwithι(A,B)=AB (60)

by taking the product of the corresponding matrices. Then we have . We are going to show that is surjective and the differential of is of full rank at any point.

Lemma 1.

is a complex manifold of dimension .

Proof. Let . To construct a set of local coordinates of near , we write

 W=(w1,⋯,wn) (61)

where are column vectors. Without any loss of generality, we assume that are linearly independent. Thus we must have

 wj=p∑i=1ξijwi (62)

for , with complex coefficients . The local coordinates of are and . Thus

 dim(Rp∖Rp−1)=p(n−p)+pn=2pn−p2 (63)

Next, we consider the tangent space of at . By definition, a basis of is given by

 ∂∂wik 1≤i≤p,1≤k≤n; (64) ∂∂ξij 1≤i≤p,p

Let be the standard basis of . Then the corresponding matrices of the tangent vectors are

 ∂∂cik ⟶(0,⋯,eki−thplace,⋯,0,ξi,p+1ek,⋯,ξinek); ∂∂ξij ⟶(0,⋯,0,0,⋯,wij−thplace,⋯,0).
Lemma 2.

Let , where are full-rank and matrices, respectively. Let be and matrices such that

 AB1+A1B=0 (66)

Then there is an invertible matrix such that

 A1=AV,B1=−VB (67)

Proof. By multiplying on the left by , we have

 A∗AB1+A∗A1B=0. (68)

Since is full rank, is an invertible matrix. Thus

 B1=−(A∗A)−1A∗A1B (69)

Substituting the above into Equation 66 yields

 −A(A∗A)−1A∗A1B+A1B=0 (70)

Since is of full rank, we get

 −A(A∗A)−1A∗A1+A1=(1−PA)A1=0 (71)

which implies that the columns of span the same linear space as the image of , i.e. the same space spanned by the columns of . Hence for some matrix .

Lemma 3.

The tangent space