A deep matrix factorization method for learning attribute representations
Abstract
SemiNonnegative Matrix Factorization is a technique that learns a lowdimensional representation of a dataset that lends itself to a clustering interpretation. It is possible that the mapping between this new representation and our original data matrix contains rather complex hierarchical information with implicit lowerlevel hidden attributes, that classical one level clustering methodologies can not interpret. In this work we propose a novel model, Deep SemiNMF, that is able to learn such hidden representations that allow themselves to an interpretation of clustering according to different, unknown attributes of a given dataset. We also present a semisupervised version of the algorithm, named Deep WSF, that allows the use of (partial) prior information for each of the known attributes of a dataset, that allows the model to be used on datasets with mixed attribute knowledge. Finally, we show that our models are able to learn lowdimensional representations that are better suited for clustering, but also classification, outperforming SemiNonnegative Matrix Factorization, but also other stateoftheart methodologies variants.
1 Introduction
Matrix factorization is a particularly useful family of techniques in data analysis. In recent years, there has been a significant amount of research on factorization methods that focus on particular characteristics of both the data matrix and the resulting factors. Nonnegative matrix factorization (NMF), for example, focuses on the decomposition of nonnegative multivariate data matrix into factors and that are also nonnegative, such that . The application area of the family of NMF algorithms has grown significantly during the past years. It has been shown that they can be a successful dimensionality reduction technique over a variety of areas including, but not limited to, environmetrics [1], microarray data analysis [2, 3], document clustering [4], face recognition [5, 6], blind audio source separation [7] and more. What makes NMF algorithms particularly attractive is the nonnegativity constraints imposed on the factors they produce, allowing for better interpretability. Moreover, it has been shown that NMF variants (such as the SemiNMF) are equivalent to a soft version of kmeans clustering, and that in fact, NMF variants are expected to perform better than kmeans clustering particularly when the data is not distributed in a spherical manner [8, 9].
In order to extend the applicability of NMF in cases where our data matrix is not strictly nonnegative, [8] introduced SemiNMF, an NMF variant that imposes nonnegativity constraints only on the second factor , but allows mixed signs in both the data matrix and the first factor . This was motivated from a clustering perspective, where represents cluster centroids, and represents soft membership indicators for every data point, allowing SemiNMF to learn new lowerdimensional features from the data that have a convenient clustering interpretation.
It is possible that the mapping between this new representation and our original data matrix contains rather complex hierarchical and structural information. Such a complex dataset is produced by a multimodal data distribution which is a mixture of several distributions, where each of these constitutes an attribute of the dataset. Consider for example the problem of mapping images of faces to their identities: a face image also contains information about attributes like pose and expression that can help identify the person depicted. One could argue that by further factorizing this mapping , in a way that each factor adds an extra layer of abstraction, one could automatically learn such latent attributes and the intermediate hidden representations that are implied, allowing for a better higherlevel feature representation . In this work, we propose Deep SemiNMF, a novel approach that is able to factorize a matrix into multiple factors in an unsupervised fashion – see Figure 3, and it is therefore able to learn multiple hidden representations of the original data. As SemiNMF has a close relation to kmeans clustering, Deep SemiNMF also has a clustering interpretation according to the different latent attributes of our dataset, as demonstrated in Figure 4. Using a nonlinear deep model for matrix factorization also allows us to project datapoints which are not initially linearly separable into a representation that is; fact which we demonstrate in subsection 6.1.
It might be the case that the different attributes of our data are not latent. If those are known and we actually have some label information about some or all of our data, we would naturally want to leverage it and learn representations that would make the data more separable according to each of these attributes. To this effect, we also propose a weaklysupervised Deep SemiNMF (Deep WSF), a technique that is able to learn, in a semisupervised manner, a hierarchy of representations for a given dataset. Each level of this hierarchy corresponds to a specific attribute that is known a priori, and we show that by incorporating partial label information via graph regularization techniques we are able to perform better than with a fully unsupervised Deep SemiNMF in the task of classifying our dataset of faces according to different attributes, when those are known. We also show that by initializing an unsupervised Deep SemiNMF with the weights learned by a Deep WSF we are able to improve the clustering performance of the Deep SemiNMF. This could be particularly useful if we have, as in our example, a small dataset of images of faces with partial attribute labels and a larger one with no attribute labels. By initializing a Deep SemiNMF with the weights learned with Deep WSF from the small labelled dataset we can leverage all the information we have and allow our unsupervised model to uncover better representations for our initial data on the task of clustering faces.
Relevant to our proposal are hierarchical clustering algorithms [10, 11] which are popular in gene and document clustering applications. These algorithms typically abstract the initial data distribution as a form of tree called a dendrogram, which is useful for analysing the data and help identify genes that can be used as biomarkers or topics of a collection of documents. This makes it hard to incorporate outofsample data and prohibits the use of other techniques than clustering.
Another line of work which is related to ours is multilabel learning [12]. Multilabel learning techniques rely on the correlations [13] that exist between different attributes to extract better features. We are not interested in cases where there is complete knowledge about each of the attributes of the dataset but rather we propose a new paradigm of learning representations where have data with only partly annotated attributes. An example of this is a mixture of datasets where each one has label information about a different set of attributes. In this new paradigm we can not leverage the correlations between the attribute labels and we rather rely on the hierarchical structure of the data to uncover relations between the different dataset attributes. To the best of our knowledge this is the first piece of work that tries to automatically discover the representations for different (known and unknown) attributes of a dataset with an application to a multimodal application such as face clustering.
The novelty of this work can be summarised as follows: (1) we outline a novel deep framework ^{1}^{1}1A preliminary version of this work has appeared in [14]. for matrix factorization suitable for clustering of multimodally distributed objects such as faces, (2) we present a greedy algorithm to optimize the factors of the SemiNMF problem, inspired by recent advances in deep learning [15], (3) we evaluate the representations learned by different NMF–variants in terms of clustering performance, (4) present the Deep WSF model that can use already known (partial) information for the attributes of our data distribution to extract better features for our model, and (5) demonstrate how to improve the performance of Deep SemiNMF, by using the existing weights from a trained Deep WSF model.
2 Background
In this work, we assume that our data is provided in a matrix form , i.e., is a collection of data vectors as columns, each with features. Matrix factorization aims at finding factors of that satisfy certain constraints. In Singular Value Decomposition (SVD) [16], the method that underlies Principal Component Analysis (PCA) [17], we factorize into two factors: the loadings or bases and the features or components , without imposing any sign restrictions on either our data or the resulting factors. In Nonnegative Matrix Factorization (NMF) [18] we assume that all matrices involved contain only nonnegative elements^{2}^{2}2When not clear from the context we will use the notation to state that a matrix contains only nonnegative elements. Similarly, when not clear, we will use the notation to state that may contain any real number., so we try to approximate a factorization .
2.1 SemiNMF
In turn, SemiNMF [8] relaxes the nonnegativity constrains of NMF and allows the data matrix and the loadings matrix to have mixed signs, while restricting only the features matrix to comprise of strictly nonnegative components, thus approximating the following factorization:
(1) 
This is motivated from a clustering perspective. If we view as the cluster centroids, then can be viewed as the cluster indicators for each datapoint.
In fact, if we had a matrix that was not only nonnegative but also orthogonal, such that [8], then every column vector would have only one positive element, making SemiNMF equivalent to means, with the following cost function:
(2) 
where denotes the norm of a vector and the Frobenius norm of a matrix.
Thus SemiNMF, which does not impose an orthogonality constraint on its features matrix, can be seen as a soft clustering method where the features matrix describes the compatibility of each component with a cluster centroid, a base in . In fact, the cost function we optimize for approximating the SemiNMF factors is indeed:
(3) 
We optimize via an alternate optimization of and : we iteratively update each of the factors while fixing the other, imposing the nonnegativity constrains only on the features matrix :
(4) 
where is the Moore–Penrose pseudoinverse of , and
(5) 
where is a small number to avoid division by zero, is a matrix that has the negative elements of matrix replaced with 0, and similarly is one that has the positive elements of replaced with 0:
(6) 
2.2 Stateoftheart for learning features for clustering based on NMFvariants
In this work, we compare our method with, among others, the stateoftheart NMF techniques for learning features for the purpose of clustering. [19] proposed a graphregularized NMF (GNMF) which takes into account the intrinsic geometric and discriminating structure of the data space, which is essential to the realworld applications, especially in the area of clustering. To accomplish this, GNMF constructs a nearest neighbor graph to model the manifold structure. By preserving the graph structure, it allows the learned features to have more discriminating power than the standard NMF algorithm, in cases that the data are sampled from a submanifold which lies in a higher dimensional ambient space.
Closest to our proposal is recent work that has presented NMFvariants that factorize into more than 2 factors. Specifically, [20] have demonstrated the concept of MultiLayer NMF on a set of facial images and [21, 22, 23] have proposed similar NMF models that can be used for Blind Source Separation, classification of digit images (MNIST), and documents. The representations of the Multilayer NMF however do not lend themselves to a clustering interpretation, as the representations learned from our model. Although the Multilayer NMF is a promising technique for learning hierarchies of features from data, we show in this work that our proposed model, the Deep SemiNMF outperforms the Multilayer NMF and, in fact, all models we compared it with on the task of feature learning for clustering images of faces.
2.3 Semisupervised matrix factorization
For the case of the proposed Deep WSF algorithms, we also evaluate our method with previous semisupervised nonnegative matrix factorization techniques. These include the Constrained Nonnegative Matrix Factorization (CNMF) [24], and the Discriminant Nonnegative Matrix Factorization (DNMF) [25]. Although both take label information as additional constraints, the difference between these is that CNMF uses the label information as hard constrains on the resulting features , whereas DNMF tries to use the Fisher Criterion in order to incorporate discriminant information in the decomposition [25]. Both approaches only work for cases where we want to encode the prior information of only one attribute, in contrast to the proposed Deep WSF model.
3 Deep SemiNMF
In SemiNMF the goal is to construct a lowdimensional representation of our original data , with the bases matrix serving as the mapping between our original data and its lowerdimensional representation (see Equation 1). In many cases the data we wish to analyze is often rather complex and has a collection of distinct, often unknown, attributes. In this work for example, we deal with datasets of human faces where the variability in the data does not only stem from the difference in the appearance of the subjects, but also from other attributes, such as the pose of the head in relation to the camera, or the facial expression of the subject. The multiattribute nature of our data calls for a hierarchical framework that is better at representing it than a shallow SemiNMF.
We therefore propose here the Deep SemiNMF model, which factorizes a given data matrix into factors, as follows:
(7) 
This formulation, as shown directly in Equation 9 with respect to Figures 3 and 4 allows for a hierarchy of layers of implicit representations of our data that can be given by the following factorizations:
(8) 
As one can see above, we further restrict these implicit representations () to also be nonnegative. By doing so, every layer of this hierarchy of representations also lends itself to a clustering interpretation, which constitutes our method radically different to other multilayer NMF approaches [21, 22, 23]. By examining Figure 4, one can better understand the intuition of how that happens. In this case the input to the model, , is a collection of face images from different subjects (identity), expressing a variety of facial expressions taken from many angles (pose). A SemiNMF model would find a representation of , which would be useful for performing clustering according to the identity of the subjects, and the mapping between these identities and the face images. A Deep SemiNMF model also finds a representation of our data that has a similar interpretation at the top layer, its last factor . However, the mapping from identities to face images is now further analyzed as a product of three factors , with corresponding to the mapping of identities to expressions, corresponding to the mapping of identities to poses, and finally corresponding to the mapping of identities to the face images. That means that, as shown in Figure 4 we are able to decompose our data in 3 different ways according to our 3 different attributes:
(9) 
More over, due to the nonnegativity constrains we enforce on the latent features , it should be noted that this model does not collapse to a SemiNMF model. Our hypothesis is that by further factorizing we are able to construct a deep model that is able to (1) automatically learn what this latent hierarchy of attributes is; (2) find representations of the data that are most suitable for clustering according to the attribute that corresponds to each layer in the model; and (3) find a better highlevel, finallayer representation for clustering according to the attribute with the lowest variability, in our case the identity of the face depicted. In our example in Figure 4 we would expect to find better features for clustering according to identities by learning the hidden representations at each layer most suitable for each of the attributes in our data, in this example: for clustering our original images in terms of poses and for clustering the face images in terms of expressions.
In order to expedite the approximation of the factors in our model, we pretrain each of the layers to have an initial approximation of the matrices as this greatly improves the training time of the model. This is a tactic that has been employed successfully before [15] on deep autoencoder networks. To perform the pretraining, we first decompose the initial data matrix , where and . Following this, we decompose the features matrix , where and , continuing to do so until we have pretrained all of the layers. Afterwards, we can finetune the weights of each layer, by employing alternating minimization (with respect to the objective function in Equation 10) of the two factors in each layer, in order to reduce the total reconstruction error of the model, according to the cost function in Equation 10.
(10) 
Update rule for the weights matrix We fix the rest of the weights for the layer and we minimize the cost function with respect to . That is, we set , which gives us the updates:
(11) 
where , denotes the MoorePenrose pseudoinverse and is the reconstruction of the layer’s feature matrix.
Update rule for features matrix Utilizing a similar proof to [8], we can formulate the update rule for which enforces the nonnegativity of :
(12) 
Complexity
The computational complexity for the pretraining stage of Deep SemiNMF is of order , where is the number of layers, the number of iterations until convergence and is the maximum number of components out of all the layers. The complexity for the finetuning stage is where is the number of additional iterations needed.
3.1 Nonlinear Representations
By having a linear decomposition of the initial data distribution we may fail to describe efficiently the nonlinearities that exist in between the latent attributes of the model. Introducing nonlinear functions between the layers, can enable us to extract features for each of the latent attributes of the model that are nonlinearly separable in the initial input space.
This is motivated further from neurophysiology paradigms, as the theoretical and experimental evidence suggests that the human visual system has a hierarchical and rather nonlinear approach [26] in processing image structure, in which neurons become selective to process progressively more complex features of the image structure. As argued by Malo et al. [27], employing an adaptive nonlinear image representation algorithm results in a reduction of the statistical and the perceptual redundancy amongst the representation elements.
From a mathematical point of view, one can use a nonlinear function , between each of the implicit representations (), in order to better approximate the nonlinear manifolds which the given data matrix originally lies on. In other words by using a nonlinear squashing function we enhance the expressibility of our model and allow for a better reconstruction of the initial data. This has been proved in [28] by the use of the StoneWeierstrass theorem, in the case of multilayer feedforward network structures, which SemiNMF is an instance of, that arbitrary squashing functions can approximate virtually any function of interest to any desired degree of accuracy, provided sufficiently many hidden units are available.
To introduce nonlinearities in our model we modify the feature matrix , by setting
(13) 
which in turns changes the objective function of the model to be:
(14) 
In order to compute the derivative for the feature layer, we make use of the chain rule and get:
The derivation of the first feature layer is then identical to the version of the model with one layer.
Similarly we can compute the derivative for the weight matrices ,
and
Using these derivatives we can make use of gradient descent optimizations such as Nesterov’s optimal gradient [29], to minimize the cost function with respect to each of the weights of our model.
4 WeaklySupervised Attribute Learning
As before, consider a dataset of faces as in Figure 4. In this dataset, we have a collection of subjects, where each one has a number of images expressing different expressions, taken by different angles (pose information). A three layer Deep SemiNMF model could be used here to automatically learn representations in an unsupervised manner () that conform to this latent hierarchy of attributes. Of course, the features are extracted without accounting (partially) available information that may exist for each of the these attributes of the dataset.
To this effect we propose a Deep SemiNMF approach that can incorporate partial attribute information that we named WeaklySupervised Deep SemiNonnegative Matrix Factorization (Deep WSF). Deep WSF is able to learn, in a semisupervised manner, a hierarchy of representations; each level of this hierarchy corresponding to a specific attribute for which we may have only partial labels for. As depicted in Figure 5, we show that by incorporating some label information via graph regularization techniques we are able to do better than the Deep SemiNMF for classifying faces according to pose, expression, and identity. We also show that by initializing a Deep SemiNMF with the weights learned by a Deep WSF we are able to improve the performance of the Deep SemiNMF for the task of clustering faces according to identity.
4.1 Incorporating known attribute information
Consider that we have an undirected graph with nodes, where each of the nodes corresponds to one data point in our initial dataset. A node is connected to another node iff we have a priori knowledge that those samples share the same label, and this edge has a weight .
In the simplest case scenario, we use a binary weight matrix defined as:
(15) 
Instead one can also choose a radial basis function kernel
(16) 
or a dotproduct weighting, where
(17) 
Using the graph weight matrix , we formulate , which denotes the Graph Laplacian [30] that stores our prior knowledge about the relationship of our samples and is defined as , where is a diagonal matrix whose entries are column (or row, since is symmetric) sums of , . In order to control the amount of embedded information in the graph we introduce as in [31, 32, 33], a term which controls the smoothness of the low dimensional representation.
(18) 
where is the lowdimensional features for sample , that we obtain from the decomposed model.
Minimizing this term , we ensure that the euclidean difference between the final level representations of any two data points and is low when we have prior knowledge that those samples have a relationship, producing similar features and . On the other hand, when we do not have any expert information about some or even all the class information about an attribute, the term has no influence on the rest of the optimization.
Before deriving the update rules and the algorithm for the multilayer Deep WSF model, we first show the simpler case of the one layer version, which will come into use for pretraining the model, as SemiNMF can be used to pretrain the purely unsupervised Deep SemiNMF. We call this model Weakly Supervised SemiNMF WSF.
By combining the term introduced in Equation 18, with the cost function of SemiNMF we obtain the cost function for WeaklySupervised Factorization (WSF).
(19) 
The update rules, but also the algorithm for training a WSF model can be found in the supplementary material.
We incorporate the available partial labelled information for the pose, expression, and identity by forming a graph Laplacian for pose for the first layer , expression for the second layer , and identity for the third layer of the model. We can then tune the regularization parameters accordingly for each of the layers to express the importance of each of these parameters to the Deep WSF model. Using the modified version of our objective function subsection 4.1, we can derive the Algorithm 2.
(20) 
In order to compute the derivative for the feature layer, we make use of the chain rule and get:
And the derivation of the first feature layer is then:
Similarly we can compute the derivative for the weight matrices ,
and
Using these derivatives we can make use of gradient descent optimizations as with the nonlinear Deep SemiNMF model, to minimize the cost function with respect to each of the factors of our model. If instead use the linear version of the algorithm where is the identity function, then we can derive a multiplicative update algorithm version of Deep WSF, as described in Algorithm 2.
4.2 Weakly Supervised Factorization with Multiple Label Constraints
Another approach we propose within this framework is a single–layer WSF model that learns only a single representation based on information from multiple attributes. This MultipleAttribute extension of the WSF, the WSFMA, accounts for the case of having multiple number of attributes for our data matrix , by having a regularization term . This term uses the prior information from all the available attributes to construct Laplacian graphs where each of them has a different regularization factor .
This constitutes WSFMA, whose cost function is
(21) 
The update rules used, and the algorithm can be found in the supplementary material.
5 Outofsample projection
After learning an internal model of the data, either using the purely unsupervised Deep SemiNMF or to perform semisupervised learning using the Deep WSF model with learned weights , and features we can project an outofsample data point to the new lowerdimensional embedding . We can accomplish this using one of the two presented methods.
Method 1: Basis matrix reconstruction.
Each testing sample is projected into the linear space defined by the weights matrix . Although this method has been used by various previous works [34, 35] using the NMF model, it does not guarantee the nonnegativity of .
For the linear case of Deep WSF, this would lead to
(22) 
and for the nonlinear case
(23) 
Method 2: Using nonnegativity update rules.
Using the same process as in Deep SemiNMF, we can intuitively learn the new features , by assuming that the weight matrices remain fixed.
(24)  
and for the nonlinear case
(25)  
where , corresponds to the feature layer for the outofsample data point . This problem is then solved by using Algorithm 1 as Deep SemiNMF, but without updating the weight matrices .
6 Experiments
Our main hypothesis is that a Deep SemiNMF is able to learn better highlevel representations of our original data than a onelayer SemiNMF for clustering according to the attribute with the lowest variability in the dataset. In order to evaluate this hypothesis, we have compared the performance of Deep SemiNMF with that of other methods, on the task of clustering images of faces in two distinct datasets. These datasets are:

CMU PIE: We used a freely available version of CMU Pie [36], which comprises of grayscale face images of 68 subjects. Each person has 42 facial images under different light and illumination conditions. In this database we only know the identity of the face in each image.

XM2VTS: The Extended Multi Modal Verification for Teleservices and Security applications (XM2VTS) [37] contains frontal images of different subjects. Each subject has two available images for each of the four different laboratory sessions, for a total of 8 images. The images were eyealigned and resized to .
In order to evaluate the performance of our Deep SemiNMF model, we compared it against not only SemiNMF [8], but also against other NMF variants that could be useful in learning such representations. More specifically, for each of our two datasets we performed the following experiments:

Pixel Intensities: By using only the pixel intensities of the images in each of our datasets, which of course give us a strictly nonnegative input data matrix , we compare the reconstruction error and the clustering performance of our Deep SemiNMF method against the SemiNMF, NMF with multiplicative update rules [18], MultiLayer NMF [23], GNMF [19], and NeNMF [38].

Image Gradient Orientations (IGO): In general, the trend in Computer Vision is to use complicated engineered features like HoGs, SIFT, LBPs, etc. As a proof of concept, we choose to conduct experiments with simple gradient orientations [39] as features, instead of pixel intensities, which results into a data matrix of mixed signs, and expect that we can learn better data representations for clustering faces according to identities. In this case, we only compared our Deep SemiNMF with its onelayer SemiNMF equivalent, as the other techniques are not able to deal with mixedsign matrices.
In subsection 6.6, having demonstrated the effectiveness of the purely unsupervised Deep SemiNMF model we show next how pretraining a Deep WSF model on an auxiliary dataset and using the learned weights to perform unsupervised Deep SemiNMF can lead to significant improvements in terms of the clustering accuracy.
Finally, in subsection 6.7, we examine the classification abilities of the proposed models for each of the three attributes of the CMU MultiPIE dataset (pose/expression/identity) and use this to test more on our secondary hypothesis, i.e. that every representation in each layer is in fact most suited for learning according to the attributes that corresponds to the layer of interest.
6.1 An example with multimodal synthetic data
As previously mentioned images of faces are multimodal distributions which are composed of multiple attributes such as pose and identity. A simplified example of such dataset is Figure 6 where we have two subjects depicting two poses each. This example twodimensional dataset was generated using 100 samples from four normal distributions with .
As previously discussed in subsection 3.1, SemiNMF is an instance of a single layer neural network. As such there can not exist a linear projection which maps the original data distribution into a subspace such as the two subjects (red and blue) of the dataset are linearly separable.
Instead by employing a deep factorization model using the labels for the pose and identity for the first and second layer respectively we can find a nonlinear mapping which separates the two identities as shown in Figure 9.
6.2 Implementation Details
To initiate the matrix factorization process, NMF and SemiNMF algorithms start from some initial point (), where usually and are randomly initialized matrices.
A problem with this approach, is not only the initialization point is far from the final convergence point, but also makes the process non deterministic.
The proposed initialization of SemiNMF by its authors is instead by using the kmeans algorithm [40]. Nonetheless, means is computationally heavy when the number of components is fairly high (). As an alternative we implemented the approach by [41] which suggests exact and heuristic algorithms which solve SemiNMF decompositions using an SVD based initialization. We have found that using this method for SemiNMF, Deep SemiNMF, and WSF helps the algorithms to converge a lot sooner.
Similarly, to speed up the convergence rate of NMF we use the Nonnegative Double Singular Value decomposition (NNDSVD) suggested by Boutsidis et al.[42]. NNDSVD is a method based on two SVD processes, one to approximate the initial data matrix and the other to approximate the positive sections of the resulting partial SVD factors.
For the GNMF experimental setup, we chose a suitable number of neighbours to create the regularizing graph, by visualizing our datasets using Laplacian Eigenmaps [43], such that we had visually distinct clusters (in our case 5).
6.3 Number of layers
Important for the experimental setup is the selected structure of the multilayered models. After careful preliminary experimentation, we focused on experiments that involve two hidden layer architectures for the Deep SemiNMF and Multilayer NMF. We specifically experimented with models that had a first hidden representation with features, and a second representation with a number of features that ranged from to . This allowed us to have comparable configurations between the different datasets and it was a reasonable compromise between speed and accuracy. Nonetheless, in Figure 12 we show experiments with more than two layers on our two datasets. In the latter experiment we generated two hundred configurations of the Deep SemiNMF with a variant number of layers, and we evaluated the the final feature layer according to its clustering accuracy for the XM2VTS and CMU MultiPIE datasets. To make these models comparable we keep a constant number of components for the last layer () and we generated the number of components for the rest of the layers drawn from an exponential distribution with a mean of 400 components and then arrange them in an decreasing order. We decided to do so to comply with our main assumption: the first layers of our hierarchical model capture attributes with a larger variance and thus the model needs a larger capacity to encode them, where as the last layers will capture attributes with a lower variance.
6.4 Reconstruction Error Results
Our first experiment was to evaluate whether the extra layers, which naturally introduce more factors and are therefore more difficult to optimize, result in a lower quality local optimum. We evaluated how well the matrix decomposition is performed by calculating the reconstruction error, the Frobenius norm of the difference between the original data and the reconstruction for all the methods we compared. Note that, in order to have comparable results, all of the methods have the same stopping criterion rules. We have set the maximum amount of iterations to 1000 (usually iterations are enough) and we use the convergence rule in order to stop the process when the reconstruction error between the current and previous update is small enough. In our experiments we set . Table I shows the change in reconstruction error with respect to the selected number of features in for all the methods we used on the MultiPIE dataset.
# Components  

Model  20  30  40  50  60  70 
Deep SemiNMF  9.18  7.61  6.50  5.67  4.99  4.39 
GNMF  10.56  9.35  8.73  8.18  7.81  7.48 
Multilayer NMF  11.11  10.16  9.28  8.49  7.63  6.98 
NMF (MUL)  10.53  9.36  8.51  7.91  7.42  7.00 
NeNMF  9.83  8.39  7.39  6.60  5.94  5.36 
SemiNMF  9.14  7.57  6.43  5.53  4.76  4.13 
The results show that SemiNMF manages to reach a much lower reconstruction error than the other methods consistently, which would match our expectations as it does not constrain the weights to be nonnegative. What is important to note here is that the Deep SemiNMF models do not have a significantly lower reconstruction error compared to the equivalent SemiNMF models, even though the approximation involves more factors. Multilayer NMF and GNMF have a larger reconstruction error, in return for uncovering more meaningful features than their NMF counterpart.
6.5 Clustering Results
After achieving satisfactory reconstruction error for our method, we proceeded to evaluate the features learned at the final representation layer, by using means clustering, as in [19]. To assess the clustering quality of the representations produced by each of the algorithms we compared, we take advantage of the fact that the datasets are already labelled. The two metrics used were the accuracy (AC) and the normalized mutual information metric (NMI), as those are defined in [44]. For a cleaner presentation we have included all the experiments that use NMI in the supplement.
We made use of two main nonlinearities for our experiments, the scaled hyperbolic tangent with [45], and a square auxiliary function .
Figures 1314 show the comparison in clustering accuracy when using means on the feature representations produced by each of the techniques we compared, when our input matrix contained only the pixel intensities of each image. Our method significantly outperforms every method we compared it with on all the datasets, in terms of clustering accuracy.
By using IGOs, the Deep SemiNMF was able to outperform the singlelayer SemiNMF as shown in Figures 1516. Making use of these simple mixedsigned features improved the clustering accuracy considerably. It should be noted that in all cases, with the exception of the CMU PIE Pose experiment with IGOs, our Deep SemiNMF outperformed all other methods with a difference in performance that is statistically significant (paired ttest, ).
6.6 Supervised pretraining
As the optimization process of deep architectures is highly nonconvex, the initialization point of the process is an important factor for obtaining good final representation for the initial dataset. Following trends in deep learning [46], we show that supervised pretraining of our model on a auxiliary dataset and using the learned weights as initialization points for the unsupervised Deep SemiNMF algorithm can lead to significant performance improvements in regards to clustering accuracy.
As an auxiliary dataset we use XM2VTS where we resize all the images to a 32x32 resolution to match the CMU PIE image resolution, which is our primary dataset. Splitting the XM2VTS dataset to training/validation sets, we learn weights using a Deep WSF model with (625–) layers, and regularization parameters .
We then use the obtained weights from the supervised task as an initialization point and perform unsupervised finetuning on the CMU PIE dataset. To evaluate the resulting features, we once again perform clustering using the means algorithm.
In our experiments all the models with supervised pretraining outperformed the ones without, as shown in Figure 17, in terms of clustering accuracy. Additionally this validates our claim of how pretraining can be exploited to get better representations out of unlabelled data.
6.7 Learning with Respect to Different Attributes
Finally, we conducted experiments for classification using each of the three representations learned by our threelayered Deep WSF models when the input was the raw pixel intensities of our images of a larger subset of the CMU MultiPIE dataset.
CMU MultiPIE contains around images of subjects, captured under laboratory conditions in four different sessions. In this work, we used a subset of images of subjects in different poses and expressing 6 different emotions, which is the amount of samples that we had annotations and were imposed to the same illumination conditions. Using the annotations from [47, 48], we aligned these images based on a common frame. After that, we resized them to a smaller resolution of . The database comes with labels for each of the attributes mentioned above: identity, illumination, pose, expression. We only used CMU MultiPIE for this experiment since we only had identity labels for our other datasets. We split this subset into a training and validation set of 2025 images, and the rest for testing.
Model  Pose  Expression  Identity  

Unsupervised 
SemiNMF  99.73  81.50  36.46 
NMF  100.00  80.68  49.12  
Deep SemiNMF  99.86  80.54  61.22  
Semi 
CNMF  89.21  33.88  28.30 
DNMF  100.00  82.22  55.78  
Proposed 
WSF  100.00  81.50  63.81 
WSFMA  100.00  81.50  64.08  
Deep WSF  100.00  82.90  65.17 
We compare the classification performance of an SVM classifier (with a penalty parameter ) using the data representations of the NMF, SemiNMF, and Deep SemiNMF models that have no attribute information. The CNMF [24], DNMF [25], and our WSF models that have attribute labels only for the attribute we were classifying for, and our WSFMA and Deep WSF that learned data representations based on all attribute information available. In Table II, we demonstrate the performance in accuracy of each of the methods. In all of the methods, each feature layer has 100 components, and in the case of the Deep WSF model, we have used .
We also compared the performance of our Deep WSF with that of WSF and WSFMA to see whether the different levels of representation amount to better performance in classification tasks for each of the attributes represented. In both cases, but also in comparison with the rest stateoftheart unsupervised and semisupervised matrix factorization techniques, our proposed solution manages to extract better features for the task at hand as seen in Table II for classification.
7 Conclusion
We have introduced a novel deep architecture for seminonnegative matrix factorization, the Deep SemiNMF, that is able to automatically learn a hierarchy of attributes of a given dataset, as well as representations suited for clustering according to these attributes. Furthermore we have presented an algorithm for optimizing the factors of our Deep SemiNMF, and we evaluate its performance compared to the singlelayered SemiNMF and other related work, on the problem of clustering faces with respect to their identities. We have shown that our technique is able to learn a highlevel, finallayer representation for clustering with respect to the attribute with the lowest variability in the case of two popular datasets of face images, outperforming the considered range of typical powerful NMFbased techniques.
We further proposed Deep WSF, which incorporates knowledge from the known attributes of a dataset that might be available. Deep WSF can be used for datasets that have (partially) annotated attributes or even are a combination of different data sources with each one providing different attribute information. We have demonstrated the abilities of this model on the CMU MultiPIE dataset, where using additional information provided to us during training about the pose, emotion, and identity information of the subject we were able to uncover better features for each of the attributes, by having the model learning from all the available attributes simultaneously. Moreover, we have shown that Deep WSF could be used to pretrain models on auxiliary datasets, not only to speed up the learning process, but also uncover better representations for the attribute of interest.
Acknowledgments
George Trigeorgis is a recipient of the fellowship of the Department of Computing, Imperial College London, and this work was partially funded by it. The work of Konstantinos Bousmalis was funded partially from the Google Europe Fellowship in Social Signal Processing. The work of Stefanos Zafeiriou was partially funded by the EPSRC project EP/J017787/1 (4DFAB). The work of Björn W. Schuller was partially funded by the European Community’s Horizon 2020 Framework Programme under grant agreement No. 645378 (ARIAVALUSPA). The responsibility lies with the authors.
References
 [1] P. Paatero and U. Tapper, “Positive matrix factorization: A nonnegative factor model with optimal utilization of error estimates of data values,” Environmetrics, vol. 5, no. 2, pp. 111–126, 1994.
 [2] J.P. Brunet, P. Tamayo, T. R. Golub, and J. P. Mesirov, “Metagenes and molecular pattern discovery using matrix factorization,” PNAS, vol. 101, no. 12, pp. 4164–4169, 2004.
 [3] K. Devarajan, “Nonnegative matrix factorization: an analytical and interpretive tool in computational biology,” PLoS computational biology, vol. 4, no. 7, p. e1000029, 2008.
 [4] M. W. Berry and M. Browne, “Email surveillance using nonnegative matrix factorization,” Computational & Mathematical Organization Theory, vol. 11, no. 3, pp. 249–264, 2005.
 [5] S. Zafeiriou, A. Tefas, I. Buciu, and I. Pitas, “Exploiting discriminant information in nonnegative matrix factorization with application to frontal face verification,” TNN, vol. 17, no. 3, pp. 683–695, 2006.
 [6] I. Kotsia, S. Zafeiriou, and I. Pitas, “A novel discriminant nonnegative matrix factorization algorithm with applications to facial image characterization problems.” TIFS, vol. 2, no. 32, pp. 588–595, 2007.
 [7] F. Weninger and B. Schuller, “Optimization and parallelization of monaural source separation algorithms in the openBliSSART toolkit,” Journal of Signal Processing Systems, vol. 69, no. 3, pp. 267–277, 2012.
 [8] C. H. Ding, T. Li, and M. I. Jordan, “Convex and seminonnegative matrix factorizations,” IEEE TPAMI, vol. 32, no. 1, pp. 45–55, 2010.
 [9] C. Cing, X. He, and H. Simon, “On the equivalence of nonnegative matrix factorization and spectral clustering,” in Proc. SIAM Data Mining, 2005.
 [10] J. Herrero, A. Valencia, and J. Dopazo, “A hierarchical unsupervised growing neural network for clustering gene expression patterns.” Bioinformatics (Oxford, England), vol. 17, pp. 126–136, 2001.
 [11] Y. Zhao and G. Karypis, “Hierarchical clustering algorithms for document datasets,” Data Mining and Knowledge Discovery, vol. 10, pp. 141–168, 2005.
 [12] G. Tsoumakas and I. Katakis, “Multilabel classification: An overview,” International Journal of Data Warehousing and Mining, vol. 3, pp. 1–13, 2007.
 [13] Y. Zhang and Z.H. Zhou, “Multilabel dimensionality reduction via dependence maximization,” pp. 1–21, 2010.
 [14] G. Trigeorgis, K. Bousmalis, S. Zafeiriou, and W. B. Schuller, “A Deep SemiNMF Model for Learning Hidden Representations,” in ICML, vol. 32, 2014.
 [15] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
 [16] G. H. Golub and C. Reinsch, “Singular value decomposition and least squares solutions,” Numerische Mathematik, vol. 14, no. 5, pp. 403–420, 1970.
 [17] S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,” Chemometrics and intelligent laboratory systems, vol. 2, no. 1, pp. 37–52, 1987.
 [18] D. D. Lee and H. S. Seung, “Algorithms for nonnegative matrix factorization,” Advances in neural information processing systems, vol. 13, pp. 556–562, 2001.
 [19] D. Cai, X. He, J. Han, and T. S. Huang, “Graph regularized nonnegative matrix factorization for data representation,” TPAMI, vol. 33, no. 8, pp. 1548–1560, 2011.
 [20] J.H. Ahn, S. Choi, and J.H. Oh, “A multiplicative uppropagation algorithm,” in ICML. ACM, 2004, p. 3.
 [21] S. Lyu and X. Wang, “On algorithms for sparse multifactor nmf,” in Advances in Neural Information Processing Systems, 2013, pp. 602–610.
 [22] A. Cichocki and R. Zdunek, “Multilayer nonnegative matrix factorization,” Electronics Letters, vol. 42, pp. 947–948, 2006.
 [23] H. A. Song and S.Y. Lee, “Hierarchical data representation model  multilayer nmf,” ICLR, vol. abs/1301.6316, 2013.
 [24] H. Liu, Z. Wu, X. Li, D. Cai, and T. S. Huang, “Constrained Nonnegative Matrix Factorization for Image Representation,” PAMI, vol. 34, pp. 1299–1311, 2012.
 [25] I. Kotsia, S. Zafeiriou, and I. Pitas, “Novel discriminant nonnegative matrix factorization algorithm with applications to facial image characterization problems,” TIFS, vol. 2, pp. 588–595, 2007.
 [26] M. Riesenhuber and T. Poggio, “Hierarchical models of object recognition in cortex.” Nature neuroscience, vol. 2, no. 11, pp. 1019–1025, 1999.
 [27] J. Malo, I. Epifanio, R. Navarro, and E. P. Simoncelli, “Nonlinear image representation for efficient perceptual coding,” TIP, vol. 15, pp. 68–80, 2006.
 [28] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural networks, vol. 2, no. 5, pp. 359–366, 1989.
 [29] Y. Nesterov et al., “Gradient methods for minimizing composite objective function,” 2007.
 [30] D. M. Cvetkovic, M. Doob, and H. Sachs, Spectra of graphs: Theory and application. Academic press New York, 1980, vol. 413.
 [31] M. Belkin and P. Niyogi, “Using Manifold Structure for Partially Labelled Classification,” in NIPS 2002, 2002, pp. 271–277.
 [32] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A geometric framework for learning from labeled and unlabeled examples,” JMLR, vol. 7, pp. 2399–2434, 2006.
 [33] Y. Hao, C. Han, G. Shao, and T. Guo, “Generalized graph regularized nonnegative matrix factorization for data representation,” in Lecture Notes in Electrical Engineering, vol. 210 LNEE, 2013, pp. 1–12.
 [34] M. Turk and A. Pentland, “Eigenfaces for Recognition,” 1991.
 [35] S. Li, X. W. H. X. W. Hou, H. J. Z. H. J. Zhang, and Q. S. C. Q. S. Cheng, “Learning spatially localized, partsbased representation,” CVPR, vol. 1, 2001.
 [36] T. Sim, S. Baker, and M. Bsat, “”the cmu pose, illumination, and expression database.”,” TPAMI, vol. 25, no. 12, pp. 1615–1618, 2003.
 [37] K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre, “Xm2vtsdb: The extended m2vts database,” in International conference on audio and videobased biometric person authentication, vol. 964. Citeseer, 1999, pp. 965–966.
 [38] N. Guan, D. Tao, Z. Luo, and B. Yuan, “NeNMF: an optimal gradient method for nonnegative matrix factorization,” TSP, vol. 60, no. 6, pp. 2882–2898, 2012.
 [39] G. Tzimiropoulos, S. Zafeiriou, and M. Pantic, “Subspace learning from image gradient orientations,” 2012.
 [40] C. Ding, T. Li, and M. I. Jordan, “Convex and seminonnegative matrix factorizations,” TPAMI, vol. 32, pp. 45–55, 2010.
 [41] N. Gillis and A. Kumar, “Exact and Heuristic Algorithms for SemiNonnegative Matrix Factorization,” arXiv preprint arXiv:1410.7220, 2014.
 [42] C. Boutsidis and E. Gallopoulos, “Svd based initialization: A head start for nonnegative matrix factorization,” Pattern Recognition, vol. 41, no. 4, pp. 1350–1362, 2008.
 [43] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering.” in NIPS, vol. 14, 2001, pp. 585–591.
 [44] W. Xu, X. Liu, and Y. Gong, “Document clustering based on nonnegative matrix factorization,” in SIGIR. ACM, 2003, pp. 267–273.
 [45] Y. A. LeCun, L. Bottou, G. B. Orr, and K. R. Müller, “Efficient backprop,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 7700 LECTU, pp. 9–48, 2012.
 [46] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layerwise training of deep networks,” Advances in neural information processing systems, vol. 19, p. 153, 2007.
 [47] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic, “300 faces inthewild challenge: The first facial landmark Localization Challenge,” in CVPR, 2013, pp. 397–403.
 [48] ——, “A semiautomatic methodology for facial landmark annotation,” in CVPRW, 2013, pp. 896–903.
 [49] S. Zafeiriou, “Discriminant nonnegative tensor factorization algorithms,” TNN, vol. 20, no. 2, pp. 217–235, 2009.
 [50] ——, “Algorithms for nonnegative tensor factorization,” in Tensors in Image Processing and Computer Vision. Springer, 2009, pp. 105–124.
George Trigeorgis is pursuing a Ph.D. degree from Imperial College London. 
Konstantinos Bousmalis is a researcher working with Google Robotics, California. 
Stefanos Zafeiriou is a Lecturer in the Department of Computing, Imperial College London. 
Björn W. Schuller is a Lecturer in the Department of Computing, Imperial College London. 