# Tensor Valued Common and Individual Feature Extraction: Multi-dimensional Perspective

###### Abstract

A novel method for common and individual feature analysis from exceedingly large-scale data is proposed, in order to ensure the tractability of both the computation and storage and thus mitigate the curse of dimensionality, a major bottleneck in modern data science. This is achieved by making use of the inherent redundancy in so-called multi-block data structures, which represent multiple observations of the same phenomenon taken at different times, angles or recording conditions. Upon providing an intrinsic link between the properties of the outer vector product and extracted features in tensor decompositions (TDs), the proposed common and individual information extraction from multi-block data is performed through imposing physical meaning to otherwise unconstrained factorisation approaches. This is shown to dramatically reduce the dimensionality of search spaces for subsequent classification procedures and to yield greatly enhanced accuracy. Simulations on a multi-class classification task of large-scale extraction of individual features from a collection of partially related real-world images demonstrate the advantages of the “blessing of dimensionality” associated with TDs.

skins \tcbsettab2/.style=enhanced, fonttitle=, fontupper=, colback=yellow!10!white, colframe=red!50!black, colbacktitle=Salmon!40!white, coltitle=black,center title

Index terms— Tensor decomposition, tensor rank, feature extraction, common and individual features, classification

## 1 Introduction

Modern datasets in data science applications have immense volume, veracity, velocity and variety (the for V‘s of big data) [1, 2], and often exhibit a large degree of structural richness among their entries. These data characteristics are often prohibitive to the application of classical matrix algebra as its “flat-view” way of operation cannot cope with the sheer volume of data and the corresponding imbalanced matrix structures, such as as “tall and narrow” or “short and wide” ones. On the other hand, when arranged in multi-dimensional structures (tensors), the same data often admit much more convenient and mathematically tractable ways of analysis, by virtue of the associated multi-linear algebra. However, until recently, such an approach to data analysis was not very popular, due to high demand for storage and computational resources.

There are several ways to tensorize data prior to further analysis, such as through: (i) natural tensor formation, (ii) experimental design, or (iii) mathematical construction [3]. This flexibility and a highly informative nature of multi-way data representation is supported by tensor decompositions (TDs) which allow for storage and memory efficient low-rank approximation of otherwise intractable large data, and are being exploited in diverse range disciplines including brain science [4, 5], chemometrics [6], psychometric [7], machine learning [8, 9] and signal processing [3].

The generalisation of a matrix to a tensor, as in Fig. 1, is intuitive but highly non-trivial, not least due to multi-linear algebra having different properties to linear algebra. Along these lines, the authors in [10] consider the physical meaning of factor matrices obtained through TDs. Missing data can also be handled through tensor dictionary learning [11], whereby the tensor structure allows for a simultaneous retrieval of local patterns and establishing the global information. The algorithms for classification of the multi-dimensional data have been proposed in [12, 13].

We here consider a problem of the extraction of information and classification of reduced dimension features from large-scale multi-block data. A typical example of such structure is a set of recordings of the same phenomenon but under different experimental setups, such as multiple images of objects recorded under different lighting and angle combination, multi-block data. Intuitively, the so obtained ensemble of images contains some common and some individual features, and machine learning tasks would benefit from exploiting only either common features (for clustering) or individual features (for identification), both of much lower dimensionality than the original data.

In this work, to resolve the computational and storage issues for large scale classification problems, the identification of common features is achieved by first providing an additional insight into the physical meaning of the outer product of multiple vectors in the tensor setting, supported by an intuitive example. The separation of the common and individual feature subspaces is then achieved by multi-linear rank decomposition (LL1), whereby the number of “simplest” data structures in such a decomposition is equivalent to the number of multi-linear tensor ranks [14]. The non-negativity constraint is further imposed on the so extracted factors, to preserve the physical properties of the images considered. Simulations on the benchmark ORL dataset demonstrate that the proposed method provides significant advantages in terms of accuracy, mathematical tractability and ease of interpretation, when used in conjunction with standard classification algorithms.

## 2 Common and Individual Components in Data

Consider a set, , of observations in a matrix form, given by

(1) |

where the so called block-matrix structure could be a representation of medical images, EEG recordings, or financial stock characteristics. All members of such a set of matrices are naturally linked together and it would be beneficial to analyse them simultaneously at the same time, however, the representation in (1) yields imbalanced (tall and narrow) structure which is cumbersome for further processing. The main goal of the common and individual feature analysis is, therefore, to make use of the “blessing of dimensionality” associated with tensor structures, in order to find a much lower-dimensional unique subspace that is common across all . In this way, the common subspace, , can be separated from the individual information, , for every .

The flat view matrix methods typically stack all entries of into a tall and narrow matrix and subsequently perform matrix factorisations, such as the principal component analysis (PCA) [15], to give:

(2) |

where . In [16], this method was applied to neuroimaging data of patients with Alzheimer disease, whereby is interpreted as well established knowledge about the disease (common components), while represents the individual state for a specific patient. However, as with all matrix models, this approach does not generalise well and is only appropriate when all components, , of the tall and narrow matrix, , exhibit exactly the same common information.

Approaches to common and individual feature extraction presented in [17, 18] also employ a PCA like factorisation to every entry of the naturally linked dataset given in (1), to yield

(3) | ||||

where the matrices, , are the common components across the dataset , while the matrices, , are the individual components for every in . The matrices and are the basis matrices respectively for the matrices and , while the matrices, and represent mixing coefficients so that and

Remark 1. Due to the linear separability of the matrices and , it is sufficient to establish the basis of the common information, , which can be estimated through iterative minimisation of the cost function, formulated in [19] as:

(4) |

where the orthogonal matrix is obtained from , is the column vector of and is the common component which defines the basis of , if the cost in (4) is smaller then a predefined threshold. Thus, the weak but consistently presented similarities among data matrices from the dataset contribute to the total cost in (4) the same amount as the very prominent ones and, therefore, take important part in the multi-block data analysis.

The matrix approaches have their pros and cons and are powerful if exploited appropriately, however, they do not account directly for the intrinsic multidimensional form of data. To this end, we propose a novel method for common and individual feature extraction which exploits multi-modal properties of tensor decompositions.

## 3 Notation and Theoretical Background

A tensor of order is a N-dimensional array and is denoted by a bold underlined capital letter, . A particular dimension of is usually referred to as a mode. An element of a tensor is a scalar which has indices. A fiber is a vector obtained by fixing all but one of the indices, e.g. is the mode-2 fiber. Fixing all but two of the indices yields a matrix called a slice of a tensor, e.g. is the frontal slice. Mode-n unfolding is the process of element mapping from a tensor to a matrix, e.g. is the mode-2 unfolding. A mode-n product of a tensor with a matrix is equivalent to

(5) |

The outer product of vectors results in a rank-1 tensor of order , e.g.

### 3.1 Basic Tensor Decompositions

The Canonical Polyadic Decomposition (CPD), illustrated in Fig. 2, represents a given tensor as a sum of rank-1 tensors . For a third order tensor of rank , the CPD is given by

(6) | ||||

where is a superdiagonal core tensor that guarantees “one to one relation” for the factor vectors and , while and are factor matrices which are composed of the corresponding factor vectors, e.g. . Despite soft uniqueness conditions, in practice the CPD in (6) does not provide the exact decomposition of the original data tensor [20]. On the other hand, the Higher Order Singular Value Decomposition (HOSVD) requires orthogonality constraints to be imposed on the factor matrices, is always exact [21], and takes the form

(7) | ||||

where is a dense core tensor, and are the orthogonal factor matrices and the n-tuple is called the multi-linear rank. Observe that the HOSVD also decomposes multi-dimensional data into a sum of rank-1 terms . However, as opposed to the “one to one” relation for the CPD, the HOSVD models all possible combinations of its factor vectors, hence, providing enhanced flexibility. To make use of the desirable properties of both CPD and HOSVD, the LL1 decomposition efficiently combines their concepts [22], by decomposing the tensor into a linear combination of tensors, whereby each term has a multi-linear rank , that is

(8) | ||||

The LL1 decomposition is illustrated in Fig. 3, where , and the “one to one” relation between the factor matrices , , and factor vector is preserved. Moreover, upon employing the matrix-tensor duality, we can represent the matrix as a tensor of order three, , so that .

Remark 2. The matrix in (8) is no longer of rank-1 and is consequently more informative. However, the so-obtained tensor is still considered to exhibit the simplest structure as far as the LL1 decomposition is concerned.

## 4 Common and Individual Feature Extraction

The intuition behind the proposed common and individual feature analysis is given in the following examples.

Example 1. Observe a rank-1 tensor of order 3, expressed as

(9) |

According to the values of and the definition of the outer product, the values in the first frontal slice of are respectively four and eight times smaller then the values in the second and third frontal slices. Hence, each observation stored as the frontal slice of exhibits the same pattern (base matrix ) that can be considered as a common feature. At this point, no individual information can be extracted since there is only one base matrix.

Example 2. Consider a collection, , of five different color matrices stacked along the third dimension, as illustrated in Fig. 4. The tensor rank of such a 3rd order tensor (color ensemble) is three, that is, equivalent to the number of base colors (red, green, blue), which are the simplest structures from which all data can be generated through a mixing matrix . Thus, adopting the multi-linear notation and the RGB representation of colors, we can write

(10) | ||||

Here, is the original data, and contains intensity values of the red, green and blue colors. These three base colors are stored in different rank-1 frontal slices of the tensor and represent common information among the frontal slices . The individual features can be obtained by subtracting the weighted common features, to give

(11) |

where is a subset of common features for with respect to the values in the n-th row of .

### 4.1 LL1 decomposition with non-negativity constraint

If a slice belongs to the set of common features for the data sample , then an intuitive implication is that the corresponding value of is positive. However, this cannot be guaranteed for a general implementation of TDs, and the non-negativity constraint should be imposed on the factor matrix , since it corresponds to the mode along which members of the ensemble are stacked together. In order to obtain more descriptive common features, we employ the LL1 decomposition from (8). In this way, the rank of a frontal slice is increased (see Remark 2), whereby the extraction of the common features is given by

(12) |

and requires the minimization of the cost function

(13) |

Notice that this problem is similar to the computation of the CPD in (6). Therefore, our solution is based on the ALS-CPD algorithm (we refer to [20] for more detail) and is summarized in Algorithm 1, where and denote respectively Khatri-Rao and Hadamard products, is the MooreâPenrose pseudoinverse, NonNegLeastSq performs least squares on an input and an output , the least squares coefficients are constrained to be non-negative [23], while repeat() duplicates an input times.

Remark 3. For the illustration of the proposed approach, we used a tensor of order three, however, unlike matrices the proposed approach generalises well and allows for the common and individual features to be extracted from a tensor of any order, with only one requirement that observations must be concatenated along the same mode.

## 5 Simulations and Analysis

The proposed approach was employed for the classification of face images from the benchmark ORL dataset [24]. This database includes a total of the 400 grey scale images of 40 subjects in ten different illumination conditions and facial expressions. Ten sets of 40 images were created by randomly choosing one image of every subject. Six of these sets were arbitrarily selected for the training set with the remaining four forming the test set. Each group from the training set was represented as a tensor where the images of 40 different subjects were stacked along the third dimension, as in Fig. 5. Their individual features were extracted by applying the proposed framework with the non-negativity constraint imposed on the mode-3 factor matrix, while the number of common features was found empirically. The classification models used were SVM, NN, QD and cKNN and were trained on the so obtained individual information. The classification scores were calculated from 100 realizations. Note that during the test stage, we used the original images in order to make a fair and realistic evaluation.

[tab2, fonttitle=, fontupper=, tabularx=X——Y—Y—Y—Y,title= Table 1: Classification Performance in % ,boxrule=0.5pt, width=0.75] & SVM & NN & QD & cKNN

[0.1cm]Original & 83.9 & 4.35 & 91.5 & 79.0

[0.2cm] CPD & 91.5 & 85.8 & 89.8 & 85.5

[0.2cm] LL1 & 94.7 & 92.2 & 86.8 & 84.3

Fig. 5 illustrates examples of images used in the experiments, and the extracted individual information and common features for the CPD and LL1 decomposition. Table 1 summarizes the performance of multi-class classification of the original images based on the corresponding individual information, extracted through the proposed method for the CPD and LL1 decompositions.

The most significant improvement can be observed for the NN based classifier. Here, the poor accuracy on the original data is associated with the high variance of the original samples and the small size of the training set which resulted in overfitting. On the other hand, the extracted individual features were of lower variance, which allowed the NN classifier to find a decision boundary that is less prone to fluctuations in the training data, leading to much higher classification accuracy.

## 6 Conclusion

We have proposed a novel framework for common and individual feature extraction based on the CPD and LL1 tensor decompositions with the non-negativity constraint. The multi-modal relations expressed through the outer product have been shown to play a key role in the extraction of the shared information from multi-block data. In this way, the performance of machine learning algorithms can be greatly enhanced, as the classification models use only the much lower dimensional and significantly more discriminative individual information during the training stage. Simulations have employed the ORL database of images taken from various angles, under several illumination conditions, and with different face expressions of the subjects and have achieved excellent results. Unlike the matrix methods, the proposed method is very flexible and is not restricted to input data of a specific shared structure or images of the same dimensions.

## References

- [1] A. Cichocki, “Era of big data processing: A new approach via tensor networks and tensor decompositions,” arXiv preprint arXiv:1403.2048, 2014.
- [2] A. Cichocki, N. Lee, I. Oseledets, A. H. Phan, Q. Zhao, and D. P. Mandic, “Tensor networks for dimensionality reduction and large-scale optimization. Part 1: Low-rank tensor decompositions,” Foundations and Trends® in Machine Learning, vol. 9, no. 4-5, pp. 249–429, 2016.
- [3] A. Cichocki, D. P. Mandic, L. De Lathauwer, G. Zhou, Q. Zhao, C. Caiafa, and H. A. Phan, “Tensor decompositions for signal processing applications: From two-way to multiway component analysis,” IEEE Signal Processing Magazine, vol. 32, no. 2, pp. 145–163, 2015.
- [4] A. Cichocki, “Tensor decompositions: A new concept in brain data analysis?,” arXiv preprint arXiv:1305.0395, 2013.
- [5] J. Escudero, E. Acar, A. Fernández, and R. Bro, “Multiscale entropy analysis of resting-state magnetoencephalogram with tensor factorisations in Alzheimer’s disease,” Brain Research Bulletin, vol. 119, pp. 136–144, 2015.
- [6] A. Smilde, R. Bro, and P. Geladi, “Multi-way analysis: Applications in the chemical sciences,” Tecnometrics, pp. 1–380, 2005.
- [7] H. A. Kiers and I. V. Mechelen, “Three-way component analysis: Principles and illustrative application,” Psychological Methods, vol. 6, no. 1, pp. 84–110, 2001.
- [8] N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, Evangelos E. P., and C. Faloutsos, “Tensor decomposition for signal processing and machine learning,” arXiv preprint arXiv:1607.01668, 2016.
- [9] T. D. Nguyen, T. Tran, D. Q. Phung, and S. Venkatesh, “Tensor-variate restricted Boltzmann machines,” In Proceedings of the 29th Conference on Artificial Intelligence, pp. 2887–2893, 2015.
- [10] F. Cong, Q. Lin, L. Kuang, X. Gong, P. Astikainen, and T. Ristaniemi, “Tensor decomposition of EEG signals: A brief review,” Journal of Neuroscience Methods, vol. 248, pp. 59–69, 2015.
- [11] H. A. Phan, A. Cichocki, P. Tichavskỳ, G. Luta, and A. Brockmeier, “Tensor completion through multiple Kronecker product decomposition,” In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3233–3237, 2013.
- [12] B. Savas and L. Eldén, “Handwritten digit classification using higher order singular value decomposition,” Pattern Recognition, vol. 40, no. 3, pp. 993–1003, 2007.
- [13] R. Zink, B. Hunyadi, S. Van Huffel, and M. De Vos, “Tensor-based classification of an auditory mobile BCI without a subject-specific calibration phase,” Journal of Neural Engineering, vol. 13, no. 2, pp. 1–10, 2016.
- [14] L. Sorber, M. Van Barel, and L. De Lathauwer, “Optimization-based algorithms for tensor decompositions: Canonical polyadic decomposition, decomposition in rank-(L,L,1) terms, and a new generalization,” SIAM Journal on Optimization, vol. 23, no. 2, pp. 695–720, 2013.
- [15] Y. Guo and G. Pagnoni, “A unified framework for group independent component analysis for multi-subject fMRI data,” NeuroImage, vol. 42, no. 3, pp. 1078–1093, 2008.
- [16] A. R. Groves, C. F. Beckmann, S. M. Smith, and M. W. Woolrich, “Linked independent component analysis for multimodal data fusion,” NeuroImage, vol. 54, no. 3, pp. 2198–2217, 2011.
- [17] E. F. Lock, K. A. Hoadley, J. S. Marron, and A. B. Nobel, “Joint and individual variation explained (JIVE) for integrated analysis of multiple data types,” The Annals of Applied Statistics, vol. 7, no. 1, pp. 523–550, 2013.
- [18] G. Zhou, A. Cichocki, Y. Zhang, and D. P. Mandic, “Group component analysis for multiblock data: Common and individual feature extraction,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 11, pp. 2426–2439, 2016.
- [19] G. Zhou, A. Cichocki, and D. P. Mandic, “Common components analysis via linked blind source separation,” In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2150–2154, 2015.
- [20] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM Review, vol. 51, no. 3, pp. 455–500, 2009.
- [21] L. De Lathauwer, B. De Moor, and J. Vandewalle, “A multilinear singular value decomposition,” SIAM Journal on Matrix Analysis and Applications, vol. 21, no. 4, pp. 1253–1278, 2000.
- [22] L. De Lathauwer, “Decompositions of a higher-order tensor in block terms. Part II: Definitions and uniqueness,” SIAM Journal on Matrix Analysis and Applications, vol. 30, no. 3, pp. 1033–1066, 2008.
- [23] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” In Proceedings of the 2000 Conference on Advances in Neural Information Processing Systems (NIPS), pp. 556–562, 2001.
- [24] F. S. Samaria and A. C. Harter, “Parameterisation of a stochastic model for human face identification,” In Proceedings of the Second IEEE Workshop on Applications of Computer Vision, pp. 138–142, 1994.