A Recurrent Probabilistic Neural Network with Dimensionality Reduction Based on Time-series Discriminant Component Analysis

A Recurrent Probabilistic Neural Network with Dimensionality Reduction Based on Time-series Discriminant Component Analysis

Hideaki Hayashi,  Taro Shibanoki,  Keisuke Shima,  Yuichi Kurita,  and Toshio Tsuji,  H. Hayashi is with Department of Advanced Information Technology, Kyushu University, 744, Motooka, Nishi-ku, Fukuoka-shi, 819-0395 JAPAN. e-mail: hayashi@ait.kyushu-u.ac.jpT. Shibanoki is with College of Engineering, Ibaraki University, Hitachi, Japan.K. Shima is with Faculty of Engineering, Yokohama National University, Yokohama, 240-8501 Japan.Y. Kurita and T. Tsuji are with Institute of Engineering, Hiroshima University, Higashi-hiroshima, 739-8527 Japan
Abstract

This paper proposes a probabilistic neural network developed on the basis of time-series discriminant component analysis (TSDCA) that can be used to classify high-dimensional time-series patterns. TSDCA involves the compression of high-dimensional time series into a lower-dimensional space using a set of orthogonal transformations and the calculation of posterior probabilities based on a continuous-density hidden Markov model with a Gaussian mixture model expressed in the reduced-dimensional space. The analysis can be incorporated into a neural network, which is named a time-series discriminant component network (TSDCN), so that parameters of dimensionality reduction and classification can be obtained simultaneously as network coefficients according to a backpropagation through time-based learning algorithm with the Lagrange multiplier method. The TSDCN is considered to enable high-accuracy classification of high-dimensional time-series patterns and to reduce the computation time taken for network training. The validity of the TSDCN is demonstrated for high-dimensional artificial data and EEG signals in the experiments conducted during the study.

neural network, dimensionality reduction, pattern classification, hidden Markov model, Gaussian mixture model.

Nomenclature

[]

Dimensionality in the original space

Dimensionality in the subspace

Number of classes

Number of states

Number of components

Probability

Time-series vector in the original space

Time-series vector in the subspace

Orthogonal transformation matrix

Element of

Mean vector in the original space

Mean vector in the subspace

Gaussian distribution

Mixture proportion of a GMM

Covariance matrix in the subspace

Element of

State change probability of an HMM

Prior probability of an HMM

Transformed vector in the original space

Transformed vector in the subspace

Weight between the first/second layers

Element of

Weight between the third/fourth layers

Element of

Input of the th unit

Output of the th unit

Teacher vector

Negative log-likelihood function

Lagrange function

Lagrange multiplier

Orthogonality conditions

Modification amount for

I Introduction

Time-series pattern classification has a wide range of applications such as speech recognition, gesture recognition, and biosignal classification, and many studies have been performed to investigate higher classification performance [1, 2, 3, 4, 5, 6, 7].

Time-series pattern classification methods can be divided into three large categories — sequence distance-based classification, feature-based classification, and model-based classification[8]. Sequence distance-based methods measure the similarity between a pair of patterns based on a distance function such as the Euclidean distance, the Mahalanobis distance[9, 10], or dynamic time warping, and then classify the patterns using conventional classification algorithms typified by a -nearest neighbor classifier. In feature-based methods, features are extracted from the original time series and are classified using a support vector machine, decision trees, and neural networks (NNs). In particular, NNs are expanded for time-series classification, known as the appearance of the Jordan network[11], the Elman network [12], and time delay neural networks [13]. The most popular approach in model-based methods is the hidden Markov model (HMM). In the HMM, the system for each class is modeled by a Markov process with unobserved states. Time-series patterns are then classified into the classes based on a likelihood that is calculated from the model. All the above methods, however, have some drawbacks: Sequence distance-based methods and model based methods need a large amount of training data to estimate the distribution of input data precisely. Feature-based methods are likely to cause overfitting because they have too many free parameters and complex structures.

In recent years, a fourth option, “model-based NNs”, has been proposed as a hybrid of NNs and model-based methods [14, 15, 16, 17, 18, 19, 20]. Model-based NNs are developed by integrating prior knowledge of the input data into the network structure to be capable of saving the amount of training data and preventing overfitting. Tsuji et al. [20] proposed a recurrent probabilistic neural network based on HMMs known as the recurrent log-linearized Gaussian mixture network (R-LLGMN) and showed that the R-LLGMN achieves high classification performance with a smaller amount of training data compared with HMMs and conventional NNs.

Although such model-based NNs have high classification performance, they often suffer from problems caused by high dimensionality. The increased input dimensionality of NNs because of high-dimensional features (e.g., signals measured with numerous electrodes and frequency spectra) causes problems such as poor generalization capability, parameter learning difficulty and longer computation time. These phenomena are called the “curse of dimensionality” [21].

To avoid these problems, dimensionality reduction techniques are used before classification [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35]. For example, Güler et al. [22] applied principal component analysis (PCA) to the frequency spectra of electromyograms, and classified them using a multi-layer perceptron and a support vector machine. Bu et al. [27] classified time-series data combining linear discriminant analysis (LDA) and a recurrent probabilistic neural network. However, there is no guarantee that the reduced input data can be classified correctly, because the dimensionality reduction stage and the classification stage are composed separately.

In contrast, the authors previously proposed a novel time-series pattern classification model called time-series discriminant component analysis (TSDCA) [36], which allows a reduction in the dimensionality of input data using several orthogonal transformation matrices and enables the calculation of posterior probabilities for classification under the assumption that the reduced feature vectors obey an HMM. We also proposed a probabilistic neural network based on TSDCA in which the parameters of TSDCA are obtained as weight coefficients of the NN. In the learning process of the proposed network, Gram-Schmidt orthonormalization was conducted to maintain the orthogonality of the transformation matrices for dimensionality reduction; thus, the convergence of the learning was not guaranteed.

This paper proposes a novel recurrent probabilistic neural network, a time-series discriminant component network (TSDCN), that improves the learning algorithm of our previously proposed network. In the new learning algorithm, the Lagrange multiplier method is integrated with a backpropagation through time-based learning process to guarantee the convergence of learning while maintaining the orthogonality of the transformation matrices. In this way, the TSDCN can obtain the parameters of dimensional reduction and classification simultaneously, thereby supporting the accurate classification of time-series data with high dimensionality.

The rest of this paper is organized as follows: Section II describes TSDCA. The structure and the learning algorithm of the TSDCN are explained in Section III. The verification of the classification ability using high-dimensional artificial data and electroencephalograms (EEGs) are presented in Section IV and V. Finally, Section VI concludes the paper.

Ii Time-series discriminant component analysis (TSDCA)

Ii-a Model structure [36]

Fig. 1 shows the overview of TSDCA. TSDCA consists of several orthogonal transformation matrices and an HMM that incorporates a GMM for the probabilistic density function. The model allows a reduction in the dimensionality of input data and enables the calculation of posterior probabilities for each class.

In regard to classifying a -dimensional time-series vector () into one of the given classes, the posterior probability () is examined. First, is projected into a -dimensional vector using several orthogonal transformation matrices . This can be described as follows:

(1)

where is the mean vector of the component (; is the number of states, ; is the number of components), and is the orthogonal transformation matrix that projects from into .

In the compressed feature space, the projected data obey a probabilistic density function as follows:

(2)
(3)

where is the covariance matrix in the compressed feature space.

Fig. 1: Overview of time-series discriminant component analysis (TSDCA). TSDCA is an expansion of a hidden Markov model (HMM). In this analysis, each class is assumed to be a Markov process with unobserved states. Each state has a Gaussian mixture model (GMM) that incorporates several orthogonal transformation matrices, allowing a reduction in the dimensionality of the time-series data and a calculation of the posterior probabilities of input data for each class.

Assuming that the projected data obey an HMM, the posterior probability of given is calculated as

(4)
(5)
(6)

where is the probability of a state change from to in class , is defined as the likelihood of corresponding to the state in class , and the prior probability is equal to . Here, can be derived with the form

(7)

where represents the mixture proportion.

Incidentally, TSDCA can be simplified in the particular case where and . Since becomes in this case, can be calculated instead of (4) as

(8)

This is equivalent to the posterior probability calculation for static data based on a GMM. TSDCA therefore includes classification based on a simple GMM and can be applied to not only time-series data classification but also static data classification.

Ii-B Log-linearization

An effective way to acquire dimensionality reduction appropriate for classification is to train the dimensionality reduction part and classification part together with a single criterion [37]. In this paper, the authors address this issue by incorporating TSDCA into a neural network so that the parameters of TSDCA can be trained simultaneously as network coefficients through a backpropagation training algorithm.

In preparation for the incorporation, (1) and (7) are considered based on a linear combination of coefficient matrices and input vectors. First, is transformed as follows:

(9)

where corresponds to an image of the mean vector mapped onto the compressed space from the input space. Hence, is expressed by the multiplication of the coefficient matrix and the novel input vector . Secondly, setting

(10)

and taking the log-linearization of gives

(11)

where are elements of the inverse matrix , and is a Kronecker delta, which is 1 if and otherwise 0. Additionally, is defined as

(12)

As stated above, the parameters of TSDCA can be expressed with a smaller number of coefficients and using log-linearization. If these coefficients are appropriately obtained, the parameters and the structure of the model can be determined and the posterior probability of high-dimensional time-series data for each class can be calculated.

The next section describes how and are acquired as weight coefficients of a NN through learning.

Iii Time-series discriminant component network (TSDCN)

Iii-a Network structure

Fig. 2 shows the structure of the TSDCN, which is a seven-layer recurrent type with weight coefficients and between the first/second and third/fourth layers, respectively, and a feedback connection between the fifth and sixth layers. The computational complexity of each layer is proportional to the number of units that is specified below.

Fig. 2: Structure of the time-series discriminant component network (TSDCN). The TSDCN is constructed by incorporating the calculation of TSDCA into the network structure, and consequently consists of seven layers. , , and in the figure stand for a linear sum unit, an identity unit, and a multiplication unit, respectively. The weight coefficients between the first and second layer correspond to the orthogonal transformation matrices and the mean vectors of compressed data, and conduct dimensionality reduction of input data. The weight coefficients between the third and fourth layer includes the probabilistic parameters of GMMs and HMMs. The recurrent connections between the fifth and sixth layer correspond to the state changes of HMMs. Because of this structure, a calculation equivalent to TSDCA can be implemented.

The first layer consists of units corresponding to the dimensions of the input data . The relationships between the input and the output are defined as

(13)
(14)

where and are the input and output of the th unit, respectively. This layer corresponds to the construction of in (9).

The second layer is composed of units, each receiving the output of the first layer weighted by the coefficient . The relationships between the input and the output of the unit , , , are described as

(15)
(16)

where the weight coefficient is for each element of the matrix described as follows:

(17)

This layer is equal to the multiplication of and in (9).

The third layer comprises () units. The relationships between the input and the output of the units are defined as

(18)
(19)

where ( = 1,,), and (18) corresponds to the nonlinear conversion shown in (12).

The fourth layer comprises units. Unit receives the output of the third layer weighted by the coefficient . The input and the output are defined as

(20)
(21)

where the weight coefficient corresponds to each element of the vector .

(22)

(20) stands for the multiplication of and in (11). (21) corresponds to the exponential function of a Gaussian distribution in (2).

The fifth layer consists of units. The output of the fourth layer is added up and input into this layer. The one-time-prior output of the sixth layer is also fed back to the fifth layer. These are expressed as follows:

(23)
(24)

where for the initial phase. This layer represents the summation over components in (7) and the multiplication of , , and in (6).

The sixth layer has units. The relationships between the input and the output of the unit {} are described as

(25)
(26)

This layer corresponds to the summation over states in (6) and the normalization in (4).

Finally, the seventh layer consists of units, and its input and output are

(27)
(28)

corresponds to the posterior probability for class . Here, the posterior probability based on TSDCA can be calculated if the NN coefficients , are appropriately established.

Iii-B Learning algorithm

The learning process of the TSDCN is required to achieve maximization of the likelihood and orthogonalization of the transformation matrices , simultaneously. In our previous paper, the weight modulations based on the maximum likelihood and the Gram-Schmidt process for orthogonalization were conducted alternately [36]. Unfortunately, the Gram-Schmidt process interfered with the monotonic increase of likelihood, hence the convergence of network learning was not theoretically guaranteed.

This paper addresses this issue by introducing the Lagrange multiplier method to the learning algorithm. In preparation, let us redefine as follows:

(29)

where is the th column vector of . A set of input vectors (; is the number of training data) is given for training with the teacher vector for the th input, where and () for the training sample of class . Now consider the following optimization problem:

(30)

Here, is a negative log-likelihood function described as

(31)

where is the output for the th input at . is a constraint that represents orthogonality conditions defined as

(32)

is the number of constraints given as . Using the method of Lagrange multiplier, the constrained optimization problem can be converted to an unconstrained problem defined as

(33)

where is a Lagrange multiplier. By differentiating with respect to , , and , the necessary conditions of the above optimization problem are derived as

(34)
(35)

where is a differential operator with respect to and . The log-likelihood function can therefore be minimized while maintaining orthogonality by obtaining a point that satisfies (34) and (35) using an arbitrary nonlinear optimization algorithm.

Now we consider the minimization of using the gradient method. The partial differentiations of with respect to each weight coefficient are given as

(36)
(37)

thus can be minimized without constraint with respect to . The weight modification for is then defined as

(38)

where is the learning rate. If the learning rate is appropriately chosen, the convergence of the gradient method is guaranteed.

With regard to , let be an arbitrary that satisfies (35), and the differential vector to the next point is given as

(39)

where is calculated as

(40)

Here, the Taylor expansion for (35) at is

(41)

Let us denote a solution of the above equation by , thereby (41) can be rewritten as

(42)

Hence (39) and (42) can be combined as follows:

(43)

where is an identity matrix. Solving the above simultaneous equations, can be modulated using as follow:

(44)

The detailed calculation of gradient vectors will be described in Appendix.

Using this algorithm, collective training can be applied in relation to the weight coefficients for dimensional reduction and discrimination, simultaneously.

Iv Simulation experiments

To investigate the characteristics of the TSDCN, artificial data classification experiments were conducted. The purposes of these experiments were as follows: A: To reveal the classification ability of the TSDCN under ideal conditions and the influence of network parameter variation, B: To show that the TSDCN can reduce training time while maintaining classification accuracy equivalent to that of conventional NNs, C: To compare the dimensionality reduction ability of the TSDCN with conventional methods, and D: To compare the TSDCN with our previous network.

In terms of the examination of network parameters, Tsuji et al. had already discussed the influence of parameters derived from an HMM such as the number of states , the number of components , the number of training data , and time-series length in the R-LLGMN [20]. Subsection A will then examine the influence of the number of input dimensions and the number of reduced dimensions since they make the greatest difference between the TSDCN and the R-LLGMN. The variation of the number of classes will also be examined because it is the most important issue for classification.

In subsection B, the training time of the TSDCN will be compared with those of the R-LLGMN [20] and the Elman network [12] that are without dimensionality reduction.

The authors will show the differences in the dimensionality reduction characteristics between the TSDCN and PCA, and between the TSDCN and LDA using two kinds of visible problems, and then verify how the differences influence high-dimensional data classification in subsection C.

Finally, comparison with our previously proposed network will be conducted in subsection D. Since the TSDCN is an improvement over our previous network [36] with respect to the learning algorithm, this comparison demonstrates that the improvement really works better in terms of accuracy and convergency.

The computer used in each experiment was an Intel Core(TM) i7-3770K (3.5 GHz), 16.0 GB RAM. Furthermore, a terminal attractor was introduced to curtail the training time [38].

Iv-a Classification under ideal conditions and network parameter variation

Iv-A1 Method

Fig. 3: Example of used in the classification experiment. The data were generated based on an HMM for each class (, ).

The data used in the experiment were generated based on an HMM for each class because the ideal conditions for the TSDCN are that the input signals obey HMMs. The HMMs used were fully connected, which had two states and two components for each state. The length of the time series was set as . Fig. 3 shows an example of . The parameters used in this figure were , . The vertical axis and the horizontal axis in the figure indicate the value of each dimension and time, respectively. In terms of the HMM parameters used in the generation of , the mean vectors and the covariance matrices were set using uniform random numbers in ; the mixture proportion, the initial state probabilities, and the state change probabilities are determined by uniform random vectors normalized in . In the validation of classification accuracy, 5 samples were treated as training data, and 50 samples were used as test data for each class. The classification accuracy is defined as , where is the amount of test data correctly classified, and is the total amount of test data. Average classification accuracy and training time were then calculated by changing the HMM parameters, regenerating the training/test data sets 10 times, and resetting the initial weight coefficients of the TSDCN 10 times for each data set. For the fixed number of states and the number of components , the number of input dimensions and the number of reduced dimensions were varied as and , respectively. The number of classes was also varied as for fixed and .

Iv-A2 Results

Fig. 4: Average training time and standard deviation for each combination of the number of input dimensions and the number of reduced dimensions ().

Fig. 4 shows average training time and its standard deviation for each combination of the number of input dimensions and the number of reduced dimensions (). In this experiment, the classification accuracies were [%] for any combinations of and .

Fig. 5: Average classification accuracy and training time of the TSDCN for each number of classes (, ). Note that the vertical axis scale from 0 to 99.0 was omitted in (a).

Fig. 5 shows average classification accuracy and training time for each number of classes . The classification accuracies for to were [%]. Although the classification accuracies for and were uneven ( [%], [%], respectively), there were no significant differences among , , and based on the Holm method [39].

Iv-A3 Discussion

Forward calculation Backpropagation Lagrange multiplier
TABLE I: Calculation complexity of each calculation part

In Fig. 4 (a), the increase of the training time according to the increase of is almost linear with a small number of such as and . In contrast, the training time increases as a quadratic function with a larger number of . This can be explained on the basis of calculation complexity. Table I summarizes approximate complexity in each part of the calculation. When the number of reduced dimensions is smaller, the calculation amounts of the forward calculation and the backpropagation are substantially larger than in the Lagrange multiplier method, hence the total calculation complexity according to the increase of can be regarded as linear. If is larger, however, the calculation complexity of the Lagrange multiplier method rapidly increases, and the quadratic term of cannot be ignored. In other words, the TSDCN can suppress the increase of the amount of calculation according to the increase of the input dimensions to linear if the number of reduced dimensions is sufficiently small, whereas conventional methods such as HMMs and the R-LLGMN have the complexity of because they are required to calculate the covariance matrices of the input vectors.

In the relationship between the classification accuracy and the number of classes shown in Fig. 5, the TSDCN achieved exceedingly high classification accuracy with any although the training time increased as increased. These results indicate that the TSDCN can classify high-dimensional time-series data significantly accurately if the assumption that the input data obey HMMs is satisfied. In addition, the TSDCN achieved high classification performance even for high multi-class problems such as although the number of reduced dimensions was very small (). This is because that the TSDCN has orthogonal transformation matrices different for each class, hence mappings separate from each class onto reduced-dimensional spaces that are appropriate for classification can be acquired. Therefore, there is no need to use a large number of , thus the previously mentioned restriction that should be sufficiently small is easily satisfied.

Fig. 6: Average classification accuracy and training time of each method for comparison with conventional NNs.

Iv-B Comparison with conventional NNs

Iv-B1 Method

The data used in this experiment were also generated by an HMM (, , ). The classification accuracy and training time for each NN were calculated in the same manner as subsection A. The parameters of the TSDCN were , , , and the number of states and components of the R-LLGMN were identical to those of the TSDCN. The Elman network had three layers (one hidden) and 30, 10, 3 units for input, hidden, and output layers, respectively. The input/output function was a log-sigmoid function, and the backpropagation with 0.01 of learning rates and 0.01 of error thresholds was used for training.

Iv-B2 Results

Fig. 6 shows the classification accuracy and the training time for each method. The classification accuracies of both the TSDCN and the R-LLGMN were [%]. The accuracy of the Elman network was [%], and significant differences between the other methods were confirmed. The training times of the TSDCN, the R-LLGMN, and the Elman network were [s], [s], and [s], respectively, and there were significant differences among all the methods.

Iv-B3 Discussion

It can be considered that the Elman network estimated the distribution of the high-dimensional input data in the original feature space, and made its structure complex, resulting in the low classification accuracy and the significantly long training time. The R-LLGMN achieved high classification performance since it is developed based on HMMs. However, the training time was long because the R-LLGMN modeled all the input dimensions using numerous parameters. The training time of the TSDCN was significantly reduced in spite of the high classification accuracy. These results showed that the TSDCN can reduce training time, maintaining the high classification performance.

Iv-C Comparison with conventional dimensionality reduction

Iv-C1 Method

Fig. 7: Example of the problem in comparison with PCA. (a) shows an example of time-series signal of each dimension, and (b) indicates a scattergram of the distribution of the population used in the data generation. The auxiliary axis and show the first and second principal component, respectively.
Fig. 8: Example of the problem in comparison with LDA. The scattergram (b) shows that the problem is nonlinearly separative.
Fig. 9: Example of used in the comparison with conventional dimensionality reduction techniques. The data were prepared by combining (HMM-based signals) and white Gaussian noise . The noise ratio was in this figure.

Fig. 7 shows an example of the problem used in comparison with PCA, where Fig. 7 (a) is an example of time series , and Fig. 7 (b) is a scattergram that indicates the distribution of population used in data generation. Auxiliary axes and in Fig. 7 (b) show the direction with the largest variance (the first principal component), and the direction orthogonal to (the second principal component), respectively. The signals were generated based on the following equation so that is an insignificant component for classification.

(45)

where is a Gaussian random number with a mean of 0 and a covariance matrix of [1 -1; -1 1], which is independently generated for each time and each class. Reduced-dimensional signals were calculated using the TSDCN and PCA, and were compared with each other. For the reduced-dimensional signals of PCA, classification accuracy was then calculated using the R-LLGMN by changing the combination of 5 training samples and 100 test samples 10 times, and resetting the initial weight coefficients 10 times for each data set. The classification accuracy of the TSDCN for the original problem was also calculated.

An example of the problem used in comparison with LDA is shown in Fig. 8, where Fig. 8 (a) and (b) represent an example of a time series and a scattergram that indicates the distribution of population, respectively. In Fig. 8 (b), the data of class 1 are distributed in the first and third quadrants, and the data of class 2 are distributed in the second and fourth quadrants referring to the XOR problem that is a typical nonlinear classification problem. To be more precise, the data are uniform random numbers in the region where for class 1, and