A Recurrent Probabilistic Neural Network with Dimensionality Reduction Based on Timeseries Discriminant Component Analysis
Abstract
This paper proposes a probabilistic neural network developed on the basis of timeseries discriminant component analysis (TSDCA) that can be used to classify highdimensional timeseries patterns. TSDCA involves the compression of highdimensional time series into a lowerdimensional space using a set of orthogonal transformations and the calculation of posterior probabilities based on a continuousdensity hidden Markov model with a Gaussian mixture model expressed in the reduceddimensional space. The analysis can be incorporated into a neural network, which is named a timeseries discriminant component network (TSDCN), so that parameters of dimensionality reduction and classification can be obtained simultaneously as network coefficients according to a backpropagation through timebased learning algorithm with the Lagrange multiplier method. The TSDCN is considered to enable highaccuracy classification of highdimensional timeseries patterns and to reduce the computation time taken for network training. The validity of the TSDCN is demonstrated for highdimensional artificial data and EEG signals in the experiments conducted during the study.
Nomenclature

[]

Dimensionality in the original space

Dimensionality in the subspace

Number of classes

Number of states

Number of components

Probability

Timeseries vector in the original space

Timeseries vector in the subspace

Orthogonal transformation matrix

Element of

Mean vector in the original space

Mean vector in the subspace

Gaussian distribution

Mixture proportion of a GMM

Covariance matrix in the subspace

Element of

State change probability of an HMM

Prior probability of an HMM

Transformed vector in the original space

Transformed vector in the subspace

Weight between the first/second layers

Element of

Weight between the third/fourth layers

Element of

Input of the th unit

Output of the th unit

Teacher vector

Negative loglikelihood function

Lagrange function

Lagrange multiplier

Orthogonality conditions

Modification amount for
I Introduction
Timeseries pattern classification has a wide range of applications such as speech recognition, gesture recognition, and biosignal classification, and many studies have been performed to investigate higher classification performance [1, 2, 3, 4, 5, 6, 7].
Timeseries pattern classification methods can be divided into three large categories — sequence distancebased classification, featurebased classification, and modelbased classification[8]. Sequence distancebased methods measure the similarity between a pair of patterns based on a distance function such as the Euclidean distance, the Mahalanobis distance[9, 10], or dynamic time warping, and then classify the patterns using conventional classification algorithms typified by a nearest neighbor classifier. In featurebased methods, features are extracted from the original time series and are classified using a support vector machine, decision trees, and neural networks (NNs). In particular, NNs are expanded for timeseries classification, known as the appearance of the Jordan network[11], the Elman network [12], and time delay neural networks [13]. The most popular approach in modelbased methods is the hidden Markov model (HMM). In the HMM, the system for each class is modeled by a Markov process with unobserved states. Timeseries patterns are then classified into the classes based on a likelihood that is calculated from the model. All the above methods, however, have some drawbacks: Sequence distancebased methods and model based methods need a large amount of training data to estimate the distribution of input data precisely. Featurebased methods are likely to cause overfitting because they have too many free parameters and complex structures.
In recent years, a fourth option, “modelbased NNs”, has been proposed as a hybrid of NNs and modelbased methods [14, 15, 16, 17, 18, 19, 20]. Modelbased NNs are developed by integrating prior knowledge of the input data into the network structure to be capable of saving the amount of training data and preventing overfitting. Tsuji et al. [20] proposed a recurrent probabilistic neural network based on HMMs known as the recurrent loglinearized Gaussian mixture network (RLLGMN) and showed that the RLLGMN achieves high classification performance with a smaller amount of training data compared with HMMs and conventional NNs.
Although such modelbased NNs have high classification performance, they often suffer from problems caused by high dimensionality. The increased input dimensionality of NNs because of highdimensional features (e.g., signals measured with numerous electrodes and frequency spectra) causes problems such as poor generalization capability, parameter learning difficulty and longer computation time. These phenomena are called the “curse of dimensionality” [21].
To avoid these problems, dimensionality reduction techniques are used before classification [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35]. For example, Güler et al. [22] applied principal component analysis (PCA) to the frequency spectra of electromyograms, and classified them using a multilayer perceptron and a support vector machine. Bu et al. [27] classified timeseries data combining linear discriminant analysis (LDA) and a recurrent probabilistic neural network. However, there is no guarantee that the reduced input data can be classified correctly, because the dimensionality reduction stage and the classification stage are composed separately.
In contrast, the authors previously proposed a novel timeseries pattern classification model called timeseries discriminant component analysis (TSDCA) [36], which allows a reduction in the dimensionality of input data using several orthogonal transformation matrices and enables the calculation of posterior probabilities for classification under the assumption that the reduced feature vectors obey an HMM. We also proposed a probabilistic neural network based on TSDCA in which the parameters of TSDCA are obtained as weight coefficients of the NN. In the learning process of the proposed network, GramSchmidt orthonormalization was conducted to maintain the orthogonality of the transformation matrices for dimensionality reduction; thus, the convergence of the learning was not guaranteed.
This paper proposes a novel recurrent probabilistic neural network, a timeseries discriminant component network (TSDCN), that improves the learning algorithm of our previously proposed network. In the new learning algorithm, the Lagrange multiplier method is integrated with a backpropagation through timebased learning process to guarantee the convergence of learning while maintaining the orthogonality of the transformation matrices. In this way, the TSDCN can obtain the parameters of dimensional reduction and classification simultaneously, thereby supporting the accurate classification of timeseries data with high dimensionality.
The rest of this paper is organized as follows: Section II describes TSDCA. The structure and the learning algorithm of the TSDCN are explained in Section III. The verification of the classification ability using highdimensional artificial data and electroencephalograms (EEGs) are presented in Section IV and V. Finally, Section VI concludes the paper.
Ii Timeseries discriminant component analysis (TSDCA)
Iia Model structure [36]
Fig. 1 shows the overview of TSDCA. TSDCA consists of several orthogonal transformation matrices and an HMM that incorporates a GMM for the probabilistic density function. The model allows a reduction in the dimensionality of input data and enables the calculation of posterior probabilities for each class.
In regard to classifying a dimensional timeseries vector () into one of the given classes, the posterior probability () is examined. First, is projected into a dimensional vector using several orthogonal transformation matrices . This can be described as follows:
(1) 
where is the mean vector of the component (; is the number of states, ; is the number of components), and is the orthogonal transformation matrix that projects from into .
In the compressed feature space, the projected data obey a probabilistic density function as follows:
(2)  
(3) 
where is the covariance matrix in the compressed feature space.
Assuming that the projected data obey an HMM, the posterior probability of given is calculated as
(4) 
(5) 
(6) 
where is the probability of a state change from to in class , is defined as the likelihood of corresponding to the state in class , and the prior probability is equal to . Here, can be derived with the form
(7) 
where represents the mixture proportion.
Incidentally, TSDCA can be simplified in the particular case where and . Since becomes in this case, can be calculated instead of (4) as
(8) 
This is equivalent to the posterior probability calculation for static data based on a GMM. TSDCA therefore includes classification based on a simple GMM and can be applied to not only timeseries data classification but also static data classification.
IiB Loglinearization
An effective way to acquire dimensionality reduction appropriate for classification is to train the dimensionality reduction part and classification part together with a single criterion [37]. In this paper, the authors address this issue by incorporating TSDCA into a neural network so that the parameters of TSDCA can be trained simultaneously as network coefficients through a backpropagation training algorithm.
In preparation for the incorporation, (1) and (7) are considered based on a linear combination of coefficient matrices and input vectors. First, is transformed as follows:
(9)  
where corresponds to an image of the mean vector mapped onto the compressed space from the input space. Hence, is expressed by the multiplication of the coefficient matrix and the novel input vector . Secondly, setting
(10) 
and taking the loglinearization of gives
(11)  
where are elements of the inverse matrix , and is a Kronecker delta, which is 1 if and otherwise 0. Additionally, is defined as
(12)  
As stated above, the parameters of TSDCA can be expressed with a smaller number of coefficients and using loglinearization. If these coefficients are appropriately obtained, the parameters and the structure of the model can be determined and the posterior probability of highdimensional timeseries data for each class can be calculated.
The next section describes how and are acquired as weight coefficients of a NN through learning.
Iii Timeseries discriminant component network (TSDCN)
Iiia Network structure
Fig. 2 shows the structure of the TSDCN, which is a sevenlayer recurrent type with weight coefficients and between the first/second and third/fourth layers, respectively, and a feedback connection between the fifth and sixth layers. The computational complexity of each layer is proportional to the number of units that is specified below.
The first layer consists of units corresponding to the dimensions of the input data . The relationships between the input and the output are defined as
(13)  
(14) 
where and are the input and output of the th unit, respectively. This layer corresponds to the construction of in (9).
The second layer is composed of units, each receiving the output of the first layer weighted by the coefficient . The relationships between the input and the output of the unit , , , are described as
(15)  
(16) 
where the weight coefficient is for each element of the matrix described as follows:
(17) 
This layer is equal to the multiplication of and in (9).
The third layer comprises () units. The relationships between the input and the output of the units are defined as
(18) 
(19) 
where ( = 1,,), and (18) corresponds to the nonlinear conversion shown in (12).
The fourth layer comprises units. Unit receives the output of the third layer weighted by the coefficient . The input and the output are defined as
(20)  
(21) 
where the weight coefficient corresponds to each element of the vector .
(22) 
(20) stands for the multiplication of and in (11). (21) corresponds to the exponential function of a Gaussian distribution in (2).
The fifth layer consists of units. The output of the fourth layer is added up and input into this layer. The onetimeprior output of the sixth layer is also fed back to the fifth layer. These are expressed as follows:
(23)  
(24) 
where for the initial phase. This layer represents the summation over components in (7) and the multiplication of , , and in (6).
The sixth layer has units. The relationships between the input and the output of the unit {} are described as
(25)  
(26) 
This layer corresponds to the summation over states in (6) and the normalization in (4).
Finally, the seventh layer consists of units, and its input and output are
(27)  
(28) 
corresponds to the posterior probability for class . Here, the posterior probability based on TSDCA can be calculated if the NN coefficients , are appropriately established.
IiiB Learning algorithm
The learning process of the TSDCN is required to achieve maximization of the likelihood and orthogonalization of the transformation matrices , simultaneously. In our previous paper, the weight modulations based on the maximum likelihood and the GramSchmidt process for orthogonalization were conducted alternately [36]. Unfortunately, the GramSchmidt process interfered with the monotonic increase of likelihood, hence the convergence of network learning was not theoretically guaranteed.
This paper addresses this issue by introducing the Lagrange multiplier method to the learning algorithm. In preparation, let us redefine as follows:
(29) 
where is the th column vector of . A set of input vectors (; is the number of training data) is given for training with the teacher vector for the th input, where and () for the training sample of class . Now consider the following optimization problem:
(30) 
Here, is a negative loglikelihood function described as
(31) 
where is the output for the th input at . is a constraint that represents orthogonality conditions defined as
(32)  
is the number of constraints given as . Using the method of Lagrange multiplier, the constrained optimization problem can be converted to an unconstrained problem defined as
(33) 
where is a Lagrange multiplier. By differentiating with respect to , , and , the necessary conditions of the above optimization problem are derived as
(34)  
(35) 
where is a differential operator with respect to and . The loglikelihood function can therefore be minimized while maintaining orthogonality by obtaining a point that satisfies (34) and (35) using an arbitrary nonlinear optimization algorithm.
Now we consider the minimization of using the gradient method. The partial differentiations of with respect to each weight coefficient are given as
(36) 
(37) 
thus can be minimized without constraint with respect to . The weight modification for is then defined as
(38) 
where is the learning rate. If the learning rate is appropriately chosen, the convergence of the gradient method is guaranteed.
With regard to , let be an arbitrary that satisfies (35), and the differential vector to the next point is given as
(39)  
where is calculated as
(40) 
Here, the Taylor expansion for (35) at is
(41)  
Let us denote a solution of the above equation by , thereby (41) can be rewritten as
(42) 
Hence (39) and (42) can be combined as follows:
(43) 
where is an identity matrix. Solving the above simultaneous equations, can be modulated using as follow:
(44) 
The detailed calculation of gradient vectors will be described in Appendix.
Using this algorithm, collective training can be applied in relation to the weight coefficients for dimensional reduction and discrimination, simultaneously.
Iv Simulation experiments
To investigate the characteristics of the TSDCN, artificial data classification experiments were conducted. The purposes of these experiments were as follows: A: To reveal the classification ability of the TSDCN under ideal conditions and the influence of network parameter variation, B: To show that the TSDCN can reduce training time while maintaining classification accuracy equivalent to that of conventional NNs, C: To compare the dimensionality reduction ability of the TSDCN with conventional methods, and D: To compare the TSDCN with our previous network.
In terms of the examination of network parameters, Tsuji et al. had already discussed the influence of parameters derived from an HMM such as the number of states , the number of components , the number of training data , and timeseries length in the RLLGMN [20]. Subsection A will then examine the influence of the number of input dimensions and the number of reduced dimensions since they make the greatest difference between the TSDCN and the RLLGMN. The variation of the number of classes will also be examined because it is the most important issue for classification.
In subsection B, the training time of the TSDCN will be compared with those of the RLLGMN [20] and the Elman network [12] that are without dimensionality reduction.
The authors will show the differences in the dimensionality reduction characteristics between the TSDCN and PCA, and between the TSDCN and LDA using two kinds of visible problems, and then verify how the differences influence highdimensional data classification in subsection C.
Finally, comparison with our previously proposed network will be conducted in subsection D. Since the TSDCN is an improvement over our previous network [36] with respect to the learning algorithm, this comparison demonstrates that the improvement really works better in terms of accuracy and convergency.
The computer used in each experiment was an Intel Core(TM) i73770K (3.5 GHz), 16.0 GB RAM. Furthermore, a terminal attractor was introduced to curtail the training time [38].
Iva Classification under ideal conditions and network parameter variation
IvA1 Method
The data used in the experiment were generated based on an HMM for each class because the ideal conditions for the TSDCN are that the input signals obey HMMs. The HMMs used were fully connected, which had two states and two components for each state. The length of the time series was set as . Fig. 3 shows an example of . The parameters used in this figure were , . The vertical axis and the horizontal axis in the figure indicate the value of each dimension and time, respectively. In terms of the HMM parameters used in the generation of , the mean vectors and the covariance matrices were set using uniform random numbers in ; the mixture proportion, the initial state probabilities, and the state change probabilities are determined by uniform random vectors normalized in . In the validation of classification accuracy, 5 samples were treated as training data, and 50 samples were used as test data for each class. The classification accuracy is defined as , where is the amount of test data correctly classified, and is the total amount of test data. Average classification accuracy and training time were then calculated by changing the HMM parameters, regenerating the training/test data sets 10 times, and resetting the initial weight coefficients of the TSDCN 10 times for each data set. For the fixed number of states and the number of components , the number of input dimensions and the number of reduced dimensions were varied as and , respectively. The number of classes was also varied as for fixed and .
IvA2 Results
Fig. 4 shows average training time and its standard deviation for each combination of the number of input dimensions and the number of reduced dimensions (). In this experiment, the classification accuracies were [%] for any combinations of and .
Fig. 5 shows average classification accuracy and training time for each number of classes . The classification accuracies for to were [%]. Although the classification accuracies for and were uneven ( [%], [%], respectively), there were no significant differences among , , and based on the Holm method [39].
IvA3 Discussion
Forward calculation  Backpropagation  Lagrange multiplier  

In Fig. 4 (a), the increase of the training time according to the increase of is almost linear with a small number of such as and . In contrast, the training time increases as a quadratic function with a larger number of . This can be explained on the basis of calculation complexity. Table I summarizes approximate complexity in each part of the calculation. When the number of reduced dimensions is smaller, the calculation amounts of the forward calculation and the backpropagation are substantially larger than in the Lagrange multiplier method, hence the total calculation complexity according to the increase of can be regarded as linear. If is larger, however, the calculation complexity of the Lagrange multiplier method rapidly increases, and the quadratic term of cannot be ignored. In other words, the TSDCN can suppress the increase of the amount of calculation according to the increase of the input dimensions to linear if the number of reduced dimensions is sufficiently small, whereas conventional methods such as HMMs and the RLLGMN have the complexity of because they are required to calculate the covariance matrices of the input vectors.
In the relationship between the classification accuracy and the number of classes shown in Fig. 5, the TSDCN achieved exceedingly high classification accuracy with any although the training time increased as increased. These results indicate that the TSDCN can classify highdimensional timeseries data significantly accurately if the assumption that the input data obey HMMs is satisfied. In addition, the TSDCN achieved high classification performance even for high multiclass problems such as although the number of reduced dimensions was very small (). This is because that the TSDCN has orthogonal transformation matrices different for each class, hence mappings separate from each class onto reduceddimensional spaces that are appropriate for classification can be acquired. Therefore, there is no need to use a large number of , thus the previously mentioned restriction that should be sufficiently small is easily satisfied.
IvB Comparison with conventional NNs
IvB1 Method
The data used in this experiment were also generated by an HMM (, , ). The classification accuracy and training time for each NN were calculated in the same manner as subsection A. The parameters of the TSDCN were , , , and the number of states and components of the RLLGMN were identical to those of the TSDCN. The Elman network had three layers (one hidden) and 30, 10, 3 units for input, hidden, and output layers, respectively. The input/output function was a logsigmoid function, and the backpropagation with 0.01 of learning rates and 0.01 of error thresholds was used for training.
IvB2 Results
Fig. 6 shows the classification accuracy and the training time for each method. The classification accuracies of both the TSDCN and the RLLGMN were [%]. The accuracy of the Elman network was [%], and significant differences between the other methods were confirmed. The training times of the TSDCN, the RLLGMN, and the Elman network were [s], [s], and [s], respectively, and there were significant differences among all the methods.
IvB3 Discussion
It can be considered that the Elman network estimated the distribution of the highdimensional input data in the original feature space, and made its structure complex, resulting in the low classification accuracy and the significantly long training time. The RLLGMN achieved high classification performance since it is developed based on HMMs. However, the training time was long because the RLLGMN modeled all the input dimensions using numerous parameters. The training time of the TSDCN was significantly reduced in spite of the high classification accuracy. These results showed that the TSDCN can reduce training time, maintaining the high classification performance.
IvC Comparison with conventional dimensionality reduction
IvC1 Method
Fig. 7 shows an example of the problem used in comparison with PCA, where Fig. 7 (a) is an example of time series , and Fig. 7 (b) is a scattergram that indicates the distribution of population used in data generation. Auxiliary axes and in Fig. 7 (b) show the direction with the largest variance (the first principal component), and the direction orthogonal to (the second principal component), respectively. The signals were generated based on the following equation so that is an insignificant component for classification.
(45) 
where is a Gaussian random number with a mean of 0 and a covariance matrix of [1 1; 1 1], which is independently generated for each time and each class. Reduceddimensional signals were calculated using the TSDCN and PCA, and were compared with each other. For the reduceddimensional signals of PCA, classification accuracy was then calculated using the RLLGMN by changing the combination of 5 training samples and 100 test samples 10 times, and resetting the initial weight coefficients 10 times for each data set. The classification accuracy of the TSDCN for the original problem was also calculated.
An example of the problem used in comparison with LDA is shown in Fig. 8, where Fig. 8 (a) and (b) represent an example of a time series and a scattergram that indicates the distribution of population, respectively. In Fig. 8 (b), the data of class 1 are distributed in the first and third quadrants, and the data of class 2 are distributed in the second and fourth quadrants referring to the XOR problem that is a typical nonlinear classification problem. To be more precise, the data are uniform random numbers in the region where for class 1, and