Conditional probability calculation using restricted Boltzmann machine with application to system identification

# Conditional probability calculation using restricted Boltzmann machine with application to system identification

Erick de la Rosa, Wen Yu
Departamento de Control Automatico
CINVESTAV-IPN (National Polytechnic Institute)
Mexico City, 07360, Mexico. yuw@ctrl.cinvestav.mx
###### Abstract

There are many advantages to use probability method for nonlinear system identification, such as the noises and outliers in the data set do not affect the probability models significantly; the input features can be extracted in probability forms. The biggest obstacle of the probability model is the probability distributions are not easy to be obtained.

In this paper, we form the nonlinear system identification into solving the conditional probability. Then we modify the restricted Boltzmann machine (RBM), such that the joint probability, input distribution, and the conditional probability can be calculated by the RBM training. Binary encoding and continue valued methods are discussed. The universal approximation analysis for the conditional probability based modelling is proposed. We use two benchmark nonlinear systems to compare our probability modelling method with the other black-box modeling methods. The results show that this novel method is much better when there are big noises and the system dynamics are complex.

Keywords: system identification, conditional probability, restricted Boltzmann machine

## 1 Introduction

Data based system identification is to use the experimental data from system input and output. Usually, the mathematical model of its input-output behavior is applied to predict the system response (output) from the excitation (input) to the system. There is always uncertainty when a model is chosen to represent the system. Including probability theory in dynamic system identification can improve the modeling capability with respect to noise and uncertainties [1]. The common probability approaches for system identification parameterize the model and apply Bayes’ Theorem [2], such as maximizing the posterior [3], maximizing the likelihood function [5], or matching the output by least-squares method [4]. There are several computational difficulties with the above estimation methods, for example, the probability models for the likelihood construction cannot be expected to be perfect; the parameter estimations are often not unique, because the true values of the parameters may not exist.

In the sense of probability theory, the objective of system modeling is to find the best conditional probability distribution [6], where is the input and is the output. So the system identification can be transformed to the calculation of the conditional probability distribution, it is no longer the parameter estimation problem. For nonlinear system identification, there are two correlated time series, the input and the output . The popular Monte Carlo method cannot obtain the best prediction of from the conditional probability between of and [7].

Recent results, show that restricted Boltzmann machine (RBM) [8] can learn the probability distribution of the input data using the unsupervised learning method, and obtains their hidden features [9][10]. These latent features help the RBM to obtain better representations of the empirical distribution. Feature extractions and pre-training are two important properties of the RBMs for solving classification problems in the past years [11].

RBMs are also applied for data regression and time series modeling. With RBM pre-training, the prediction accuracy of the time series can be improved [12][13]. The hidden and visible units of normal RBM are binary. The prediction accuracy of the time series with continuous values are not satisfied [14]. In [15], the binary units are replaced by linear units with Gaussian noise. The denoising autoencoders are used for continuous valued data in [16]. In [17], the time series are assumed to have the Gaussian property.

In order to find the relation between the input and the output , the RBM is used to obtain the conditional probabilities between and the hidden units of the RBM in our previous papers [18][19]. The hidden units are used as the initial weights of neural networks, then the supervised learning is implemented to obtain the inpu-output relation, . To the best of our knowledge, conditional probability approach for system identification has not been still applied in the literature.

In this paper, we use the conditional probability to model the nonlinear system by RBM training. The RBMs are modified, such that the input distribution, the joint probability and the conditional probability can be calculated. We proposed two probability calculation methods: binary encoding and continuous values. Probability gradient algorithm is used to maximize the log-likelihood of the conditional probability between the input and output. The comparisons with the other black-box identification methods are carried out using two nonlinear benchmark systems.

## 2 Nonlinear system modeling with conditional probability

We use the following difference equation to describe a discrete-time nonlinear system,

 y(k)=f[x(k),ξ(k),⋯ξ(k−nξ)] (1)

where are noise sequences, is an unknown nonlinear function,

 x(k)=[y(k−1),⋯y(k−ny),u(k),⋯u(k−nu)]T (2)

representing the plant dynamics, and are the measurable input and output of the nonlinear plant, and correspond to the system order, is the maximum lag for the noise. can be regarded as a new input to the nonlinear function It is the well known NARMAX model [20].

The objective of the system modeling is to use the input-output data, to construct a model such that If the lost function is defined by

 L=∑k[y(k)−^f[x(k)]]2

The mathematical expectation of the modeling error is

 E{[y(k)−^f[x(k)]]2}=∫[y−^f(x)]2p(dx,dy) (3)

where the joint probability satisfies

 p(X,Y)=p(X)p(Y∣X)

Here is the conditional probability. The modeling error (3) becomes

 E{[y(k)−^f[x(k)]]2}=∫[y−^f]2p(dx)p(dy∣dx)=ExE(y∣x){[y−^f]2∣x}

The system identification becomes the minimization problem as

 minE{[y−^f]2}→min^fE(y∣x){[y−^f]2∣x}

The best prediction of at is the conditional mean (conditional expectation) as

 ^f[x(k)]=E{^y∣x}

The nonlinear system modeling becomes

 maxθ{p(y∣x,Λ)} (4)

where is the parameter vector of the model The existing methods use some probability approaches, such as Bayes’ Theorem, to estimate . In this paper, we do not use parameterized models. We will calculate the conditional probability distribution directly.

The loss function for the conditional distribution is defined as

 Jo(D)=∑Dlogp(y|x) (5)

where is the training set, and are the -th training input vector and output respectively. So the object of the nonlinear system identification (4) becomes

 maxΛ[∑Dlogp(y|x)] (6)

In this paper, we use the restricted Boltzmann machine (RBM) to obtain the best conditional probability. RBM is a stochastic neural network. It can learn the probability distribution of given data set. The input (or the visible nodes) to the RBM is . The output (or the hidden nodes) of the RBM is . For the hidden node and the visible node, the conditional probabilities are calculated as

 p(¯hi=1∣x)=ϕ[Wx+b]p(xj=1∣h)=ϕ[WTh+c] (7)

where , is the Sigmoid function, is the weight matrix, is a number sampled from the uniform distribution over , and are visible and hidden biases respectively, , The probability vector is defined as

 h=[p(¯h1=1∣x)⋯p(¯hs=1∣x)]=[h1⋯hs]

The training object of the RBM is to modify the parameters such that the probability distribution of the hidden units is near to the distribution of the input . We use Kullback-Liebler (KL) divergence to measure the distance between two probability distributions and

 KL(p,q)=∑Dq(x)logq(x)p(x)=∑Dq(x)logq(x)−∑Dq(x)logp(x) (8)

Here the first term is the entropy of the input. It is independent of the RBM training.

The RBM training object becomes

 minKL(p,q)→max∑Dq(x)logp(x) (9)

cannot be obtained directly, it is estimated by Monte Carlo method

 ∑Dq(x)logp(x)≈1n∑Dlogp(x)

where is the number of training data. So

 minKL(p,q)→max∑xlogp(x)

The following lemma and theorem give the universal approximation of the nonlinear system (1) with the conditional probability and the RBM.

###### Lemma 1

Any marginal probability distribution , can be approximated arbitrarily well in the sense of the KL divergence (8) by then RBM with hidden units, where is the number of input vector whose probability is not [22].

###### Theorem 1

If the hidden units of the RBM (7) is big enough, the conditional probability of the nonlinear system (1) can be approximated arbitrarily well by an RBM and the pair in the sense of the KL divergence (8).

Proof. Consider the nonlinear system (1), the vectors and are from the finite sets and . For each pair , the conditional probability distribution is and . The conditional probability of the pair is

 p(yl|xk)=p(xk,yl)p(xk)=p(xk,yl)∑jp(xk,yj) (10)

Here the value of is assumed to be known. The term can be separated as and From (10),

 p(xk,yl)=p(yl|xk)∑j≠lp(xk,yj)1−p(yl|xk) (11)

For each pair , can be calculated for the indexes and from (11). There are equations. They are considered as the desired conditional distribution of , i.e., we can use to create the distribution which contains the original conditional distribution . Now we apply all pairs and to the RBM whose hidden unit number is

If and are binary values, we define the pair as a single random variable From Lemma 1, we can construct an RBM with the most hidden units (all input nodes have non-zero probability), which models the distribution , with the desired conditional distribution of

If and are not binary values, we encode the input variable and into binary values with the resolution of bits. This means that is encoded into different levels with the step of Similarly, the control is encoded into different levels,

 x∈Rn⟶x∈{0,1}n×my∈R⟶y∈{0,1}m (12)

We can construct an RBM with at most hidden units. The desired conditional distribution of can be approximated arbitrarily well in the sense of the KL divergence.

The above theorem can be regarded as the probability version of the universal approximation theory of neural networks [23][24]. One RBM with one hidden layer can learn the probability of the nonlinear system in any accuracy, the hidden node number of the RBM should be the same as total data number. This cause serious over-fitting and computational problems.

In order to improve the approximation accuracy, we can use several RBMs with cascade connection. It is called deep Boltzmann machines (or deep belief nets) [21]. Instead of using more hidden nodes, it use more layers (or more RBMs) to learn the probability to avoid the above problems. This architecture is shown in Figure 1. Each RBM has the form of (7). The output of the current RBM is its hidden vector . It is the input of the next RBM.

## 3 Joint distribution for nonlinear system identification

Because the conditional probability

 p(y|x)=p(x,y)p(x),logp(y|x)=logp(x,y)−logp(x) (13)

The object of system identification (6) can also be formed as

 maxΛ[∑Dlogp(x,y)] and minΛ[−∑Dlogp(x)] (14)

The identification object (14) is an alternative method of the conditional probability (6), which needs the joint distribution and the input distribution . It is more easy to calculate the joint distribution than the conditional probability. However, it is impossible to optimize and with at same time. We can use part of the training data to minimize first, then use the rest of data to maximize the joint probability .

In our previous works [18][19], we first use the input data to pre-train the RBM. The training results are used as the initial weights of neural networks. Then we used the supervised learning to train the neural model. In this paper, after the unsupervised learning for we train the RBM to obtain For the nonlinear system identification, its a sub-optimization process.

### 3.1 Pre-training for p(x)

The goal of the unsupervised training is to obtain by reconstructing the RBM The parameters are trained such that is the representation of (feature extraction). The probability distribution is the following energy-based model

 p(x)=∑hp(x,h)=∑he−E(x,h)Z (15)

where denotes the sums over all possible values of and is the energy function which is defined by

 E(x,h)=−cTx−bTh−hTWx (16)

The loss function for the training is

 L(Λ)=log∏xp(x)=log[∑xe−E(x,h)]−log[∑x,he−E(x,h)]

The weights are updated as

 Λ(k+1)=Λ(k)−η1∂[−logp(x)]∂Λ (17)

where is the learning rate, is the free energy defined as

 ϝ(x)=∑xlogp(x)=−cTx−li∑p=1log∑hpehp(bp+Wpx)

Here is estimated by the Monte Carlo sampling,

 ∑zp(x)∂ϝ(x)∂Λ≈1s∑z∈S∂ϝ(x)∂Λ (18)

The above algorithm is for one RBM. The training process of multiple RBMs is shown in Figure 2. After the first model is trained, their weights are fixed. The codes or hidden representations of the first model are sent to the second model. The second model is trained using input and it generates its hidden representations which is the input of the third model.

If is encoded into binary values as (12), i.e., the binary hidden units are

 p(hp=1|x)p=1⋯li=ϕ[Wpx+bp]p(xt=1|h)t=1⋯li−1=ϕ[WTth+ct] (19)

If the input uses continuous value. We first normalize in The conditional probability for the -th visible node is

 P(xj|h)=e(VTj¯¯¯h+cj)xj∫ˆxje(VTj¯¯¯h+cj)ˆxjdˆxj

Since the probability distribution is . The accumulative conditional probability from sampling process is

 PC(xj|h)=eajxj−1eaj−1 (20)

The expected value of the distribution is .

### 3.2 Joint distribution p(x,y)

Here we will discuss how to maximize the joint probability by RBM training. The parameters of RBMs are updated as

 Λ(k+1)=Λ(k)−η2∂[logp(x,y)]∂Λ (21)

where is the learning rate. By the chain rule ,

 (22)

where is the hidden variable to capture the relationship between and .

The expectations and cannot be computed directly. The contrastive divergence (CD) approach [9] is applied in this paper. From the starting point , we sample the hidden state using this trigger. The conditional probability uses the sampling processes of and This process is repeated times. These conditional probabilities are

 p(h|x,y)=∏jp(hj|x,y)p(x|h)=∏κp(xi|h)p(y|h)=∏κ(yi|h) (23)

which are used as the intermediate results in CD.

If and are encoded into binary values as (12),

 p(hj=1|x,y)=sign(cj+∑κVjκyκ+∑iWjixi)p(xi=1|h)=sign(bi+∑jWjihj)p(yκ=1|h)=sign(dκ+∑jVjκhj) (24)

The binary encoding method causes dramatically large training data. The dimension of increase from to .

If and are used as continuous values. The above three conditional probabilities are calculated as follows.

1) The conditional probability of given

 p(x|h)=p(x,h)p(h)=∫¯yp(x,h,¯y)d¯y∫¯y∫¯xp(¯x,h,¯y)d¯xd¯y=ehTWx+bTx∫¯xehTW¯x+bT¯xd¯x=∏ip(¯xi|h)

where , and denote the silent variables of , and

 p(xi|h)=exi(bi+∑jwjihj)/∫¯xie¯xi(bi+∑jwjihj)d¯xi (25)

We explore three different cases for the domain of : and where For the case of if we define we can directly evaluate the integral taking into account that In order to ensure that the integral converges, the evaluations of three integrals are presented in Table 1.

 Table 1.- Probability expressions for p(x|h)

2) Probability of given

 p(y|h)=p(y,h)p(h)=∫¯xip(x,h,¯y)d¯x∫¯y∫¯xp(¯x,h,¯y)d¯xd¯y==ehTVy+DTy∫¯yehTV¯y+dT¯yd¯y=∏κp(yκ|h) (26)

where

 p(y|h)=e(hTV+D)y∫¯ye(hTV′+D)¯yd¯y (27)

If we define the evaluations of are presented in Table 2.

 Table 2. Probability expressions for p(y|h)

3) Probability of given and The hidden units have binary values, while and have continuous values. So is

 p(h|x,y)=p(x,y,h)p(x,y)=∏jp(hj|x,y)

where denotes the -th element of the vector .

To calculate the gradient (22), we use following modified CD algorithm.

Algorithm 1:

1.- Take a training pair

2.- Initialize and

3.- Sample from

4.- Sample and from and

5.- Sample from

This algorithm replaces the expectations with the - steps Gibbs sampling process. This process is initiated by pre-training of as the initial weights.

## 4 RBM training for conditional probability

The unsupervised pre-training (17) generates the probability distribution and extracts the features of the input . The supervised learning (21) can only obtain the sub-optimal model for the system identification index (14). However, the training of (21) discussed in the above is relatively simple.

Now we use the supervised learning to obtain the conditional distribution via RBM training. The parameters of RBMs are updated as

 Λ(k+1)=Λ(k)−η3∂[logp(y|x)]∂Λ (28)

where is the training factor, will be calculated as follows. This is the final goal of the nonlinear system modeling (6). Because , from (15)

 (29)

So the gradient of the negative log-likelihood with respect to the parameter is

 −∂logp(y|x)∂Λ=∑he−E[x,y,h]∂E[x,y,h]∂Λ∑he−E(x,y,h)−∑y,he−E[x,y,h]∂E[x,y,h]∂Λ∑y,he−E[x,y,h] (30)

In the form of mathematical expectation, the gradient is

 (31)

Both probability expectations of (31) can be computed using Gibbs sampling and CD algorithm. The CD algorithm needs alternation sampling processes over the distributions (23). However, there are not compact expressions for and In order to implement the CD algorithm for  we calculate as

 p(y,h|x)=e−E(x,y,h)∑y,he−E(x,y,h)=ehTWx+bTx+cTh+dy+hTVy∑y,hehTWx+bTx+cTh+dy+hTVydy (32)

The calculation of (32) is expensive, because it requires to calculate more than possible values. It is tractable for system identification. After this distribution is obtained, we just should sample all possible values for and . Once is calculated, we need to complete the Gibbs sampling as

 p(x|y,h)=p(x,y,h)p(y,h)=∏ip(¯xi|h) (33)

The calculation of given by (31) cannot be done directly. We must use again the CD algorithm. The term can be computed as

 E(x,y,h)[∂E(x,y,h)∂Λ]≈∂E(x2,y2,h2)∂ΛE(h|x,y)[∂E(x,y,h)∂Λ]≈∂E(x,y,h2)∂Λ (34)

In this paper, we modify the learning algorithm of RBM, such that nonlinear system can be modeled by continuous values. In order to train the parameters in (28), we need the three conditional probabilities discussed in the above session, and the following three conditional probabilities.

4) Probability of given We assume that is a scalar, the bias variable is a real number, the weight matrix is a real vector.

 p(y|x)=p(x,y)p(x)=∑¯hp(x,y,¯h)∫¯y∑¯hp(x,¯y,¯h)d¯y=∑¯he¯hTWx+bTx+cT¯h+dy+¯hTVy∫¯y∑¯hehTWx+bTx+cT¯h+d¯y+¯hTV¯yd¯y (35)

Using Fubini’s Theorem, the integral and the sum are

 p(y|x)=edy∏j(1+eτj(x,y))∫¯yed¯y∏j(1+eτj(x,¯y))d¯y (36)

5) Probability of given The second term of the negative log-likelihood (30) can be computed as

 p[(y,h)|x(k)]=e−E(x(k),¯y,¯h)∑¯y,¯he−E(x(k),¯y,¯h)=ehTWx(k)+bTx(k)+cTh+dy+hTVy∫¯y∑¯heh′TWx(k)+bTx(k)+cT¯h+d¯y+h′TV¯yd¯y (37)

The integral of the denominator is expanded as . In order to find a closed form of the solution, we define , which is incomplete power set , because the empty set is not included in . Associated with with elements , the elements contain all possible combinations of elements . The finite product can be expressed as

 ∏j(1+eτj)=1+∑Pϝie∑τγ (38)

where is an index for which takes values such that . The integral then becomes Because define the vector then . Considering the expression for the value of the integral is

 ∫¯y⎛⎝ed¯y+∑Pϝie∑wγx(k)+cγe(d+∑vγ)¯y⎞⎠d¯y (39)

For the interval

 −1D−∑Pϝi1D+∑vγe∑wγx(k)+cγ (40)

For the interval

 1d(eD−1)+∑Pϝie∑wγx(k)+cγF+∑vγ(eD+∑vγ−1) (41)

For the interval

 1d(eDδ−e−Dδ)+∑Pϝie∑wγx(k)+cγD+∑vγ(e(D+∑vγ)δ−e−(D+∑vγ)δ) (42)

The sum is performed along the elements of the power set which is computational expensive. The number of elements is which represents all possible combinations. For system identification, the number of visible and hidden units is no so big, so the procedure becomes tractable.

6) Probability of given We have shown that In the intervals and we get the same expressions presented in Table 1 for .

Finally we use the following CD algorithm to calculate .

Algorithm 2

1.- Take a training pair

2.- Initialize

3.- Sample and from

4.- Sample from

5.- Sample and from

## 5 Simulations

In this section, we use two benchmark examples to show the effectiveness of the conditional probability based system identification.

Gas furnace process

One of the most utilized benchmark examples in system identification is the famous gas furnace [25]. The air and methane are mixed to create gas mixture which contains carbon dioxide. The control is methane gas, the output is concentration. This process is modeled as (1). In this paper, we use the simulation model, i.e., only the input is used to obtain the modeled output,

 ^y(k)=NN[u(k),⋯u(k−nu)]T (43)

This model is more difficult than the following prediction model, who uses both the input and the past output

 ^y(k)=NN[y(k−1),⋯y(k−ny),u(k),⋯u(k−nu)]T (44)

The big advantage of the simulation model (43) is the on-line measurement of the gas furnace output is not needed.

The gas furnace are sampled continuously in second intervals. The data set is composed of successive pairs of . samples are used as the training data, the rest samples are for the testing. We compare our conditional probability calculation with restricted Boltzmann machine (RBM-C) and the joint distribution with restricted Boltzmann machine (RBM-J), to the feedforward neural networks, multilayer perceptrons (MLP), and the support vector machine (SVM). The MLP has the same structure as the RBMs, i.e., the same hidden layer and hidden node numbers. The SVMs use three types of kernel: polynomial kernel (SVM-P), radial basis function kernel (SVM-R), linear kernel (SVM-L).

We use the random search method [26] to determine how many RBMs we need. The search ranges of the RBM number is the hidden node number is The random search results are, , . Three RBMs are used and each hidden layer has nodes. The following steps are applied for RBM training.

a) Binary encoding and decoding. We used two resolutions, bits and bits, for and . The resolutions of the input are and The output data are sampled from and decoded to continuous equivalent values.

b) Probability distribution with continuous values. The data are normalized into the interval ,

 (45)

c) Training: the RBM is trained using the coded data or continuous values. The learning rates are . Stochastic gradient descents (17), (21), (28) are applied over the data set. The algorithm use training epochs.

If we use the prediction model (44) to describe the gas furnace process, with and MLP, SVM, RBM-J and RBM-C work well. The mean squared errors (MSE) of the testing data are similar small. In order to show the noise resistance of the conditional probability methods, we added noises to the data set,

 x(k)=x(k)+z(k) (46)

where is a normal distribution with average and standard deviation. The testing errors are shown in Table 3.

 Table 3. MSE of different prediction models with noise(×10−3)

The probabilistic models have great advantage over MLP and SVM with respect to noises and disturbances. The main reason is that we model the probability distributions of the input and output, the noises and outliers in the data do not affect the conditional distributions significantly.

Now we use the simulation model (43) to compare the above models. When MLP and SVM cannot model the process, RBM can but the MSE of the testing error is about . When all models work. The testing errors are shown in Table 4.

 Table 4. MSE of different simulation models without noise (×10−3)

Besides the supervised learning, RBMs can extract the input features with the unsupervised learning. So RBMs work better than MLP and SVM when the input vector does not include previous output

To show the effectiveness of the deep structure and the binary encoding method, we use RBMs. The testing errors are given in Table 5.

 Table 5. MSE of different RBMs-C (×10−3)

By adding new feature extraction block, the MSE drops significantly. If the number of RBM is more than the MSE becomes worse. This means that it is not necessary to add new RBM to extract information.

Both bits and bits encoding have good approximation results, see Figure 3. The high precise encoding helps to improve the model accuracy. However, adding one bit in the encoding procedure immediately doubles the computation time.

Finally, we test the continuous valued algorithm. We use the simulation model (43). The training parameters are and training epochs for each RBM. For the output layer, the coded features have and training epochs. The testing MSE is It is better than binary encoding, because it provides more information on real axis.

Wiener-Hammerstein system

Wiener-Hammerstein system [27] has a static nonlinear part surrounded by two dynamic linear systems. There are input/output pairs, defined and . We use samples for training, samples for testing. This is a big data modelling problem.

We use the simulation model (43) to compare these models, with We only show the feature extraction of the proposed method. The random search method uses the RBM number as and the hidden node number as Then RBM number the hidden node number The RBMs are trained using the Gibbs sampling, , the learning rates are . It has training epochs. We use the same structure for the MLP, five hidden layers, each layer has nodes.

Table 6 shows the testing errors of different methods for the simulation model (43).

We can see that even for the best result obtained by the SVM-R, RBM-C is much better than the others when the input to the models is only control

Now we show how does the training data seize affect the conditional probability modelling. The stochastic gradient descent is a batch process. The training samples are divided into several packages. All packages have the same size. The package seizes are selected as and The probability distributions are calculated with: binary encoding, in the interval in the interval and in the interval

We can see that the RBMs cannot model the probability distribution properly with small batch number, for example batches. The training error with the size is little bigger than the size The fluctuations of the training error with the size vanish, and the interval becomes unstable. So large batch seize may affect the distributions by some mislead samples.

## 6 Conclusions

In this paper, the conditional probability based method is applied for nonlinear system modelling. We show that this method is better than the other data based models when there are noises and the previous outputs are not available on-line. The modelling accuracy is satisfied with the binary encoding and continuous values by the modified RBMs. The training algorithms are obtained from maximizing the conditional likelihood of the data set. Two nonlinear modelling problems are used to validate the proposed methods. The simulation results show that the modelling accuracies are improved a lot when there are noises and the simulation models have to be used.