Parallel Training Considered Harmful?: Comparing Series-Parallel and Parallel Feedforward Network Training

“Parallel Training Considered Harmful?”: Comparing Series-Parallel and Parallel Feedforward Network Training


Neural network models for dynamic systems can be trained either in parallel or in series-parallel configurations. Influenced by early arguments, several papers justify the choice of series-parallel rather than parallel configuration claiming it has a lower computational cost, better stability properties during training and provides more accurate results. Other published results, on the other hand, defend parallel training as being more robust and capable of yielding more accurate long-term predictions. The main contribution of this paper is to present a study comparing both methods under the same unified framework. We focus on three aspects: i) robustness of the estimation in the presence of noise; ii) computational cost; and, iii) convergence. A unifying mathematical framework and simulation studies show situations where each training method provides better validation results, being parallel training better in what is believed to be more realistic scenarios. An example using measured data seems to reinforce such claim. We also show, with a novel complexity analysis and numerical examples, that both methods have similar computational cost, being series series-parallel training, however, more amenable to parallelization. Some informal discussion about stability and convergence properties is presented and explored in the examples.


Neural networks are widely used and studied for modeling nonlinear dynamic systems [1]. In the seminal paper by Narenda and Parthasarathy [1] a distinction is made between two training methods: series-parallel and parallel. In the series-parallel configuration, the weights are estimated minimizing the one-step-ahead prediction error, while, in the parallel configuration, the free-run simulation error is minimized.

In [1], the series-parallel configuration is said to be preferred to parallel configuration for three main reasons, latter mentioned in several other works: i) all signals generated in the identification procedure for series-parallel configuration are bounded, while this is not guaranteed for the parallel configuration [2]; ii) for the parallel configuration a modified version of backpropagation is needed, resulting in a greater computational cost, while the standard backpropagation can be used for series-parallel configuration [4]; and, iii) assuming that the error tends asymptotically to small values the simulated output is not so different from the actual one and, therefore, the results obtained from two configurations would not be significantly different [11]. An additional reason that also appears sometimes in the literature is that: iv) the series-parallel training provides better results because of more accurate inputs to the neural network during training [5].

Results presented in other papers, however, show the strengths of parallel training: according to [14] neural network trained in parallel yield more accurate long-term predictions than the ones trained in series-parallel; [16] shows for diverse types of models, including neural networks, that the parallel training can be more robust than series-parallel training in the presence of some kinds of noise; and, in [17] neural network parallel models presents better validation results than series-parallel models for modeling a boiler unit. Furthermore, the free-run simulation error minimization (used in parallel training) seems to be successful when dealing with other types of models: some state-of-the-art structure selection techniques for polynomial models are based on it [18]; and, in [23] it provided the best results when estimating the parameters of a battery.

The main contribution of this paper is to compare these two training methods. We focus on three aspects: i) robustness of the estimation in the presence of noise; ii) computational cost; and, iii) convergence. Our findings suggest that parallel training may provide the most accurate models under some common circumstances. Furthermore, its computational cost is not significantly different than series-parallel training. We believe this to be relevant because it contradicts some of the frequently cited reasons for using series-parallel rather than parallel training.

The rest of the paper is organized as follows: Section 2 presents both training modes as prediction error methods [24] and Section 3 formulates neural network training in both configurations as a nonlinear least-squares problem. It is a secondary contribution of this paper to present both training methods under the same framework and the comparison in subsequent sections is built on top of this formulation. Section 4 presents a complexity analysis comparing the methods and Section 5 discusses signal unboundedness and the possibility of convergence to “bad” local solutions. In Section 6, numerical examples with measured and simulated data investigate the effect of the noise in the estimation and the running time of both methods. Final comments and future work ideas are presented in Section 7.

2Unifying Framework

Consider the following data set: containing a sequence of sampled inputs-output pairs. Here and are vectors containing all the inputs and outputs of interest at instant . The output is correlated with its own past values , , , , and with past input values , , , . The integer is the input-output delay (when changes in the input take some time to affect the output).

This paper is focused on trying to find a difference equation model:

that best describes the data in . The model is described by: the parameterized nonlinear function (e.g. a neural network); a parameter vector ; the maximum input and output lags and ; and, the input-output delay . It is implicit here the assumption that a finite number of past terms can be used to describe the output.

From now on, the simplified notation: ,
will be used, hence, Equation (Equation 1) can be rewritten as: .

In this section we present parallel and series-parallel training in the prediction error methods framework. Prediction error methods were initially developed for linear systems [24]. Although there are works that extend the results to nonlinear models [25], the authors are not aware of a mathematical derivation in an entirely nonlinear setup, as the one presented here, being described anywhere else.

2.1Output Error vs Equation Error

To study the previously described problem it is assumed that for a given input sequence and a set of initial conditions the output was generated by a “true system”, described by the following equations:

where and are the “true” function and parameter vector that describe the system; and are random variable vectors, that cause the deviation of the deterministic model from its true value; and are the measured input and output vectors; and, is the output vector without the effect of the output error.

The random variable affects the system dynamics and is called equation error, while the random variable only affects the measured values and is called output error.

2.2Free-run Simulation and One-step-ahead Prediction

2.3Optimal Predictor

If the measured values of and are known at all instants previous to , the optimal prediction of is the following conditional expectation: 1

where denotes the optimal prediction and indicates the mathematical expectation.

Consider the following situations:

The next two lemmas give the optimal prediction for the two situations above:

Since the output error is zero, it follows that and therefore Equation (Equation 2) reduces to .

And, because has zero mean2, it follows that:

There is no equation error and therefore:

where it was used that for matching initial conditions and parameters and in the absence of equation error, the noise-free output is exactly equal to the free-run simulation ().

In series-parallel configuration the parameters are estimated by minimizing the one-step-ahead error, , while in parallel training the free-run simulation error, , is the one minimized. For , both training methods minimize an error that approaches the optimal predictor error as . Series-parallel training does it for Situation ?, and parallel training for Situation ?. Hence, both training procedures may be regarded as prediction error methods.3

Neural networks are an appropriate choice for the function because they are universal approximators. A neural network with as few as one hidden layer can approximate any measurable function with arbitrary accuracy within a given compact set [28].

Let be the prediction error vectors. In this paper we focus in the minimization of the sum of square errors . This loss function is optimal in the maximum likelihood sense if the residuals are considered to be Gaussian white noise [29]. An algorithm for minimizing the square errors is described in the following section in the context of neural network models.

3Nonlinear Least-Squares Network Training

Unlike other machine learning applications (e.g. natural language processing and computer vision) where there are enormous datasets available to train neural network models, the datasets available for system identification are usually of moderate size. The available data is usually obtained through tests with limited duration because of practical and economical reasons. And, even when there is a long record of input-output data, it either does not contain meaningful dynamic behavior [30] or the system cannot be considered time-invariant over the entire record, resulting in the necessity of selecting smaller portions of this longer dataset for training.

Due to the unavability of large datasets, the use of neural networks in system identification is usually restricted to neural networks with few hundred weights. The Levenberg-Marquardt method does provide a fast convergence rate [31] and has been described as very efficient for batch training of moderate size problems [32]. Hence, it will be the method of choice for training neural networks in this paper.

This section presents the parallel and series-parallel training as nonlinear least-squares problems. Sections Section 3.1 and Section 3.2 give some background in the the Levenberg-Marquardt algorithm and in the backpropagation algorithm proposed by [32]. Series-parallel and parallel training are discussed in Sections Section 3.3 and Section 3.4. The backpropagation can be directly applied to series-parallel training, while for parallel training we introduce a new formula for computing the derivatives. This formula can be interpreted either as a variation of the dynamic backpropagation [1] adapted to compute the Jacobian instead of the gradient; or, as a specific case of real-time recurrent learning [33] with a special type of recurrent connection.

3.1Nonlinear Least-Squares

Let be a vector of parameters containing all neural network weights and bias and an error vector. In order to estimate the parameter vector the sum of square errors is minimized. Its gradient vector and Hessian matrix may be computed as: [31]

where is the Jacobian matrix associated with . Non-linear least-squares algorithms usually update the solution iteratively () and exploit the special structure of the gradient and Hessian of , in order to compute the parameter update .

The Levenberg-Marquardt algorithm considers a parameter update [34]:

for which is a non-negative scalar and is a non negative diagonal matrix. Furthermore, and are the error and its Jacobian matrix evaluated at

There are different ways of updating and . The update strategy presented here is similar to [35]. The elements of the diagonal matrix are chosen equal to the elements in the diagonal of . And is increased or decreased according to the agreement between the local model and the real objective function . The degree of agreement is measured using the following ratio:

One iteration of the algorithm is summarized next:

3.2Modified Backpropagation

Consider a multi-layer feedforward network, such as the three-layer network in Figure 1. This network can be seen as a function that relates the input to the output . The parameter vector contains all weights and bias of the network. This subsection presents a modified version of backpropagation [32] for computing the neural network output and its Jacobian matrix for a given input . The notation used is the one displayed in Figure 1.

Figure 1: Three-layer feedforward network.
Figure 1: Three-layer feedforward network.

Forward Stage

For an layer network the output nodes can be computed using the following recursive matrix relation:

where, for the -th layer, is a matrix containing the weights , is a vector containing the the bias and applies the nonlinear function element-wise. The output is given by:

Backward Stage

The follow recurrence relation can be used to compute for every :

where is given by the following diagonal matrix:

The recursive expression (Equation 9) follows from applying the chain rule
, and considering and .

Computing Derivatives

The derivatives of in relation to the weights and the bias can be used to form the Jacobian matrix and can be computed using the following expressions:

Furthermore, the derivatives of in relation to the inputs are:

3.3Series-parallel Training

In the series-parallel configuration the parameters are estimated by minimizing , what can be done using the algorithm described in Section 3.1. The required Jacobian matrix can be computed according to the following well known result.

Results from differentiating ( ?).

3.4Parallel Training

In the parallel configuration the parameters are estimated by minimizing . There are two different ways to take into account the initial conditions , they are: (i) to fix the initial conditions and estimate the model parameters ; and, (ii) to define an extended parameter vector and estimate simultaneously with .

When using formulation (i), a suitable choice is to set the initial conditions equal to the measured outputs (). When using formulation (ii) the measured outputs may be used as an initial guess to be refined by the optimization algorithm.

The optimal choice for the initial condition would be for
. Formulation (i) uses the non-optimal choice . Formulation (ii) goes one step further and include the initial conditions in the optimization problem, so it converges to and hence improves the parameter estimation.

The Jacobian matrices and can be computed according to the following proposition.

Results from differentiating ( ?) and applying the chain rule.

4Complexity Analysis

In this section we present a novel complexity analysis comparing series-parallel training (SP), parallel training with fixed initial conditions (P) and parallel training with extended parameter vector (P). We show that the training methods have similar computational cost for the nonlinear least-squares formulation. The number of floating point operations (flops) is estimated based on [36]. Low order terms, as usual, are neglected in the analysis.

4.1Neural Network Output and its Partial Derivatives

The backpropagation algorithm described in Section 3.2 can be used for training both fully or partially connected networks. What would have to change is the internal representation of the weight matrices : for a partially connected network the matrices would be stored using a sparse representation, e.g. compressed sparse column (CSC) representation.

The total number of flops required to evaluate the output and to compute its partial derivatives for the feedforward network is summarized in Table ? considering a fully connected network. The total number of network weights is denoted as and the total number of bias terms as , such that . For this fully connected network:

Modified backpropagation number of flops for a fully connected network.
i) Compute  — Eq. (Equation 7)-(Equation 8)
ii) Backward Stage — Eq. (Equation 9)
iii) Compute  — Eq. (Equation 10)-( ?)
[2pt] iv) Compute  — Eq. (Equation 11)

Since the more relevant terms of the complexity analysis in Table ? are being expressed in terms of and the results for a fully connected network are similar to the ones that would be obtained for a partially connected network using a sparse representation.

4.2Number of Flops for Series-Parallel and Parallel Training

Levenberg-Marquardt number of flops per iteration for series-parallel training (SP), parallel training with fixed initial conditions (P) and parallel training with extended parameter vector (P). A mark signals which calculation is required in each method.
i) Compute
ii) Backward Stage
iii) Compute
[2pt] iv) Compute
[2pt] v) Equation ( ?)
vi) Equation ( ?)
vii) Solve (Equation 5) — 
[2pt] viii) Solve (Equation 5) — 

The number of flops of each iteration of the Levenberg-Marquardt algorithm is summarized in Table ?. Entries (i) to (iv) in Table ? follow directly from Table ?, considering and multiplying the costs by because of the number of different inputs being evaluated. Furthermore, in entries (v) and (vi), it is considered that the evaluation of ( ?) and ( ?) is done, not by recomputing the entire summation each time, but by storing each computed values and computing only one new matrix-matrix product per evaluation.

The cost of solving (Equation 5) is about where the cost is due to the multiplication of the Jacobian matrix by its transpose and is due to the needed Cholesky factorization. For an extended parameter vector is replaced by in the analysis.

4.3Comparing Methods

Assuming the number of nodes in the last hidden layer is greater than the number of outputs (), the inequalities follow directly from this paper definitions:

From Table ? and from the above inequalities it follows that the cost of each Levenberg-Marquardt iteration is dominated by the cost of solving Equation (Equation 5). Furthermore, , therefore, S training method has the same asymptotic computational cost of SP and S methods: .

From Table ? it is also possible to analyze the cost of each of the major steps needed in each full iteration of the algorithm:

  • Computing Error:

    The cost of computing the error is the same for all of the training methods.

  • Computing Partial Derivative:

    The computation of partial derivatives has a cost of for the SP training method and a cost for both P and P. For many cases of interest in system identification, the number of model outputs is small. Furthermore, (see Eq. (Equation 12)). That is why the cost of computing the partial derivatives for parallel training is comparable to the cost for series-parallel training.

  • Solving Equation (Equation 5):

    It already has been established that the cost of this step dominates the computational cost for all the training methods. Furthermore is usually much smaller than such that and the number of flops of this stage is basically the same for all the training methods.

4.4Memory Complexity

Considering that is significantly larger than , it follows that, for the three training methods, the storage capacity is dominated by the storage of the Jacobian matrix or of the matrix resulting from the product . Therefore the size of the memory used by the algorithm is about .

For very large datasets and a large number of parameters, this storage requirements may be prohibitive and others methods should be used (e.g. stochastic gradient descent or variations). Nevertheless, for datasets of moderate size and networks with few hundred parameters, as it is usually the case for system identification, the use of nonlinear least-squares is a viable option.

5Practical Aspects

5.1Convergence towards a Local Minima

The optimization problem that results from both series-parallel and parallel training of neural networks are non-convex and may have multiple local minima. The solution of the Levenberg-Marquardt algorithm discussed in this paper, as well as most algorithms of interest for training neural networks (e.g. stochastic gradient descent, L-BFGS, conjugate gradient), converges towards a local minima4. However, there is no guarantee for neither series-parallel nor parallel training that solution found is the global optimum. The convergence to “bad” local solutions may happen for both training methods, however, as illustrated in the numerical examples, it seems to happen more often for parallel training.

5.2Signal Unboundedness During Training

Signals obtained in intermediary steps of parallel identification may become unbounded. The one-step-ahead predictor, used in series-parallel training, is always bounded since the prediction depends only on measured values – it has a FIR (Finite Impulse Response) structure – while, for the parallel training, the free-run simulation could be unbounded for some choice of parameters because of its dependence on its own past simulation values.

This is a potential problem because during an intermediary stage of parallel training a choice of that results in unbounded values of may need to be evaluated, resulting in overflows. Hence, optimization algorithms that may work well minimizing one-step-ahead prediction errors , may fail when minimizing simulation errors .

For instance, steepest descent algorithms with a fixed step size may, for a poor choice of step size, fall into a region in the parameter space for which the signals are unbounded and the computation of the gradient will suffer from overflow and prevent the algorithm from continuing (since it does not have a direction to follow). This may also happen in more practical line search algorithms (e.g. the one described in [31]).

The Levenberg-Marquardt algorithm, on the other hand, is robust against this kind of problem because every step that causes overflow in the objective function computation yields a negative 5, hence the step is rejected by the algorithm and is increased. The increase in causes the length of to decrease6. Therefore, the step length is decreased until a point is find close enough to the current iteration such that overflow does not occur. Hence, the Levenberg-Marquardt algorithm does not fail or stall due to overflows. Similar reasoning could be used for any trust-region method or for backtracking line-search.

Regardless of the optimization algorithm, signal unboundedness is not a problem for feedforward networks with bounded activation functions (e.g. Logistic or Hyperbolic Tangent) in the hidden layers, because its output is always bounded. Hence parallel training of these particular neural networks is not affected by the previously mentioned difficulties.

6Implementation and Test Results

The implementation is in Julia and runs on a computer with a processor Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz. For all examples in this paper, the activation function used in hidden layers is the hyperbolic tangent, the weights initial values are drawn from a normal distribution with standard deviation and the bias terms are initialized with zeros [38]. Furthermore, in all parallel training examples we include the initial conditions as parameters of the optimization process (P training).

The free run simulation mean square error () is used as a goodness of fit measure to compare the models in the validation window.

The first example compares the training method using data from a real process and the second one investigates different noise configurations on computer generated data. The code and data used in the numerical examples are available in the GitHub repository:

6.1Example 1: Data from a Pilot Plant

In this example, input-output signals were collected from LabVolt Level Process Station (model 3503-MO [39]). This plant consists of a tank that is filled with water from a reservoir. The water is pumped at fixed speed, while the flow is regulated using a pneumatically operated control valve driven by a voltage. The water column height is indirectly measured using a pressure sensor at the bottom of the tank. Figure 2 shows the free-run simulation in the validation window of models obtained for this process using parallel and series-parallel training.

Figure 2: Displays free-run simulation in the validation window for models obtained using series-parallel (SP) and parallel (P) training. The mean square errors are {\text{MSE}_{\text{SP}} = 1144.6}; {\text{MSE}_{\text{P}} = 296.2}. The models have {n_y=n_u=1} and 10 nodes in the hidden layer and were trained on a two hour long dataset sampled at T_s=10{\rm s}. The same initial parameter guess was used for both training methods. The training was 100 epochs long, which took 3.3 and 3.9 seconds, respectively, for series-parallel and parallel training.
Figure 2: Displays free-run simulation in the validation window for models obtained using series-parallel (SP) and parallel (P) training. The mean square errors are ; . The models have and 10 nodes in the hidden layer and were trained on a two hour long dataset sampled at . The same initial parameter guess was used for both training methods. The training was 100 epochs long, which took 3.3 and 3.9 seconds, respectively, for series-parallel and parallel training.

Since the parameters initial guess are randomly initialized, different realizations will yield different results. Figure 3 shows the validation errors for both training methods for randomly drawn initial guesses. While parallel training consistently provides models with better validation results than series-parallel training, it also has some outliers with very bad validation results. We interpret these outliers in the boxplot as consequence of the parallel training getting trapped in “bad” local minima, as mentioned in Section 5.

Figure 3: Boxplots show the distribution of the free-run simulation MSE over the validation window for models trained using series-parallel (SP) and parallel (P) methods under the circumstances specified in Figure . There are 100 realizations of the training in each boxplot, in each realization the weights w_{i,j}^{(n)} are draw from a normal distribution with standard deviation {\sigma=(N_{s (n-1)})^{-0.5}} and the bias terms \gamma_i^{(n)} are initialized with zeros .
Figure 3: Boxplots show the distribution of the free-run simulation MSE over the validation window for models trained using series-parallel (SP) and parallel (P) methods under the circumstances specified in Figure . There are 100 realizations of the training in each boxplot, in each realization the weights are draw from a normal distribution with standard deviation and the bias terms are initialized with zeros .

The training of the neural network was performed using normalized data. However, if unscaled data were used instead, parallel training would yield models with over the validation window while series-parallel training can still yield solutions with a reasonable fit to the validation data. We understand this as another indicator of parallel training greater sensitivity to the initial parameter guess: for unscaled data, the initial guess is far away from meaningful solutions of the problem, and, while series-parallel training converges to acceptable results, parallel training gets trapped in “bad” local solutions.

6.2Example 2: Investigating the Noise Effect

The non-linear system: [40]

was simulated and the generated dataset was used to build neural network models. Figure 4 shows the validation results for models obtained for a training set generated with white Gaussian equation and output errors and . In this section, we repeat this same experiment for diverse random processes applied to and in order to investigate how different noise configurations affect parallel and series-parallel training.

Figure 4: Displays the first 100 samples of the free-run simulation in the validation window for models trained using series-parallel (SP) and parallel (P) methods. The mean square errors are {\text{MSE}_{\text{SP}} = 0.39}; {\text{MSE}_{\text{P}} = 0.06}. The models have {n_y=n_u=2} and a single hidden layer with 10 nodes. The training set has N= 1000 samples and was generated with () for v and w white Gaussian noise with standard deviations \sigma_v = 0.1 and \sigma_w = 0.5. The validation window is generated without the noise effect. For both, the input u is randomly generated with standard Gaussian distribution, each randomly generated value held for 5 samples. The training was 100 epochs long, which took 5.0 and 6.1 seconds for, respectively, series-parallel and parallel training.
Figure 4: Displays the first 100 samples of the free-run simulation in the validation window for models trained using series-parallel (SP) and parallel (P) methods. The mean square errors are ; . The models have and a single hidden layer with 10 nodes. The training set has samples and was generated with () for and white Gaussian noise with standard deviations and . The validation window is generated without the noise effect. For both, the input is randomly generated with standard Gaussian distribution, each randomly generated value held for 5 samples. The training was 100 epochs long, which took 5.0 and 6.1 seconds for, respectively, series-parallel and parallel training.

White Noise

Let be white Gaussian noise with standard deviation and let be zero. Figure ? (a) shows the free-run simulation error on the validation window using parallel or series-parallel training for increasing values of . Figure ? (b) shows the complementary experiment, for which is zero and is white Gaussian noise with increasing larger values of being tried out.

In Section 2, series-parallel training was derived considering only the presence of white equation error and, in this situation, the numerical results illustrate the model using this training method presents the best validation results (Figure ? (a)). On the other hand, parallel training was derived considering only the presence of white output error and is significantly better in this alternative situation (Figure ? (b)).

Colored Noise

Consider and a white Gaussian noise filtered by a low pass filter with cutoff frequency . Figure ? shows the free-run simulation error in the validation window for both parallel and series-parallel training for a sequence of values of and different noise intensities. The result indicates parallel training provides the best results unless the equation error has a very large bandwidth.

More extensive tests are summarized in Table ?, which shows the validation errors for a training set with colored Gaussian errors in different frequency bands. Again, except for white or large bandwidth equation error, parallel training seems to provide the smallest validation errors.

Equation error can be interpreted as the effect of unaccounted inputs and unmodeled dynamics, hence situations where this error is not auto-correlated are very unlikely. Therefore, the only situations we found series-parallel training to perform better (when the equation error power spectral density occupy almost the whole frequency spectrum) seem unlikely to appear in practice. This may justify parallel training to produce better models for real application problems as the pilot plant in Example 1, the battery modeling described in [23], or the boiler unit in [17].

Free-run simulation MSE on the validation window for parallel and series-parallel training. Both the mean and the standard deviation are displayed (30% trimmed estimation computed from 12 realizations). In situation (a), the training data was generated with zero output error () and is a Gaussian random process. In (b), the training data was generated with and is a Gaussian random process. The Gaussian random process has standard deviation and power spectral density confined to the given frequency band. In both situations, the rows where the frequency ranges from to (the whole spectrum) corresponds to white noise, in the remaining rows we apply a 4th-order lowpass (or highpass) Butterworth filter to white Gaussian noise (in both the forward and reverse directions) in order to obtain the signal in the desired frequency band. The cell of the training method with the best validation results between the two models is colored. Its colored red when the difference in the MSE is larger than the sum of standard deviations and yellow when it is not.
range SP P SP P


In Section 4 we find out the computational complexity of . The first term seems to dominate and in Figure ? we show that the running time grows linearly with the number of training samples and quadratically with the number of parameters .

The running time growing with the same rate for both training methods implies that the ratio between series-parallel and parallel training running time is bounded by a constant. For the examples we presented in this paper the parallel training takes about 20% more than series-parallel training. Hence the difference of running times for sequential execution does not justify the use of one method over the other. Parallel training is, however, much less amenable to parallelization because of the dependencies introduced by the recursive relations used for computing its error and Jacobian matrix. We propose a possible solution to this problem in the final remarks.

7Final Remarks

In this paper we have studied different aspects of parallel training under a nonlinear least squares formulation. Due to the results presented in the numerical examples we have reasons to believe parallel training can provide models with smaller generalization error than series-parallel training under more realistic noise assumptions. Furthermore, for sequential execution both the complexity analysis and the numerical examples suggest the computational cost is not significantly different for both methods. Some other reasons mentioned in the literature to avoid parallel training, as the possibility of signal unboundedness [1], are also easy to circumvent (see Section 5).

Several published works take for granted that series-parallel training always provide better results with lower computational cost [1]. The results presented in this paper show this is not always the case and that parallel training does provide advantages that justify its use for training neural networks in several situations (see the numerical examples).

Series-parallel training, however, has two advantages over parallel training: i) it seems less likely to be trapped in “bad” local solutions; ii) it is more amenable to parallelization. For the examples presented in this paper the possibility of being trapped in “bad” local solutions is only a small inconvenience, requiring the data to be carefully normalized and, in some rare situations, the model to be retrained. We believe this to be the case in many situations. An exception are chaotic systems for which small variations in the parameters may cause great variations in the free-run simulation trajectory, causing the parallel training objective function to be very intricate and full of undesirable local minima. In [41] a technique called multiple shooting is introduced in the prediction error methods framework as a way of reducing the possibility of parallel training getting trapped in “bad” local minima. Multiple shooting can also make the algorithm much more amenable to parallelization and seems to be a promising way to solve the shortcomings of parallel training.


This work has been supported by the Brazilian agencies CAPES, CNPq and FAPEMIG.



  1. The prediction is optimal in the sense that the expected squared prediction error is minimized [27].
  2. A white noise process has zero mean by definition.
  3. In the context of predictor error methods the nomenclature NARX (nonlinear autoregressive model with exogenous input) and NOE (nonlinear output error model) is often used to refer to the models obtained using, respectively, series-parallel and parallel training.
  4. It is proved in [37] that the Levenberg-Marquardt (not exactly the one discussed here) converges towards a local minima or a stationary point under simple assumptions.
  5. Programming languages as Matlab, C, C++ and Julia returns the floating point value encoded for infinity when an overflow occur. In this case formula (Equation 6) yields a negative .
  6. This inverse relation between and is explained in [31].


  1. K. S. Narendra, K. Parthasarathy, Identification and control of dynamical systems using neural networks, IEEE Transactions on Neural Networks 1 (1) (1990) 4–27.
  2. D.-y. Zhang, L.-p. Sun, J. Cao, Modeling of temperature-humidity for wood drying based on time-delay neural network, Journal of Forestry Research 17 (2) (2006) 141–144.
  3. M. Singh, I. Singh, A. Verma, Identification on non linear series-parallel model using neural network, MIT Int. J. Electr. Instrumen. Eng 3 (1) (2013) 21–23.
  4. M. Saad, P. Bigras, L.-A. Dessaint, K. Al-Haddad, Adaptive robot control using neural networks, IEEE Transactions on Industrial Electronics 41 (2) (1994) 173–181.
  5. M. H. Beale, M. T. Hagan, H. B. Demuth, Neural network toolbox for use withMATLAB, Tech. Rep., Mathworks, 2017.
  6. M. Saggar, T. Meriçli, S. Andoni, R. Miikkulainen, System identification for the Hodgkin-Huxley model using artificial neural networks, in: Neural Networks, 2007. IJCNN 2007. International Joint Conference on, IEEE, 2239–2244, 2007.
  7. E. Petrović, Ž. Ćojbašić, D. Ristić-Durrant, V. Nikolić, I. Ćirić, S. Matić, Kalman filter andNARX neural network for robot vision based human tracking, Facta Universitatis, Series: Automatic Control And Robotics 12 (1) (2013) 43–51.
  8. I. B. Tijani, R. Akmeliawati, A. Legowo, A. Budiyono, Nonlinear identification of a small scale unmanned helicopter using optimizedNARX network with multiobjective differential evolution, Engineering Applications of Artificial Intelligence 33 (2014) 99–115.
  9. E. A. Khan, M. A. Elgamal, S. M. Shaarawy, Forecasting the Number of Muslim Pilgrims UsingNARX Neural Networks with a Comparison Study with Other Modern Methods, British Journal of Mathematics & Computer Science 6 (5) (2015) 394.
  10. E. Diaconescu, The use ofNARX neural networks to predict chaotic time series, WSEAS Transactions on Computer Research 3 (3) (2008) 182–191.
  11. K. Warwick, R. Craddock, An introduction to radial basis functions for system identification. A comparison with other neural network methods, in: Decision and Control, 1996., Proceedings of the 35th IEEE Conference on, vol. 1, IEEE, 464–469, 1996.
  12. W. Kamińnski, P. Strumitto, E. Tomczak, Genetic algorithms and artificial neural networks for description of thermal deterioration processes, Drying Technology 14 (9) (1996) 2117–2133.
  13. M. F. Rahman, R. Devanathan, Z. Kuanyi, Neural network approach for linearizing control of nonlinear process plants, IEEE Transactions on Industrial Electronics 47 (2) (2000) 470–477.
  14. H. T. Su, T. J. McAvoy, P. Werbos, Long-term predictions of chemical processes using recurrent neural networks: A parallel training approach, Industrial & Engineering Chemistry Research 31 (5) (1992) 1338–1352.
  15. H.-T. Su, T. J. McAvoy, Neural model predictive control of nonlinear chemical processes, in: Intelligent Control, 1993., Proceedings of the 1993 IEEE International Symposium on, IEEE, 358–363, 1993.
  16. L. A. Aguirre, B. H. Barbosa, A. P. Braga, Prediction and simulation errors in parameter estimation for nonlinear systems, Mechanical Systems and Signal Processing 24 (8) (2010) 2855–2867.
  17. K. Patan, J. Korbicz, Nonlinear model predictive control of a boiler unit: A fault tolerant control study, International Journal of Applied Mathematics and Computer Science 22 (1) (2012) 225–237.
  18. L. Piroddi, W. Spinelli, An identification algorithm for polynomialNARX models based on simulation error minimization, International Journal of Control 76 (17) (2003) 1767–1781.
  19. M. Farina, L. Piroddi, Some convergence properties of multi-step prediction error identification criteria, in: Decision and Control, 2008. CDC 2008. 47th IEEE Conference on, IEEE, 756–761, 2008.
  20. M. Farina, L. Piroddi, An iterative algorithm for simulation error based identification of polynomial input–output models using multi-step prediction, International Journal of Control 83 (7) (2010) 1442–1456.
  21. M. Farina, L. Piroddi, Simulation error minimization identification based on multi-stage prediction, International Journal of Adaptive Control and Signal Processing 25 (5) (2011) 389–406.
  22. M. Farina, L. Piroddi, Identification of polynomial input/output recursive models with simulation error minimisation methods, International Journal of Systems Science 43 (2) (2012) 319–333.
  23. C. Zhang, K. Li, Z. Yang, L. Pei, C. Zhu, A new battery modelling method based on simulation error minimization, in: 2014 IEEE PES General Meeting| Conference & Exposition, 2014.
  24. L. Ljung, System Identification, Springer, 1998.
  25. J. Sjöberg, Q. Zhang, L. Ljung, A. Benveniste, B. Delyon, P.-Y. Glorennec, H. Hjalmarsson, A. Juditsky, Nonlinear black-box modeling in system identification: a unified overview, Automatica 31 (12) (1995) 1691–1724.
  26. P. M. Nørgård, O. Ravn, N. K. Poulsen, L. K. Hansen, Neural Networks for Modelling and Control of Dynamic Systems-A Practitioner’s Handbook, Springer-London, 2000.
  27. T. Hastie, R. Tibshirani, J. Friedman, The elements of statistical learning, vol. 1, Springer Series in Statistics, second edn., 2008.
  28. K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Neural networks 2 (5) (1989) 359–366.
  29. O. Nelles, Nonlinear System Identification: from classical approaches to neural networks and fuzzy models, Springer Science & Business Media, 2001.
  30. A. H. Ribeiro, L. A. Aguirre, Selecting Transients Automatically for the Identification of Models for an Oil Well, IFAC-PapersOnLine 48 (6) (2015) 154–158.
  31. J. Nocedal, S. Wright, Numerical Optimization, Springer Science & Business Media, second edn., 2006.
  32. M. T. Hagan, M. B. Menhaj, Training feedforward networks with theMarquardt algorithm, IEEE Transactions on Neural Networks 5 (6) (1994) 989–993.
  33. R. J. Williams, D. Zipser, Experimental Analysis of the Real-Time Recurrent Learning Algorithm, Connection Science 1 (1) (1989) 87–111.
  34. D. W. Marquardt, An algorithm for least-squares estimation of nonlinear parameters, Journal of the Society for Industrial and Applied Mathematics 11 (2) (1963) 431–441.
  35. R. Fletcher, A modifiedMarquardt subroutine for non-linear least squares, 1971.
  36. G. H. Golub, C. F. Van Loan, Matrix Computations, vol. 3, JHU Press, 2012.
  37. J. J. Moré, TheLevenberg-Marquardt algorithm: implementation and theory, in: Numerical analysis, Springer, 105–116, 1978.
  38. Y. A. LeCun, L. Bottou, G. B. Orr, K.-R. Müller, Efficient backprop, in: Neural Networks: Tricks of the trade, Springer, 9–48, 2012.
  39. LabVolt, Mobile Instrumentation and Process Control Training Systems, Tech. Rep., Festo, 2015.
  40. S. Chen, S. A. Billings, P. M. Grant, Non-linear system identification using neural networks, International Journal of Control 51 (6) (1990) 1191–1214.
  41. A. H. Ribeiro, L. A. Aguirre, Shooting Methods for Parameter Estimation of Output Error Models, IFAC-PapersOnLine 50 (1) (2017) 13998 – 14003, ISSN 2405-8963.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description