Estimating the Number of Components in a Mixture of Multilayer Perceptrons

Estimating the Number of Components in a Mixture of Multilayer Perceptrons

M. Olteanu and J. Rynkiewicz
SAMOS-MATISSE-CES Universite Paris 1, UMR 8174
90 Rue de Tolbiac, 75013 Paris, France

BIC criterion is widely used by the neural-network community for model selection tasks, although its convergence properties are not always theoretically established. In this paper we will focus on estimating the number of components in a mixture of multilayer perceptrons and proving the convergence of the BIC criterion in this frame. The penalized marginal-likelihood for mixture models and hidden Markov models introduced by Keribin (2000) and, respectively, Gassiat (2002) is extended to mixtures of multilayer perceptrons for which a penalized-likelihood criterion is proposed. We prove its convergence under some hypothesis which involve essentially the bracketing entropy of the generalized score-functions class and illustrate it by some numerical examples.

1 Introduction

Although linear models have been the standard tool for time series analysis for a long time, their limitations have been underlined during the past twenty years. Real data often exhibit characteristics that are not taken into account by linear models. Financial series, for instance, alternate strong and weak volatility periods, while economic series are often related to the business cycle and switch from recession to normal periods. Several solutions such as heteroscedatic ARCH, GARCH models, threshold models, multilayer perceptrons or autoregressive switching Markov models were proposed to overcome these problems.

In this paper, we consider models which allow the series to switch between regimes and more particularly we study the case of mixtures of multilayer perceptrons. In this frame, rather than using a single global model, we estimate several local models from the data. For the moment, we assume that switches between different models occur independently, the next step of this approach being to also learn how to split the input space and to consider the more general case of gated experts or mixtures of experts models (Jacobs et al., 1991). The problem we address here is how to select the number of components in a mixture of multilayer perceptrons. This is typically a problem of non-identifiability which leads to a degenerate Fisher information matrix and the classical chi-square theory on the convergence of the likelihood ratio fails to apply. One possible method to answer this problem is to consider penalized criteria. The consistency of the BIC criterion was recently proven for non-identifiable models such as mixtures of densities or hidden Markov models (Keribin, 2000 and Gassiat, 2002). We extend these results to mixtures of nonlinear autoregressive models and prove the consistency of a penalized estimate for the number of components under some good regularity conditions.

The rest of the paper is organized as follows : in Section 2 we give the definition of the general model and state sufficient conditions for regularity. Then, we introduce the penalized likelihood estimate for the number of components and state the result of consistency. Section 3 is concerned with applying the main result to mixtures of multilayer perceptrons. Some open questions, as well as some possible extensions are discussed in the conclusion.

2 Penalized-likelihood estimate for the number of components in a mixture of nonlinear autoregressive models

This section is devoted to the setting of the general theoretical frame : model, definition and consistency of the penalized-likelihood estimate for the number of components.

The model - definition and regularity conditions

Throughout the paper, we shall consider that the number of lags is known and, for ease of writing, we shall set the number of lags equal to one, the extension to time-lags being immediate.

Let us consider the real-valued time series which verifies the following model


  • is an iid sequence of random variables valued in a finite space and with probability distribution ;

  • for every , and

    is the family of possible regression functions. We suppose throughout the rest of the paper that are sublinear, that is they are continuous and there exist such that ;

  • for every , is an iid noise such that is independent of . Moreover, has a centered Gaussian density .

The sublinearity condition on the regression functions is quite general and the consistency for the number of components holds for various classes of processes, such as mixtures of densities, mixtures of linear autoregressive functions or mixtures of multilayer perceptrons.

Let us also remark that besides its necessity in the proof of the theoretical result, the compactness hypothesis for the parameter space is also useful in practice. Indeed, one needs to bound the parameter space in order to avoid numerical problems in the multilayer perceptrons such as hidden-unit saturation. In our case, seems to be an acceptable bound for the computations. On the other hand, mixture probabilities are naturally bounded.

The next example of a linear mixture illustrates the model introduced by (1). The hidden process is a sequence of iid variables with Bernoulli(0.5) distribution. We define as follows, using and a standard Gaussian noise :

The penalized-likelihood estimate which we introduce in the next subsection converges in probability to the true number of components of the model under some regularity conditions on the process . More precisely, we need the following hypothesis which implies, according to Yao and Attali (2000), strict stationarity and geometric ergodicity for :


Let us remark that hypothesis (HS) does not request every component to be stationary and that it allows non-stationary “regimes” as long as they do not appear too often. Since multilayer perceptrons are bounded function, this hypothesis will be naturally fulfilled.

Construction of the penalized likelihood criterion

Let us consider an observed sample of the time series . Then, for every observation , the conditional density with respect to the previous and marginally in is

As the goal is to estimate , the number of regimes of the model, let us consider all possible conditional densities up to a maximal number of regimes , a fixed positive integer. We shall consider the class of functions

where , , and is a centered Gaussian density.

For every we define the number of regimes as

and let be the true number of regimes.

We can now define the estimate as the argument maximizing the penalized criterion


is the log-likelihood marginal in and is a penalty term.

Convergence of the penalized likelihood estimate

Several statistical and probabilistic notions such as mixing processes, bracketing entropy or Donsker classes will be used hereafter. For parcimony purposes we shall not remind them, but the reader may refer to Doukhan (1995) and Van der Vaart (2000) for complete monographs on the subject.

The consistency of is given by the next result, which in an extension of Gassiat (2002):

Theorem 1 : Consider the model defined by (1) and the penalized-likelihood criterion introduced in (2). Let us introduce the next assumptions :

(A1) is an increasing function of , when for every and when for every

(A2) the model verifies the weak identifiability assumption (HI)

(A3) the parameterization is continuous for every and there exists an integrable map with respect to the stationary measure of such that

(A4) is strictly stationary and geometrically -mixing, and the family of generalized score functions associated to

where is the stationary measure of and for every

being the bracketing entropy of with respect to the -norm.

Then, under hypothesis (A1)-(A4), (HS) et (HC), in probability.

Proof of Theorem 1

First, let us show that does not overestimate . We shall need the following likelihood ratio inequality which is an immediate generalization of Gassiat (2002) to multivariate dependent data.

Let be a parametric family of conditional densities containing the true model and consider the generalized score functions

where is the stationary measure of . Then,



Then we have :

Under the hypothesis (HS), there exists a unique strictly stationary solution which is also geometrically ergodic and this implies that is in particular geometrically -mixing. Then, by remarking that

we obtain that the bivariate series is also strictly stationary and geometrically -mixing.

This fact, together with the assumption on the -bracketing entropy of with respect to the norm and the condition that ensures that Theorem 4 in Doukan, Massart and Rio (1995) holds and

is uniformly tight and verifies a functional central limit theorem. Then,

On the other hand, , thus and using the -entropy condition is Glivenko-Cantelli. Since is ergodic and strictly stationary, we obtain the following uniform convergence in probability :

To finish the first part, let us prove that

If we suppose, on the contrary, that , then there exists a sequence of functions , such that . The -convergence implies that in and a.s. for a subsequence . Since and , where , we obtain that and thus in and a.s. for a subsequence . The hypothesis (A4) ensures the existence of a square-integrable dominating function for and, finally, we get that a subsequence of converges to a.s. and in , which contradicts the fact that for every , so that :

Then, by the uniform tightness above and the hypothesis (A1),

Let us now prove that does not underestimate :

Now, and under the hypothesis (A3), the class of functions is -Glivenko-Cantelli (the general proof for a parametric family can be found in Van der Vaart, 2000) and since is ergodic and strictly stationary, we obtain the following uniform convergence in probability :

Since and using assumption (A2), the limit is negative. By hypothesis (A1), converges to when , so we finally have that and the proof is done.

3 Mixtures of multilayer perceptrons

In this section, we consider the model defined in (1) such that, for every , is a multilayer perceptron. Since non-identifiability problems also arise in multilayer perceptrons (see, for instance, Rynkiewicz, 2006), we shall simplify the problem by considering one hidden layer and a fixed number of units on every layer, . Then, we have that for every

where is the hyperbolic tangent and

is the true parameter with the true variance.Let us check if the hypothesis of the main result of section 2 apply in the case of mixtures of multilayer perceptrons.

Hypothesis (HS) : The stationarity and ergodicity assumption (HS) is immediately verified since the output of every perceptron is bounded, by construction. Thus, every regime is stationary and the global model is also stationary.

Let us consider the class of all possible conditional densities up to a maximum number of components :

, where

  • and we may suppose quite naturally that for every ,

  • for every , is a multilayer perceptron

, where

belongs to a compact set.

Hypothesis (A1) : may be chosen, for instance, equal to the BIC penalizing term, .

Hypothesis (A2)-(A3) : Since the noise is normally distributed, the weak identifiability hypothesis is verified according to the result of Teicher (1963), while assumption (A3) is a regularity condition verified by Gaussian densities.

Hypothesis (A4) : We consider the class of generalized score functions

The difficult part will be to show that for all which, since we are on a functional space, is equivalent to prove that “the dimension” of can be controlled. For , let us denote and , so that the global parameter will be and the associated generalized score function .

Proving that a parametric family like verifies the condition on the bracketing entropy is usually immediate under good regularity conditions (see, for instance, Van der Vaart, 2000). A sufficient condition is that the bracketing number grows as a polynomial function of . In this particular case, the problems arise when and the limits in of have to be computed. Let us split into two classes of functions. We shall consider a neighborhood of such that and . On , it can be easily seen that

Hence, on , it is sufficient that


Now, is a parametric class. Since the derivatives of the transfer functions are bounded, according to the example 19.7 of Van der Vaart (2000), it exists a constant so that the bracketing number of is lower than

where is the diameter of the smallest sphere of including the set of possible parameters. So, we get that , where is the number of -brackets necessary to cover and the bracketing entropy is computed as .

As for , the idea is to reparameterize the model in a convenient manner which will allow a Taylor expansion around the identifiable part of the true value. For that, we shall use a slight modification of the method proposed by Liu and Shao (2003). Let us remark that when , the weak identifiability hypothesis (A2) and the fact that for every , , imply that there exists a vector such that and, modulo a permutation, can be rewritten as follows : , , . With this remark, one can define in the general case and so that, for every , ,

and the new parameterization will be ,

, , with containing all the identifiable parameters of the model and the non-identifiable ones. Then, for , we will have that

This reparameterization allows to write a second-order Taylor expansion of at . For ease of writing, we shall first denote

Then, the density ratio becomes :

By remarking that when , does not vary with , we will study the variation of this ratio in a neighborhood of and for fixed . Let us note the vector of derivatives of with respect of each components of and the vector of second derivatives of with respect of each components of . Assuming that , and , where

are linearly independent in , one can prove the following :

Proposition 1 : Let us denote . For any fixed , there exists the second-order Taylor expansion at :




Proof of Proposition 1

The first term in the developpement can be computed easily by remarking that the gradient of at is :

  • for and ,

  • for ,

The term of second order can be obtained by direct computations once the hessian in computed at :

  • , and

  • , and

  • ,

  • , and

  • , and

  • the other crossed derivatives of and are zero

It remains to prove that the rest is but this follows directly from Yao (2000) and the fact that, since the noise is normally distributed, has moments of any order.

Using the Taylor expansion above, for belonging to , is the sum of a linear combination of

and of a term whose norm is negligible compared to the norm of this combination when goes to 0. By assumption (A3), a strictly positive number exists so that for any vector of norm 1 with components

and sufficiently small:

Since any function can be written:

belongs to the set of functions:

whose bracketing number is smaller or equal to