Renormalized Normalized Maximum Likelihood and ThreePart Code Criteria For Learning Gaussian Networks
Abstract
Score based learning (SBL) is a promising approach for learning Bayesian networks in the discrete domain. However, when employing SBL in the continuous domain, one is either forced to move the problem to the discrete domain or use metrics such as BIC/AIC, and these approaches are often lacking. Discretization can have an undesired impact on the accuracy of the results, and BIC/AIC can fall short of achieving the desired accuracy. In this paper, we introduce two new scoring metrics for scoring Bayesian networks in the continuous domain: the threepart minimum description length and the renormalized normalized maximum likelihood metric. We rely on the minimum description length principle in formulating these metrics. The metrics proposed are free of hyperparameters, decomposable, and are asymptotically consistent. We evaluate our solution by studying the convergence rate of the learned graph to the generating network and, also, the structural hamming distance of the learned graph to the generating network. Our evaluations show that the proposed metrics outperform their competitors, the BIC/AIC metrics. Furthermore, using the proposed RNML metric, SBL will have the fastest rate of convergence with the smallest structural hamming distance to the generating network.
Introduction
A Bayesian network (BN) over a set of variables is a probabilistic graphical model where the dependencies between the variables are represented through a collection of edges among them in a directed acyclic graph (DAG)[\citeauthoryearPearl2014]. BNs have found extensive applications in diverse areas of engineering such as bioinformatics, image processing, and decision systems [\citeauthoryearCowell et al.2006, \citeauthoryearSpirtes et al.2000a, \citeauthoryearFriedman2004].
In simple cases, experts can design BNs using their domain knowledge. However, in most applications, this approach is impractical. It is, therefore, important to be able to learn and estimate a BN from observational data. There are two general approaches for learning a BN from data: constraintbased learning [\citeauthoryearSpirtes et al.2000b] and score based learning [\citeauthoryearHeckerman, Geiger, and Chickering1995].
Score based learning has proved to be a promising approach for learning Bayesian networks [\citeauthoryearChickering2002, \citeauthoryearScutari2009, \citeauthoryearTeyssier and Koller2012, \citeauthoryearSilander and Myllymaki2012]. In score based learning, the learning of a Bayesian network from data is viewed as an optimization task. The optimization task consists of finding the network with the highest score where the candidate networks are scored with respect to the observations using a statistically suitable scoring metric. Understandably, the choice of scoring metric plays an important role in the success of score based learning algorithms.
For Bayesian networks on discrete domains, various scoring metrics have been proposed, each formulated based on different set of statistical assumptions and motivations [\citeauthoryearHeckerman, Geiger, and Chickering1995, \citeauthoryearSilander et al.2008, \citeauthoryearBouckaert1993]. For Bayesian networks over continuous domains, however, there are very few scoring criteria. Therefore, to employ score based learning in the continuous domain, one is most often either forced to use the BIC/AIC metrics or to convert the variables into discrete counterparts. These approaches, however, are often lacking.
The most common discretization method used in practice is to heuristically partition the range of the continuous features into several mutually exclusive and exhaustive regions. Doubtless, the choice of discretization policy can have a significant undesired impact on the accuracy of the results [\citeauthoryearDaly, Shen, and Aitken2011]. To minimize the unfavorable effects of discretization, Friedman et al. proposed a more principled discretization policy where discretization of a continuous feature is based on its relation to other variables in the network [\citeauthoryearFriedman, Goldszmidt, and others1996]. However, the scoring metric resulting from this discretization policy is not decomposable. Lack of decomposability makes the search for the highest scoring network computationally challenging as most search algorithms require the decomposability of the scoring metric.
In the absence of discretization, there are only three scoring metrics that are directly applicable to BNs on domains containing continuous variables; the AIC, BIC and the BGe score [\citeauthoryearAkaike1998, \citeauthoryearSchwarz and others1978, \citeauthoryearGeiger and Heckerman1994]. The BIC/AIC metrics can fall short of achieving the desired accuracy. The BGe score lacks a closed form expression and requires specification of a set of hyperparameters which demands extensive domain knowledge or the use of a portion of the data.
In this paper, we describe two novel scoring metrics for Bayesian networks in the continuous domain based on the minimum description length principle (MDL). Our work draws inspiration from the use of the MDL principle in problems of variable selection in regression [\citeauthoryearBarron, Rissanen, and Yu1998, \citeauthoryearHansen and Yu2001]. Thus we expect it to carry the advantages that MDL offers versus the AIC and BIC in the problems of variable selection in regression to the problem of finding a suitable Bayesian network [\citeauthoryearGrünwald2007]; learning a Bayesian network can be thought of as selecting predictors for a set of variables where the predictors are constrained to follow a particular order. Both scoring metrics proposed here are free of hyperparameters, decomposable, and are asymptotically consistent.
In the next section, we formally introduce the Bayesian Gaussian networks. We then formulate a crude threepart code scoring metric for Bayesian networks in part 3 of our paper. In section 4, we propose a renormalized normalized maximum likelihood scoring scheme for continous domain Bayesian networks under the assumption that the model class under consideration consists of only Bayesian Gaussian networks. Afterward, we study the asymptotic properties of the two proposed scoring metrics in section 5. We evaluate and compare the performance of the proposed scoring metrics to the BIC and the AIC metrics using simulated data in section 6. Our results suggest that the scoring metrics proposed here consistently perform better than BIC and AIC for both sparse and dense graphs.
Gaussian Networks
Throughout this paper, we consider a domain, of continuous variables. Let denote the joint probability density function over . A Bayesian network over is a pair , where encodes a set of conditional independence statements and is a set of local conditional probabilities associated with . In particular, is a collection of sets such that renders and conditionally independent:
(1) 
Assuming that the joint probability distribution function of is a multivariate normal distribution, we can write:
(2) 
where is the mdimensional mean vector, is the covariance matrix, is the determinant operation, and is the transpose operation. This distribution can be factorized into the product of conditional distributions:
(3)  
where is the residual variance of the node , is the unconditional mean of , and is a measure of the extent of partial correlation between nodes and . Such a distribution corresponds to a Bayesian network if:
(4)  
where ’s correspond to the parent sets specified in . In other words, a multivariate Gaussian distribution corresponds to a BN, , if [\citeauthoryearShachter and Kenley1989]. The Bayesian network is minimal when there is an arc from to if and only if . This network is also referred to as the Imap of the probability distribution.
Instead of the above parametrization of a multivariate Normal distribution, we opt to work with the following parametrization:
(5) 
Every instantiation of parameters of one model corresponds to an instantiation of parameters of the second model; note that the new parametrization corresponds to a network with nodes with , and the unconditional means of all other nodes set to zero. Thus a Bayesian network, , can be parametrized by . We call such a Bayesian network, a Gaussian network; if further, the network is minimal, we call it a minimal Gaussian network. In the rest of the paper, for notational convenience, we represent the set simply by . We also write for the cardinality of the , , and for the values of the variables .
A Crude ThreePart Code Scoring Metric
In this section, we propose our first scoring metric for Bayesian networks based on the MDL principle. We assume that the reader is familiar with the basics of MDL principle [\citeauthoryearJorma1998, \citeauthoryearRissanen1986]. In short, the MDL principle states that the model most suitable for a given set of observations is the one that encodes the observations using the fewest symbols.
We wish to measure how well a Bayesian network structure, fits the observed data. Motivated by the MDL principle, we can alternatively evaluate how compact a description a Bayesian network structure can provide for the observations. We propose two methods for encoding observations using Gaussian networks. The first method proposed in this section is a crude and a heuristic encoding scheme. In the next section of the paper, we optimize this coding scheme to formulate a minmax optimal encoding of the observations known as normalized maximum likelihood coding [\citeauthoryearGrünwald2007].
The threepart code encodes the observations in three steps. The first part of the code describes the Bayesian network structure and the second and the third part encode the observations using the network structure coded in the first part. The total description length then serves as a measure of how well the network fits the observations.
In this threepart code, we extend the MDL formulations of Lam et al. for Bayesian networks over discrete domains to continuous domains [\citeauthoryearLam and Bacchus1994]. First, to encode the structure of the network, we simply enumerate the parents of each node. For a node with parents we use nats to list its parents. Therefore, the encoding length of the structure of the network can be written as:
(6) 
Having coded the network structure, we now describe how we encode the observations in the remainder of the code. Assume that we have observed the data of sample size . Our coding scheme is to first encode the values of the root nodes (nodes without parents) and then to encode the values of nodes whose parent values has already been encoded. We continue this process iteratively, descending the Bayesian network until we reach the leaves of the network.
Now, suppose that we have already encoded the values for and we wish to encode the values of . Since we have assumed that data is sampled from a multivariate normal distribution, we have:
(7) 
where is the matrix of the values of the parents of . Within this framework, the problem of optimal encoding of and the parameters is equivalent to the problem of encoding a response variable given the values of predictor variables in linear regression. Here, the response variable is , and the predictor variables are . Following works of Rissanen, in such a setting, the shortest code for encoding the values of has a length of [\citeauthoryearRissanen1983, \citeauthoryearJorma1998]:
(8)  
where and are the maximum likelihood estimates (MLE) of and . More specifically, we encode the values of in two parts; in the first part, we encode the MLE parameters , and in the second part we encode the values of using the distribution . Such a coding scheme is referred to as a crude two part coding in the MDL literature [\citeauthoryearGrünwald2007].
Thus the total description length of the observed data will be:
(9)  
It is insightful to compare this coding metric to BIC for Gaussian networks and the MDL metric proposed for BNs over discrete domains. Comparing the penalty terms in the proposed MDL scoring metric for Gaussian networks with the MDL scoring metric of discrete networks, one observes that the penalty term for discrete networks is exponential in the number of parents while it only grows linearly for Gaussian networks. The reason is that the dimensionality of the parameter space for discrete Bayesian networks increases exponentially with an increasing number of parents while the parameter space of multivariate Gaussian distribution is polynomial in the number of parents. The proposed MDL metric is essentially the same as the BIC metric with the addition of the penalty term which accounts for the network structure, .
Inefficieny of Two Part Codes
The coding scheme introduced above is not Krafttight [\citeauthoryearRissanen1986]. In particular, consider the code proposed for encoding values of given values of :
(10)  
Note that once we decode ( contains information on the MLE values for the regression coefficients ) the set of possible values for become restricted to those for which . Therefore, this coding scheme is inefficient and the data can be coded using fewer bits. Normalized maximum likelihood (NML) codes are a variation of the twopart coding scheme where this inefficiency of the crude twopart coding is addressed [\citeauthoryearRissanen1986, \citeauthoryearRissanen2000].
Normalized Maximum Likelihood
The normalized maximum likelihood distribution with respect to a class of probability distributions parametrized by a dimensional parameter vector , , is defined as:
(11) 
where
(12) 
In normalized maximum likelihood codes, instead of using a threepart code, we encode each observation with a single code using the NML distribution:
(13) 
We are now ready to formulate the NML pdf for a Gaussian network.
A Gaussian network structure over m variables defines a class of probability distributions parametrized by :
(14)  
where:
(15)  
Let denote the MLE estimates of and :
(16)  
The integral in the denominator of the NML distribution does not exist for [\citeauthoryearMiyaguchi2017]. We write down the constrained NML density as below:
(17) 
the constrained NML density is only defined for :
(18) 
In the case of , we can specify using the following set of hyperparameters:
(19)  
Let these hyperparameters define the as below:
(20)  
and
(21)  
The numerator of 17 can be easily calculated as:
(22)  
We can calculate the denominator by first writing down the factored form of the density:
(23)  
Since only the factor is a function of , we can take the other factors out of the last integral. We now write down this last integral for a given value of :
(24) 
Note that both the region of integration and the MLE estimates are functions of . Using sufficient statistics, the value of this integral was calculated in [\citeauthoryearRoos2004] :
(25)  
Interestingly, the above factor is independent of the values of . As we will see later, this independence will make way for a scoring metric that is decomposable and local to the nodes of the graph.
The denominator of the constrained NML distribution is then:
(26) 
In the expression above, due to the terms , the hyperparameters have different effects on the score of different networks structures. To get rid of such effects, Rissanen proposes a second level normalization [\citeauthoryearRissanen2000].
Renormalized Normalized Maximum Likelihood Scoring Metric
Let and denote the MLE estimates of and :
(27)  
where
(28) 
The renormalized NML (RNML) probability distribution is then given by:
(29) 
where the region of integration is given by the hyperparameters:
(30)  
with defined as:
(31)  
Inserting the density for the NML distribution into (29), with the boundry conditions above, the RNML distribution can be calculated as:
(32) 
where we have used the RNML calculations for Gaussian distribution from [\citeauthoryearRoos2004], together with the property of our parametriziation that allows the integrals to be calculated independent of each other in solving the RNML distribution. Dropping the terms independent of the network structure, the RNML code can be written as:
(33)  
Using Stirling’s approximation, we can simplify the above expression:
(34)  
Equations (34) and (33) provide a closedform expression for the scoring of a Gaussian Bayesian network based on the RNML metric. Note that both these expressions are free of hyperparameter and are decomposable.
Asymptotic Behavior
It is well known that BIC prefers minimal Imaps over other network structures for large sample sizes [\citeauthoryearBouckaert1993]. Examining the equation (9), it is clear that the asymptotic behavior of the threepart coding metric is equivalent to that of the BIC metric. We now show that the RNML scoring metric also prefers networks that are minimal Imaps.
Theorem 1. Let be a set of variables, be an ordering on the variables in and be a probability distribution over . Let be a sample generated from . Let be a minimal Imap Bayesian network of and let be any other network structure. Furthermore, let both and be consistent with the ordering . We have:
(35) 
That is the network corresponding to the minimal Imap has the lowest description length based on the RNML distribution.
Proof. We consider two cases. In the first case, we assume that is a nonminimal Imap of . In the second case, we consider a that is not an Imap of .
Assuming that is a nonminimal Imap of , then the variance of the residual, , of each node is the same in both and . Dropping the factors less than we have:
(36)  
Since is a nonminimal Imap of we also have:
(37) 
Therefore, we have:
(38) 
Now assume that is a nonminimal Imap of . Thus there exists at least one node, , such that or equivalently, such that . Without loss of generality, we assume that this consists of the only difference between the two networks. Now consider a network structure, , that is exactly equal to network except for the parent set of the node where . Therefore, is a nonminimal Imap of the probability distribution and from the previous result we know that . We now show that .
We can transform the network to the network by removing the nodes from . Doing so will increase the residual variance of the node by at least . More specifically, using equation 34 dropping the node will increase the :
(39)  
Keeping only the terms of the factor , we can simplify the above expression:
(40)  
Hence we can see that asymptotically as , is smaller than by a factor of .
Numerical Evaluation
We evaluate and compare the performance of the proposed scoring metrics to the BIC and the AIC metrics using simulated data. We study the properties of these four metrics on two levels: a low dimensional setting where the number of the nodes of the graph is small, , and lager graphs having nodes. In low dimensional setting, the number of possible generating graphs is limited. It is, therefore, possible to evaluate the performance of these four metrics in detail since we can calculate the score of all the possible generating networks. In this setting, we examine how the true generating network structure is ranked among all other networks by the different scoring criteria.
With larger DAGs, the number of possible generating graphs is exponentially high and such evaluation of the performance of the scoring metrics is computationally challenging. Hence, we decided to compare the performance of these metrics by using a structural distance measure between the highest scoring networks (found using dynamic programming [\citeauthoryearSilander and Myllymaki2012]) and the true generating network.
We chose Structural Hamming Distance (SHD) as our measure of performance. SHD is a function of the number of edge addition/deletion or reversals required to convert one DAG to another [\citeauthoryearTsamardinos, Brown, and Aliferis2006]. While such a structural distance measure can only provide a heuristic summary of the performance of these metrics, nevertheless, a comparison in terms of such structural errors has a desirable intuitive interpretation. Furthermore, SHD compares the similarity of BNs in a causal context [\citeauthoryearTsamardinos, Brown, and Aliferis2006, \citeauthoryearde Jongh and Druzdzel2009]. Therefore, it also serves as a tool to examine the applicability of the proposed metrics in identifying the generating causal structure.
Performance in Low Dimensional Setting
In this experiment, we compare how the generating network structure is ranked among all the possible networks by the different scoring criteria.
First, a random DAG was chosen among all DAGs on nodes, . A sample was then recursively generated from the selected DAG starting from the root node down to the leaves using the following equation:
(41) 
where network parameters , and were chosen as independent realizations of Uniform([0.1, 1]). Afterward, we computed the scores of all possible DAGs for the simulated data and calculated the rank of the generating DAG among all other DAGs. The simulation was repeated for 500 iterations. Figure 1 shows the average rank of the generating network structure for different scoring metrics as a function of the sample size.
As is shown in Figure 1, the convergence rate of the rank is the fastest for the RNML metric. The MDL metric and the BIC metric show similar performance while the AIC metric has the lowest convergence rate.
Performance For Larger Networks
In this section, we analyze the performance of the four metrics against the sparsity of the generating graph for graphs having nodes, . We compare the performance of these metrics by evaluating the SHD of the prime DAG to the generating DAG.
Our simulations start by selecting a random graph having a specified sparsity. Similar to the work of Kalisch et al., we simulated graphs of different sparsity by controlling the expected number of connections, neighbours, of each node, [\citeauthoryearKalisch and Bühlmann2007]. Thus, we extended the scope of our simulations beyond ErdosRenyi (uniformly random) networks and examined the performance of the metrics for networks of varying sparsity. Afterward, a dataset of sample size , , was simulated based on the selected generating graph similar to the previous section. To find the prime DAG, we used the optimal dynamic programming method of Silander et al. [\citeauthoryearSilander and Myllymaki2012]. In selecting the maximum parent size parameter in the implementation of this algorithm, we chose it equal to the maximum parent size of the generating graph itself. This way, the generating graph would be included in the possible solution set of our algorithm. For every of the 27 combinations of the control parameters (the control parameters being the number of nodes, expected number of neighbours, and sample size) we repeated the simulation for 100 iterations, each time for a randomly selected generating DAG. We then computed the average SHD between the generating and the prime DAG across these 100 iterations. Results are shown in Figures 24.
Figures 24 show that the RNML metric consistently outperforms the other metrics irrespective of the size and the sparsity of the generating network. The proposed threepart MDL metric comes second in terms of performance, surpassing the AIC and the BIC metric. On average, aggregated across all sample size, sparsity, and network size values, the RNML metric outperformed the MDL metric by an SHD value of and the other two metrics by an SHD value of . The average reduction in the SHD value for the RNML metric as compared to MDL and the best (smallest) of AIC/BIC is shown in Table 1 as a function of sample size.
Sample Size  MDL  BIC/AIC 

50  0.7333  3.5900 
500  0.6078  1.4422 
1000  0.5656  0.5789 
Conclusion
In this paper, we introduced two new scoring metrics based on the MDL principle for Gaussian networks. These scores are asymptotically consistent, have simple to calculate closedform expressions, and are parameterfree. They are furthermore decomposable and therefore compatible with the most Bayesian network search procedures. Our evaluation of the proposed metrics suggests that the proposed metrics have better performance than the AIC and the BIC metrics. The proposed RNML metric specially outperforms all other metrics consistently and regardless of the size and the sparsity of the generating graph and the sample size (see Figure 1 and Figures 24).
The RNML metric proposed here can be thought of as a continuous domain extension to the FNML metric proposed for discrete Bayesian networks [\citeauthoryearSilander et al.2008]. However, unlike the FNML metric, the RNML metric was not tuned to be decomposable, rather, decomposability came naturally to our formulation. More specifically, a major theoretical contribution of ours in this paper was to show that a global RNML formulation for a Gaussian network is equivalent to applying a RNML model selection criterion at each local distribution.
References

Akaike, H.
1998.
Information theory and an extension of the maximum likelihood
principle.
In Selected Papers of Hirotugu Akaike. Springer.
199–213.

Barron, A.; Rissanen, J.; and Yu, B.
1998.
The minimum description length principle in coding and modeling.
IEEE Transactions on Information Theory 44(6):2743–2760.

Bouckaert, R. R.
1993.
Probabilistic network construction using the minimum description
length principle.
In European conference on symbolic and quantitative approaches
to reasoning and uncertainty, 41–48.
Springer.

Chickering, D. M.
2002.
Optimal structure identification with greedy search.
Journal of machine learning research 3(Nov):507–554.

Cowell, R. G.; Dawid, P.; Lauritzen, S. L.; and Spiegelhalter, D. J.
2006.
Probabilistic networks and expert systems: Exact computational
methods for Bayesian networks.
Springer Science & Business Media.

Daly, R.; Shen, Q.; and Aitken, S.
2011.
Learning bayesian networks: approaches and issues.
The knowledge engineering review 26(2):99–157.

de Jongh, M., and Druzdzel, M. J.
2009.
A comparison of structural distance measures for causal bayesian
network models.
Recent Advances in Intelligent Information Systems, Challenging
Problems of Science, Computer Science series 443–456.

Friedman, N.; Goldszmidt, M.; et al.
1996.
Discretizing continuous attributes while learning bayesian networks.
In ICML, 157–165.

Friedman, N.
2004.
Inferring cellular networks using probabilistic graphical models.
Science 303(5659):799–805.

Geiger, D., and Heckerman, D.
1994.
Learning gaussian networks.
In Uncertainty Proceedings 1994. Elsevier.
235–243.

Grünwald, P. D.
2007.
The minimum description length principle.
MIT press.

Hansen, M. H., and Yu, B.
2001.
Model selection and the principle of minimum description length.
Journal of the American Statistical Association
96(454):746–774.

Heckerman, D.; Geiger, D.; and Chickering, D. M.
1995.
Learning bayesian networks: The combination of knowledge and
statistical data.
Machine learning 20(3):197–243.

Jorma, R.
1998.
Stochastic complexity in statistical inquiry, volume 15.
World scientific.

Kalisch, M., and Bühlmann, P.
2007.
Estimating highdimensional directed acyclic graphs with the
pcalgorithm.
Journal of Machine Learning Research 8(Mar):613–636.

Lam, W., and Bacchus, F.
1994.
Learning bayesian belief networks: An approach based on the mdl
principle.
Computational intelligence 10(3):269–293.

Miyaguchi, K.
2017.
Normalized maximum likelihood with luckiness for multivariate normal
distributions.
arXiv preprint arXiv:1708.01861.

Pearl, J.
2014.
Probabilistic reasoning in intelligent systems: networks of
plausible inference.
Elsevier.

Rissanen, J.
1983.
A universal prior for integers and estimation by minimum description
length.
The Annals of statistics 416–431.

Rissanen, J.
1986.
Stochastic complexity and modeling.
The annals of statistics 1080–1100.

Rissanen, J.
2000.
Mdl denoising.
IEEE Transactions on Information Theory 46(7):2537–2543.

Roos, T.
2004.
Mdl regression and denoising.
Unpublished manuscript.

Schwarz, G., et al.
1978.
Estimating the dimension of a model.
The annals of statistics 6(2):461–464.

Scutari, M.
2009.
Learning bayesian networks with the bnlearn r package.
arXiv preprint arXiv:0908.3817.

Shachter, R. D., and Kenley, C. R.
1989.
Gaussian influence diagrams.
Management science 35(5):527–550.

Silander, T., and Myllymaki, P.
2012.
A simple approach for finding the globally optimal bayesian network
structure.
arXiv preprint arXiv:1206.6875.

Silander, T.; Roos, T.; Kontkanen, P.; and Myllymäki, P.
2008.
Factorized normalized maximum likelihood criterion for learning
bayesian network structures.
In Proceedings of the 4th European Workshop on Probabilistic
Graphical Models.

Spirtes, P.; Glymour, C.; Scheines, R.; Kauffman, S.; Aimale, V.; and Wimberly,
F.
2000a.
Constructing bayesian network models of gene expression networks from
microarray data.

Spirtes, P.; Glymour, C. N.; Scheines, R.; Heckerman, D.; Meek, C.; Cooper, G.;
and Richardson, T.
2000b.
Causation, prediction, and search.
MIT press.

Teyssier, M., and Koller, D.
2012.
Orderingbased search: A simple and effective algorithm for learning
bayesian networks.
arXiv preprint arXiv:1207.1429.

Tsamardinos, I.; Brown, L. E.; and Aliferis, C. F.
2006.
The maxmin hillclimbing bayesian network structure learning
algorithm.
Machine learning 65(1):31–78.