Deep Broad Learning  Big Models for Big Data
Abstract
Deep learning has demonstrated the power of detailed modeling of complex highorder (multivariate) interactions in data. For some learning tasks there is power in learning models that are not only Deep but also Broad. By Broad, we mean models that incorporate evidence from large numbers of features. This is of especial value in applications where many different features and combinations of features all carry small amounts of information about the class. The most accurate models will integrate all that information. In this paper, we propose an algorithm for Deep Broad Learning called DBL. The proposed algorithm has a tunable parameter , that specifies the depth of the model. It provides straightforward paths towards outofcore learning for large data. We demonstrate that DBL learns models from large quantities of data with accuracy that is highly competitive with the stateoftheart.
capbtabboxtable[][\FBwidth] \jmlrheading1201512409/15; Revised 00/0000/00Nayyar A. Zaidi and Francois Petitjean and Geoffrey I. Webb \ShortHeadingsDeep Broad LearningZaidi and Webb and Carman and Petitjean \firstpageno1
xxxxxx xxxxxx
Classification, Big Data, Deep Learning, Broad Learning, DiscriminativeGenerative Learning, Logistic Regression, Extended Logistic Regression
1 Introduction
The rapid growth in data quantity (Ganz and Reinsel, 2012) makes it increasingly difficult for machine learning to extract maximum value from current data stores. Most stateoftheart learning algorithms were developed in the context of small datasets. However, the amount of information present in big data is typically much greater than that present in small quantities of data. As a result, big data can support the creation of very detailed models that encode complex higherorder multivariate distributions, whereas, for small data, very detailed models will tend to overfit and should be avoided (Brain and Webb, 2002; Martinez et al., 2015). We highlight this phenomenon in Figure 1. We know that the error of most classifiers decreases as they are provided with more data. This can be observed in Figure 1 where the variation in errorrate of two classifiers is plotted with increasing quantities of training data on the pokerhand dataset (Frank and Asuncion, 2010). One is a lowbias highvariance learner (KDB , taking into account quintic features, (Sahami, 1996)) and the other is a lowvariance highbias learner (naive Bayes, a linear classifier). For small quantities of data, the lowvariance learner achieves the lowest error. However, as the data quantity increases, the lowbias learner comes to achieve the lower error as it can better model higherorder distributions from which the data might be sampled.
The capacity to model different types of interactions among variables in the data is a major determinant of a learner’s bias. The greater the capacity of a learner to model differing distributions, the lower its bias will tend to be. However, many learners have limited capacity to model complex higherorder interactions.
Deep learning^{1}^{1}1Deep neural networks, convolutional deep neural networks, deep belief networks, etc. has demonstrated some remarkable successes through its capacity to create detailed (deep) models of complex multivariate interactions in structured data (e.g., data in computer vision, speech recognition, bioinformatics, etc.). Deep learning can be characterized in several different ways. But the underlying theme is that of learning higherorder interactions among features using a cascade of many layers. This process is known as ‘feature extraction’ and can be unsupervised as it leverages the structure within the data to create new features. Higherorder features are created from lowerorder features creating a hierarchical structure. We conjecture that the deeper the model the higher the order of interactions that are captured in the data and the lower the bias that the model exhibits.
We argue that in many domains there is value in creating models that are broad as well as deep. For example, when using web browsing history, or social network likes, or when analyzing text, it is often the case that each feature provides only extremely small amounts of information about the target class. It is only by combining very large amounts of this microevidence that reliable classification is possible.
We call a model broad if it utilizes large numbers of variables. We call a model deep and broad if it captures many complex interactions each between numerous variables. For example, typical linear classifiers such as Logistic Regression (LR) and Naive Bayes (NB) are Broad Learners, in that they utilize all variables. However, these models are not deep, as they do not directly model interactions between variables. In contrast, Logistic Regression^{2}^{2}2Logistic Regression taking into account all level features is denoted by , e.g., , , , etc. takes into account all quadratic, cubic, quartic, etc. features. with cubic features () (Langford et al., 2007) and Averaged 2Dependence Estimators (A2DE) (Webb et al., 2011; Zaidi and Webb, 2012), both of which consider all combinations of 3 variables, are both Deep and Broad. The parameters of the former are fit discriminatively through computationally intensive gradient descentbased search, while the parameters of the latter are fit generatively using computationally efficient maximumlikelihood estimation. This efficient estimation of A2DE parameters makes it computationally wellsuitable for big data. In contrast, we argue that ’s discriminative parameterization can more closely fit the data than A2DE, making it lower bias and hence likely to have lower error when trained on large training sets. However, the computation required to optimize the parameters for becomes computationally intensive even on moderate dimensional data.
Recently, it has been shown that it is possible to form a hybrid generativediscriminate learner that exploits the strengths of both naive Bayes (NB) and Logistic Regression (LR) by creating a weighted variant of NB in which the weights are optimized using discriminative minimization of conditional loglikelihood (Zaidi et al., 2013, 2014). From one perspective, the resulting learner can be viewed as using weights to alleviate the attribute independence assumption of NB. From another perspective it can be seen to use the maximum likelihood parameterization of NB to precondition the discriminative search of LR. The result is a learner that learns models that are exactly equivalent to LR, but does so much more efficiently.
In this work, we show how to achieve the same result with , creating a hybrid generativediscriminative learner named for categorical data that learns equivalent deep broad models to those of , but does so more efficiently. We further demonstrate that the resulting models have low bias and have very low error on large quantities of data. However, to create this hybrid learner we must first create an efficient generative counterpart to .
In short, the contributions of this work are:

developing an efficient generative counterpart to , named Averaged Join Estimators (AnJE),

developing , a hybrid of and AnJE,

demonstrating that has equivalent error to LR, but is more efficient,

demonstrating that has low error on large data.
2 Notation
We seek to assign a value of the class variable , to a given example , where the are value assignments for the attributes . We define as the set of all subsets of of size , where each subset in the set is denoted as :
We use to denote the set of values taken by attributes in the subset for any data object .
LR for categorical data learns a weight for every attribute value per class. Therefore, for LR, we denote, to be the weight associated with class , and to be the weight associated with attribute taking value with class label . For , specifies the weight associated with class and attribute subset taking value . The equivalent weights for are denoted by , and .
The probability of attribute taking value given class is denoted by . Similarly, probability of attribute subset , taking value is denoted by .
3 Using generative models to precondition discriminative learning
There is a direct equivalence between a weighted NB and LR (Zaidi et al., 2013, 2014). We write LR for categorical features as:
(1) 
and NB as:
One can add the weights in NB to alleviate the attribute independence assumption, resulting in the WANBIAC formulation, that can be written as:
(2) 
When conditional log likelihood (CLL) is maximized for LR and weighted NB using Equation 1 and 2 respectively, we get an equivalence such that and . Thus, WANBIAC and LR generate equivalent models. While it might seem less efficient to use WANBIAC which has twice the number of parameters of LR, the probability estimates are learned very efficiently using maximum likelihood estimation, and provide useful information about the classification task that in practice serve to effectively precondition the search for the parameterization of weights to maximize conditional log likelihood.
4 Deep Broad Learner (DBL)
In order to create an efficient and effective lowbias learner, we want to perform the same trick that is used by WANBIAC for LR with higherorder categorical features. We define as:
(3) 
We do not include lowerorder terms. For example, if we do not include terms for as well as for , because doing so does not increase the space of distinct distributions that can be modeled but does increase the number of parameters that must be optimized.
To precondition this model using generative learning, we need a generative model of the form
(4)  
(5) 
The only existing generative model of this form is a loglinear model, which requires computationally expensive conditional loglikelihood optimization and consequently would not be efficient to employ. It is not possible to create a Bayesian network of this form as it would require that be independent of . However, we can use a variant of the AnDE (Webb et al., 2011, 2005) approach of averaging many Bayesian networks. Unlike AnDE, we cannot use the arithmetic mean, as we require a product of terms in Equation 4 rather than a sum, so we must instead use a geometric mean.
4.1 Averaged nJoin Estimators (AnJE)
Let be a partition of the attributes . By assuming independence only between the sets of attributes one obtains an njoint estimator:
For example, if there are four attributes , , and that are partitioned into the sets and then by assuming conditional independence between the sets we obtain
Let be the set of all partitions of such that . For convenience we assume that is a multiple of . Let be a subset of that includes each set of attributes once,
The AnJE model is the geometric mean of the set of njoint estimators for the partitions .
The AnJE estimate of conditional likelihood on a perdatumbasis can be written as:
(6)  
This is derived as follows. Each is of size . There are attributevalue tuples. Each must occur in exactly one partition, so the number of partitions must be
(7) 
The geometric mean of all the AnJE models is thus
(8) 
Using Equation 6, we can write the of as:
(9) 
4.2
It can be seen that AnJE is a simple model that places the weight defined in Equation 7 on all feature subsets in the ensemble. The main advantage of this weighting scheme is that it requires no optimization, making AnJE learning extremely efficient. All that is required for training is to calculate the counts from the data. However, the disadvantage AnJE is its inability to perform any form of discriminative learning. Our proposed algorithm, uses AnJE to precondition by placing weights on all probabilities in Equation 4 and learning these weights by optimizing the conditionallikelihood^{3}^{3}3One can initialize these weights with weights in Equation 7 for faster convergence.. One can rewrite AnJE models with this parameterization as:
(10) 
Note that we can compute the likelihood and classprior probabilities using either MLE or MAP. Therefore, we can write Equation 4.2 as:
(11) 
Assuming a Dirichlet prior, a MAP estimate of is which equals:
where is the number of instances in the dataset with class and is the total number of instances, and is the smoothing parameter. We will set in this work. Similarly, a MAP estimate of is which equals:
where is the number of instances in the dataset with class and attribute values .
computes weights by optimizing CLL. Therefore, one can compute the gradient of Equation 4.2 withrespectto weights and rely on gradient descent based methods to find the optimal value of these weights. Since we do not want to be stuck in local minimums, a natural question to ask is whether the resulting objective function is convex Boyd and Vandenberghe (2008). It turns out that the objective function of is indeed convex. Roos et al. (2005) proved that an objective function of the form , optimized by any conditional Bayesian network model is convex if and only if the structure of the Bayesian network is perfect, that is, all the nodes in are moral nodes. is a geometric mean of several submodels where each submodel models interactions each conditioned on the class attribute. Each submodel has a structure that is perfect. Since, the product of two convex objective function leads to a convex function, one can see that ’s optimization function will also lead to a convex objective function.
Let us first calculate the gradient of Equation 4.2 withrespectto weights associated with . We can write:
(12)  
where denotes an indicator function that is if derivative is taken withrespectto class and otherwise. Computing the gradient withrespectto weights associated with gives:
(13)  
where and denotes an indicator function that is if the derivative is taken withrespectto attribute set (respectively, class ) and otherwise.
4.3 Alternative Parameterization
Let us reparameterize such that:
(14) 
Now, we can rewrite Equation 4.2 as:
(15) 
It can be seen that this leads to Equation 3. We call this parameterization .
Like , also leads to a convex optimization problem, and, therefore, its parameters can also be optimized by simple gradient decent based algorithms. Let us compute the gradient of objective function in Equation 15 withrespectto . In this case, we can write:
(16) 
Similarly, computing gradient withrespectto , we can write:
(17) 
4.4 Comparative analysis of and
It can be seen that the two models are actually equivalent and each is a reparameterization of the other. However, there are subtle distinctions between the two.. The most important distinction is the utilization of MAP or MLE probabilities in . Therefore, is a two step learning algorithm:

Step 1 is the optimization of loglikelihood of the data () to obtain the estimates of the prior and likelihood probabilities. One can view this step as of generative learning.

Step 2 is the introduction of weights on these probabilities and learning of these weights by maximizing CLL () objective function. This step can be interpreted as discriminative learning.
employs generativediscriminative learning as opposed to only discriminative learning by .
One can expect a similar biasvariance profile and a very similar classification performance as both models will converge to a similar point in the optimization space, the only difference in the final parameterization being due to recursive descent being terminated before absolute optimization. However, the rate of convergence of the two models can be very different. Zaidi et al. (2014) show that for NB, such style parameterization with generativediscriminative learning can greatly speedup convergence relative to only discriminative training. Note, discriminative training with NB as the graphical model is vanilla LR. We expect to see the same trend in the convergence performance of and .
Another distinction between the two models becomes explicit if a regularization penalty is added to the objective function. One can see that in case of , optimizing parameters towards will effectively pull parameters back towards the generative training estimates. For smaller datasets, one can expect to obtain better performance by using a large regularization parameter and pulling estimates back towards . However, one cannot do this for . Therefore, models can very elegantly combine generative discriminative parameters.
An analysis of the gradient of in Equation 12 and 13 and that of in Equation 16 and 17 also reveals an interesting comparison. We can write ’s gradients in terms of ’s gradient as follows:
It can be seen that has the effect of rescaling ’s gradient by the log of the conditional probabilities. We conjecture that such rescaling has the effect of preconditioning the parameter space and, therefore, will lead to faster convergence.
5 Related Work
Averaged Dependent Estimators (AnDE) is the inspiration for AnJE. An AnDE model is the arithmetic mean of all Bayesian Network Classifiers in each of which all attributes depend on the class and the some attributes. A simple depiction of A1DE in graphical form in shown in Figure 2.
There are possible combination of attributes that can be used as parents, producing submodels which are combined by averaging.
AnDE and AnJE both use simple generative learning, merely the counting the relevant sufficient statistics from the data. Second, both have only one tweaking parameter: – that controls the biasvariance tradeoff. Higher values of leads to low bias and high variance and viceversa.
It is important not to confuse the equivalence (in terms of the level of interactions they model) of AnJE and AnDE models. That is, the following holds:
where is a function that returns the number of interactions that the algorithm models. Thus, an AnJE model uses the same core statistics as an A(n1)DE model. At training time, AnJE and A(n1)DE must learn the same information from the data. However, at classification time, each of these statistics is accessed once by AnJE and times by A(n1)DE, making AnJE more efficient. However, as we will show, it turns out that AnJE’s use of the geometric mean results in a more biased estimator than than the arithmetic mean used by AnDE. As a result, in practice, an AnJE model is less accurate than the equivalent AnDE model.
However, due to the use of arithmetic mean by AnDE, its weighted version would be much more difficult to optimize than AnJE, as transformed to log space it does not admit to a simple linear model.
A work relevant to is that of Greiner et al. (2004); Greiner and Zhou (2002). The proposed technique in these papers named ELR has a number of similar traits with . For example, the parameters associated with a Bayesian network classifier (naive Bayes and TAN) are learned by optimizing the CLL. Both ELR and can be viewed as feature engineering frameworks. An ELR (let us say with TAN structure) model is a subset of models. The comparison of with ELR is not the goal of this work. But in our preliminary results, produce models of much lower bias that ELR (TAN). Modelling higherorder interactions is also an issue with ELR. One could learn a Bayesian network structure and create features based on that and then use ELR. But several restrictions needs to be imposed on the structure, that is, it has to fulfill the property of perfectness, to make sure that it leads to a convex optimization problem. With , as we discussed in Section 4.2, there are no restrictions. Need less to say, ELR is neither broad nor deep. Some related ideas to ELR are also explored in Pernkopf and Bilmes (2005); Pernkopf and Wohlmayr (2009); Su et al. (2008).
Several
6 Experiments
In this section, we compare and analyze the performance of our proposed algorithms and related methods on natural domains from the UCI repository of machine learning (Frank and Asuncion, 2010).
Domain  Case  Att  Class  Domain  Case  Att  Class 

Kddcup  5209000  41  40  Vowel  990  14  11 
Pokerhand  1175067  10  10  TicTacToeEndgame  958  10  2 
MITFaceSetC  839000  361  2  Annealing  898  39  6 
Covertype  581012  55  7  Vehicle  846  19  4 
MITFaceSetB  489400  361  2  PimaIndiansDiabetes  768  9  2 
MITFaceSetA  474000  361  2  BreastCancer(Wisconsin)  699  10  2 
CensusIncome(KDD)  299285  40  2  CreditScreening  690  16  2 
Localization  164860  7  3  BalanceScale  625  5  3 
Connect4Opening  67557  43  3  Syncon  600  61  6 
Statlog(Shuttle)  58000  10  7  Chess  551  40  2 
Adult  48842  15  2  Cylinder  540  40  2 
LetterRecognition  20000  17  26  Musk1  476  167  2 
MAGICGammaTelescope  19020  11  2  HouseVotes84  435  17  2 
Nursery  12960  9  5  HorseColic  368  22  2 
Sign  12546  9  3  Dermatology  366  35  6 
PenDigits  10992  17  10  Ionosphere  351  35  2 
Thyroid  9169  30  20  LiverDisorders(Bupa)  345  7  2 
Pioneer  9150  37  57  PrimaryTumor  339  18  22 
Mushrooms  8124  23  2  Haberman’sSurvival  306  4  2 
Musk2  6598  167  2  HeartDisease(Cleveland)  303  14  2 
Satellite  6435  37  6  Hungarian  294  14  2 
OpticalDigits  5620  49  10  Audiology  226  70  24 
PageBlocksClassification  5473  11  5  NewThyroid  215  6  3 
Wallfollowing  5456  25  4  GlassIdentification  214  10  3 
Nettalk(Phoneme)  5438  8  52  SonarClassification  208  61  2 
Waveform5000  5000  41  3  AutoImports  205  26  7 
Spambase  4601  58  2  WineRecognition  178  14  3 
Abalone  4177  9  3  Hepatitis  155  20  2 
Hypothyroid(Garavan)  3772  30  4  TeachingAssistantEvaluation  151  6  3 
Sickeuthyroid  3772  30  2  IrisClassification  150  5  3 
Kingrookvskingpawn  3196  37  2  Lymphography  148  19  4 
SplicejunctionGeneSequences  3190  62  3  Echocardiogram  131  7  2 
Segment  2310  20  7  PromoterGeneSequences  106  58  2 
CarEvaluation  1728  8  4  Zoo  101  17  7 
Volcanoes  1520  4  4  PostoperativePatient  90  9  3 
Yeast  1484  9  10  LaborNegotiations  57  17  2 
ContraceptiveMethodChoice  1473  10  3  LungCancer  32  57  3 
German  1000  21  2  Contactlenses  24  5  3 
LED  1000  8  10 
The experiments are conducted on the datasets described in Table 1. There are a total of datasets, datasets with less than instances, datasets with instances between and , and datasets with more than instances. There are datasets with over instances. These datasets are shown in bold font in Table 1.
Each algorithm is tested on each dataset using rounds of fold cross validation^{4}^{4}4Exception is MITFaceSetA, MITFaceSetB and Kddcup where results are reported with rounds of fold cross validation..
We compare four different metrics, i.e., 01 Loss, RMSE, Bias and Variance^{5}^{5}5As discussed in Section 1, the reason for performing bias/variance estimation is that it provides insights into how the learning algorithm will perform with varying amount of data. We expect low variance algorithms to have relatively low error for small data and low bias algorithms to have relatively low error for large data (Brain and Webb, 2002)..
We report WinDrawLoss (WDL) results when comparing the 01 Loss, RMSE, bias and variance of two models. A twotail binomial sign test is used to determine the significance of the results. Results are considered significant if .
The datasets in Table 1 are divided into two categories. We call the following datasets Big – KDDCup, Pokerhand, USCensus1990, Covertype, MITFaceSetB, MITFaceSetA, Censusincome, Localization. All remaining datasets are denoted as Little in the results. Due to their size, experiments for most of the Big datasets had to be performed in a heterogeneous environment (grid computing) for which CPU wallclock times are not commensurable. In consequence, when comparing classification and training time, the following datasets constitutes Big category – Localization, Connect4, Shuttle, Adult, Letterrecog, Magic, Nursery, Sign, Pendigits.
When comparing average results across Little and Big datasets, we normalize the results with respect to and present a geometric mean.
Numeric attributes are discretized by using the Minimum Description Length (MDL) discretization method (Fayyad and Irani, 1992). A missing value is treated as a separate attribute value and taken into account exactly like other values.
We employed LBFGS quasiNewton methods (Zhu et al., 1997) for solving the optimization^{6}^{6}6The original LBFGS implementation of (Byrd et al., 1995) from http://users.eecs.northwestern.edu/~nocedal/lbfgsb.html is used..
We used a Random Forest that is an ensemble of decision trees Breiman (2001).
Both and are regularized. The regularization constant is not tuned and is set to for all experiments.
The detailed 01 Loss and RMSE results on Big datasets are also given in Appendix A.
6.1 vs. AnJE
A WDL comparison of the 01 Loss, RMSE, bias and variance of and AnJE on Little datasets is shown in Table 2. We compare with A2JE and with A3JE only. It can be seen that has significantly lower bias but significantly higher variance. The 01 Loss and RMSE results are not in favour of any algorithm. However, on Big datasets, wins on 7 out of 8 datasets in terms of both RMSE and 01 Loss. The results are not significant since value of is greater than our set threshold of . One can infer that successfully reduces the bias of AnJE, at the expense of increasing its variance.
vs. A2JE  vs. A3JE  
WDL  WDL  
Little Datasets  
Bias  66/4/5  0.001  58/2/15  0.001 
Variance  16/3/56  0.001  19/2/54  0.001 
01 Loss  42/5/28  0.119  37/3/35  0.906 
RMSE  37/1/37  1.092  30/1/44  0.130 
Big Datasets  
01 Loss  7/0/1  0.070  7/0/1  0.070 
RMSE  7/0/1  0.070  7/0/1  0.070 
Normalized 01 Loss and RMSE results for both models are shown in Figure 3.
It can be seen that has a lower averaged 01 Loss and RMSE than AnJE. This difference is substantial when comparing on Big datasets. The training and classification time of AnJE is, however, substantially lower than as can be seen from Figure 4. This is to be expected as adds discriminative training to AnJE and uses twice the number of parameters at classification time.
6.2 vs. AnDE
A WDL comparison for 01 Loss, RMSE, bias and variance results of the two models relative to the corresponding AnDE models are presented in Table 3. We compare with A1DE and with A2DE only. It can be seen that has significantly lower bias and significantly higher variance variance than AnDE models. Recently, AnDE models have been proposed as a fast and effective Bayesian classifiers when learning from large quantities of data (Zaidi and Webb, 2012). These biasvariance results make a suitable alternative to AnDE when dealing with big data. The 01 Loss and RMSE results (with exception of RMSE comparison of vs. A2DE) are similar.
vs. A1DE  vs. A2DE  
WDL  WDL  
Little Datasets  
Bias  65/3/7  0.001  53/5/17  0.001 
Variance  21/5/49  0.001  26/5/44  0.041 
01 Loss  42/4/29  0.1539  39/3/33  0.556 
RMSE  30/1/44  0.130  22/1/52  0.001 
Big Datasets  
01 Loss  8/0/0  0.007  7/0/1  0.073 
RMSE  7/0/1  0.073  6/0/2  0.289 
Normalized 01 Loss and RMSE are shown in Figure 5. It can be seen that the models have lower 01 Loss and RMSE than the corresponding AnDE models.
A comparison of the training time of and AnDE is given in Figure 6. As expected, due to its additional discriminative learning, requires substantially more training time than AnDE. However, AnDE does not share such a consistent advantage with respect to classification time, the relativities depending on the dimensionality of the data. For highdimensional data the large number of permutations of attributes that AnDE must consider results in greater computation.
6.3 vs.
In this section, we will compare the two models with their equivalent models. As discussed before, we expect to see similar biasvariance profile and a similar classification performance as the two models are reparameterization of each other.
We compare the two parameterizations in terms of the scatter of their 01 Loss and RMSE values on Little datasets in Figure 7, 9 respectively, and on Big datasets in Figure 8, 10 respectively. It can be seen that the two parameterizations (with an exception of one dataset, that is: wallfollowing) have a similar spread of 01 Loss and RMSE values for both and .
The comparative scatter of the number of iterations each parameterization takes to converge is shown in Figure 11 and 12 for Little and Big datasets respectively. It can be seen that the number of iterations for are far fewer than . With a similar spread of 01 Loss and RMSE values, it is very encouraging to see that converges in far fewer iterations.
The number of iterations to converge plays a major part in determining an algorithm’s training time. The training time of the two parameterizations is shown in Figure 13 and 14 for Little and Big datasets, respectively. It can be seen that models are much faster than equivalent models.
A comparison of rate of convergence of NegativeLogLikelihood (NLL) of and parameterization on some sample datasets is shown in Figure 15. It can be seen that, has a steeper curve, asymptoting to its global minimum much faster. For example, on almost all datasets, one can see that follows a steeper, hence more desirable, path toward convergence. This is extremely advantageous when learning from very few iterations (for example, when learning using Stochastic Gradient Descent based optimization) and, therefore, is a desirable property for scalable learning.
A similar trend can be seen in Figure 16 for and .
Finally, let us present some comparison results about the speed of convergence of vs. as we increase . In Figure 17, we compare the convergence for , and on the sample Localization dataset. It can be seen that the improvement that provides over gets better as we go to deeper structures, i.e., as becomes larger. The similar behaviour was observed for several datasets and, although studying rates of convergence is a complicated matter and is outside the scope of this work, we anticipate this phenomenon to be an interesting venue of investigation for future work.
6.4 vs. Random Forest
The two models are compared in terms of WDL of 01 Loss, RMSE, bias and variance with Random Forest in Table 4. On Little datasets, it can be seen that has significantly lower bias than RF. The variance of is significantly higher than RF, whereas, difference in the variance is not significant for and RF. 01 Loss results of and RF are similar. However, RF has better RMSE results than on Little datasets. On Big datasets, wins on majority of datasets in terms of 01 Loss and RMSE.
vs. RF  vs. RF  
WDL  WDL  
Little Datasets  
Bias  51/3/21  0.001  52/2/21  0.001 
Variance  33/3/39  0.556  28/5/42  0.119 
01 Loss  40/3/32  0.409  37/3/35  0.906 
RMSE  26/1/48  0.014  27/1/47  0.026 
Big Datasets  
01 Loss  5/0/3  0.726  6/0/2  0.289 
RMSE  5/0/3  0.726  5/0/3  0.726 
The averaged 01 Loss and RMSE results are given in Figure 18. It can be seen that , and RF have similar 01 Loss and RMSE across Little datasets. However, on Big datasets, the lower bias of results in much lower error than RF in terms of both 01 Loss and RMSE. These averaged results also corroborate with the WDL results in Table 4, showing to be a less biased model than RF.
The comparison of training and classification time of and RF is given in Figure 19. It can be seen that models are worst than RF in terms of the training time but better in terms of classification time.
7 Conclusion and Future Work
We have presented an algorithm for deep broad learning. DBL consists of parameters that are learned using both generative and discriminative training. To obtain the generative parameterization for DB, we first developed AnJE, a generative counterpart of higherorder logistic regression. We showed that and learn equivalent models, but that is able to exploit the information gained generatively to effectively precondition the optimization process. converges in fewer iterations, leading to its global minimum much more rapidly, resulting in faster training time. We also compared with the equivalent AnJE and AnDE models and showed that has lower bias than both AnJE and AnDE models. We compared with state of the art classifier Random Forest and showed that models are indeed lower biased than RF and on bigger datasets often obtains lower 01 loss than RF.
There are a number of exciting new directions for future work.

We have showed that is a low bias classifier with minimal tuning parameters and has the ability to handle multiple classes. The obvious extension is to make it outofcore. We argue that is greatly suited for stochastic gradient descent based methods as it can converge to global minimum very quickly.

It may be desirable to utilize a hierarchical DBL, such that , incorporating all the parameters up till . This may be useful for smoothing the parameters. For example, if a certain interaction does not occur in the training data, at classification time one can resort to lower values of .

In this work, we have constrained the values of to two and three. Scalingup to higher values of is greatly desirable. One can exploit the fact that many interactions at higher values of will not occur in the data and hence can develop sparse implementations of models.

Exploring other objective functions such as MeanSquaredError or Hinge Loss can result in improving the performance and has been left as a future work.

The preliminary version of DBL that we have developed is restricted to categorical data and hence requires that numeric data be discretized. While our results show that this is often highly competitive with random forest using local cutpoints, on some datasets it is not. In consequence, there is much scope for investigation of deep broad techniques for numeric data.

DBL presents a credible path towards deep broad learning for big data. We have demonstrated very competitive error on big data and expect future refinements to deliver even more efficient and effective outcomes.
8 Code and Datasets
Code with running instructions can be download from https://www.dropbox.com/sh/iw33mgcku9m2quc/AABXwYewVtm0mVE6KoyMPEVFa?dl=0.
9 Acknowledgments
This research has been supported by the Australian Research Council (ARC) under grants DP140100087, DP120100553, DP140100087 and Asian Office of Aerospace Research and Development, Air Force Office of Scientific Research under contracts FA23861214030, FA23861514017 and FA23861514007.
Appendix A Detailed Results
In this appendix, we compare the 01 Loss and RMSE results of , AnDE and RF. The goal here is to assess the performance of each model on Big datasets. Therefore, results on 8 big datasets are reported only in Table 5 and 6 for 01 Loss and RMSE respectively. We also compare results with AnJE. Note A1JE is naive Bayes. Also results are also compared. Note, is WANBIAC (Zaidi et al., 2013).
The best results are shown in bold font.
Localization  Pokerhand  Censusincome  Covtype  Kddcup  MITFaceSetA  MITFaceSetB  MITFaceSetC  
A1JE  0.4938  0.4988  0.2354  0.3143  0.0091  0.0116  0.0268  0.0729 
A2JE  0.3653  0.0763  0.2031  0.2546  0.0061  0.0106  0.0239  0.0630 
A3JE  0.2813  0.0763  0.1674  0.1665  0.0053  0.0096  0.0215  0.0550 
A1DE  0.3584  0.4640  0.0986  0.2387  0.0025  0.0124  0.0322  0.0417 
A2DE  0.2844  0.1348  0.0682  0.1552  0.0023  0.0105  0.0325  0.0339 
0.4586  0.4988  0.0433  0.2576  0.0017  0.0012  0.0047  0.0244  
0.3236  0.0021  0.0686  0.1381  0.0014  0.0002  0.0007  0.0007  
0.2974  0.0056  0.0557  0.0797  0.0013  0.0001  0.0005  0.0005  
RF  0.2976  0.0687  0.0494  0.0669  0.0015  0.0012  0.0022  0.0013 
Localization  Pokerhand  Censusincome  Covtype  Kddcup  MITFaceSetA  MITFaceSetB  MITFaceSetC  
A1JE  0.2386  0.2382  0.4599  0.2511  0.0204  0.1053  0.1607  0.2643 
A2JE  0.2115  0.1924  0.4231  0.2256  0.0170  0.1006  0.1516  0.2455 
A3JE  0.1972  0.1721  0.3812  0.1857  0.0160  0.0954  0.1436  0.2293 
A1DE  0.2090  0.2217  0.2780  0.2174  0.0103  0.1079  0.1746  0.1989 
A2DE  0.1890  0.2044  0.2269  0.1779  0.0098  0.0983  0.1745  0.1530 
0.2330  0.2382  0.1807  0.2254  0.0072  0.0347  0.0602  0.1360  
0.2179  0.1970  0.250  0.1802  0.0068  0.0123  0.0248  0.0257  
0.2273  0.0323  0.2332  0.1494  0.0065  0.0105  0.0198  0.0241  
RF  0.1939  0.1479  0.1928  0.1336  0.0072  0.0296  0.0484  0.0651 
References
 Boyd and Vandenberghe (2008) S Boyd and L Vandenberghe. Convex Optimization. Cambridge Unversity Press, 2008.
 Brain and Webb (2002) Damien Brain and Geoffrey I. Webb. The need for low bias algorithms in classification learning from small data sets. In PKDD, pages 62–73, 2002.
 Breiman (2001) L Breiman. Random forests. Machine Learning, 45:5–32, 2001.
 Byrd et al. (1995) R.H Byrd, P Lu, and J Nocedal. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific and Statistical Computing, 16(5):1190–1208, 1995.
 Fayyad and Irani (1992) Usama M Fayyad and Keki B Irani. On the handling of continuousvalued attributes in decision tree generation. Machine Learning, 8(1):87–102, 1992.
 Frank and Asuncion (2010) A Frank and A Asuncion. UCI machine learning repository, 2010. URL http://archive.ics.uci.edu/ml.
 Ganz and Reinsel (2012) J. Ganz and D. Reinsel. The Digital Universe Study, 2012.
 Greiner and Zhou (2002) R. Greiner and W. Zhou. Structural extension to logistic regression: Discriminative paramter learning of belief net classifiers. In AAAI, 2002.
 Greiner et al. (2004) Russell Greiner, Wei Zhou, Xiaoyuan Su, and Bin Shen. Structural extensions to logistic regression: Discriminative parameter learning of belief net classifiers. Journal of Machine Learning Research, 2004.
 Langford et al. (2007) J. Langford, L. Li, and A. Strehl. Vowpal wabbit online learning project, 2007.
 Martinez et al. (2015) S. Martinez, A. Chen, G. I. Webb, and N. A. Zaidi. Scalable learning of bayesian network classifiers. Journal of Machine Learning Research, 2015.
 Pernkopf and Bilmes (2005) F. Pernkopf and J. Bilmes. Discriminative versus generative parameter and structure learning of bayesian network classifiers. In ICML, 2005.
 Pernkopf and Wohlmayr (2009) F. Pernkopf and M. Wohlmayr. On discriminative parameter learning of bayesian network classifiers. In ECML PKDD, 2009.
 Roos et al. (2005) T Roos, H Wettig, P Grünwald, P Myllymäki, and H Tirri. On discriminative Bayesian network classifiers and logistic regression. Machine Learning, 59(3):267–296, 2005.
 Sahami (1996) M Sahami. Learning limited dependence Bayesian classifiers. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 334–338, Menlo Park, CA, 1996. AAAI Press.
 Su et al. (2008) J. Su, H. Zhang, C. Ling, and S. Matwin. Discriminative parameter learning for bayesian networks. In ICML, 2008.
 Webb et al. (2005) G I. Webb, J Boughton, and Z Wang. Not so naive Bayes: Averaged onedependence estimators. Machine Learning, 58(1):5–24, 2005.
 Webb et al. (2011) Geoffrey I. Webb, Janice Boughton, Fei Zheng, Kai Ming Ting, and Houssam Salem. Learning by extrapolation from marginal to fullmultivariate probability distributions: decreasingly naive Bayesian classification. Machine Learning, pages 1–40, 2011. ISSN 08856125. URL http://dx.doi.org/10.1007/s1099401152636. 10.1007/s1099401152636.
 Zaidi and Webb (2012) N. A. Zaidi and G. I. Webb. Fast and efficient single pass Bayesian learning. In Proceedings of the 17th PacificAsia Conference on Knowledge Discovery and Data Mining (PAKDD), 2012.
 Zaidi et al. (2013) N. A. Zaidi, J. Cerquides, M. J Carman, and G. I. Webb. Alleviating naive Bayes attribute independence assumption by attribute weighting. Journal of Machine Learning Research, 14:1947–1988, 2013.
 Zaidi et al. (2014) N. A. Zaidi, M. J Carman, J. Cerquides, and G. I. Webb. Naivebayes inspired effective preconditioners for speedingup logistic regression. In IEEE International Conference on Data Mining, 2014.
 Zhu et al. (1997) C. Zhu, R. H. Byrd, and J. Nocedal. LBFGSB, fortran routines for large scale bound constrained optimization. ACM Transactions on Mathematical Software, 23(4):550–560, 1997.