Dichotomize and Generalize: PAC-Bayesian Binary Activated Deep Neural Networks

# Dichotomize and Generalize: PAC-Bayesian Binary Activated Deep Neural Networks

Gaël Letarte
Université Laval
gael.letarte.1@ulaval.ca
&Pascal Germain
Inria
France
pascal.germain@inria.fr
Benjamin Guedj
Inria and University College London
France and United Kingdom
benjamin.guedj@inria.fr
&François Laviolette
Université Laval
francois.laviolette@ift.ulaval.ca
###### Abstract

We present a comprehensive study of multilayer neural networks with binary activation, relying on the PAC-Bayesian theory. Our contributions are twofold: (i) we develop an end-to-end framework to train a binary activated deep neural network, (ii) we provide nonvacuous PAC-Bayesian generalization bounds for binary activated deep neural networks. Our results are obtained by minimizing the expected loss of an architecture-dependent aggregation of binary activated deep neural networks. Our analysis inherently overcomes the fact that binary activation function is non-differentiable. The performance of our approach is assessed on a thorough numerical experiment protocol on real-life datasets.

## 1 Introduction

The remarkable practical successes of deep learning make the need for better theoretical understanding all the more pressing. The PAC-Bayesian theory has recently emerged as a fruitful framework to analyze generalization abilities of deep neural network. Inspired by precursor work of Langford and Caruana (2001), nonvacuous risk bounds for multilayer architectures have been obtained by Dziugaite and Roy (2017); Zhou et al. (2019). Although informative, these results do not explicitly take into account the network architecture (number of layers, neurons per layer, type of activation function). A notable exception is the work of Neyshabur et al. (2018) which provides a PAC-Bayesian analysis relying on the network architecture and the choice of ReLU activation function. The latter bound arguably gives insights on the generalization mechanism of neural networks (namely in terms of the spectral norms of the learned weight matrices), but their validity hold for some margin assumptions, and they are likely to be numerically vacuous.

We focus our study on deep neural networks with a sign activation function. We call such networks binary activated multilayer (BAM) networks. This specialization leads to nonvacuous generalization bounds which hold under the sole assumption that training samples are \iid. We provide a PAC-Bayesian bound holding on the generalization error of a continuous aggregation of BAM networks. This leads to an original approach to train BAM networks, named PBGNet. The building block of PBGNet arises from the specialization of PAC-Bayesian bounds to linear classifiers (Germain et al., 2009), that we adapt to deep neural networks. The term binary neural networks has been coined by Bengio (2009), and further studied in Hubara et al. (2016); Soudry et al. (2014); Hubara et al. (2017): it refers to neural networks for which both the activation functions and the weights are binarized (in contrast with BAM networks). These architectures are motivated by the desire to reduce the computation and memory footprints of neural networks.

Our theory-driven approach is validated on real life datasets, showing competitive accuracy with -activated multilayer networks, and providing nonvacuous generalization bounds.

Organisation of the paper. We formalize our framework and notation in Section 2, along with a presentation of the PAC-Bayes framework and its specialization to linear classifiers. Section 3 illustrates the key ideas we develop in the present paper, on the simple case of a two-layers neural network. This is then generalized to deep neural networks in Section 4. We present our main theoretical result in Section 5: a PAC-Bayesian generalization bound for binary activated deep neural networks, and the associated learning algorithm. Section 6 presents the numerical experiment protocol and results. The paper closes with avenues for future work in Section 7.

## 2 Framework and notation

We stand in the supervised binary classification setting: given a real input vector111Bold uppercase letters denote matrices, bold lowercase letters denote vectors. , one wants to predict a label . Let us consider a neural network of fully connected layers with a (binary) sign activation function: if and otherwise.222We consider the activation function as an element-wise operator when applied to vectors or matrices. We let denote the number of neurons of the layer, for ; is the input data point dimension, and is the total number of parameters. The output of the (deterministic) BAM network on an input data point is given by

 fθ(\xbf)=\sgn(\Wk[L]\sgn(\Wk[L−1]\sgn(…\sgn(\Wk[1]\xbf)))), (1)

where denotes the weight matrices. The network is thus parametrized by . The line of matrix will be denoted . For binary classification, the BAM network final layer has one line (), that is a vector , and . We study the classification accuracy under the linear loss .

### 2.1 Elements from the PAC-Bayesian theory

The Probably Approximately Correct (PAC) framework (introduced by Valiant, 1984) holds under the frequentist assumption that data is sampled in an \iid fashion from a data distribution over the input-output space. The learning algorithm observes a finite training sample and outputs a predictor . We define as the generalization loss on the data generating distribution , and as the empirical error on the training set, given by

 \Lcal\Dcal(f)=\Esp(\xbf,y)∼\Dcalℓ(f(\xbf),y), and ˆ\LcalS(f)=1nn∑i=1ℓ(f(\xbfi),yi).

PAC-Bayes considers the expected loss of an aggregation of predictors: considering a distribution  (called the posterior) over a family of binary classifiers , one obtains PAC upper bounds on . Under the linear loss, this is equivalent to the loss of the predictor , performing a -aggregation of all classifiers in . In other words, we may upper bound with an arbitrarily high probability the generalization loss , by its empirical counterpart and a complexity term, the Kullback-Leibler divergence between and a reference measure  (called the prior distribution) chosen independently of the training set , given by . Since the seminal works of Shawe-Taylor and Williamson (1997), McAllester (1999, 2003) and Catoni (2004, 2003, 2007), the celebrated PAC-Bayesian theorem has been declined in many forms (see Guedj, 2019, for a survey). The following Theorems 1 and 2 will be useful in the sequel.

###### Theorem 1 (Seeger (2002)).

Given a prior on , with probability at least over ,

 for all Q on \Fcal\,:\kl(ˆ\LcalS(FQ)∥∥\Lcal\Dcal(FQ)) ≤ \KL(Q∥P)+ln2√nδn, (2)

where is the Kullblack-Leibler divergence between Bernoulli distributions with probability of success and , respectively.

###### Theorem 2 (Catoni (2007)).

Given on and , with probability at least over ,

 for all Q on \Fcal\,:\Lcal\Dcal(FQ)≤ 11−e−C⎛⎝1−exp⎛⎝−Cˆ\LcalS(FQ)−\KL(Q∥P)+ln1δn⎞⎠⎞⎠. (3)

From Theorems 1 and 2, we obtain PAC-Bayesian bounds on the linear loss of the -aggregated predictor . The obtained bounds can be turned into bounds on the zero-one loss with an extra multiplicative factor (using the elementary inequality ).

### 2.2 Elementary building block: PAC-Bayesian learning of linear classifiers

The PAC-Bayesian specialization to linear classifiers has been proposed by Langford and Shawe-Taylor (2002), and used for providing tight generalization bounds and a model selection criteria (further studied by Langford, 2005; Ambroladze et al., 2006; Parrado-Hernández et al., 2012). This paved the way to the PAC-Bayesian bound minimization algorithm of Germain et al. (2009), that learns a linear classifier , with . The strategy is to consider a Gaussian posterior and a Gaussian prior over the space of all linear predictors (where denotes the identity matrix). The posterior is used to define a linear predictor and the prior may have been learned on previously seen data; a common uninformative prior being the null vector . With such parametrization, . Moreover, the -aggregated output can be written in terms of the Gauss error function . In Germain et al. (2009), the erf function is introduced as a loss function to be optimized. Here we interpret it as the predictor output, to be in phase with our neural network approach. Likewise, we study the linear loss of an aggregated predictor instead of the Gibbs risk of a stochastic classifier. We obtain (explicit calculations are provided in Appendix A.1 for completeness)

 F\wbf(\xbf)\eqdef\Esp\vbf∼Q\wbff\vbf(\xbf)=\Erf(\wbf⋅\xbf√d∥\xbf∥),with  \Erf(x)\eqdef2√π∫x0e−t2dt. (4)

Given a training set , Germain et al. (2009) propose to minimize a PAC-Bayes upper bound on by gradient descent on . This approach is appealing as the bounds are valid uniformly for all (see Equations 2 and 3). In other words, the algorithm provides both a learned predictor and a generalization guarantee that is rigorously valid (under the \iid assumption) even when the optimization procedure did not find the global minimum of the cost function (either because it converges to a local minimum, or early stopping is used). Germain et al. (2009) investigate the optimization of several versions of Theorems 1 and 2. The minimization of Theorem 1 generally leads to tighter bound values, but empirical studies show lowest accuracy as the procedure conservatively prevents overfitting. The best empirical results are obtained by minimizing Theorem 2 for a fixed hyperparameter , selected by cross-validation. Minimizing Equation (3) amounts to minimizing

 Cnˆ\LcalS(F\wbf)+\KL(Q\wbf∥P\wbf0) = C12n∑i=1\Erf(−yi\wbf⋅\xbfi√d∥\xbfi∥)+12∥\wbf−\wbf0∥2. (5)

In their discussion, Germain et al. (2009) observe that the objective in Equation (5) is similar to the one optimized by the soft-margin Support Vector Machines (Cortes and Vapnik, 1995), by roughly interpreting the hinge loss as a convex surrogate of the probit loss . Likewise, Langford and Shawe-Taylor (2002) present this parameterization of the PAC-Bayes theorem as a margin bound. In the following, we develop an original approach to neural networks based on a slightly different observation: the predictor output given by Equation (4) is reminiscent of the activation used in classical neural networks (see Figure 3 in the appendix for a visual comparison). Therefore, as the linear perceptron is viewed as the building block of modern multilayer neural networks, the PAC-Bayesian specialization to binary classifiers is the cornerstone of our theoretical and algorithmic framework for BAM networks.

## 3 The simple case of a one hidden layer network

Let us first consider a network with one hidden layer of size . Hence, this network is parameterized by weights , with and . Given an input , the output of the network is

 fθ(\xbf)=\sgn(\wbf2⋅\sgn(\Wbf1\xbf)). (6)

Following Section 2, we consider an isotropic Gaussian posterior distribution centered in , denoted , over the family of all networks . Thus, the prediction of the -aggregate predictor is given by . Note that Langford and Caruana (2001); Dziugaite and Roy (2017) also consider Gaussian distributions over neural networks parameters. However, as their analysis is not specific to a particular activation function—experiments are performed with typical activation functions (sigmoid, ReLU)—the prediction relies on sampling the parameters according to the posterior. An originality of our approach is that, by studying the sign activation function, we can calculate the exact form of , as detailed below.

### 3.1 Deterministic network

Prediction. To compute the value of , we first need to decompose the probability of each as , with and .

 Fθ(\xbf)= ∫\Rdsd1×d0Q1(\Vbf1)∫\Rdsd1Q2(\vbf2)\sgn(\vbf2⋅\sgn(\Vbf1\xbf))d\vbf2d\Vbf1 = ∫\Rdsd1×d0Q1(\Vbf1)\Erf(\wbf2⋅\sgn(\Vbf1\xbf)√2∥\sgn(\Vbf1\xbf)∥)d\Vbf1 (7) = (8) = ∑\sbf∈{−1,1}d1\Erf(\wbf2⋅\sbf√2d1)Ψ\sbf(\xbf,\Wbf1), (9)

where, from with , we obtain

 Ψ\sbf(\xbf,\Wbf1) \eqdef d1∏i=1∫\mathrlap\Rdsd0 1[si\xbf⋅\vbfi1>0]Qi1(\vbfi1)d\vbfi1 = d1∏i=1[12+si2\Erf(\wbfi1⋅\xbf√2\norm\xbf)]ψsi(\xbf,\wbfi1). (10)

Line (7) states that the output neuron is a linear predictor over the hidden layer’s activation values ; based on Equation (4), the integral on becomes . As a function of , the latter expression is piecewise constant. Thus, Line (8) discretizes the integral on  as a sum of the different values of . Note that .

Finally, one can compute the exact output of , provided one accepts to compute a sum combinatorial in the number of hidden neurons (Equation 9). We show in forthcoming Section 3.2 that it is possible to circumvent this computational burden and approximate by a sampling procedure.

Derivatives. Following contemporary approaches in deep neural networks (Goodfellow et al., 2016), we minimize the empirical loss by stochastic gradient descent (SGD). This requires to compute the partial derivative of the cost function according to the parameters :

 ∂ˆ\LcalS(Fθ)∂θ=1nn∑i=1∂ℓ(Fθ(\xbfi),yi)∂θ=1nn∑i=1∂Fθ(\xbfi)∂θℓ′(Fθ(\xbfi),yi), (11)

with the derivative of the linear loss .

The partial derivatives of the prediction function (Equation 9) according to the hidden layer parameters and the output neuron parameters are

 ∂∂\wbfk1Fθ(\xbf)= \xbf232∥\xbf∥\Erf′(\wbfk1⋅\xbf√2\norm\xbf)∑\sbf∈{−1,1}d1sk\Erf(\wbf2⋅\sbf√2d1)⎡⎣Ψ\sbf(\xbf,\Wbf1)ψsk(\xbf,\wbfk1)⎤⎦, (12) ∂∂\wbf2Fθ(\xbf)= 1√2d1∑\sbf∈{−1,1}d1\sbf\Erf′(\wbf2⋅\sbf√2d1)Ψ\sbf(\xbf,\Wbf1),with \ \Erf′(x)\eqdots2√πe−x2\,. (13)

Note that this is an exact computation. A salient fact is that even though we work on non-differentiable BAM networks, we get a structure trainable by (stochastic) gradient descent by aggregating networks.

Majority vote of learned representations. Note that (Equation 10) defines a distribution on . Indeed, , as for every . Thus, by Equation (9) we can interpret akin to a majority vote predictor, which performs a convex combination of a linear predictor outputs . The vote aggregates the predictions on the possible binary representations. Thus, the algorithm does not learn the representations per se, but rather the weights associated to every given an input , as illustrated by Figure 1.

### 3.2 Stochastic approximation

Since (Equation 10) defines a distribution, we can interpret the function value as the probability of mapping input  into the hidden representation given the parameters . Using a different formalism, we could write . This viewpoint suggests a sampling scheme to approximate both the predictor output (Equation 9) and the partial derivatives (Equations 12 and 13), that can be framed as a variant of the REINFORCE algorithm (Williams, 1992) (see the discussion below): We avoid computing the terms by resorting to a Monte Carlo approximation of the sum. Given an input and a sampling size , the procedure goes as follows.

Prediction. We generate random binary vectors according to the -distribution. This can be done by uniformly sampling , and setting .
A stochastic approximation of is given by .

Derivatives. Note that for a given sample , the approximate derivatives according to (Equation 15 below) can be computed numerically by the automatic differentiation mechanism of deep learning frameworks while evaluating (\eg, Paszke et al., 2017). However, we need the following Equation (14) to approximate the gradient according to because .

 ∂∂\wbfk1Fθ(\xbf) ≈\xbfT232∥\xbf∥\Erf′(\wbfk1⋅\xbf√2\norm\xbf)T∑t=1stkψstk(\xbf,\wbfk1)\Erf(\wbf2⋅\sbft√2d1); (14) ∂∂\wbf2Fθ(\xbf) ≈1T√2d1T∑t=1\sbft\Erf′(\wbf2⋅\sbft√2d1)=∂∂\wbf2ˆFθ(Z). (15)

Similar approaches to stochastic networks. Random activation functions are commonly used in generative neural networks, and tools have been developed to train these by gradient descent (see Goodfellow et al. (2016, Section 20.9) for a review). Contrary to these approaches, our analysis differs as the stochastic operations are introduced to estimate a deterministic objective. That being said, Equation (14) can be interpreted as a variant of REINFORCE algorithm (Williams, 1992) to apply the back-propagation method along with discrete activation functions. Interestingly, the formulation we obtain through our PAC-Bayes objective is similar to a commonly used REINFORCE variant (\eg, Bengio et al., 2013; Yin and Zhou, 2019), where the activation function is given by a Bernoulli variable with probability of success , where is the neuron input, and is the sigma is the sigmoid function. The latter can be interpreted as a surrogate of our .

## 4 Generalization to multilayer networks

In the following, we extend the strategy introduced in Section 3 to BAM architectures with an arbitrary number of layers (Equation 1). An apparently straightforward approach to achieve this generalization would have been to consider a Gaussian posterior distribution over the BAM family . However, doing so leads to a deterministic network relying on undesirable sums of elements. Instead, we define a mapping which transforms the BAM network into a computation tree. An illustration of this mapping is given by Figure 2.

BAM to tree architecture map. Given a BAM network of layers with sizes (reminder: ), we obtain a computation tree by decoupling the neurons (\ie, the computation graph nodes): the tree leaves contain copies of each of the BAM input neurons, and the tree root node corresponds to the single BAM output neuron. Each input-output path of the original BAM network becomes a path of length from one leaf to the tree root. Each tree edge has its own parameter (a real-valued scalar); the total number of edges/parameters is , with . We define a set of tree parameters recursively according to the tree structure. From level to , the tree has edges. That is, each node at level has its own parameters subtree , where each is either a weight vector containing the input edges parameters (by convention, ) or a parameter set (thus, are themselves parameter subtrees). Hence, the deepest elements of the recursive parameters set are weight vectors . Let us now define the output tree on an input as a recursive function:

 g1(\xbf,{\wbf}) =\sgn(\wbf⋅\xbf), gk+1(\xbf,{\wbf,ηk1,…,ηkdk}ηk) =\sgn(\wbf⋅(gk(\xbf,η1),…,gk(\xbf,ηdk))\gbfk(\xbf,ηk)) for k=1,…,L−1.

BAM to tree parameters map. Given BAM parameters , we denote . The mapping from into the corresponding (recursive) tree parameters set is , such that , and . Note that the parameters tree obtained by the transformation is highly redundant, as each weight vector (the th line of the matrix from ) is replicated times. This construction is such that for all .

Deterministic network. With a slight abuse of notation, we let denote a parameter tree of the same structure as , where every weight is sampled \iid from a normal distribution. We denote , and we compute the output value of this predictor recursively. In the following, we denote the function returning the th neuron value of the layer . Hence, the output of this network is . As such,

 G(j)θ1:1(\xbf) = ∫\Rdsd0Q\wbfj1(\vbf)\sgn(\vbf⋅\xbf)d\vbf = \Erf(\wbfj1⋅\xbf√2∥\xbf∥), G(j)θ1:k+1(\xbf)= ∑\mathclap\sbf∈{−1,1}dk\Erf(\wbfjk+1⋅\sbf√2dk)Ψk\sbf(\xbf,θ), % with Ψk\sbf(\xbf,θ)=dk∏i=1(12+12si×G(i)θ1:k(\xbf)). (16)

The complete mathematical calculations leading to the above results are provided in Appendix A.2. The computation tree structure and the parameter mapping are crucial to obtain the recursive expression of Equation (16). However, note that this abstract mathematical structure is never manipulated explicitly. Instead, it allows computing each hidden layer vector sequentially; a summation of terms is required for each layer .

Stochastic approximation. Following the Section 3.2 sampling procedure trick for the one hidden layer network, we propose to perform a stochastic approximation of the network prediction output, by a Monte Carlo sampling for each layer. Likewise, we recover exact and approximate derivatives in a layer-by-layer scheme. The related equations are given in Appendix A.3.

## 5 PBGNet: PAC-Bayesian SGD learning of binary activated networks

We design an algorithm to learn the parameters of the predictor by minimizing a PAC-Bayesian upper bound on the generalization loss . We name our algorithm PBGNet (PAC-Bayesian Binary Gradient Network), as it is a generalization of the PBGD (PAC-Bayesian Gradient Descent) learning algorithm for linear classifiers (Germain et al., 2009) to deep binary activated neural networks.

Kullback-Leibler regularization. The computation of a PAC-Bayesian bound value relies on two key elements: the empirical loss on the training set and the Kullback-Leibler divergence between the prior and the posterior. Sections 3 and 4 present exact computation and approximation schemes for the empirical loss (which is equal to when ). Equation (17) introduces the -divergence associated to the parameter maps of Section 4. We use the shortcut notation to refer to the divergence between two multivariate Gaussians of dimensions, corresponding to learned parameters and prior parameters .

 \Kcal(θ,μ)\eqdef\KL(Qζ(θ)∥∥Pζ(μ))=12(∥\wbfL−\ubfL∥2+L−1∑i=1d†k+1∥∥\Wbfi−\Ubfi∥∥2F), (17)

where the factors are due to the redundancy introduced by transformation . This has the effect of penalizing more the weights on the first layers. It might have a considerable influence on the bound value for very deep networks. On the other hand, we observe that this is consistent with the fine-tuning practice performed when training deep neural networks for a transfer learning task: prior parameters are learned on a first dataset, and the posterior weights are learned by adjusting the last layer weights on a second dataset (see Bengio, 2009; Yosinski et al., 2014).

Bound minimization. PBGNet minimizes the bound of Theorem 1 (rephrased as Equation 18). However, this is done indirectly by minimizing a variation on Theorem 2 and used in a deep learning context by Zhou et al. (2019) (Equation 19). Theorem 3 links both results (proof in Appendix A.4).

###### Theorem 3.

Given prior parameters , with probability at least over , we have for all on  :

 \Lcal\Dcal(Gθ) ≤sup0≤p≤1{p:\kl(ˆ\LcalS(Gθ)∥p)≤1n[\Kcal(θ,μ)+ln2√nδ]} (18) =infC>0{11−e−C(1−exp(−Cˆ\LcalS(Gθ)−1n[\Kcal(θ,μ)+ln2√nδ]))}. (19)

We use stochastic gradient descent (SGD) as the optimization procedure to minimize Equation (19) with respect to and . It optimizes the same trade-off as in Equation (5), but choosing the value which minimizes the bound.333We also note that our training objective can be seen as a generalized Bayesian inference one (Knoblauch et al., 2019), where the tradeoff between the loss and the KL divergence is given by the PAC-Bayes bound. The originality of our SGD approach is that not only do we induce gradient randomness by selecting mini-batches among the training set , we also approximate the loss gradient by sampling elements for the combinatorial sum at each layer. Our experiments show that, for some learning problems, reducing the sample size of the Monte Carlo approximation can be beneficial to the stochastic gradient descent. Thus the sample size value has an influence on the cost function space exploration during the training procedure (see Figure LABEL:fig:sample_effect in the appendix). Hence, we consider as a PBGNet hyperparameter.

## 6 Numerical experiments

Experiments were conducted on six binary classification datasets, described in Appendix B.

Learning algorithms. In order to get insights on the trade-offs promoted by the PAC-Bayes bound minimization, we compared PBGNet to variants focusing on empirical loss minimization. We train the models using multiple network architectures (depth and layer size) and hyperparameter choices. The objective is to evaluate the efficiency of our PAC-Bayesian framework both as a learning algorithm design tool and a model selection criterion. For all methods, the network parameters are trained using the Adam optimizer (Kingma and Ba, 2015). Early stopping is used to interrupt the training when the cost function value is not improved for 20 consecutive epochs. Network architectures explored range from 1 to 3 hidden layers () and a hidden size ( for ). Unless otherwise specified, the same randomly initialized parameters are used as a prior in the bound and as a starting point for SGD optimization (as in Dziugaite and Roy, 2017). Also, for all models except MLP, we select the binary activation sampling size in a range going from 10 to 10000. More details about the experimental setting are given in Appendix B.

MLP.  We compare to a standard network with activation, as this activation resembles the function of PBGNet. We optimize the linear loss as the cost function and use 20% of training data as validation for hyperparameters selection. A weight decay parameter is selected between and . Using weight decay corresponds to adding an regularizer to the cost function, but contrary to the regularizer of Equation (17) promoted by PBGNet, this regularization is uniform for all layers.

PBGNet.  This variant minimizes the empirical loss , with an regularization term. The corresponding weight decay, as well as other hyperparameters, are selected using a validation set, exactly as the MLP does. The bound expression is not involved in the learning process and is computed on the model selected by the validation set technique.

PBGNet.  Again, the empirical loss with an regularization term is minimized. However, only the weight decay hyperparameter is selected on the validation set the other ones are selected by the bound. This method is motivated by an empirical observation: our PAC-Bayesian bound is a great model selection tool for most hyperparameters, except the weight decay term.

PBGNet.  As described in Section 5, the generalization bound is directly optimized as the cost function during the learning procedure and used solely for hyperparameters selection: no validation set is needed and all training data are exploited for learning.

PBGNet.  We also explore the possibility of using a part of the training data as a pre-training step. To do so, we split the training set into two halves. First, we minimize the empirical loss for a fixed number of 20 epochs on the first 50% of the training set. Then, we use the learned parameters as initialization and prior for PBGNet and learn on the second 50% of the training set.

Analysis. Results are summarized in Table 1, which highlights the strengths and weaknesses of the models. Both MLP and PBGNet obtain competitive error scores but lack generalization guarantees. By introducing the bound value in the model selection process, even with the linear loss as the cost function, PBGNet yields non-vacuous generalization bound values although with an increase in error scores. Using the bound expression for the cost function in PBGNet improves bound values while keeping similar performances. The Ads dataset is a remarkable exception where the small amount of training examples seems to radically constrain the network in the learning process as it hinders the KL divergence growth in the bound expression. With an informative prior from pre-training, PBGNet is able to recover competitive error scores while offering tight generalization guarantees. All selected hyperparameters are presented in the appendix (Table 4).

A notable observation is the impact of the bound exploitation for model selection on the train-test error gap. Indeed, PBGNet, PBGNet and PBGNet display test errors closer to their train errors, as compared to MLP and PBGNet. This behavior is more noticeable as the dataset size grows and suggests potential robustness to overfitting when the bound is involved in the learning process.

## 7 Conclusion and perspectives

We made theoretical and algorithmic contributions towards a better understanding of generalization abilities of binary activated multilayer networks, using PAC-Bayes. Note that the computational complexity of a learning epoch of PBGNet is higher than the cost induced in binary neural networks (Bengio, 2009; Hubara et al., 2016; Soudry et al., 2014; Hubara et al., 2017). Indeed, we focus on the optimization of the generalization guarantee more than computational complexity. Although we also propose a sampling scheme that considerably reduces the learning time required by our method, achieving a nontrivial tradeoff.

We intend to investigate how we could leverage the bound to learn suitable priors for PBGNet. Or equivalently, finding (from the bound point of view) the best network architecture. We also plan to extend our analysis to multiclass and multilabel prediction, and convolutional networks. We believe that this line of work is part of a necessary effort to give rise to a better understanding of the behavior of deep neural networks.

#### Acknowledgments

We would like to thank Mario Marchand for the insight leading to the Theorem 3, Gabriel Dubé and Jean-Samuel Leboeuf for their input on the theoretical aspects, Frédérik Paradis for his help with the implementation, and Robert Gower for his insightful comments. This work was supported in part by the French Project APRIORI ANR-18-CE23-0015, in part by NSERC and in part by Intact Financial Corporation. We gratefully acknowledge the support of NVIDIA Corporation with the donation of Titan Xp GPUs used for this research.

## References

• Ambroladze et al. [2006] Amiran Ambroladze, Emilio Parrado-Hernández, and John Shawe-Taylor. Tighter PAC-Bayes bounds. In NIPS, 2006.
• Bengio [2009] Yoshua Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127, 2009.
• Bengio et al. [2013] Yoshua Bengio, Nicholas Léonard, and Aaron C. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR, abs/1308.3432, 2013.
• Catoni [2003] Olivier Catoni. A PAC-Bayesian approach to adaptive classification. preprint, 840, 2003.
• Catoni [2004] Olivier Catoni. Statistical learning theory and stochastic optimization: Ecole d’Eté de Probabilités de Saint-Flour XXXI-2001. Springer, 2004.
• Catoni [2007] Olivier Catoni. PAC-Bayesian supervised classification: the thermodynamics of statistical learning, volume 56. Inst. of Mathematical Statistic, 2007.
• Cortes and Vapnik [1995] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3), 1995.
• Dziugaite and Roy [2017] Gintare Karolina Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In UAI. AUAI Press, 2017.
• Germain et al. [2009] Pascal Germain, Alexandre Lacasse, François Laviolette, and Mario Marchand. PAC-Bayesian learning of linear classifiers. In ICML, pages 353–360. ACM, 2009.
• Goodfellow et al. [2016] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
• Guedj [2019] Benjamin Guedj. A primer on PAC-Bayesian learning. arXiv preprint arXiv:1901.05353, 2019.
• Hubara et al. [2016] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In NIPS, pages 4107–4115, 2016.
• Hubara et al. [2017] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. JMLR, 18(1):6869–6898, 2017.
• Kingma and Ba [2015] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
• Knoblauch et al. [2019] Jeremias Knoblauch, Jack Jewson, and Theodoros Damoulas. Generalized variational inference, 2019.
• Lacasse [2010] Alexandre Lacasse. Bornes PAC-Bayes et algorithmes d’apprentissage. PhD thesis, Université Laval, 2010.
• Langford [2005] John Langford. Tutorial on practical prediction theory for classification. JMLR, 6, 2005.
• Langford and Caruana [2001] John Langford and Rich Caruana. (Not) Bounding the True Error. In NIPS, pages 809–816. MIT Press, 2001.
• Langford and Shawe-Taylor [2002] John Langford and John Shawe-Taylor. PAC-Bayes & margins. In NIPS, 2002.
• McAllester [1999] David McAllester. Some PAC-Bayesian theorems. Machine Learning, 37(3), 1999.
• McAllester [2003] David McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51(1), 2003.
• Neyshabur et al. [2018] Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. In ICLR, 2018.
• Paradis [2018] Frédérik Paradis. Poutyne: A Keras-like framework for PyTorch, 2018.
• Parrado-Hernández et al. [2012] Emilio Parrado-Hernández, Amiran Ambroladze, John Shawe-Taylor, and Shiliang Sun. PAC-Bayes bounds with data dependent priors. JMLR, 13, 2012.
• Paszke et al. [2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017.
• Seeger [2002] Matthias Seeger. PAC-Bayesian generalization bounds for gaussian processes. JMLR, 3, 2002.
• Shawe-Taylor and Williamson [1997] John Shawe-Taylor and Robert C. Williamson. A PAC analysis of a Bayesian estimator. In COLT, 1997.
• Soudry et al. [2014] Daniel Soudry, Itay Hubara, and Ron Meir. Expectation backpropagation: Parameter-free training of multilayer neural networks with continuous or discrete weights. In NIPS, pages 963–971, 2014.
• Valiant [1984] Leslie G Valiant. A theory of the learnable. In Proceedings of the sixteenth annual ACM symposium on Theory of computing, pages 436–445. ACM, 1984.
• Williams [1992] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3):229–256, May 1992.
• Yin and Zhou [2019] Mingzhang Yin and Mingyuan Zhou. ARM: augment-reinforce-merge gradient for stochastic binary networks. In ICLR (Poster), 2019.
• Yosinski et al. [2014] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In NIPS, pages 3320–3328, 2014.
• Zhou et al. [2019] Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P. Adams, and Peter Orbanz. Non-vacuous generalization bounds at the imagenet scale: a PAC-bayesian compression approach. In ICLR, 2019.

## Appendix A Supplementary Material

### a.1 From the sign activation to the erf function

For completeness, we present the detailed derivation of Equation (4). This result appear namely in Langford and Shawe-Taylor [2002], Langford [2005], Germain et al. [2009].

Given , we have

 F\wbf(\xbf) =\Esp\vbf∼\Ncal(\wbf,I)\sgn(\vbf⋅\xbf) =∫\Rdsd\sgn(\vbf⋅\xbf)(1√2π)de−12\norm\vbf−\wbf2d\vbf =(1√2π)d∫\Rdsd1[\vbf⋅\xbf>0]e−12\norm\vbf−\wbf2d\vbf−(1√2π)d∫\Rdsd1[\vbf⋅\xbf<0]e−12\norm\vbf−\wbf2d\vbf.

Without loss of generality, let us consider a vector basis where is the first coordinate. In this basis, the first elements of the vectors and are

 v1=\vbf⋅\xbf\norm\xbf, w1=\wbf⋅\xbf\norm\xbf.

Hence, with . Looking at the left side of the subtraction from the previous equation, we thus have

 (1√2π)d∫\Rdsd 1[\vbf⋅\xbf>0]e−12\norm\vbf−\wbf2d\vbf =∫\Rds1[v1>0]1√2πe−12(v1−w1)2⎡⎣∫\Rdsd−1(1√2π)d−1e−12\norm\vbf2:d−\wbf2:d2d\vbf2:d⎤⎦dv1 =1√2π∫∞−∞1[t>−w1]e−12t2dt,

with . Hence,

 \Esp\vbf∼\Ncal(\wbf,I)\sgn(\vbf⋅\xbf) =1√2π∫∞−∞1[t>−w1]e−12t2dt−1√2π∫∞−∞1[t<−w1]e−12t2dt =1√2π∫∞−w1e−12t2dt−1√2π∫−w1−∞e−12t2dt =12+1√2π∫w10e−12t2dt−12+1√2π∫w10e−12t2dt =√2√