PAC-Bayesian Contrastive Unsupervised Representation Learning

# PAC-Bayesian Contrastive Unsupervised Representation Learning

## Abstract

Contrastive unsupervised representation learning (CURL) is the state-of-the-art technique to learn representations (as a set of features) from unlabelled data. While CURL has collected several empirical successes recently, theoretical understanding of its performance was still missing. In a recent work, Arora et al. (2019) provide the first generalisation bounds for CURL, relying on a Rademacher complexity. We extend their framework to the flexible PAC-Bayes setting, allowing us to deal with the non-iid setting. We present PAC-Bayesian generalisation bounds for CURL, which are then used to derive a new representation learning algorithm. Numerical experiments on real-life datasets illustrate that our algorithm achieves competitive accuracy, and yields non-vacuous generalisation bounds.

## 1 Introduction

Unsupervised representation learning (Bengio et al., 2013) aims at extracting features representation from an unlabelled dataset for downstream tasks such as classification and clustering (see Mikolov et al., 2013; Noroozi and Favaro, 2016; Zhang et al., 2016; Caron et al., 2018; Devlin et al., 2019). An unsupervised representation learning model is typically learnt by solving a pretext task without supervised information. Trained model work as a feature extractor for supervised tasks.

In unsupervised representation learning, contrastive loss is a widely used objective function class. Contrastive loss uses two types of data pair, namely, similar pair and dissimilar pair. Their similarity is defined without label information of a supervised task. For example, in word representation learning, Mikolov et al. (2013) define a similar pair as co-occurrence words in the same context, while dissimilar pairs are randomly sampled from a fixed distribution. Intuitively, by minimising a contrastive loss, similar data samples are mapped to similar representations in feature space in term of some underlying metric (as the inner product), and dissimilar samples are not mapped to similar representations.

Contrastive unsupervised representation learning improves the performance of supervised models in practice, and has attracted a lot of research interest lately (see Chen et al., 2020, and references therein), although usage is still quite far ahead of theoretical understanding. Recently, Arora et al. (2019) introduced a theoretical framework for contrastive unsupervised representation learning and derived the first generalisation bounds for CURL. In parallel, PAC-Bayes is emerging as a principled tool to understand and quantify the generalisation ability of many machine learning algorithms, including deep neural networks (as recently studied by Dziugaite and Roy, 2017; Neyshabur et al., 2018; Letarte et al., 2019).

Our contributions. We extend the framework introduced by Arora et al. (2019), by adopting a PAC-Bayes approach to contrastive unsupervised representation learning. We derive the first PAC-Bayes generalisation bounds for CURL, both in iid and non-iid settings. Our bounds are then used to derive new CURL algorithms, for which we provide a complete implementation. The paper closes with numerical experiments on two real-life datasets (\cifar and \auslan) showing that our bounds are non-vacuous in the iid setting.

## 2 Contrastive Unsupervised Representation Learning

### 2.1 Learning Framework

Inputs are denoted , and outputs are denoted , where is a discrete and finite set. The representation is learnt from a (large) unlabelled dataset , where is a tuple of elements; being similar to and dissimilar to every element of the negative sample set . The predictor is learnt from a labelled dataset .

In the following, we present the contrastive framework proposed by Arora et al. (2019) in a simplified scenario in order to highlight the key ideas, where the supervised prediction task is binary and the negative sample sets for unsupervised representation learning contain one element. Thus, we choose the label set to be , and the unsupervised set contains triplets . The extension to a more generic setting (for and ) bears no particular difficulty and is deferred to section A.2. It is important to note at this stage that both and are assumed to be iid (independent, identically distributed) collections, as also assumed by Arora et al. (2019).

Latent classes and data distributions. The main assumption is the existence of a set of latent classes . Let us denote by a probability distribution over . Moreover, with each class , comes a class distribution  over the input space . A similar pair is such that both and are generated by the same class distribution. Note that an input possibly belongs to multiple classes: take the example of being an image and a set of latent classes including “the image depicts a dog” and “the image depicts a cat” (both classes are not mutually exclusive).

{definition}

Let be a shorthand notation for the joint distribution . We refer to the unsupervised data distribution as the process that generates an unlabelled sample according to the following scheme:

1. Draw two latent classes  ;
2. Draw two similar samples  ;
3. Draw a negative sample  .

The labelled sample is obtained by fixing two classes (from now on, the shorthand notation is used to refer to a pair of latent classes). Each class is then mapped on a label of . We fix and ;

Thus we can write as an ordered set. The label is obtained from the latent class distribution restricted to two values :

 ρ\cpm(c−)=ρ(c−)ρ(c−)+ρ(c+), ρ\cpm(c+)=ρ(c+)ρ(c−)+ρ(c+).
{definition}

We refer to the supervised data distribution as the process that generates a labelled sample according to the following scheme:

1. Draw a class and set label  ;
2. Draw a sample  .

Loss function. The learning process is divided in two sequential steps, the unsupervised and supervised steps. In order to relate these two steps, the key is to express them in terms of a common convex loss function . Typical choices are

 ℓlog(v)\eqdef log2(1+e−v), (logistic loss) (1) ℓhinge(v)\eqdef max{0,1−v}, (hinge loss) (2)

where the loss argument expresses a notion of margin.

In the first step, an unsupervised representation learning algorithm produces a feature map . The contrastive loss associated to is defined as

 \Lun(\fbf)\eqdef = \Esp(\xbf,\xbf+,\xbf−)∼\Ucalℓ(\fbf(\xbf)⋅[\fbf(\xbf+)−\fbf(\xbf−)]).

More precisely, from the unsupervised training dataset

 U={(\xbfi,\xbf+i,\xbf−i)}mi=1∼\Ucalm, (3)

we are interested in learning the feature map that minimises the following empirical contrastive loss:

 \Lunhat(\fbf)\eqdef1mm∑i=1ℓ(\fbf(\xbfi)⋅[\fbf(\xbf+i)−\fbf(\xbf−i)]). (4)

In the second step, a supervised learning algorithm is given the mapped dataset , with ,

and returns a predictor . For a fixed pair , the predicted label on an input is then obtained from (recall that ), and we aim to minimise the supervised loss

 \Lsup(g∘\fbf)\eqdef \Espc∼ρ\cpm\Esp\xbf∼\Dcalcℓ(ycg(\fbf(\xbf))) = \Esp(\xbf,y)∼\Scalℓ(yg(\fbf(\xbf))).

Given a labelled dataset , the empirical counterpart of the above supervised loss is

 \Lsuphat(g∘\fbf)\eqdef1nn∑i=1ℓ(yig(\fbf(\xbfi))).

Mean classifier. Following Arora et al. (2019), we study the mean classifier defined from the linear function

 g\cpm(^\xbf)\eqdef\wbf\cpm⋅^\xbf,

where , and .

Then, the supervised average loss of the mean classifier is the expected loss on a dataset whose pair of labels is sampled from the latent class distribution .

 \Lsupμ(\fbf)\eqdef\Esp\cpm∼\rhowo2\Lsup(g\cpm∘\fbf), (5)

with being a shorthand notation for the sampling without replacement of two classes among . Indeed, we want positive and negative samples that are generated by distinct latent class distributions, \ie, .

### 2.2 Generalisation Guarantees

A major contribution of the framework introduced by Arora et al. (2019) is that it rigorously links the unsupervised representation learning task and the subsequent prediction task: it provides generalisation guarantees on the supervised average loss of eq. 5 in terms of the empirical contrastive loss in eq. 4. Central to this result is the upcoming section 2.2, that relates the supervised average loss of the mean classifier to its unsupervised loss. {lemma}[Arora et al., 2019, Lemma 4.3] Given a latent class distribution on and a convex loss , for any feature map , we have

 \Lsupμ(\fbf)≤11−τ(\Lun(\fbf)−τ),

where is the probability of sampling twice the same latent class ( is the indicator function):

 τ \eqdef\Esp\cpm∼ρ2\onebf[\cp=\cm]=∑c∈\Ccal[ρ(c)]2. (6)

Arora et al. (2019) upper bound the unsupervised contrastive loss in section 2.2 by its empirical estimates. The obtained generalisation guarantee is presented by the following section 2.2. The bound focuses on a class of feature map functions through its empirical Rademacher complexity on a training dataset , defined by

 \RcalU(\Fcal)\eqdef\Esp\sigmabf∼{±1}3dm(supf∈\Fcal[\sigmabf⋅\fbf|U]),

where is the concatenation of all feature mapping given by on , and denotes the uniformly sampled Rademacher variables over that “representation” space. {theorem}[Arora et al., 2019, Theorem 4.1] Let be such that , with probability over training samples ,

 \Lsupμ(ˆ\fbf)≤ 11−τ(\Lun(\fbf)−τ)+11−τO⎛⎝B\RcalU(\Fcal)m+B2√ln1δm⎞⎠,

where  .

## 3 Pac-Bayes Analysis

Among the different techniques to analyse generalisation in statistical learning theory, PAC-Bayes has emerged in the late 90s as a promising alternative to Rademacher complexity. PAC-Bayes (pioneered by Shawe-Taylor and Williamson, 1997; McAllester, 1998; Catoni, 2003, 2004, 2007 – see Guedj, 2019 for a recent survey) consists in obtaining PAC (probably approximately correct, Valiant, 1984) generalisation bounds for Bayesian-flavoured predictors. PAC-Bayes bounds typically hold with arbitrarily high probability and express a trade-off between the empirical risk on the training set and a measure of complexity of the predictors class. A particularity of PAC-Bayes is that the complexity term relies on a divergence measure between a prior belief and a data-dependent posterior distribution (typically the Kullback-Leibler divergence).

### 3.1 Supervised Learning Framework

Let be a prior over a predictor class , which cannot depend on training data, and let be a posterior over the predictor class , which can depend on the training data. Any predictor is a classification function . Most PAC-Bayes results measure the discrepancy between the prior and the posterior distributions through the Kullback–Leibler divergence,

 \KL(\Pcal∥\Qcal)\eqdef\bbEh∼\Pcalln\Pcal(h)\Qcal(h). (7)

Moreover, PAC-Bayes provides bounds on the expected loss of the predictors under the distribution . Let us present the classical setup, where the zero-one loss is used. We refer to this loss as the classification risk, denoted by .1 Given a data-generating distribution on , the expected -risk is

 R(\Qcal)\eqdef\bbE(\xbf,y)∼\Scal\bbEh∼\Qcalr(y,h(\xbf)),

and the empirical counterpart, \ie, the -weighted empirical risk on a training set , is given by

 \Rhat(\Qcal)\eqdef1nn∑i=1\bbEh∼\Qcalr(yi,h(\xbfi)).

The following section 3.1 expresses an upper bound on the risk , from the empirical risk and the posterior-prior divergence . {theorem}[Catoni, 2007, Theorem 1.2.6] Given and a prior over , with probability at least over training samples , over ,

 R(\Qcal)≤1−exp(−λ\Rhat(\Qcal)−\KL(\Qcal∥\Pcal)+ln1δn)1−exp(−λ). (8)

### 3.2 Pac-Bayes Representation Learning

We now proceed to the first of our contributions. We prove a PAC-Bayesian bound on the contrastive unsupervised representation loss, by replacing the Rademacher complexity in  section 2.2 with a Kullback-Leibler divergence. To do so, we consider a prior and posterior distributions over a class of feature mapping functions . Note that our PAC-Bayesian analysis for a multi-class extension is found at section A.2.

First, let us remark that we can adapt section 3.1 to a bound on the unsupervised expected contrastive risk defined as

 \Run(\Qcal)\eqdef\Esp(\xbf,\xbf+,\xbf−)∼\Ucal\Esp\fbf∼\Qcalr(\fbf(\xbf+)−\fbf(\xbf−),\fbf(\xbf)),

where is the zero-one loss extended to vector arguments. We denote the empirical counterpart of computed on the unsupervised training set . Once expressed this way, section 3.1—devoted to classical supervised learning—can be straightforwardly adapted for the expected contrastive risk. Thus, we obtain the following section 3.2. {corollary} Given and a prior over , with probability at least over training samples , over ,

 \Run(\Qcal)≤1−exp(−λ\Runhat(\Qcal)−\KL(\Qcal∥\Pcal)+ln1δm)1−exp(−λ).

Unfortunately, the bound on the contrastive risk does not translate directly to a bound on the supervised average risk

 \Rsupμ(\fbf)\eqdef\Esp\cpm∼\rhowo2\Rsup(g\cpm∘\fbf). (9)

This is because the zero-one loss is not convex, preventing us to apply section 2.2 to obtain a result analogous to section 2.2. However, note that both loss functions defined by Equations (1-2) are upper bound on the zero-one loss:

 ∀\ybf,^\ybf∈\Rbbd:r(\ybf,^\ybf)≤ℓ(\ybf⋅^\ybf), with ℓ∈{ℓlog,ℓhinge}.

Henceforth, we study the expected loss

 \Lsupμ(\Qcal)=\bbE\fbf∼\Qcal\Lsupμ(\fbf)

in regards to

 \Lun(\Qcal)=\bbE\fbf∼\Qcal\Lun(\fbf).

By assuming that the representation vectors are bounded, \ie, for some as in section 2.2, we also have that the loss function is bounded. Thus, by rescaling in the loss function, section 3.1 can be used to derive the following section 3.2, which is the PAC-Bayesian doppelgänger of section 2.2. {theorem} Let such that for all . Given and a prior over , with probability at least over training samples , over ,

 \Lsupμ(\Qcal)≤ (10) 11−τ(Bℓ1−exp(−λBℓ\Lunhat(\Qcal)−\KL(\Qcal∥\Pcal)+ln1δm)1−exp(−λ)−τ),

with and given by Eq. (6).

###### Proof.

Since , we have :

 −2B≤\fbf(\xbf)⋅[\fbf(\xbf+)−\fbf(\xbf−)]≤2B.

Thus, , as is both convex and positive. Therefore, the output of the rescaled loss function belongs to . From that point, we apply section 3.1 to obtain2, with probability at least ,

 1Bℓ\Lun(\Qcal)≤1−exp(−λBℓ\Lunhat(\Qcal)−\KL(\Qcal∥\Pcal)+ln1δm)1−exp(−λ).

Also, since the inequality stated in section 2.2 holds true for all , taking the expected value according to gives

 \Lsupμ(\Qcal)≤11−τ(\Lun(\Qcal)−τ).

The desired result is obtained by replacing in the equation above by its bound in terms of . ∎

The Rademacher bound of section 2.2 and the PAC-Bayes bound of section 3.2 convey a similar message: finding a good representation mapping (in terms of the empirical contrastive loss) guarantee to generalise well, on average, on the supervised tasks. An asset of the PAC-Bayesian bound lies in the fact that its exact value is easier to compute than the Rademacher one. Indeed, for a well chosen prior-posterior family, the complexity term has a closed-form solution, while computing involves a combinatorial complexity. From an algorithm design perspective, the fact that varies with suggests a trade-off between accuracy and complexity to drive the learning process, while is constant for a given choice of class . We leverage these assets to propose a bound-driven optimisation procedure for neural networks in section 4.

Note that one could be interested to study the risk of a predictor learned on the representation of the supervised data instead of the mean classifier’s risk. As discussed in section A.3, the loss of the best supervised predictor is at least as good as the mean classifier’s one.

### 3.3 Relaxing the Iid Assumption

An interesting byproduct of Arora et al. (2019)’s approach is that the proof of the main bound (section 2.2) is modular: we mean that in the proof of section 3.2, instead of plugging in Catoni’s bound (section 3.1), we can use any relevant bound. We therefore leverage the recent work of Alquier and Guedj (2018) who proved a PAC-Bayes generalisation bound which no longer needs to assume that data are iid, and even holds when the data-generating distribution is heavy-tailed. We can therefore cast our results onto the non-iid setting.

We believe removing the iid assumption is especially relevant for contrastive unsupervised learning, as we deal with triplets of data points governed by a relational causal link (similar and dissimilar examples). In fact, several contrastive representation learning algorithms violate the iid assumption (Goroshin et al., 2015; Logeswaran and Lee, 2018).

Alquier and Guedj (2018)’s framework generalises the Kullback-Leibler divergence in the PAC-Bayes bound with the class of -divergences (see Csiszár and Shields, 2004, for an introduction). Given a convex function such that , the -divergence between two probability distributions is given by

 Df(\Pcal∥\Qcal)=\bbEh∼\Qcalf(\Pcal(h)\Qcal(h)). (11)
{theorem}

Given and a prior over , with probability at least over ,

 \Lsupμ(\Qcal)≤11−τ(\Lunhat(\Qcal)−τ)+11−τ(\Mcalqδ)1q(Dϕp−1(\Qcal∥\Pcal)+1)1p, (12)

where (recall that depends on , see Eqs. 3 and 4) and . The proof is a straightforward combination of aforementioned results, substituting Theorem 1 in Alquier and Guedj (2018) to Catoni’s bound (section 3.1) in the proof of section 3.2.

Up to our knowledge, section 3.3 is the first generalisation bound for contrastive unsupervised representation learning that holds without the iid assumption, therefore extending the framework introduced by Arora et al. (2019) in a non-trivial and promising direction. Note that section 3.3 does not require iid assumption for both unsupervised and supervised steps.

## 4 From Bounds to Algorithms

In this section, we propose contrastive unsupervised representation learning algorithms derived from the PAC-Bayes bounds stated in sections 3.3 and 3.2. The algorithms are obtained by optimising the weights of a neural network by minimising the right-hand side of (10) and (12), respectively. Our training method is inspired by the work of Dziugaite and Roy (2017), who optimise a PAC-Bayesian bound in a supervised classification framework, and show that it leads to non-vacuous bounds values and accurately detects overfitting.

### 4.1 Neural Network Optimisations

#### Algorithm based on section 3.2

We consider a neural network architecture with real-valued learning parameters. Let us denote the concatenation into a single vector of all the weights, and the output of the neural network whose output is a -dimensional representation vector of its input. From now on, is the set of all possible neural networks for the chosen architectures. We restrict the posterior and prior over to be Gaussian distributions, that is

 \Qcal\eqdef\Ncal(μ\Qcal,\diag(\sigmabf2\Qcal)),\Pcal\eqdef\Ncal(μ\Pcal,σ2\PcalI),

where , , and .

Given a fixed in section 3.2, since is a constant value, minimising the upper bound is equivalent to minimising the following expression3

 λm\Lunhat(\Qcal)+\KL(\Qcal∥\Pcal)+ln1δ. (13)

Since is still intractable (as it is expressed as the expectation with respect to the posterior distribution on predictors), we resort to an unbiased estimator; the weights parameter is sampled at each iteration of a gradient descent, according to

 \wbf=μ\Qcal+\sigmabf\Qcal⊙ϵ; with ϵ∼\Ncal(0,I),

the symbol being the element-wise product. Therefore we optimise the posterior’s parameters and . In addition, we optimise the prior variance in the same way as Dziugaite and Roy (2017, Section 3.1). That is, given fixed , we consider the bound value for

 σ2\Pcal∈{cexp(−jb)∣j∈N}. (14)

From the union bound argument, the obtained result is valid with probability by computing each bound with a confidence parameter , where .

Given and , our final objective based on section 3.2 is

 minμ\Qcal,σ2\Qcal,σ2\Pcalλm\Lunhat(\Qcal)+\KL(\Qcal∥\Pcal)+2ln(blncσ2\Pcal),

where

 \KL(\Qcal∥\Pcal)= 12(∥μ\Qcal−μ\Pcal∥22σ2\Pcal−N+∥\sigmabf2\Qcal∥1σ2\Pcal+Nlnσ2\Pcal−N∑i=1ln\sigmabf2\Qcal,i).

#### Algorithm based on section 3.3

We consider the same neural network architecture, prior, and posterior in section 4.1.1.

We specify of section 3.3 to use a more familiar divergence, -divergence, rather than -divergence. Then, minimising the upper bound is equivalent to minimising the following expression:

 \Lunhat(\Qcal)+√\Mcal2δ(χ2(\Qcal∥\Pcal)+1). (15)

Even though we use the unbiased estimator to evaluate the first term like iid algorithm, the objective is still intractable since the moment requires the test loss . Thus we assume the existence of an upper bound of the covariance of the contrastive loss to bound as follow4:

 Cov(ℓ(\zbfi),ℓ(\zbfj)){≤B2ℓif i−T≤j≤i+T=0otherwise, (16)

where is the length of dependency to generate similarity pairs .

Note that Popoviciu’s inequality (Sharma et al., 2010) gives a tighter bound of the covariance: , when .

Intuitively, the assumption sounds natural for CURL on sequential data (Mikolov et al., 2013; Goroshin et al., 2015), where a positive sample appears in sample ’s neighbours in a time series.

Given and , our final objective is

 min\mubf\Qcal,\sigmabf2\Qcal,σ2\Pcal\Lunhat(\Qcal)+ π⎛⎝blncσ2\Pcal⎞⎠√B2ℓ24mδ(1+8T)(χ2(\Qcal∥\Pcal)+1), (17)

where the full expression of -divergence is found in appendix C. The objective is obtained by using the covariance’s assumption and the union bound for the prior’s variance .

The objective tells us that its value becomes large if is large, where data dependency is long. Therefore collecting independent time-series samples is a more effective way to tighten the bound instead of increasing .

Interestingly, eq. 15 with eq. 16 can be viewed as a generalised bound of Corollary 10 in Bégin et al. (2016). In fact, our objective becomes their bound if our data hold the iid assumption and is the zero-one loss.

### 4.2 Parameter Selection

In the forthcoming experiments (Section 5), we empirically compare the following three criteria for parameter selection: (i) the validation contrastive risk according to the posterior , (ii) the validation contrastive risk of the maximum-a-posteriori network, and (iii) the PAC-Bayes bound associated to the learned .

For the first validation contrastive risk criterion, we select a model with the best hyper-parameters such that it achieves the lowest contrastive risk on the validation data. We approximate in a Monte Carlo fashion by sampling several from .

Empirically, stochastic neural networks learnt by minimising the PAC-Bayes bound perform quite conservatively (Dziugaite and Roy, 2017). Therefore we also use a validation contrastive risk computed with the deterministic neural network being the most likely according to the posterior (\ie, the neural network weights are taken as the mean vector of the posterior, rather than sampled from it).

The last criterion, the PAC-Bayes bound, does not use validation data; it only requires training data. For the algorithm described in section 4.1.1, we select a model with the best hyper-parameters such that it minimises the following PAC-Bayes bound on the contrastive supervised risk :

 minλ>0[1−exp(−λ\Runhat(\Qcal)−\KL(\Qcal∥\Pcal)+lnπ2j26+ln2√mδm)1−exp(−λ)]. (18)

This criterion is given by section 3.2, where the term is replaced by . The first summand comes from the union bound over the prior’s variances–see eq. 14.

The second summand replaces by , as Letarte et al. (2019, Theorem 3) showed that this suffices to make the bound valid uniformly for all , which allows for minimising the bound over . Note that the learning algorithm minimises a bound on the (differentiable) convex loss, but our model selection bound focuses on the zero-one loss as our task is a classification one.

## 5 Numerical Experiments

Our experimental codes are publicly available.5 We implemented all algorithms with PyTorch (Paszke et al., 2019). Herein, we report experiments for the algorithm described in section 4.1.1.

Experiments for the non-iid algorithm are provided in appendix E.

### 5.1 Protocol

Datasets. We use \cifar (Krizhevsky, 2009) image classification task, containing images, equally distributed into labels. We created train/validation/test splits of images. We preprocess the images by normalising all pixels per channel based on the training data. We build the unsupervised contrastive learning dataset by considering each of the label as a latent class, using a block size of and a number of negative samples of  (see appendix A in the supplementary material for the extended theory for block samples and more than one negative samples).

We also use \auslan (Kadous, 2002) dataset that contains labels, each one being a sign language’s motion, and having dimensional features. We split the dataset into training/validation/test sets. As pre-processing, we normalise feature vectors per dimension based on the training data. The contrastive learning dataset then contains latent classes. The block size and the number of negative samples are the same as \cifar setting. More details are provided in section D.1.

Neural networks architectures. For \cifar experiments, we use a two hidden convolutional layers neural network (CNN). The two hidden layers are convolutions (kernel size of and channels) with the ReLU activation function, followed by max-pooling (kernel size of and stride of ). The final layer is a fully connected linear layer ( neurons) without activation function. For \auslan experiments, we used a fully connected one hidden layer network with the ReLU activation function. Both hidden and last layer have neurons. More architecture details are given in section D.2.

PAC-Bayes bound optimisation. We learn the neural network parameters by minimising the bound given by section 3.2, using the strategy proposed in section 4.1.1. We rely on the logistic loss given by eq. 1. We fix the following PAC-Bayes bound’s parameters: and . The prior variance is initialised at . The prior mean parameters coincide with the random initialisation of the gradient descent.

We repeat the optimisation procedure with different combination of hyper-parameters. Namely, the PAC-Bayes bound constant is chosen in for \cifar, and in for \auslan. We also consider as an hyper-parameter the choice of the gradient descent optimiser, here between RMSProp (Tieleman and Hinton, 2012) and Adam (Kingma and Lei Ba, 2015). The learning rate is in . In all cases, epochs are performed and the learning rate is divided by at the epoch. To select the final model among the ones given by all these hyper-parameter combinations, we experiment three parameter selection criteria based on approaches described in section 4.2, as detailed below.
Stochastic validation (s-valid). This metric is obtained by randomly sampling set of network parameters according to the learnt posterior , and averaging the corresponding empirical contrastive loss values computed on validation data. The same procedure is used to perform early-stopping during optimisation (we stop the learning process when the loss stops decreasing for consecutive epochs).
Deterministic validation (det-valid). This metric corresponds the empirical contrastive loss values computed on validation data of the deterministic network , which corresponds to the mean parameters of the posterior (\ie, the maximum-a-posteriori network given by ). Early stopping is performed in the same way as for s-valid.
PAC-Bayes bound (PB). The bound values of the learnt posterior are computed by using eq. 18. Note that since this method does not require validation data, we perform optimisation over the union of the validation data and the training data. We do not perform early stopping since the optimised objective function is directly the parameter selection metric.

Benchmark methods. We compare our results with two benchmarks, described below (more details are provided in section D.3)
Prior contrastive unsupervised learning (Arora et al., 2019). Following the original work, we minimise the empirical contrastive loss . Hyper-parameter selection is performed on the validation dataset as for s-valid and det-valid described above.
Supervised learning (supervised). We also train the neural network in a supervised way, using the label information; Following the experiment of Arora et al. (2019), we add a prediction linear layer to our architectures (with output neurons for \cifar, and output neurons for \auslan), and minimise the multiclass logistic loss function

 ℓlog(\vbf)\eqdeflog2(1+∑|Y|i=1e−vi).

Once done, we drop the last (prediction) layer. Then, we use the remaining network to extract feature representation.

### 5.2 Experimental Results

Supervised classification. table 1 contains supervised accuracies obtained from the representation learnt with the two benchmark methods, as well as with our three parameter selection strategies on the PAC-Bayes learning algorithms. For each method, two types supervised predictor of are used: and - (as in Arora et al., 2019).6 The classifier is obtained that was the average vector of feature vectors mapped from training data per supervised label, and - classifier had that was average of random training samples feature vectors. For -, we used averaged evaluation scores over times samplings on each experiment.

For the two datasets, we report three accuracies on the testing set, described below. Values are calculated by averaging over three repetitions of the whole experiments using different random seeds.
predictors-2 accuracy (\avgtwo). This is the empirical counterpart of eq. 9, \ie, given a test dataset where is a latent class, we define , given

 \Rsuphatμ(\fbf\wbf)\eqdefC(C−1)2∑1≤c+

where is the number of latent classes (\eg, for \cifar dataset), is a feature map learnt from the training data, is the predictor based on the centre of mass of the training data mapped features of classes , and is the supervised risk on the dataset :

 \RhatT\cpm(^g\cpm∘^\fbf\wbf)\eqdef1|T\cpm|∑(\xbf,y)∈T\cpmr(^g\cpm(^\fbf\wbf(\xbf)),y).

Top-1 accuracy (\topone). This is the accuracy on the multiclass labelled test data . We predicted the label on the test data. Therefore,

 \topone(\fbf\wbf)\eqdef1|T||T|∑i=1\onebf[yi=^yi].

Top-5 accuracy (\topfive). For each test instance , let be the set of labels having the highest inner products . Then,

 \topfive(\fbf\wbf)\eqdef1|T||T|∑i=1\onebf[yi∈^Yi].

Note that the \topone and \topfive metrics are not supported by theoretical results, in the present paper or the work of Arora et al. (2019). Nevertheless, we report those as an empirical hint of how representations are learnt by our contrastive unsupervised representation learning algorithm.

We observe that det-valid algorithm achieves competitive results with the ones of the CURL algorithm studied by Arora et al. (2019).

PAC-Bayesian generalisation bounds. table 2 shows the PAC-Bayes bound values obtained from eq. 18. The bounds were calculated by using the same models used in table 1. We also reported a training risk and test risk that we calculated by using only the mean parameter of the posterior as for neural network’s weight. The rows of indicated the optimised values that minimised eq. 18, and thus that correspond to the reported PAC-Bayes bounds. Let us stress that all reported bounds values are non-vacuous.

The generalisation bounds obtained with the PB parameter selection criterion are naturally the tightest. For this method, the gap between the empirical risk and the test risk is remarkably consistently small. This highlights that the PAC-Bayesian bound minimisation is not prone to overfitting. On the downside, this behaviour seems to promote “conservative” solutions, which in turns gives lower supervised accuracy compared to methods relying on a validation set (as reported in table 1).

## 6 Conclusion

We extended the framework introduced by Arora et al. (2019), by adopting a PAC-Bayes approach to contrastive unsupervised representation learning. This allows in particular to (i) derive new algorithms, by minimising the bounds (ii) remove the iid assumption. While supported by novel generalisation bounds, our approach is also validated on numerical experiments are the bound yields non-trivial (non-vacuous) values.

## Appendix A Extended Pac-Bayes Bounds

Arora et al. (2019) show two extended generalisation error bounds based on section 2.2. We also show each PAC-Bayesian counterpart of their extended bounds for section 3.2. In addition, we show PAC-Bayesian analysis of general supervised classifier instead of the mean classifier.

### a.1 Block Bound

The first extension is to use block pairs for positive and negative samples to make the bound tighter. We also derive a tighter PAC-Bayes bound on the same setting.

Let be the size of blocks. We change the data generation process; Given , we sample and . Given block pairs, unsupervised block loss is defined as

 \Lunblock(\fbf)=\bbE{ℓ[f(\xbf)⋅(∑bi=1\fbf(\xbf+i)b−∑bi=1\fbf(\xbf−i)b)]}. (19)

This block loss lower bounds  (Arora et al., 2019, Proposition 6.2): . Based on this lower bound, when we define , we obtain the following lower bound of the unsupervised risk for all over by taking the expected value according to ,

 \Lunblock(\Qcal)≤\Lun(\Qcal).

Therefore we derive the tighter block bound by combining the previous lower bound and section 3.2. {proposition} over ,

 \Lsupμ(\Qcal) ≤11−τ⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝Bℓ1−exp(−λBℓ\Lunhatblock(\Qcal)−\KL(\Qcal∥\Pcal)+ln1δm)1−exp(−λ)−τ⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠. (20)

### a.2 k-Negative Samples Bound

The second extension is to use negative samples in their framework as a general setting. Following Arora et al. (2019), we consider the data generation process with negative samples per each pair. Let be the process that generates an unlabelled sample according to the following scheme:

1. Draw latent classes  ;
2. Draw two similar samples  ;
3. Draw negative samples  .

We extend loss functions for a vector of size . We use two convex loss functions:

 ℓlog(\vbf)\eqdef log2(1+k∑i=1e−vi), (logistic loss) (21) ℓhinge(\vbf)\eqdef max[0,1+maxi(−vi)], (hinge loss) (22)

Then we define unsupervised contrastive loss and empirical contrastive loss with negative samples;

 \Lun(\fbf)\eqdef \bbE\zbf∼\Ucalℓ({\fbf(\xbf)⋅[\fbf(\xbf+)−\fbf(\xbf−i)]}ki=1), (23) \Lunhat(\fbf)\eqdef (24)

We analyse a mean classifier as with scenario. Let be the set of supervised classes whose size is , let be the distribution over , and let be the distribution over class in . The supervised average loss of mean classifier with negative samples is defined as

 \Lsupμ(\fbf)=\bbE\Tcal∼\Dcal\Lsupμ(\Tcal,\fbf)=\bbE\Tcal∼\Dcal\bbEc∼\Dcal\Tcal\bbE\xbf∼\Dcalc[ℓ({\fbf(\xbf)⋅(\mubfc−\mubfc′)}c′≠c)]. (25)

To introduce the counterpart of section 2.2 for negative samples, we introduce notations related to the extended class collision. Let