Learning De-biased Representations with Biased Representations
Learning De-biased Representations with Biased Representations – Appendix –
Many machine learning algorithms are trained and evaluated by splitting data from a single source into training and test sets. While such focus on in-distribution learning scenarios has led to interesting advancement, it has not been able to tell if models are relying on dataset biases as shortcuts for successful prediction (e.g., using snow cues for recognising snowmobiles). Such biased models fail to generalise when the bias shifts to a different class. The cross-bias generalisation problem has been addressed by de-biasing training data through augmentation or re-sampling, which are often prohibitive due to the data collection cost (e.g., collecting images of a snowmobile on a desert) and the difficulty of quantifying or expressing biases in the first place. In this work, we propose a novel framework to train a de-biased representation by encouraging it to be different from a set of representations that are biased by design. This tactic is feasible in many scenarios where it is much easier to define a set of biased representations than to define and quantify bias. We demonstrate the efficacy of our method across a variety of synthetic and real-world biases. Our experiments and analyses show that the method discourages models from taking bias shortcuts, resulting in improved generalisation.
Most machine learning algorithms are trained and evaluated by randomly splitting a single source of data into training and test sets. Although this is a standard protocol, it is blind to a critical problem: the reliance on dataset bias (Torralba & Efros, 2011). For instance, many frog images are taken in swamp scenes, but swamp itself is not a frog. Nonetheless, a model will exploit this bias (i.e., take “shortcuts”) if it yields correct predictions for the majority of training examples. If the bias is sufficient to achieve high accuracy, there is little motivation for models to learn the complexity of the intended task, despite its full capacity to do so. Consequently, a model that relies on bias will achieve high in-distribution accuracy, yet fail to generalise when the bias shifts.
We tackle this “cross-bias generalisation” problem where a model does not exploit its full capacity due to the “sufficiency” of bias cues for prediction of the target label in the training data. For example, language models make predictions based on the presence of certain words (e.g., “not” for “contradiction”) (Gururangan et al., 2018) without much reasoning on the actual meaning of sentences, even if they are in principle capable of sophisticated reasoning. Similarly, convolutional neural networks (CNNs) achieve high accuracy on image classification by using local texture cues as shortcut, as opposed to more reliable global shape cues (Geirhos et al., 2019; Brendel & Bethge, 2019). 3D CNNs achieve high accuracy on video action recognition by relying on static cues as shortcut rather than capturing temporal actions (Weinzaepfel & Rogez, 2019; Li et al., 2018; Li & Vasconcelos, 2019).
Existing methods attempt to remove a model’s dependency on bias by de-biasing the training data through augmentation (Geirhos et al., 2019) or introducing a pre-defined set of biases that a model is trained to be independent of (Wang et al., 2019a). Other approaches (Clark et al., 2019; Cadene et al., 2019) learn a biased model given source of bias as input, and de-bias through logit re-weighting or logit ensembling. These prior studies assume that biases can be easily defined or quantified, but often real-world biases do not (e.g., texture or static bias above).
To address this limitation, we propose a novel framework to train a de-biased representation by encouraging it to be statistically independent from a set of representations that are biased by design. We use the Hilbert-Schmidt Independence Criterion (Gretton et al., 2005) to formulate the statistical independence as a regularisation term. Our insight is that there are certain types of bias that can be easily captured by defining a bias-characterising model (e.g., CNNs of smaller receptive fields for texture bias; 2D CNNs for static bias in videos). Experiments show that our method is effective in reducing a model’s dependency on “shortcuts” in training data, as evidenced by improved accuracy in test data where the bias is shifted or removed.
2 Problem Definition
We provide a rigorous definition of our over-arching goal: overcoming the bias in models trained on biased data. We systematically categorise the learning scenarios and cross-bias generalisation strategies.
2.1 Cross-bias generalisation
We first define random variables, signal and bias as cues for the recognition of an input as certain target variable . Signals are the cues essential for the recognition of as ; examples include the shape and skin patterns of frogs for frog image classification. Biases ’s, on the other hand, are cues not essential for the recognition but correlated with the target ; many frog images are taken in swamp scenes, so swamp scenes can be considered as . A key property of is that intervening on should not change ; moving a frog from swamp to a desert scene does not change the “frogness”. We assume that the true predictive distribution factorises as , signifying the sufficiency of for recognition.
Under this framework, three learning scenarios are identified depending on the change of relationship across training and test distributions, and , respectively: in-distribution, cross-domain, and cross-bias generalisation. See Figure 1 for a summary.
In-distribution. . This is the standard learning setup utilised in many benchmarks by splitting data from a single source into training and test data at random.
Cross-domain. and furthermore . in this case is often referred to as “domain”. For example, training data consist of images with (=frog, =wilderness) and (=bird, =wilderness), while test data contain (=frog, =indoors) and (=bird, =indoors). This scenario is typically simulated by training and testing on different datasets (Ben-David et al., 2007).
2.2 Existing cross-bias generalisation methods and their assumptions
Under cross-bias generalisation scenarios, the dependency makes bias a viable cue for recognition. The model trained on such data becomes susceptible to interventions on , limiting its generalisabililty when the bias is changed or removed in the test data. There exist prior approaches to this problem, but with different types and amounts of assumptions on . We briefly recap the approaches based on the assumptions they require. In the next part §2.3, we will define our problem setting that requires an assumption distinct from the ones in prior approaches.
When an algorithm to disentangle bias and signal exists. Being able to disentangle and lets one collapse the feature space corresponding to in both training and test data. A model trained on such normalised data then becomes free of biases. As ideal as it is, building a model to perfectly disentangle and is often unrealistic (e.g., texture bias (Geirhos et al., 2019)). Thus, existing methods have proposed different approaches to tackle cross-bias generalisation.
When a data collection procedure or generative algorithm for exists. When additional examples can be supplied through , the training dataset itself can be de-biased, i.e., . Such a data augmentation strategy is indeed a valid solution adopted by many prior studies. Some approach has proposed to collect additional data to balance out the bias (Panda et al., 2018). Other approaches have proposed to synthesise data with a generative algorithm through image stylisation (Geirhos et al., 2019), object removal (Agarwal et al., 2019; Shetty et al., 2019), or generation of diverse, semantically similar linguistic variations (Shah et al., 2019; Ray et al., 2019). However, collecting unusual inputs can be expensive (Peyre et al., 2017), and building a generative model with pre-defined bias types (Geirhos et al., 2019) may suffer from bias mis-specification or the lack of realism.
When a ground truth or predictive algorithm for exists. Conversely, when one can tell the bias for every input , we can remove the dependency between the model predictions and the bias . The knowledge on is provided in many realistic scenarios. For example, when the aim is to remove gender biases in a job application process , applicants’ genders are supplied as ground truths. Many existing approaches for fairness in machine learning have proposed independence-based regularisers to encourage (Zemel et al., 2013) or the conditional independence (Quadrianto et al., 2019; Hardt et al., 2016). Other approaches have proposed to remove predictability of based on through domain adversarial losses (Wang et al., 2019b) or mutual information minimisation (Kim et al., 2019). When the ground truth of is not provided, another approach has proposed to quantify texture bias by utilising the neural gray-level co-occurrence matrix and encouraging independence through projection (Wang et al., 2019a). However, there exist cases when is difficult to even be defined or quantified but can only be indirectly specified.
2.3 Our scenario: Capturing bias with a set of models
Under the cross-bias generalisation scenario, some biases are not easily addressed by the above methods. Take texture bias as an example (§1, Geirhos et al. (2019)): (1) texture and shape cannot easily be disentangled, (2) collecting unusual images or building a generative model is expensive, (3) building the predictive model for texture requires enumeration (classifier) or embedding (regression) of all possible textures, which is not feasible.
However, slightly modifying the third assumption results in a problem setting that allows interesting application scenarios. Instead of assuming explicit knowledge on , we can approximate by defining a set of models that are biased towards by design. For texture biases, for example, we define to be the set of CNN architectures with small receptive fields. Then, any learned model can by design make predictions based on the patterns that can only be captured with small receptive fields (i.e., textures), becoming more liable to overfit to texture.
More precisely, we define to be a bias-characterising model class for the bias-signal pair if for every possible joint distribution there exists a such that (recall condition) and every satisfies (precision condition). In practice, may not necessarily include all biases and may also capture important signals (i.e., imperfect recall and precision). With this in mind, we formulate our framework as a regulariser to the original task so that does not ignore every signal captured by . We do not require to be perfect.
There exist many scenarios when such can be characterised, based on several empirical evidence for the type of bias. For instance, action recognition models rely heavily on static cues without learning temporal cues (Li et al., 2018; Li & Vasconcelos, 2019); we can regularise the 3D CNNs towards better generalisation across static biases by defining G to be the set of 2D CNNs. VQA models rely overly on language biases rather than visual cues (Agrawal et al., 2018). G can be defined as the set of models that only look at the language modality (Clark et al., 2019; Cadene et al., 2019). Entailment models are biased towards word overlap rather than understanding the underlying meaning of sentences (McCoy et al., 2019; Niven & Kao, 2019). We can design G to be the set of bag-of-words classifiers (Clark et al., 2019). Generally, these scenarios exemplify situations when the added architectural capacity is not fully utilised due to the sufficiency of simpler cues for solving the task in the given training set.
There are recent approaches that attempt to capture bias with bias-characterising models and remove dependency on via logit ensembling (Clark et al., 2019) or logit re-weighting (Cadene et al., 2019). In §4, we empirically measure their performance on various types of synthetic and realistic biases.
3 Proposed Method
We present a solution for the cross-bias generalisation when the bias-characterising model class is known (see §2.3); the method is referred to as ReBias. The solution consists of training a model for the task with a regularisation term encouraging the independence between the prediction and the set of all possible biased predictions . We will introduce the precise definition of the regularisation term and discuss why and how it leads to the unbiased model.
3.1 ReBias: Removing bias with bias
If is fully known, we can directly encourage . Since we only have access to the set of biased models (§2.3), we seek to promote for every . Simply put, we de-bias a representation by designing a set of biased models and letting run away from . This leads to the independence from bias cues while leaving signal cues as valid recognition cues; see §2.3. We will specify ReBias learning objective after introducing our independence criterion, HSIC.
Hilbert-Schmidt Independence Criterion (HSIC). Since we need to measure the degree of independence between continuous random variables and in high-dimensional spaces, it is infeasible to resort to histogram-based measures; we use HSIC (Gretton et al., 2005). For two random variables and and kernels and , HSIC is defined as where is the cross-covariance operator in the Reproducing Kernel Hilbert Spaces (RKHS) of and (Gretton et al., 2005), an RKHS analogue of covariance matrices. is the Hilbert-Schmidt norm, a Hilbert-space analogue of the Frobenius norm. It is known that for two random variables and and radial basis function (RBF) kernels and , if and only if . A finite-sample estimate of has been used in practice for statistical testing (Gretton et al., 2005, 2008), feature similarity measurement (Kornblith et al., 2019), and model regularisation (Quadrianto et al., 2019; Zhang et al., 2018). We employ an unbiased estimator (Song et al., 2012) with samples, defined as
where , , i.e., the diagonal entries of are set to zero. is defined similarly.
Minimax optimisation for bias removal. We define
with an RBF kernel for the degree of independence between representation and the biased representations . We write and as shorthands. The learning objective for is then defined as
where is the loss for the main task and . We write as shorthands. Having specified to represent the bias , we need to train for the original task to intentionally overfit to . Thus, the inner optimisation involves both the independence criterion and the original task loss . The final learning objective for ReBias is then
. We solve equation 3 by alternative updates.
3.2 Why and how does it work?
Independence describes relationships between random variables, but we use it for function pairs. Which functional relationship does statistical independence translate to? In this part, we argue with proofs and observations that the answer to the above question is the dissimilarity of invariance types learned by a pair of models.
Linear case: Equivalence between independence and orthogonality. We study the set of function pairs satisfying for suitable random variable . Assuming linearity of involved functions and the normality of , we obtain the equivalence between statistical independence and functional orthogonality.
Lemma 1. Assume that and are affine mappings and where and . Assume further that is a normal distribution with mean and covariance matrix . Then, if and only if . For a positive semi-definite matrix , we define , and the set orthogonality likewise. The proof is in Appendix.
In particular, when and have 1-dimensional outputs, the independence condition is translated to the orthogonality of their weight vectors and decision boundaries. From a machine learning point of view, and are models with orthogonal invariance types.
Non-linear case: HSIC as a metric learning objective. We lack theories to fully characterise general, possibly non-linear, function pairs achieving ; it is an interesting open question. For now, we make a set of observations in this general case, using the finite-sample independence criterion , where is the mean-subtracted kernel matrix and likewise for . Unlike in the loss formulation (§3.1), we use the biased HSIC statistic for simplicity.
Note that is an inner product between flattened matrices and . We consider the inner-product-minimising solution for on an input pair given a fixed . The problem can be written as , which is equivalent to .
When , is relatively invariant on , since . Then, the above problem boils down to , signifying the relative variance of on . Following a similar argument, we obtain the converse statement: if is relatively variant on a pair of inputs, invariance of on the pair minimises the objective.
We conclude that against a fixed is a metric-learning objective for the embedding , where ground truth pairwise matches and mismatches are relative mismatches and matches for , respectively. As a result, and learn different sorts of invariances.
Effect of HSIC regularisation on toy data. We have established that HSIC regularisation encourages the difference in model invariances. To see how it helps to de-bias a model, we have prepared synthetic two-dimensional training data following the cross-domain generalisation case in Figure 1: and . Since the training data is perfectly biased, a multi-layer perceptron (MLP) trained on the data only shows 55% accuracy on de-biased test data (see decision boundary figure in Appendix). To overcome the bias, we have trained another MLP with equation 3 where the bias-characterising class is defined as the set of MLPs that take only the bias dimension as input. This model exhibits de-biased decision boundaries (Appendix) with improved accuracy of 89% on the de-biased test data.
In the previous section, ReBias has been introduced and theoretically justified. In this section, we present experimental results of ReBias. We first introduce the setup, including the biases tackled in the experiments, difficulties inherent to the cross-bias evaluation, and the implementation details (§4.1). Results on Biased MNIST (§4.2), ImageNet (§4.3) and action recognition (§4.4) are shown afterwards.
4.1 Experimental setup
Which biases do we tackle? Our work tackles the types of biases that arise due to the existence of shortcut cues that are sufficient for recognition in the training data. In the experiments, we tackle the “texture” biases for image classification and the “static” biases for video action recognition. Even if a CNN image classifier has wide receptive fields, empirical evidence indicates that they heavily rely on local texture cues for recognition, instead of the global shape cues (Geirhos et al., 2019). Similarly, a 3D CNN action recognition model that possesses the capacity to model temporal cues still rely on static cues like scenes or objects rather than the human motion for recognition (Weinzaepfel & Rogez, 2019). While it is difficult to precisely define and quantify all texture and scene types in the above examples, it is easy to intentionally design a model to be biased to such cues. For the texture bias in image recognition, we design as a CNN with smaller receptive fields; for the static bias in action recognition, we design as a 2D CNN.
Evaluating cross-bias generalisation is difficult. To measure the performance of a model across real-world biases, one requires an unbiased dataset or one where the types and degrees of biases can be controlled. Unfortunately, data in real world arise with biases. To de-bias a frog and bird image dataset with swamp and sky (see §2.1), either rare data samples must be collected or one must generate such data; they are expensive procedures (Peyre et al., 2017).
We thus evaluate our method along two axes: (1) synthetic biases (Biased MNIST) and (2) realistic biases (ImageNet classification and action recognition task). Biased MNIST contains colour biases which we control in training and test data for an in-depth analysis of ReBias. For ImageNet classification, on the other hand, we use clustering-based proxy ground truths for texture bias to measure the cross-bias generalisability. For action recognition, we utilize the unbiased data that are publicly available (Mimetics), albeit in small quantity. We use the Mimetics dataset (Weinzaepfel & Rogez, 2019) for the unbiased test set accuracies, while using the biased Kinetics (Carreira & Zisserman, 2017) dataset for training. The set of experiments complement each other in terms of experimental control and realism.
|Vanilla||Biased||HEX||LearnedMixin||RUBi||ReBias (ours)||Vanilla||Biased||HEX||LearnedMixin||RUBi||ReBias (ours)|
Implementation of ReBias. We describe the specific design choices in ReBias implementation (equation 3). The source code is in the supplementary materials.
For texture biases, we define the biased model architecture families as CNNs with small receptive fields (RFs). The biased models in will by design learn to predict the target class of an image only through the local texture cues. On the other hand, we define a larger search space with larger RFs for our unbiased representations.
In our work, all networks and are fully convolutional networks followed by a global average pooling (GAP) layer and a linear classifier. and denote the outputs of GAP layer (feature maps), on which we compute the independence measures using HSIC (§3.1).
For the Biased MNIST, is a fully convolutional network with four convolutional layers with kernels. Each convolutional layer uses the batch normalisation (Ioffe & Szegedy, 2015) and ReLU. has the same architecture as , except that the kernel sizes are . On ImageNet, we use the ResNet18 (He et al., 2016) architecture for with RF435. is defined as BagNet18 (Brendel & Bethge, 2019) with RF43. For action recognition, we use 3D-ResNet18 and 2D-ResNet18 for and whose RF along temporal dimension are 19 and 1, respectively.
We conduct experiments using the same batch size, learning rate, and epochs for fair comparison. We choose . For the Biased MNIST experiments, we set the kernel radius to one, while the median of distances is chosen for ImageNet and action recognition experiments. More implementation details are provided in Appendix.
Comparison methods. There are prior methodologies that can be applied in our cross-bias generalisation task (§2.2). We empirically compare ReBias against them. The prior methods include RUBi (Cadene et al., 2019) and LearnedMixin+H (Clark et al., 2019) that reduce the dependency of the model on biases captured by via logit re-weighting and logit ensembling, respectively. While the prior works additionally alter the training data for , we only compare the objective functions themselves in our experiment. We additionally compare two methods that tackle texture bias: HEX (Wang et al., 2019a) and StylisedImageNet (Geirhos et al., 2019). HEX attempts to reduce the dependency of a model on “superficial statistics”. It measures texture via neural grey-level co-occurrence matrices (NGLCM) and projects out the NGLCM feature from the model. StylisedImageNet reduces the modelâs reliance on texture by augmenting the training data with texturised images.
4.2 Biased MNIST
We first verify our model on a dataset where we have full control over the type and amount of bias during training and evaluation. We describe the dataset and present the experimental results.
Dataset and evaluation
We construct a new dataset called Biased MNIST designed to measure the extent to which models generalise to bias shift. We modify MNIST (LeCun et al., 1998) by introducing the colour bias that highly correlate with the label during training. With alone, a CNN can achieve high accuracy without having to learn inherent signals for digit recognition , such as shape, providing little motivation for the model to learn beyond these superficial cues.
We inject the colour bias by adding a colour on training image backgrounds (Figure 2). We pre-select 10 distinct colours for each digit . Then, for each image of digit , we assign the pre-defined colour with probability and any other colour with probability . then controls the bias-target correlation in the training data: leads to complete bias and leads to an unbiased dataset. We consider to simulate significant amounts of bias during training. We evaluate the modelâs generalisability to bias shift by evaluating under the following criterion:
Biased. , the in-distribution case in §2.1. Whatever bias the training set contains, it is replicated in the test set (same ). This measures the ability of de-biased models to maintain high in-distribution performances while generalising to unbiased settings.
Unbiased. . We assign biases on test images independently of the labels. Bias is no longer predictive of and a model needs to utilise actual signals to yield correct predictions.
Results on the Biased MNIST are shown in Table 1.
ReBias lets a model overcome bias. We observe that vanilla achieves 100% accuracy under the “biased” metric (the same bias between training and test data) in the Biased MNIST for all . This is how most machine learning tasks are evaluated, yet this does not show the extent to which the model depends on bias for prediction. When the bias cues are randomly assigned to the label at evaluation, vanilla accuracy collapses to 10.4% under the “unbiased” metric on the Biased MNIST when the train correlation is large, i.e., . The intentionally biased models result in 10.0% on the Biased MNIST, the random chance performance, for all . This exemplifies the case where a seemingly high-performing model has in fact overfitted to bias and does not generalise to new situations.
On the other hand, ReBias achieves robust generalisation across all settings by learning to be different from the representations . Especially, ReBias achieves better unbiased accuracies than the vanilla model under the highly correlated settings: and boosts for 0.999 and 0.997, respectively.
Comparison against other methods. As HEX pre-defines bias as patterns captured by NGLCM, we observe that it does not improve generalisability to colour bias (18.0%) while also hurting the in-distribution accuracy (74.1%) compared to vanilla . LearnedMixin achieves performance gain in unbiased accuracies (57.2%) yet suffers a severe performance drop for unbiased accuracies (15.2%). RUBi achieves robust generalisation across biased and unbiased accuracies (99.7% and 60.2% respectively). We show in the following experiments that LearnedMixin and RUBi achieve sub-optimal performances in realistic texture and static biases.
Analysis of per-bias performances. In Figure 3, we provide more fine-grained results by visualising the accuracies per bias-class pair . The diagonal average corresponds to the biased accuracy and the overall average corresponds to the unbiased accuracy. We observe that the vanilla model has higher accuracies on diagonals and lower on off-diagonals, showing the heavy reliance on the colour (bias) cues. HEX and RUBi demonstrate sporadic improvements in certain off-diagonal cells, but the overall improvements are limited. LearnedMixin shows further enhancements, yet it shows near-zero accuracies on diagonal entries (also seen in Table 1). ReBias uniformly improves the off-diagonal accuracies, while not sacrificing the diagonal accuracies.
Learning Curves. In Figure 4, we plot the evolution of unbiased accuracy and HSIC values as ReBias is trained. ReBias is trained with and tested with (unbiased). While the classification loss alone, i.e., vanilla , leads to an unbiased accuracy of , the unbiased accuracy increases dramatically () as the HSIC between and is minimised during training. We observe that there exists a strong correlation between the HSIC values and unbiased accuracies.
In ImageNet experiments, we further validate the applicability of ReBias on the texture bias in realistic images (i.e., objects in natural scenes). The texture bias often lets a model achieve good in-distribution performances by exploiting the local texture shortcuts (e.g., determining a swan class by not seeing its shape but the background water texture).
Dataset and evaluation
We construct 9-Class ImageNet, a subset of ImageNet (Russakovsky et al., 2015) containing 9 super-classes as done in Ilyas et al. (2019), since a full-scale analysis on ImageNet is not scalable. We additionally balance the ratios of sub-class images for each super-class to focus on the effect of texture bias.
Since it is difficult to evaluate the cross-bias generalisability on realistic unbiased data (§4.1), we settle for alternative evaluations:
Biased. . Accuracy is measured on the in-distribution validation set. Though widely-used, this metric is blind to a model’s generalisability to unseen bias-target combinations.
Unbiased. . As a proxy to the perfectly de-biased test data, which is difficult to collect (§4.1), we use texture clusters IDs as the ground truth labels for texture bias obtained by the k-means clustering. For full details of texture clustering algorithm, see Appendix. For an unbiased accuracy measurement, we compute the accuracies for every set of images corresponding to a texture-class combination . The combination-wise accuracy is computed by , where is the number of correctly predicted samples in and is the total number of samples in , called the population at . The unbiased accuracy is then the mean accuracy over all where the population . This measure gives more weights on samples of unusual texture-class combinations (smaller ) that are less represented in the usual biased accuracies. Under this unbiased metric, a biased model basing its recognition on textures is likely to show sub-optimal results on unusual combinations, leading to a drop in the unbiased accuracy. Since the k-means clustering is non-convex, we report the average unbiased accuracy of three clustering results with different initial points.
ImageNet-A. ImageNet-A (Hendrycks et al., 2019) contains the failure cases of ImageNet-trained ResNet50 among web images. The images consist of many failure modes of networks when “frequently appearing background elements” (Hendrycks et al., 2019) become erroneous cues for recognition (e.g. a bee image feeding on hummingbird feeder is recognised as a hummingbird). An improved performance on ImageNet-A is an indirect signal that the model learns beyond the bias shortcuts.
We measure performances of ResNet18 trained under the ReBias framework to be different from BagNet18. We use the metrics in the previous part. Results are shown in Table 3.
Vanilla models are biased. ResNet18 shows good performances on the biased accuracy (90.8%) but dropped performances on the texture-unbiased accuracy (88.8%). BagNet18 performs worse than the vanilla ResNet as they are heavily biased towards texture by design (i.e., small receptive field sizes). The drop signifies the biases of vanilla models towards texture cues; by basing their predictions on texture cues they obtain generally better accuracies on texture-class pairs that are more represented. The drop also shows the limitation of current evaluation schemes where the cross-bias generalisation is not measured.
ReBias leads to less biased models. When ReBias is applied on ResNet18 to make it learn cues beyond those captured by BagNet18, we observe a general boost in the biased, unbiased, and ImageNet-A accuracies (Table 3). The unbiased accuracy of ResNet18 improves from 88.8% to 90.5%, thus robustly generalising to less represented texture-class combinations at test time. Our method also shows improvements on the challenging ImageNet-A subset (e.g. from 24.9% to 29.6%), which further shows an improved generalisation. While StylisedImageNet attempts to mitigate texture bias by stylisation, it does not increase the generalisability for both the unbiased and ImageNet-A accuracy (86.6% and 24.6% respectively). Similar to the Biased MNIST results, Learned-Mixin suffers a collapse in the in-distribution accuracy (from 90.8% to 67.9%) and does not improve generalisability to less represented texture-class combinations or the challenging ImageNet-A. RUBi only shows improvement on ImageNet-A (from 24.9% to 27.7%).
|StylisedIN Geirhos et al. (2019)||88.4||86.6||24.6|
|LearnedMixin Clark et al. (2019)||64.1||62.7||15.0|
|RUBi Cadene et al. (2019)||90.5||88.6||27.7|
|LearnedMixin Clark et al. (2019)||12.3||11.4|
|RUBi Cadene et al. (2019)||22.4||13.4|
4.4 Action recognition
To see further effectiveness of ReBias on reducing static biases in a video understanding task, we conduct the action recognition experiments with 3D CNNs. 3D CNNs have proven their state-of-the-art performances on action recognition benchmarks such as Kinetics (Carreira & Zisserman, 2017), but recent studies (Sevilla-Lara et al., 2019; Li et al., 2018; Li & Vasconcelos, 2019) have shown that such action datasets have strong static biases towards the scene or objects in videos. As a result, 3D CNNs make predictions dominantly based on static cues, despite their ability to capture temporal signals, and they achieve high accuracies even with temporal cues removed (e.g., shuffling frames or masking-out human actor in videos) (Weinzaepfel & Rogez, 2019). This bias problem occasionally leads to performance drop when static cues shift across training and test settings (e.g., predicting “swimming” class when a person plays football near a swimming pool).
We use the Kinetics dataset (Carreira & Zisserman, 2017) for training, which is known to have bias towards static cues. To evaluate the cross-bias generalisability, we use the Mimetics dataset (Weinzaepfel & Rogez, 2019) that consists of videos of a mime artist performing actions without any context. The classes of Mimetics are fully covered by the Kinetics classes and we use it as the unbiased validation set. Since the training and testing of the full action datasets are not scalable, we sub-sample 10-classes from both datasets. Detailed dataset descriptions are in Appendix.
We evaluate the performances of 3D-ResNet18 trained to be different from the biased model 2D-ResNet18. Main results are shown in Table 3.
Vanilla model is biased. The vanilla 3D-ResNet18 model shows a reasonable performance on the biased Kinetics with 54.5% accuracy, but significantly loses the accuracy on the unbiased Mimetics with 18.9% accuracy. While 3D-ResNet18 is originally designed for capturing temporal signals within videos, it relies a lot on static cues, resulting in a similar performance with the 18.4% accuracy by 2D-ResNet18 on Mimetics.
ReBias reduces the static bias. Applying ReBias on 3D-ResNet18 encourages it to utilise the temporal modelling capacity by forcing it to reason differently from 2D-ResNet18. ReBias improves the accuracies on both Kinetics and Mimetics datasets beyond the vanilla model : and , respectively. We also compare against the two baseline methods, LearnedMixin (Clark et al., 2019) and RUBi (Cadene et al., 2019), as in the previous sections. ReBias shows better performances than the two baseline methods for reducing the static bias for action recognition. We believe that the difficulty of the action recognition task on Kinetics hampers the normal operation of the logit-modification step in the baseline methods, severely hindering the convergence with respect to the cross-entropy loss. The training of ReBias, on the other hand, remains stable as the independence loss acts only as a regularisation term.
We have identified a practical problem faced by many machine learning algorithms that the learned models exploit bias shortcuts to recognise the target: the cross-bias generalisation problem (§2). Models tend to under-utilise its capacity to extract non-bias signals (e.g. global shapes for object recognition, or temporal actions for action recognition) when bias shortcuts provide sufficient cues for recognition in the training data (e.g. texture for object recognition, or static contexts for action recognition) (Geirhos et al., 2019; Weinzaepfel & Rogez, 2019). We have addressed this problem with the ReBias method. Given an identified set of models that encodes the bias to be removed, ReBias encourages a model to be statistically independent of (§3). We have provided theoretical justifications (§3.2) and have validated the superiority of ReBias in removing biases from models through experiments on Biased MNIST, ImageNet classification, and the Mimetics action recognition benchmark (§4).
We thank Clova AI Research team for the discussion and advice, especially Dongyoon Han, Youngjung Uh, Yunjey Choi, Byeongho Heo, Junsuk Choe, Muhammad Ferjad Naeem, and Hyojin Park for their internal reviews. Naver Smart Machine Learning (NSML) platform (Kim et al., 2018) has been used in the experiments. Kay Choi has helped the design of Figure 1.
Appendix A Statistical Independence is Equivalent to Functional Orthogonality for Linear Maps
We provide a proof for the following lemma in §3.2.
Lemma 1. Assume that and are affine mappings and where and . Assume further that is a normal distribution with mean and covariance matrix . Then, if and only if . denotes the independence. For a positive semi-definite matrix , we define . The orthogonality of two subspaces and is defined likewise.
Due to linearity and normality, the independence is equivalent to the covariance condition . The covariance is computed as:
Appendix B Algorithm
Algorithm 1 shows the detailed algorithm for solving the minimax problem in equation 3 of the main paper.
denotes the original loss for the main task, e.g., the cross entropy loss. We solve the minimax problem in a mini-batch by employing the unbiased finite-sample estimator of HSIC, (Song et al., 2012), to measure the independence of and in the mini-batch, defined as
Appendix C Implementation Details
Training setup. We solve the minimax problem in algorithm 1 through alternating stochastic gradient descents with the ADAM optimiser Kingma & Ba (2015). The regularisation parameters and are set to 1.0 in all experiments. We used the batch sizes (256, 128, 128) for (Biased MNIST, ImageNet, action recognition) experiments. For Biased MNIST, the learning rate is initially set to 0.001 and is decayed by factor 0.1 every 20 epochs. For ImageNet and action recognition, learning rates are initially set to 0.001 and 0.1, respectively, and are decayed by cosine annealing. For action recognition, we use and as the output logits for the sake of stable training. For each dataset, we train every method (vanilla, biased, comparison methods, and ReBias) for the same number of epochs. We train the models with (80, 120, 120) epochs for (Biased MNIST, ImageNet, action recognition) experiments. All experiments are implemented using PyTorch (Paszke et al., 2019).
Training details for comparison methods. For training LearnedMixin, we pre-train and fix before training , as done in the original paper. We pre-train for 5 epochs for Biased MNIST, and 30 epochs for ImageNet and action recognition. For training RUBi, we update and simultaneously without pre-training , as done in the original paper. For training HEX, we substitute with the neural grey-level co-occurrence matrix (NGLCM) to represent the “superficial statistics”. For StylisedImageNet, we augment 9-class ImageNet with its stylised version (i.e., twice the original dataset size), while maintaining the training setup as identical.
|layer name||3D-ResNet18||2D-ResNet18||output size (TCHW)|
|input||input video||input video||83224224|
Appendix D Standard Errors in Experimental Results
|.999||10.4 0.5||10. 0.||10.8 0.4||12.1 0.8||13.7 0.7||22.7 0.4|
|.997||33.4 12.1||10. 0.||16.6 0.8||50.2 4.5||43.0 1.1||64.2 0.8|
|.995||72.1 1.9||10. 0.||19.7 1.9||78.2 0.7||90.4 0.4||76.0 0.6|
|.990||89.1 0.1||10. 0.||24.7 1.6||88.3 0.7||93.6 0.4||88.1 0.6|
|Vanilla (ResNet18)||90.8 0.6||88.8 0.6||24.9 1.1|
|Biased (BagNet18)||67.7 0.3||65.9 0.3||18.8 1.1|
|StylisedIN Geirhos et al. (2019)||88.4 0.5||86.6 0.6||24.6 1.4|
|LearnedMixin Clark et al. (2019)||64.1 4.0||62.7 3.1||15.0 1.6|
|RUBi Cadene et al. (2019)||90.5 0.3||88.6 0.4||27.7 2.1|
|ReBias (ours)||91.9 1.7||90.5 1.7||29.6 1.6|
|Vanilla (3D-ResNet18)||54.5 3.2||18.9 0.4|
|Biased (2D-ResNet18)||50.7 3.3||18.4 2.3|
|LearnedMixin Clark et al. (2019)||12.3 2.3||11.4 0.4|
|RUBi Cadene et al. (2019)||22.4 2.0||13.4 1.5|
|ReBias (ours)||55.8 3.1||22.4 1.3|
Appendix E Decision Boundary Visualisation for Toy Experiment
|Training data||Vanilla (baseline)||ReBias (ours)|
Appendix F Texture Clustering on ImageNet
In our ImageNet experiments (§4.3), we have obtained the proxy ground truths for texture bias using texture feature clustering. We extract the texture features from images by computing the gram matrices of low-layer feature maps, as done in texturisation methods (Gatys et al., 2015; Johnson et al., 2016), to capture the edge and colour cues. Specifically, we use the feature maps from layer relu1_2 of the ImageNet pre-trained VGG16 (Simonyan & Zisserman, 2015).
To approximate the texture ground truth labels, we cluster the texture features of 9-Class ImageNet data. We use the mini-batch -means algorithm with and batch size 1024. As k-means clustering is non-convex, we repeat the ImageNet experiments with three different texture clustering results, each with different initialisation. We report the averaged performances across the three trials.
We show an example texture clustering in Figure A2. Clusters capture similar texture patterns. The texture clusters exhibit strong correlations with semantic classes. See Figure A3 for the top-3 correlated classes per cluster. For example, the “water” texture is strongly associated with the turtle, fish, and bird classes. In the presence of such bias-class correlations, the model is motivated to take the bias shortcut by utilising the texture cue for recognition. In this case, the model shows sub-optimal performances on unusual class-texture combinations (e.g., crab on grass texture).
Appendix G Further Analysis on Texture Bias of ImageNet-Trained Models
We further analyse the texture bias of the models trained on ImageNet (§4.3). Figure A4 shows the texture-class-wise accuracies of the vanilla-trained ResNet18 and ReBias-trained ResNet18. To quantitatively measure the texture biases present in the dataset, we count the number of samples for each texture-class pair (denoted “population” in the main paper). For each class, we define the dominant texture cluster as the largest cluster that contains more than of the class samples. 4 out of 9 classes considered has the dominant texture cluster: (“Dog”, “Face”), (“Cat”, “Face”), (“Bird”, “Eye”), and (“Monkey”, “Face”).
We measure the average accuracy over classes with dominant texture clusters (biased classes) and the average over the other classes. We observe that ResNet18 shows higher accuracy on biased classes () than on less biased classes (), signifying its bias towards texture. On the other hand, ReBias achieves similar accuracies (biased classes) and (unbiased classes). We stress that ReBias overcomes the bias and enhances generalisation across distributions even if the training dataset itself is biased.
Appendix H Action Recognition Datasets
We provide an overview of the action recognition datasets. Kinetics dataset Carreira & Zisserman (2017) has 300K videos with 400 action classes. Mimetics dataset Weinzaepfel & Rogez (2019) has 713 videos with 50 action classes (subset of the Kinetics classes). To simplify our investigation, we have sub-sampled 10 common classes from Kinetics and Mimetics: “canoeing or kayaking”, “climbing a rope”, “driving car”, “golf driving”, “opening bottle”, “playing piano”, “playing volleyball”, “shooting goal (soccer)”, “surfing water”, and “writing”. Examples of Kinetics and Mimetics are shown in Figure A5. While Kinetics samples are biased towards static cues like scene and objects, Mimetics are relatively free of such correlations. Mimetics is thus a suitable benchmark for validating the cross-bias generalisation performances.
- and denote independence and dependence, respectively.
- Agarwal, V., Shetty, R., and Fritz, M. Towards causal vqa: Revealing and reducing spurious correlations by invariant and covariant semantic editing. arXiv preprint arXiv:1912.07538, 2019.
- Agrawal, A., Batra, D., Parikh, D., and Kembhavi, A. Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4971–4980, 2018.
- Ben-David, S., Blitzer, J., Crammer, K., and Pereira, F. Analysis of representations for domain adaptation. In Advances in neural information processing systems, pp. 137–144, 2007.
- Brendel, W. and Bethge, M. Approximating CNNs with bag-of-local-features models works surprisingly well on imagenet. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SkfMWhAqYQ.
- Cadene, R., Dancette, C., Cord, M., Parikh, D., et al. Rubi: Reducing unimodal biases for visual question answering. In Advances in Neural Information Processing Systems, pp. 839–850, 2019.
- Carreira, J. and Zisserman, A. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308, 2017.
- Clark, C., Yatskar, M., and Zettlemoyer, L. Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4069–4082, 2019.
- Feichtenhofer, C., Fan, H., Malik, J., and He, K. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6202–6211, 2019.
- Gatys, L., Ecker, A. S., and Bethge, M. Texture synthesis using convolutional neural networks. In Advances in neural information processing systems, pp. 262–270, 2015.
- Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., and Brendel, W. Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bygh9j09KX.
- Gretton, A., Bousquet, O., Smola, A., and Schölkopf, B. Measuring statistical dependence with hilbert-schmidt norms. In International conference on algorithmic learning theory, pp. 63–77. Springer, 2005.
- Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Schölkopf, B., and Smola, A. J. A kernel statistical test of independence. In Advances in neural information processing systems, pp. 585–592, 2008.
- Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S., and Smith, N. A. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 107–112, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/N18-2017.
- Hardt, M., Price, E., Srebro, N., et al. Equality of opportunity in supervised learning. In Advances in neural information processing systems, pp. 3315–3323, 2016.
- He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., and Song, D. Natural adversarial examples. arXiv preprint arXiv:1907.07174, 2019.
- Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B., and Madry, A. Adversarial examples are not bugs, they are features. In Advances in Neural Information Processing Systems, pp. 125–136, 2019.
- Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 448–456, Lille, France, 07–09 Jul 2015. PMLR.
- Johnson, J., Alahi, A., and Fei-Fei, L. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pp. 694–711. Springer, 2016.
- Kim, B., Kim, H., Kim, K., Kim, S., and Kim, J. Learning not to learn: Training deep neural networks with biased data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9012–9020, 2019.
- Kim, H., Kim, M., Seo, D., Kim, J., Park, H., Park, S., Jo, H., Kim, K., Yang, Y., Kim, Y., et al. Nsml: Meet the mlaas platform with a real-world case study. arXiv preprint arXiv:1810.09957, 2018.
- Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
- Kornblith, S., Norouzi, M., Lee, H., and Hinton, G. Similarity of neural network representations revisited. In International Conference on Machine Learning, 2019.
- LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Li, Y. and Vasconcelos, N. Repair: Removing representation bias by dataset resampling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9572–9581, 2019.
- Li, Y., Li, Y., and Vasconcelos, N. Resound: Towards action recognition without representation bias. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 513–528, 2018.
- McCoy, R. T., Pavlick, E., and Linzen, T. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
- Niven, T. and Kao, H.-Y. Probing neural network comprehension of natural language arguments. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
- Panda, R., Zhang, J., Li, H., Lee, J.-Y., Lu, X., and Roy-Chowdhury, A. K. Contemplating visual emotions: Understanding and overcoming dataset bias. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 579–595, 2018.
- Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035, 2019.
- Peyre, J., Sivic, J., Laptev, I., and Schmid, C. Weakly-supervised learning of visual relations. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5179–5188, 2017.
- Quadrianto, N., Sharmanska, V., and Thomas, O. Discovering fair representations in the data domain. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8227–8236, 2019.
- Ray, A., Sikka, K., Divakaran, A., Lee, S., and Burachas, G. Sunny and dark outside?! improving answer consistency in vqa through entailed question generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5860–5865, 2019.
- Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
- Sevilla-Lara, L., Zha, S., Yan, Z., Goswami, V., Feiszli, M., and Torresani, L. Only time can tell: Discovering temporal data for temporal modeling. arXiv preprint arXiv:1907.08340, 2019.
- Shah, M., Chen, X., Rohrbach, M., and Parikh, D. Cycle-consistency for robust visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6649–6658, 2019.
- Shetty, R., Schiele, B., and Fritz, M. Not using the car to see the sidewalk–quantifying and controlling the effects of context in classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8218–8226, 2019.
- Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
- Song, L., Smola, A., Gretton, A., Bedo, J., and Borgwardt, K. Feature selection via dependence maximization. Journal of Machine Learning Research, 13(May):1393–1434, 2012.
- Torralba, A. and Efros, A. A. Unbiased look at dataset bias. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1521–1528, June 2011.
- Tran, D., Wang, H., Torresani, L., and Feiszli, M. Video classification with channel-separated convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5552–5561, 2019.
- Wang, H., He, Z., and Xing, E. P. Learning robust representations by projecting superficial statistics out. In International Conference on Learning Representations, 2019a. URL https://openreview.net/forum?id=rJEjjoR9K7.
- Wang, T., Zhao, J., Yatskar, M., Chang, K.-W., and Ordonez, V. Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5310–5319, 2019b.
- Weinzaepfel, P. and Rogez, G. Mimetics: Towards understanding human actions out of context. arXiv preprint arXiv:1912.07249, 2019.
- Zemel, R., Wu, Y., Swersky, K., Pitassi, T., and Dwork, C. Learning fair representations. In International Conference on Machine Learning, pp. 325–333, 2013.
- Zhang, C., Liu, Y., Liu, Y., Hu, Q., Liu, X., and Zhu, P. Fish-mml: Fisher-hsic multi-view metric learning. In International Joint Conference on Artificial Intelligence, pp. 3054–3060, 2018.