Bilateral Asymmetry Guided Counterfactual Generating Network for Mammogram Classification

Bilateral Asymmetry Guided Counterfactual Generating Network for Mammogram Classification


Mammogram benign or malignant classification with only image-level labels is challenging due to the absence of lesion annotations. Motivated by the symmetric prior that the lesions on one side of breasts rarely appear in the corresponding areas on the other side, given a diseased image, we can explore a counterfactual problem that how would the features have behaved if there were no lesions in the image, so as to identify the lesion areas. We derive a new theoretical result for counterfactual generation based on the symmetric prior. By building a causal model that entails such a prior for bilateral images, we obtain two optimization goals for counterfactual generation, which can be accomplished via our newly proposed counterfactual generative network. Our proposed model is mainly composed of Generator Adversarial Network and a prediction feedback mechanism, they are optimized jointly and prompt each other. Specifically, the former can further improve the classification performance by generating counterfactual features to calculate lesion areas. On the other hand, the latter helps counterfactual generation by the supervision of classification loss. The utility of our method and the effectiveness of each module in our model can be verified by state-of-the-art performance on INBreast and an in-house dataset and ablation studies.

Domain Knowledge, Bilateral Asymmetry, Counterfactual, Mammogram Classification

I Introduction

Breast cancer is the leading cause of cancer death among women [29]. The mammography-based Benign/Malignant Classification (BMC) is considered to be an effective way for early breast cancer diagnosis. Note that only the images with lesions need benign/malignant classification. It is meaningless to tell the malignancy of healthy images since there are no lesions in them. Whether there are lesions in an image can be parsed from clinical reports. Since the existence of lesions is a necessary condition to be diagnosed as malignant, we are interested in benign/malignant classification for samples with lesions. The annotations of lesion areas require extra efforts such as bounding boxes of lesion areas [6, 18, 32, 24, 30] and binary mask for segmentation [5], which require expert domain knowledge and are costly and difficult to obtain. Therefore, addressing BMC with only image-level labels is valuable to clinical application. The key for BMC with the only image-level labels as supervision is to explore abnormal features for classification from a full mammogram image. This kind of abnormality can be expressed as masses, calcification clusters, structure distortions and their associated signs like skin retraction, skin thickening and so on. However, the high-intensity breast tissues in 2D image (as projection of the 3D organ) may partially obscure the lesions, making the problem more challenging.

To solve this problem, existing works mainly utilize specific rules or attention modules for feature selection, such as the selected local features with the maximum response or largest prediction score [36], and select the most discriminative region via the proposed attention branch supervised by a classification signal [7, 34]. The common problem for these methods lie in failing to take advantage of mammogram domain knowledge, which can be very valuable for lesion localization.

Fig. 1: (a) Two cases to show how the unhealthy breasts look asymmetrical. (b) Illustrations of that healthy breasts are roughly bilaterally symmetrical, with patterns and appearance (e.g., structure, distribution, density, and morphology) of breast tissues can be very diverse among them.

One important mammogram domain knowledge is “Anotomical Symmetry”, which has been authenticated by BI-RADS standard of American College of Radiology [27]. It refers to that the lesion area in the target image (denoting the image from target side to be classified) of breast rarely appears in the corresponding area in the reference image (denoting the image of the opposite side). There is no lesion in the corresponding area on the other side, as shown in Fig. 1. Due to such a prior, the radiologists commonly compare bilateral breasts to find the asymmetric regions for further diagnosis.

Such a prior naturally motivates the counterfactual generation question: what would the features of the target image have been looked like had lesions removed, given observed target image with lesions and the reference image that is lesion-free in the corresponding area? After such counterfactual features being generated, the residue between the original target features and the counterfactual one incorporates the information of lesion hence can provide an informative and interpretable guidance for BMC. The answer to the above question is via constructing a structural causal model [23] in which the counterfactual learning is well defined. Specifically, a structural causal model (SCM) is proposed that introduces latent bilateral variables for generating bilateral images. To depict the bilateral symmetry, we further introduce a hidden confounder (including DNA, environment, etc.) that generates such bilateral features via the same causal mechanism, naturally leading to an inspiring conclusion: the target features of counterfactual generation share the same distribution (i) with the reference features in lesion areas and (ii) with the target features in lesion-free areas, namely counterfactual constraints. Based on such a theoretical finding, we propose a novel Counterfactual Generation Network (CGN). Note that pixel-to-pixel registration between bilateral images is challenging due to unpleasant spatial distortion during image capturing and imperfect anatomical symmetry, we apply counterfactual generation in feature level motivated by [17]. Moreover, it achieves faster training speed without losing prediction power. This is also the reason why many domain adaptation methods work on feature space. Our CGN iteratively optimizes counterfactual generation under counterfactual constraints and lesion-area estimation via an attention-based prediction feedback mechanism. Both the lesion-area estimation and counterfactual generation are optimized jointly and prompt each other, supervised by classification loss. Finally, the residual features that incorporate the accurate lesion information, and the original target features which encodes the contextual information, are concatenated for the final classification.

In contrast to existing GAN-based works [35, 28, 25] for counterfactual generation, our method is endowed with a theoretical guarantee regarding the counterfactual distribution [4] by exploiting the symmetric prior. Specifically, AnoGAN [25] learns the latent space of healthy data and assumes that the lesions can not be reconstructed within such latent space. Therefore the areas with large reconstruction errors are more likely to be lesions. Its performance highly relies on how well the healthy data modeled. However, in our mammogram application, the glandular structure and characterization of healthy images can be very diverse. Sometimes the healthy pattern can even be similar to lesions, as shown in Fig 1. Thus it is challenging to model healthy patterns well and distinguish the lesions at the same time using only healthy data. While another cycle consistency loss based method targets on lesion removal [35, 28]. Although these methods can utilize the lesion information by learning a back translation (i.e., from the counterfactual to the original), they also suffer from the healthy modeling problem in the forward translation (i.e., from the original to the counterfactual). What is more, these methods all assume that the translated data can be translated back to the original data [13, 21]. In our application, it means the back translation network should be able to model the location and appearance of the removed lesion. However, mammogram lesions can appear anywhere, i.e., the location of the lesions is unpredictable. Therefore, it is an ill-posed problem to translate the counterfactual data back to the corresponding original data perfectly.

In this paper, we introduce symmetry prior to counterfactual learning to propose a bilateral asymmetry guided counterfactual generating network (CGN), improving the performance of mammogram classification. Instead of learning from healthy images, our CGN applies counterfactual generation conditioning on the bilateral information. Based on the symmetry prior, we formulate the generated counterfactual features and estimated lesion areas together by counterfactual constraints: being similar distribution with the reference features in lesion areas and maintaining most of the information of target features in lesion-free areas. Therefore, we first apply a deep generator with AdaIN [14] mechanism to provide the feature generation ability. Then we design a prediction feedback mechanism to help estimate the lesion areas. Meanwhile, an adversarial reference loss, a feedback triplet loss, and an auxiliary negative embedding loss are proposed to encourage the generated features to satisfy the above counterfactual constraints. Both the lesion-area estimation and counterfactual generation are optimized jointly and prompt each other. Further, we get the residual features by computing the difference between the generated counterfactual features and target features. Finally, we aggregate the residual features together with the target features for the final classification.

We evaluate the proposed method on a public dataset INBreast [20] and an in-house dataset. Our CGN achieves an area under the curve (AUC) of 91.1% on INBreast and 78.1% on the in-house dataset, which largely outperforms the representative methods. To summarize, our contributions are mainly three-fold:

  1. First, for benign or malignant classification with only image-level labels, we propose a novel counterfactual-based method to learn the healthy features of the target image, which can help localize the lesions to prompt further classification;

  2. Second, we draw the bilateral symmetry prior to the molybdenum target images into the counterfactual generation for learning counterfactual features reasonably and effectively;

  3. Third, we achieve state-of-the-art performance for mammogram classification on both the public and in-house datasets.

Ii Related Work

Ii-a BMC with only image-level labels

Previous approaches that can be used to address BMC with only image-level labels without any extra annotations are roughly categorized into two classes: (i) the attention-based methods, e.g., Zhu et al. [36], Zhou et al. [34] and Fukui et al. [7]; (ii) the simple multi-view fusion methods, e.g., Wu et al. [33]. For the class (i), they extend a response-based visual explanation model with an attention module or specific rules. However, they all ignore medical domain knowledge which is valuable for BMC and are fragile when facing dense breasts without learning from bilateral information. For the class (ii), since the bilateral breasts are not pixel-to-pixel symmetry, simple multi-view fusions can be very sensitive to bilateral misalignment. Motivated by above, we take advantage of domain knowledge and design CGN to improve BMC.

Ii-B Counterfactual Generation

Existing GAN-based models for counterfactual generation can be roughly categorized into two classes: (i) healthy modeling methods, e.g.,AnoGAN [25] and (ii) cycle consistency based methods, e.g., CycleGAN [35], Fixed-point GAN [28]. For class (i) that learns to model the pattern of healthy data, they suffer from unstable result due to large diversity of glandular structure and characterization of healthy images which are hence difficult to model. Another line of work, i.e., class (ii), uses cycle consistency loss to incorporate bi-directed translation: forward translation (from the original to the counterfactual) and back translation (from the counterfactual to the original). These methods suffer from two problems: a) the healthy modeling problem for forward translation, similar to class (i); b) the ill-posed problem for back translation since the location and appearance of the removed lesion is diverse and unpredictable. In contrast to existing works, our method learns healthy pattern by exploiting symmetric prior, so as to avoid the problems mentioned above and hence be able to achieve more robust counterfactual generation result.

Iii Methodology

Fig. 2: (a): Our causal graph with observed variables marked by yellow and unobserved variables marked by gray. For notations, denotes the DNA, growth environment that can explain the common properties shared between and ; denote lesion states ( if there are lesions in ; and if not ); respectively denote the hidden features of the image. (a) is mathematically expressed in our Eq (1). (b): Our counterfactual learning framework, motivated by symmetric prior (as shown in the top blue box). Our theoretical result (theorem III.1) is illustrated in the bottom orange box, in which the denotes the counterfactual result of the target side with the removal of lesion areas, i.e., the counterfactual result of under counterfactual event . The blue arrows denote ”distributionally equivalence”. As shown, the distribution of is the same with , described by Eq. (2); the distribution of is the same with , described by Eq. (3).

Problem Setup and Notations The goal of mammogram benign or malignant classification is to learn classifier that predicts the disease label of target side , where () denotes the input space of bilateral breast images with denoting the target side of bilateral breast image and correspondingly denoting the other side, a.k.a, reference side, and denotes the disease label of the target side (1 denotes malignant and 0 denotes benign). To achieve this goal, we are given training data ( for any integer ). During test stage, our goal is to predict for a new instance .

Iii-a Counterfactual Learning

Symmetric Prior [27] For a paired image data, if the target image contains lesions, the corresponding symmetrical area in the reference image has almost certainly no lesions.

This symmetric prior provides a guidance for localizing lesion areas, as a residue of the feature of target image subtracting the one with the removal of corresponding lesions. The generation of the latter image, which can leverage the information of the reference features due to symmetric prior, is a counterfactual problem, i.e., what would the features of target image have been looked like had lesions removed, given observed target image with lesions and the reference image that is lesion-free in the corresponding area? Such a counterfactual problem has been well-defined and explored in the framework of (Structural) Causal Model (SCM) [23] that describes the generating process of observational variables, with assumptions entailed in the corresponding causal graph.

To describe bilateral images, we propose a SCM that introduces a hidden common factor (denoted as which can refer to DNA, growth environment, etc.) that generates bilateral variables, which depicts our symmetric prior, as shown in Fig. 2 (a). Besides, our SCM incorporates bilateral latent features, denoted as ( denotes target side and denotes reference side), as abstraction/concepts of bilateral images. Such bilateral features, which are affected by and disease status () that is determined by lesion status . The distribution of these variables are assigned by the following structural equations:


Equipped with such a SCM, we can mathematically formulate the symmetric prior as , with denoting the lesion areas of the target image ; and counterfactual generation problem as that can be read as the value of on in situation had [23]. Since the situation is induced by the factual event , our counterfactual distribution can be denoted as . Under our SCM and the symmetric prior, we have following results for counterfactual generation:

Theorem III.1.

Under the symmetric prior, the structural equation model defined in Eq. (1) for Fig. 2 (a) has the following results for counterfactual distribution of target features:


The proof of Theorem III.1 is shown in our appendix. This theorem implies that the generated counterfactual features should be equal (i) to reference features in lesion areas, (ii) to target features in lesion-free areas, which leads the following two goals for the counterfactual generation:


where denotes generalized distance measure, e.g., KL divergence. With such counterfactual learning, it is expected that the lesion areas, as the subtraction of counterfactual generation of (with lesions removed) from original , can be detected precisely and hence can lead to accurate classification performance. To achieve the above two goals, we propose a counterfactual generating network (CGN), which cooperatively localizes the lesion areas and achieve counterfactual generation simultaneously. We explain the CGN in details in the subsequent section.

Fig. 3: The schematic overview of CGN. First, two feature extractors with weight sharing extract the features for input paired target and reference images, respectively. Then the bilateral features are processed by AdaIN mechanism and fed into the generator to generate the counterfactual features. The counterfactual features are constrained by adversarial learning with a feedback triplet loss , and a negative embedding loss . Then, the residual features are obtained by computing the difference between the target features and counterfactual features. Finally, the residual features are fed into a Fusion network with target features and outputs prediction of benign/malignant.

Iii-B Counterfactual Generating Network (CGN)

As illustrated in Fig. 3, our counterfactual generation network for mammogram classification contains the following steps: (i) generation of target and reference features and from images and , via a feature extractor chosen from backbone network, e.g. AlexNet [16], ResNet [11], (ii) a counterfactual generation module is designed to generate counterfactual features from both and , (iii) a classification module is designed to predict malignant/benign, with aggregated and as input. To accurately identify for generating in step (ii), a prediction feedback mechanism and a set of counterfactual constrains motivated by Eq. (4) and  (5) are designed. In what follows, we will explain the above mechanisms in more details.

Counterfactual Generation Module The Adaptive Instance Normalization (AdaIN) [14], which has been proved to be effective for style transfer tasks, is adopted as the generator (as shown in Fig. 3) for counterfactual generation, with as content and as style in our case:


with and denoting the mean and standard variance function. As suggested by [14], an interpolated and AdaIN are fed into a generator network containing nine residual blocks to generate counterfactual features :


where is a hyper-parameter of the interpolation weight.

Classification Module The residual features (entailing lesion information) obtained by and (with additional contextual information which is showed useful for the medical image inference [2] besides lesion-related information we obtained) are fed into a classifier in a concatenated way. This classifier, which implements a convolutional block as FusionLayer to obtain the fused features, is trained via commonly used cross-entropy loss:


where (with ) is the classification probability.

Prediction Feedback Mechanism This mechanism is to estimate the lesion areas for better counterfactual generation. Specifically, we use the attention map, in which the locations with higher value implies higher lesion probabilities, as final estimation of . Such an attention map is calculated by normalization/softmax following the class activation map (CAM) [34], i.e., . is the corresponding prediction probabilities of being lesions at each position.

Method AUC (a) AUC (b) AUC (c) AUC (d)
Pretrained CNN [6] 0.690
Pretrained CNN+Random Forest [6] 0.760
Vanilla AlexNet, Zhu et al. [36] 0.790
Zhu et al. [36] 0.890
Vanilla* 0.820 0.827 0.780 0.697
AnoGAN [25]* 0.803 0.796 0.774 0.720
Fixed-Point GAN [28]* 0.835 0.837 0.805 0.734
CycleGAN [35]* 0.852 0.838 0.808 0.741
Wu et al. [33] 0.863 0.860 0.810 0.723
Zhu et al. [36]* 0.860 0.862 0.830 0.720
Vanilla*+GAP [34]* 0.857 0.827 0.780 0.718
Vanilla*+ABN [7]* 0.858 0.846 0.814 0.723
Proposed Method 0.910 0.911 0.885 0.781
TABLE I: AUC evaluation of comparative experiments on (a) INBreast + Alexnet (mass); (b) INBreast + Resnet50 (mass); (c) INBreast + Resnet50 (mixed lesions); (d) In-house + Alexnet (mixed lesions); Note that the ’*’ means our re-implementation. The ’-’ means there are no official report results.

Counterfactual Constraints Since the direct optimization of Eq. (4) and (5) can be intractable/unstable for general distance measure such as KL-divergence, we adopt the adversarial learning strategy [8]. For optimization of Eq. (4), GAN generates similar features from the whole reference image and can constrain our desired features be the same as the references in lesion areas. Specifically, a Discriminator (learns to classify and ) and a Generator (fools the discriminator) are designed and trained in a competing way:


However, the generated features through GAN loss are undesired features in lesion-free areas. For optimization of Eq. (5), we use a prediction feedback mechanism to localize lesion areas. One intuitive way to use feedback mechanism is constraining generated features be the same as the target features in lesion-free areas directly or only constrains the generated features be the same as the reference features in lesion areas in discriminator. However, motivated by [26] triplet loss can be better than such designs. They will suffer from slow convergence and falling into local minimum easily and we analysis and evaluate such variant methods in Sec IV-G. Thus, we propose a feedback triplet loss to minimize the distance between the target features and counterfactual features in lesion-free areas, which is measured by target-counterfactual distance by weighted mean square error:


, where and denote the height and width of CAM respectively. Motivated by minimization of distance between and enforced by Eq. (III-B), we choose a between and as an adaptive reference to minimize . The is measured by chamfer distance [1] to endure the misalignment, and is defined by


Therefore, the feedback triplet loss is defined as:


The triplet loss makes be closer to than in terms of the lesion-free areas. Further the GAN loss makes the distance between and be close in the lesion areas. Based on the cooperation of GAN loss and the triplet loss, the generated satisfies Eq. (4) and (5). Besides, as a margin term can avoid learning identity mapping from to during minimizing . Catering misalignment is not needed for since is for the “target” and hence perfectly aligned with in pixel-wise.

Besides, since the lesion regions of have been removed in , the must also be non-malignant. Such a knowledge can be reflected via auxiliary negative embedding loss as a constraint:


where denotes the malignant probability of .

Joint Optimization The final loss is combination of the losses defined in Eq. (III-B), (III-B), (12) and (13):


, where denotes sample index, that is, we calculate corresponding losses for each sample and derive the final joint loss. By optimizing the loss , these modules can be optimized cooperatively and compatibly: the counterfactual generation helps discover the lesions for classification; on the other hand, the classification module helps counterfactual generation in a supervised way. The effect of these modules can be validated by our ablation study, which are explained detailedly in the next section.

Iv Experiments

Iv-a Implementation Details

Mammogram images are commonly stored using a 14-bit DICOM format. A simple linear mapping is used to convert them into 8-bit gray images. Then, the Otsus method [22] is used for breast region segmentation and background removal. The segmented images are resized into and fed to networks. We implement all models with PyTorch. The models are initialized by ImageNet pre-trained weights for a fair comparison with the representative method [36]. For training, we use Adam optimization with a learning rate of and train for 50 epochs. For all experiments, we select the best model on the validation set for testing. Both target and reference features are extracted from the last convolution layer.

Methodology Top-1 error(b) Top-1 error(d)
ResNet50[10] 0.635 0.727
AnoGAN [25]* 0.684 0.789
Fixed-Point GAN [28]* 0.646 0.737
CycleGAN [35]* 0.632 0.667
Wu et al. [33]* 0.627 0.650
ABN [7] 0.632 0.722
Zhu et al. [36]* 0.627 0.625
Proposed Method 0.421 0.455
TABLE II: Top-1 localization error on (b) INBreast dataset for mass classification with Resnet50; (d) INBreast dataset for mixed-lesion classification with Resnet50.

Iv-B Datasets

We evaluate our method on the public INBreast dataset [20] due to its high quality compared to other public datasets [36] and an in-house dataset. The INBreast dataset contains 115 cases and 410 mammograms. INBreast provides each image a BI-RADS result as image-wise ground truth and we use the same process as Zhu et al. [36]. (malignant if BI-RADS 3; benign otherwise). Our experimental setting in INBreast is all the same as Zhu et al. [36] who uses 100 mammogram images with masses and reports image-wise malignant classification performance. We discard 9 of them for lack of contralateral images in the same task. The remaining 91 images all have opposite sides, i.e. 91 pairs for mass malignancy classification. We consider two settings: the mass-lesion image classification and mixed-lesion classification in which the lesion can be masses, calcification clusters and distortions. First, we follow [36] and select only the images containing masses for mass malignancy classification. In particular, we discard 9 images for the absence of the reference image. Second, to be generalized, we also evaluate mixed-lesion malignancy classification including masses, calcification clusters, or distortions. We use five-fold cross-validation for evaluation and area under the curve (AUC) for measurement.

The in-house dataset contains 2500 images, where 1303 images contain image-level malignant annotations. The dataset contains 589 only masses, 120 only suspicious calcifications, 34 only architectural distortions, 197 only asymmetries and 363 multiple lesions from 642 patients. All these 1303 images have opposite sides, i.e. 1303 pairs (Note that the target image A with a malignancy annotation is paired with B, counting as one pair. Meanwhile, if B also has a malignancy annotation, conversely B can be the target and A can be the reference, counting as another one pair). We randomly divide the dataset into training, validation and testing sets by the proportion of in patient-wise.

Iv-C Experiment settings

To fairly compare our method with others in a more general way, we implement AlexNet as backbone on both INBreast (for mass malignancy classification) and in-house dataset (for mixed-lesion malignancy classification). And we implement Resnet50 as backbone on INBreast (for both mass malignancy classification and mixed-lesion malignancy classification).

Iv-D Bilateral Distribution Verification

In this section, we verify the correctness of our symmetric prior assumption which is motivation of our proposed framework. Specifically, we choose 1,000 unhealthy couples of the bilateral images, each of which contains at least one lesion from the in-house dataset. Then for comparison, we choose another 1,000 healthy couples. We do not use the public INBreast dataset since there are few healthy couples in it. To measure the image distribution distances, we use Fréchet Inception Distance (FID) [12], which has been used to evaluate medical images [9, 19]. After calculating FID value of healthy set and the unhealthy set , we conduct Hypothesis Testing with the null hypothesis and althernative hypothesis defined as:


We obtain a p-value of , which provides an evidence for us to reject , i.e., the bilateral distribution distance of unhealthy cases is larger than healthy cases significantly. This result can be regarded as a manifestation of our symmetric prior assumption.

Iv-E Experimental Analysis

Compared Baselines for Malignancy Classification. We conduct our experiments on both Mass malignancy classification(the 2nd and the 3rd columns of Table I) and Mixed-lesion Malignancy classification(the last two columns of Table I).

The first four lines in Table I summarize the official results of the representative methods. To be fair, we compare the results with the backbone of AlexNet [16] and ResNet50 [11] separately. Due to the slightly difference in the number of images used by reference absence, for a fair comparison, we re-implement some baselines in the list such as vanilla methods which means using AlexNet [16] / ResNet50 [11], classification methods [36, 33], natural image classification methods [34, 7] and counterfactual generation methods [35, 25, 28].

Result Analysis. As shown in Table I, we achieve state-of-the art performance. We outperformed attention-based methods (Zhu [36], ABN [7] and CAM [34]) largely by to , multi-view method (Wu [33]) largely by to and GAN-based methods( AnoGAN [25], Fixed-Point GAN [28] and CycleGAN [35]) largely by to . Specifically, Zhu [36], ABN [7] and CAM [34] take advantage of the attention mechanism. They all outperform the vanilla baseline. However, without exploiting the domain knowledge of mammograms, their performances are limited. Wu [33] uses multi-view simple fusion. Better results compared with vanilla baseline indicate the bilateral information is useful. However, they are inferior to us since mammograms can not be pixel-to-pixel aligned. As to AnoGAN [25], compared with the vanilla baseline, AnoGAN performs slightly worse in INBreast dataset than in the in-house dataset. We argue this is because there are relatively more sufficient healthy images in the in-house dataset, leading to better healthy modeling. However, they are still much lower than us due to suffering from various healthy patterns in mammogram. Fixed-Point GAN [28] and CycleGAN [35] achieve similar performances due to similar cycle consistency constraints. They outperform AnoGAN since they can make use of the image-level annotations. However, their performances are limited by suffering from the ill-posed translation on lesion removal.

Fig. 4: Visualization of class activation maps of Vanilla CNN, AnoGAN [25], Fixed-Point GAN [28], CycleGAN [35], Wu et al. [33], Zhu et al. [36], ABN[7] and CGN. Each row represents a pair of mammograms from bilateral breasts in the INBreast. The target containing lesions is bounded by a red rectangle. The ground truth bounding boxes are labeled by green rectangles.

Localization Evaluation To verify whether the proposed model focuses on the lesion areas or not, we evaluate the localization error by CAM [34]. Same as [34], we first calculate the CAMs based on the predicted category. Then to generate a bounding box from CAM, we segment the regions whose CAM value is larger than 20% of the max CAM value and obtain the bounding box for the largest connected component in the segmentation map. We use the top-1 localization error as ILSVRC except for the intersection over union (IOU) threshold of 0.1, since our main concern is the classification performance, the precise localization is not necessary. As is shown in Table II, the proposed method obtains a localization error of 0.421 for masses and 0.455 for all lesions, outperforming other methods.

Visualization To verify the effectiveness of CGN in terms of learning lesion area, we visualize the class activation maps, as shown in Fig. 4. We can see the asymmetry of lesions on bilateral images validates the bilateral asymmetric prior (the first three columns).The proposed CGN succeeds to focus on all lesions since it incorporates the bilateral symmetry prior. In contrast, the other methods show uneven results. In the first two cases, the other methods also show reasonable attention since the mass areas are highly different from the background. However, for the last two cases, the lesions are relatively indistinct. Thus it is quite challenging to find the lesions without bilateral information.

Iv-F Counterfactual Validation

Since there are no ground truth images under counterfactual conditions, we validate the effectiveness and reasonableness of our generated counterfactual features in two aspects, the FID measurement and the further feature visualization, which are motivated by counterfactual evidence in [3].

Bilateral Triplet Loss AUC(a) AUC(b) AUC(c) AUC(d)
0.820 0.827 0.780 0.697
SBF 0.862 0.858 0.807 0.721
TF-GAN 0.883 0.873 0.857 0.731
BF-GAN 0.860 0.842 0.849 0.720
AdaIN-GAN 0.886 0.873 0.858 0.734
AdaIN-GAN 0.891 0.898 0.874 0.777
AdaIN-GAN Non-feedback 0.873 0.863 0.858 0.741
Feedback 0.837 0.851 0.836 0.716
AdaIN-GAN Feedback 0.905 0.902 0.884 0.771
AdaIN-GAN Feedback 0.910 0.911 0.885 0.781
TABLE III: AUC evaluation of ablation study on (a) INBreast dataset for mass classification with Alexnet; (b) INBreast dataset for mass classification with Resnet50; (c) INBreast dataset for mixed-lesion classification with Resnet50; (d) in-house dataset for mixed-lesion classification with Alexnet.

Counterfactual Visualization We visualize the target features, reference features, and generated counterfactual features in Fig. 5 to further verify the effectiveness of our counterfactual generation qualitatively. Since the three kinds of features are all with high dimension, we perform the max-pooling cross the channel dimension to generate the visualization heatmap for each of them. The heatmaps are shown in the last three columns respectively. We can see that the activated lesion features in the target features marked by green rectangles disappear in the counterfactual features. While the counterfactual features in lesion-free areas are similar to the target features. This means that the proposed method can generate a healthy version of the target features, i.e., counterfactual features, effectively.

We also visualize the predicted location of lesions during the iterative training process to further verify the effectiveness of CGN in Fig. 6. With the process of iteration, the predicted location of lesions becomes more and more accurate.

FID measurement To further evaluate the effectiveness of the generated counterfactual features, we calculate the mean FID [9] to measure the feature distribution distances in the INBreast. The mean FID between the target and reference features is 56.15. The counterfactual-reference mean FID is 27.04. The target-counterfactual mean FID is 25.42 while the one after removing the lesion areas from ground truth is 0.60. By comparing the four distances to each other, we find the learned counterfactual features contain both reference information and target information in healthy areas.

Iv-G Ablation Study

We evaluate some variant models to verify the effectiveness of each component. The ablative results in Table. III show that deleting or changing any of the components would lead to a descent of the classification performance. Specifically, naive bilateral features fusion also leads to a boosting of to over vanilla on performance. It proves the bilateral symmetric prior is quite helpful for malignancy classification. Meanwhile, the proposed prediction feedback mechanism outperforms the non-feedback largely by . We explain that the classification module provides additional useful supervision for lesion localization, making learning more accurate and stable. For additional counterfactual constraint of negative embedding loss, we show that it improves the performance by . Here are some interpretation for the variants:

in the first raw: Vanilla single view netwwork.

SBF: Simple Bilateral features. The bilateral features are directly concatenated and fed into the fusion layer;

TF-GAN: Target-feature GAN. Replace AdaIN input by target features only;

BF-GAN: Bilateral-feature GAN. Replace AdaIN input by simple combination of bilateral features;

Non-feedback: Estimate lesion areas by the areas with the largest target-counterfactual distance.

To further verify the effectiveness of the proposed adversarial loss and feedback triplet loss , we applied two variants respectively:

Methodology AUC(a) AUC(b) AUC(c) AUC(d)
Variant (1) 0.884 0.886 0.878 0.767
Variant (2) 0.860 0.863 0.850 0.739
Proposed Method 0.910 0.911 0.885 0.781
TABLE IV: AUC evaluation on (a) INBreast dataset for mass classification with Alexnet; (b) INBreast dataset for mass classification with Resnet50; (c) INBreast dataset for mixed-lesion classification with Resnet50; (d) in-house dataset for mixed-lesion classification with Alexnet.
Methodology AUC(a) AUC(b) AUC(c) AUC(d)
SBF 0.862 0.858 0.807 0.721
GF 0.865 0.862 0.812 0.726
SFF 0.864 0.862 0.813 0.724
Proposed Method 0.910 0.911 0.885 0.781
TABLE V: AUC evaluation of biliteral comparative experiments on (a)(b)(c)(d). We evaluate our method on four different experiment settings to illustrate our performance against other methods. The four settings are: (a) INBreast dataset for mass malignancy classification with Alexnet; (b) INBreast dataset for mass malignancy classification with Resnet50; (c) INBreast dataset for mixed-lesion malignancy classification with Resnet50; (d) in-house dataset for mixed-lesion malignancy classification with Alexnet.

Variant (1): As to the discriminator loss, we directly minimize the distance between counterfactual features and reference features in lesion areas. We still estimate the lesion areas by the prediction feedback mechanism.

Compared with the competing losses we used for discriminator and generator in our paper:


We denote the modified discriminator loss and generator loss of variant (1) as:


therefore we have the final losses:


which are iteratively trained with .

Variant (2): As to the feedback triplet loss , we design a variant feedback loss instead. We direct constraint the generated features in lesion-free areas to be similar to target features .

The is defined as:


where is defined as Eq. (10);

Therefore we have the final losses:


which are iteratively trained with . The and are the generator loss and the discriminator loss respectively, as we used in the competing loss in .

Fig. 5: Visualization. Left three columns: the target images, the target images with ground truth annotations which are marked by green rectangles on lesion areas, and reference images which are flipped horizontally for convenient comparison; Right three columns: feature maps of target images, feature maps of reference images, and feature maps of our generated counterfactual features. All visualized features are obtained by taking the maximum value of 256 channels. The green rectangles in each row mark the features in lesion areas before and after the counterfactual generation.
Fig. 6: Iterative Training Process. Left three columns: the images of the left side, the images of the right side, with being target or reference marked below, and the target images with ground truth annotations which are marked by green rectangles on lesion areas; Right five columns: the predicted location of lesions by CGN during training per ten epochs.

The experimental results of the two variants against our proposed method are shown in Table. IV. We can see that modifying either the adversarial loss or the feedback triplet loss would lead to a descent performance. We argue that our proposed losses are robust and effective. As we said that due to the pixel-to-pixel registration between bilateral images, we achieve counterfactual generation in feature level instead of image level. In practical experiments, we get of feature level which is higher than of image level, verifying the performance of the feature generation. Moreover, the training speed of the former is more faster than the latter with 6.6 s/epoch v.s. 23.5 s/epoch.

Iv-H Bilateral Analysis

For bilateral analysis, we re-implement some interesting modules used in recent papers.

SBF: As mentioned in ablation study, Simple Bilateral Features. e.g., Kim et al. [15] applied in ToMO;

GF: Gated fusion in SBF. Learning more weights for asymmetric enhancement based on SBF [17];

SFF: Simple Four-view features fusion. Ensembling cross-view and contralateral-view simply [31];

SFF and GF can be seen as variants of SBF. As shown in Table. V, SFF and GF slightly outperform SBF for using more information but are inferior to our proposed method for naive use of view-wise information. Both of them share the similar disadvantage with SBF: even for healthy breasts, bilateral mammograms are only roughly symmetric but not pixel-to-pixel, the similarity of bilateral features cannot be guaranteed. While our method uses the symmetric prior by counterfactual generation with an improved GAN. Therefore, our method suffers less from these problems and leads to better results.

V Conclusion

In this paper, we propose a novel approach called bilateral asymmetry guided Counterfactual Generating Network (CGN) to improve the mammogram classification performance. The proposed method performs the counterfactual generation by exploiting the symmetric prior effectively. Experimental results indicate that the proposed CGN achieves state-of-the-art results in both public and in-house datasets. Our work can be referred as the showcase of exploiting symmetric prior, which widely holds in many human organs,e.g., brains, eyes, skeletal structures, and kidneys. Therefore, we believe that the generalization ability of our method on corresponding medical imaging problems, the efforts of which will be left in future work.

Appendix A Proof of Theorem 3.1

Lemma A.1.

If the the causal graph satisfies that the common factor influences the bilateral variables simultaneously, then,


Lemma A.1 shows that the causal factor influences the bilateral mammograms in equal function relationship.

Proof of Theorem iii.1.

Proof of Eq. (2):


where the first equation is due to that the is the only parent node of ; the second equation is according to Markov condition that , the third equation is due to the symmetric prior.

Proof of Eq. (3): Since in the lesion-free areas, there are , the probabilities are derived by the actual hidden features directly, i.e.,



  1. P. Achlioptas, O. Diamanti, I. Mitliagkas and L. Guibas (2017) Learning representations and generative models for 3d point clouds. arXiv preprint arXiv:1707.02392. Cited by: §III-B.
  2. J. Amores and P. Radeva (2005) Retrieval of ivus images using contextual information and elastic matching. Int. J. Intell. Syst. 20, pp. 541–559. Cited by: §III-B.
  3. M. Besserve, A. Mehrjou, R. Sun and B. Schölkopf (2020) Counterfactuals uncover the modular structure of deep generative models. In International Conference on Learning Representations, External Links: Link Cited by: §IV-F.
  4. V. C and (2013) Inference on counterfactual distributions. Econometrica 81 (6), pp. 2205–2268. Cited by: §I.
  5. Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan and J. Feng (2017) Dual path networks. In Advances in Neural Information Processing Systems, pp. 4467–4475. Cited by: §I.
  6. N. Dhungel, G. Carneiro and A. P. Bradley (2016) The automated learning of deep features for breast mass classification from mammograms. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 106–114. Cited by: §I, TABLE I.
  7. H. Fukui, T. Hirakawa, T. Yamashita and H. Fujiyoshi (2019) Attention branch network: learning of attention mechanism for visual explanation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10705–10714. Cited by: §I, §II-A, TABLE I, Fig. 4, §IV-E, §IV-E, TABLE II.
  8. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §III-B.
  9. C. Haarburger, N. Horst, D. Truhn, M. Broeckmann, S. Schrading, C. Kuhl and D. Merhof (2019) Multiparametric magnetic resonance image synthesis using generative adversarial networks. The Eurographics Association. Cited by: §IV-D, §IV-F.
  10. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: TABLE II.
  11. A. Hermans, L. Beyer and B. Leibe (2017) In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737. Cited by: §III-B, §IV-E.
  12. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: §IV-D.
  13. X. Hu, Y. Jiang, C. Fu and P. Heng (2019) Mask-shadowgan: learning to remove shadows from unpaired data. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2472–2481. Cited by: §I.
  14. X. Huang and S. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510. Cited by: §I, §III-B.
  15. D. H. Kim, S. T. Kim and Y. M. Ro (2016) Latent feature representation with 3-d multi-view deep convolutional neural network for bilateral analysis in digital breast tomosynthesis. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 927–931. Cited by: §IV-H.
  16. A. Krizhevsky, I. Sutskever and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §III-B, §IV-E.
  17. Y. Liu, Z. Zhou, S. Zhang, L. Luo, Q. Zhang, F. Zhang, X. Li, Y. Wang and Y. Yu (2019) From unilateral to bilateral learning: detecting mammogram masses with contrasted bilateral network. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 477–485. Cited by: §I, §IV-H.
  18. W. Lotter, G. Sorensen and D. Cox (2017) A multi-scale cnn and curriculum learning strategy for mammogram classification. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pp. 169–177. Cited by: §I.
  19. I. Malkiel, S. Ahn, V. Taviani, A. Menini, L. Wolf and C. J. Hardy (2019) Conditional wgans with adaptive gradient balancing for sparse mri reconstruction. arXiv preprint arXiv:1905.00985. Cited by: §IV-D.
  20. I. C. Moreira, I. Amaral, I. Domingues, A. Cardoso, M. J. Cardoso and J. S. Cardoso (2012) Inbreast: toward a full-field digital mammographic database. Academic radiology 19 (2), pp. 236–248. Cited by: §I, §IV-B.
  21. O. Nizan and A. Tal (2019) Breaking the cycle–colleagues are all you need. arXiv preprint arXiv:1911.10538. Cited by: §I.
  22. N. Otsu (1979) A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics 9 (1), pp. 62–66. Cited by: §IV-A.
  23. J. Pearl (2009) Causality: models, reasoning, and inference. Cambridge University Press. External Links: ISBN 978-94-009-7798-3, Link Cited by: §I, §III-A, §III-A.
  24. D. Ribli, A. Horváth, Z. Unger, P. Pollner and I. Csabai (2018) Detecting and classifying lesions in mammograms with deep learning. Scientific reports 8 (1), pp. 4165. Cited by: §I.
  25. T. Schlegl, P. Seeböck, S. M. Waldstein, U. Schmidt-Erfurth and G. Langs (2017) Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International conference on information processing in medical imaging, pp. 146–157. Cited by: §I, §II-B, TABLE I, Fig. 4, §IV-E, §IV-E, TABLE II.
  26. F. Schroff, D. Kalenichenko and J. Philbin (2015) Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §III-B.
  27. E. Sickles, C. D’Orsi and L. Bassett ACR bi-rads® mammography. acr bi-rads® atlas, breast imaging reporting and data system. american college of radiology 2013. Cited by: §I, §III-A.
  28. M. M. R. Siddiquee, Z. Zhou, N. Tajbakhsh, R. Feng, M. B. Gotway, Y. Bengio and J. Liang (2019) Learning fixed points in generative adversarial networks: from image-to-image translation to disease detection and localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 191–200. Cited by: §I, §II-B, TABLE I, Fig. 4, §IV-E, §IV-E, TABLE II.
  29. R. L. Siegel, K. D. Miller and A. Jemal (2019) Cancer statistics, 2019.. CA: A Cancer Journal for Clinicians 69 (1), pp. 7–34. Cited by: §I.
  30. S. Tai, Z. Chen and W. Tsai (2013) An automatic mass detection system in mammograms based on complex texture features. IEEE journal of biomedical and health informatics 18 (2), pp. 618–627. Cited by: §I.
  31. J. Wei, H. Chan, C. Zhou, Y. Wu, B. Sahiner, L. M. Hadjiiski, M. A. Roubidoux and M. A. Helvie (2011) Computer-aided detection of breast masses: four-view strategy for screening mammography. Medical physics 38 (4), pp. 1867–1876. Cited by: §IV-H.
  32. E. Wu, K. Wu, D. Cox and W. Lotter (2018) Conditional infilling gans for data augmentation in mammogram classification. In Image Analysis for Moving Organ, Breast, and Thoracic Images, pp. 98–106. Cited by: §I.
  33. N. Wu, J. Phang, J. Park, Y. Shen, Z. Huang, M. Zorin, S. Jastrzebski, T. Fevry, J. Katsnelson and E. Kim (2019) Deep neural networks improve radiologists’ performance in breast cancer screening. IEEE transactions on medical imaging. Cited by: §II-A, TABLE I, Fig. 4, §IV-E, §IV-E, TABLE II.
  34. B. Zhou, A. Khosla, A. Lapedriza, A. Oliva and A. Torralba (2016) Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929. Cited by: §I, §II-A, §III-B, TABLE I, §IV-E, §IV-E, §IV-E.
  35. J. Zhu, T. Park, P. Isola and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §I, §II-B, TABLE I, Fig. 4, §IV-E, §IV-E, TABLE II.
  36. W. Zhu, Q. Lou, Y. S. Vang and X. Xie (2017) Deep multi-instance networks with sparse label assignment for whole mammogram classification. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 603–611. Cited by: §I, §II-A, TABLE I, Fig. 4, §IV-A, §IV-B, §IV-E, §IV-E, TABLE II.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description