GazeGAN: A Robust Generative Adversarial Saliency Model based on Invariance Analysis of Human Gaze During Scene Free Viewing
Invariance Analysis of Human Gaze vs. Saliency Models: Analysis and a New Model
Leverage eye-movement data for saliency modeling: Invariance Analysis and a Robust New Model
Data size is the bottleneck for developing deep saliency models, because collecting eye-movement data is very time-consuming and expensive. Most of current studies on human attention and saliency modeling have used high-quality stereotype stimuli. In real world, however, captured images undergo various types of transformations. Can we use these transformations to augment existing saliency datasets? Here, we first create a novel saliency dataset including fixations of 10 observers over 1900 images degraded by 19 types of transformations. Second, by analyzing eye movements, we find that observers look at different locations over transformed versus original images. Third, we utilize the new data over transformed images, called data augmentation transformation (DAT), to train deep saliency models. We find that label-preserving DATs with negligible impact on human gaze boost saliency prediction, whereas some other DATs that severely impact human gaze degrade the performance. These label-preserving valid augmentation transformations provide a solution to enlarge existing saliency datasets. Finally, we introduce a novel saliency model based on generative adversarial network (dubbed GazeGAN). A modified U-Net is proposed as the generator of the GazeGAN, which combines classic “skip connections” with a novel “center-surround connection (CSC)”, in order to leverage multi-level features. We also propose a histogram loss based on Alternative Chi-Square Distance (ACS HistLoss) to refine the saliency map in terms of luminance distribution. Extensive experiments and comparisons over 3 datasets indicate that GazeGAN achieves the best performance in terms of popular saliency evaluation metrics, and is more robust to various perturbations. We also provide a comprehensive quantitative comparison of 22 state-of-the-art saliency models on distorted images, which contributes a robustness benchmark for saliency community. Our code and data are available at: https://github.com/CZHQuality/Sal-CFS-GAN.
Visual attention is an advanced internal mechanism for selecting informative and conspicuous regions in external visual stimuli . Bottom-up saliency is an efficient front-end process to complex back-end high-level vision tasks such as scene understanding, object recognition, detection, segmentation and visual description [3, 4, 5].
A plethora of saliency models have been proposed in the past decades to predict human gaze by simulating biological attention mechanisms [2, 6, 7]. Early saliency models extract hand-crafted features [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], while recent deep saliency models [20, 21, 22, 23, 24] learn relevant features automatically.
SALICON dataset  is the current largest available saliency dataset, which contains 10,000 training images, 5,000 validation images and 5,000 test images. Different from traditional eye-movement datasets such as MIT1003  and CAT2000 , the authors of SALICON utilized a mouse clicking method to simulate eye-tracking process in order to reduce time cost and labor consumption. However, this mouse clicking method is not as accurate as eye-tracking apparatus. Besides, the data size of SALICON is still far from ImageNet (containing 14 million images) , which is dedicated for image classification and object recognition tasks.
To the best of our knowledge, most of the state-of-the-art saliency models and cognitive attention studies have used stereotype stimuli (e.g. upright images). However, in most practical circumstances, external stimuli are corrupted by diverse transformations. Some saliency-guided applications such as image/video quality assessment  and object detection and recognition  have to deal with distorted images. Kim et al.  investigated visual saliency in noisy images. They found that noise significantly degrades the accuracy of saliency models, and proposed a robust model for noise-corrupted images. Judd et al.  elaborately investigated gaze over low-resolution images, and compared gaze dispersion on different image resolutions. Zhang et al.  investigated the optimal strategy to integrate attention cues into perceptual quality assessment, and showed that eye-tracking data on distorted images improves perceptual quality assessment methods.
The above-mentioned studies only considered certain types of distortions, limited amount of data and a small set of saliency models. Further, they did not investigate the potential of various transformations for boosting saliency prediction (e.g. by serving as data augmentation). In this paper, we conduct a comprehensive study to investigate the influence of several transformations on both human gaze and saliency models. Our contributions includes:
A Novel Eye-movement Dataset: We first collect a novel eye-movement dataset including 1900 images corrupted by 19 types of common transformations.
Human Gaze Invariance Analysis: We thoroughly analyze human gaze behavior when viewing stimulus corrupted by various transformations against original distortion-free stimuli.
A General Data Augmentation Strategy: We verify that an ensemble of some label-preserving transformations which have slight impacts on human gaze are qualified to serve as an useful data augmentation strategy to boost deep saliency models.
A Robust Saliency Model: We propose a saliency model based on generative adversarial networks (GANs) and show that it achieves encouraging performance on both normal and distorted stimulus.
Ii The Proposed Eye-Movement Database
Ii-a Stimuli and transformation types
We selected 100 distortion-free reference images from the CAT2000 eye-movement database  since it covers various scenes such as indoor and outdoor scenes, natural and man-made scenes, synthetic patterns, fractals, and cartoon images. Considering that different reference images have different aspect ratios, we padded each image by adding two gray bands to the left and right sides and adjusted the image scale to make sure all images have the same resolution ().
To systematically assess the influence of ubiquitous transformations on human attention behavior, we choose 19 common transformations occurring during the whole image acquisition, transmission, and displaying chain, including:
Acquisition: 2 levels of motion blur and 2 levels of Gaussian noise,
Transmission: 2 levels of JPEG compression,
Displaying: 2 levels of contrast change, 2 rotation degrees, and 3 shearing transformations,
Other: inversion, mirroring, line drawing (boundary maps), and 2 types of cropping distortions (to explore gaze variations under extremely abnormal conditions).
Eventually, we derive 18 transformed images for each reference image, and a total of 1900 images (18 100 + 100 reference images). Details of transformation types and generation code are shown in Table I. Notably, these transformations are wildly used as data augmentation transformations for training deep neural networks to mitigate overfitting .
|Distortion||Generation code (using Matlab)||sAUC, CC, NSS|
|Reference||100 distortion-free images (img) from CAT2000||0.733, 0.954, 3.435|
|MotionBlur1||imfilter(img, fspecial(’motion’, 15, 0))||0.664, 0.923, 2.572|
|MotionBlur2||imfilter(img, fspecial(’motion’, 35, 90))||0.651, 0.920, 2.588|
|Noise1||imnoise(img, ’gaussian’, 0, 0.1)||0.706, 0.940, 3.032|
|Noise2||imnoise(img, ’gaussian’, 0, 0.2)||0.696, 0.939, 3.026|
|JPEG1||imwrite(img, saveroutine, ’Quality’, 5)||0.703, 0.902, 2.919|
|JPEG2||imwrite(img, saveroutine, ’Quality’, 0)||0.705, 0.903, 2.863|
|Contrast1||imadjust(img, [ ], [0.3,0.7])||0.722, 0.931, 3.008|
|Contrast2||imadjust(img, [ ], [0.4,0.6])||0.702, 0.931, 3.430|
|Rotation1||imrotate(img, -45, ’bilinear’, ’loose’)||0.680, 0.893, 2.287|
|Rotation2||imrotate(img, -135, ’bilinear’, ’loose’)||0.654, 0.892, 2.098|
|Shearing1||imwarp(img, affine2d([1 0 0; 0.5 1 0; 0 0 1])||0.711, 0.943, 3.011|
|Shearing2||imwarp(img, affine2d([1 0.5 0; 0 1 0; 0 0 1])||0.687, 0.927, 2.576|
|Shearing3||imwarp(img, affine2d([1 0.5 0; 0.5 1 0; 0 0 1])||0.665, 0.888, 2.118|
|Inversion||imrotate(img, -180, ’bilinear’, ’loose’)||0.695, 0.934, 3.062|
|Mirroring||mirror symmetry version of reference images||0.726, 0.930, 3.360|
|Boundary||edge(img, ’canny’, 0.3, sqrt(2))||0.667, 0.888, 2.312|
|Cropping1||a band from the left of img||0.697, 0.934, 2.630|
|Cropping2||a band from the top side of img||0.692, 0.938, 2.641|
Ii-B Eye-tracking setup
As indicated by Bylinskii et al. , the eye-tracking experimental parameters (e.g. observers’ distance to screen, calibration error, image size) influence the quality of the collected data. To address these challenges, we utilized the Tobii X120 eye tracker to record eye-movements. We used the LG 47LA6600 CA monitor with horizontal resolution of 1920 and vertical resolution of 1080, to match the resolutions of stimulus and the monitor screen. The height and width of the monitor were 60cm and 106cm, respectively. The distance between subject and the eye-tracker was 60cm. According to Bylinskii et al. , one degree of visual angle was used both as 1) an estimate of the size of the human fovea, and 2) to account for measurement error. In our experiment, the width of the screen subtended of visual angle, and of horizontal angle corresponds to 56.91 pixels ( and 56.55 pixels for the screen height, correspondingly).
Two types of ground-truths data have been traditionally used for training and measuring the performance of saliency models: 1) binary fixation maps made up of discrete gaze points recorded by an eye-tracker, and 2) continuous density maps representing the probability of the human gaze. The former can be converted into the latter by a Gaussian smooth filter with standard deviation equal to one degree of visual angle , hence we chose in this paper.
We recruited 40 subjects (24 males and 16 females, age ranging from 18 to 35 years) to participate in the eye tracking experiment under the free-viewing condition. All participants have not been exposed to the stimulus set before. The duration time for each stimulus was . We inserted a gray image with duration between each two consecutive images to reset gaze to the image center and to reduce memory effects . Besides, the order of stimulus presentation was randomized for each subject to mitigate carryover effect from previous image.
Iii Analysis of Human Gaze Invariance
In this section, we quantify the discrepancies between human gaze over transformed and reference images using Pearson’s Linear Correlation Coefficient (CC), Histogran Intersection Measure (SIM), and Kullback-Leibler divergence (KL) metrics . The CC/SIM similarity matrixes and KL dissimilarity matrix are shown in Fig. 2, where the transformation types are ranked by their similarity/dissimilarity values compared to the Reference images. Values decrease from left to right and from top to bottom. Since Inversion, Mirroring, Rotation and Shearing distortions change locations of pixels, we align gaze maps of these transformations with the Reference gaze map (via the corresponding inverse transformations introduced in Table I) for fair comparison. We display some qualitative comparisons between human gaze on transformed stimulus versus reference stimulus in Figs. 3-4. We notice that:
1. Most transformations have impacts on human gaze, and the magnitude of impact highly depends on the transformation type. As in Fig. 2, quantitative comparisons of the CC, SIM and KL metrics indicate that Mirroring and Shearing1 have slight influences on human attention compared to other transformations, whereas Rotation2, Cropping2 and Shearing3 have significant influences on human gaze.
2. As in Fig. 2, different magnitudes of the same transformation have similar impacts on human gaze, e.g. Noise1 vs Noise2, JPEG1 vs. JPEG2, and MotionBlur1 vs. MotionBlur2. These transformation pairs have high similarity values in terms of the CC and KL metrics (when compared to each other). Besides, the higher distortion magnitude, the severer impact on human gaze (when compared to the Reference).
3. In contrast to object detection, segmentation and classification tasks, we cannot directly use all of these transformations as data augmentation transformations for saliency prediction. This is because some transformations are not label-preserving in terms of human gaze. For example, Rotation2 and Shearing3 distract human attention significantly, and we find that humans tend to pay more attention to the salient objects appearing at the center regions of spatially transformed stimulus, as shown in the 3rd row of Fig. 3.
4. We also find some other patterns from eye-movement data, but these patterns depend not only on transformation type, but also on scene category and stimuli complexity. For example, Cropping1 distracts human attention from salient regions appearing on left side of whole stimuli, but the main part of the stimuli will not be influenced severely. Cropping2 alters the human gaze severely, because salient objects containing semantic information are often framed in the center part. As a result, the risk of damaging the objects with semantic information is higher for Cropping2 compared to Cropping1. Besides, for Boundary transformation, humans prefer to look at regions with intensive edges when color and luminance features are lacking. More details about these patterns are provided in the Supplementary Material.
Iv Analysis of Data Augmentation
The most common data augmentation strategy is to enlarge the training set using some label-preserving transformations, such as Cropping, Inversion, ContrastChange, and Shearing. However, different from classical image classification and object detection problems, the common data augmentation methods may produce label noise for the saliency prediction problem. This is because different transformations will change the ground truth at different levels. This paper carries important implications as to which of these types of transformations are valid and which ones provide approximations of human gaze. We divide common transformations included in the proposed dataset into two sets: valid and invalid augmented sets, and explore how fine-tuning on different sets of augmented data can improve or degrade the performance of deep models with respect to ground truth.
On the one hand, we select Reference, Mirroring, Inversion, Contrast1, Shearing1, JPEG1 and Noise1 to generate a valid augmented set, because these transformations have slight effects on human gaze. On the other hand, Rotation1, Rotation2, Shearing2, Shearing3, Cropping1, Cropping2 and MotionBlur2 serve as an invalid set, because these transformations are not able to preserve human gaze labels as approximations of the Reference. We select 4 state-of-the-art deep saliency models pretrained on the SALICON dataset , which is the largest saliency dataset. Considering that Reference images are selected from the CAT2000 dataset, we first fine-tune these models using normal images of CAT2000 as the control group. Then, we use the valid and invalid augmented sets to fine-tune these models separately.
We design two experiments in this section:
1. Which transformations can improve the model robustness on distorted stimuli? 2. Does the valid augmentation transformations increase the model performance on normal stimuli?
For the first experiment, each of valid, invalid and CAT2000 sets is divided into a training set (550 images) and a test set (150 images), respectively. We compare the performances of different fine-tuning strategies on the test set of valid, invalid and normal sets separately, as shown in the 1st and 2nd rows of Fig. 5. For the second experiment, we select 1500 normal images from CAT2000 as original training set, and 400 distortion-free images as test set. We use the valid transformations to extend the training set as 10500 images, and measure the performance gain when fine-tuning with the extended training set, as shown in the 3rd row of Fig. 5.
For fair comparison, we unify the experimental setup for different data augmentation strategies. For each model, we set the training hyper-parameters as follows: 1) For the 4 deep models mentioned in Fig. 5, SGD (stochastic gradient descent) serves as the optimization function with momentum of 0.9, weight decay of 0.0005, and the batch size of 1, and 20 training epoch, 2) For ML-Net, learning rate is , 3) For OpenSALICON, learning rate is , and 4) For SAM-VGG and SAM-ResNet, learning rates are set to .
Experimental results shown in Fig. 5 (a)-(d) verify that fine-tuning using valid set can improve deep models’ robustness on distorted test set, compared to using CAT2000 which contains only distortion-free images. However, as shown in Fig. 5 (e)-(h), fine-tuning using invalid set degrades deep models’ performances compared to using normal stimuli. Fig. 5 (i)-(l) indicates that valid transformations provide an useful method to leverage expensive eye-movement data to boost deep saliency models.
V The Proposed GazeGAN Model
The general idea behind the proposed GazeGAN includes:
GAN architecture: Train the generator with the goal of fooling the discriminator that is trained to distinguish synthetic saliency maps from human gaze maps. GAN architecture can boost generator to produce more accurate saliency map with consistent spatial and intensity distributions as real human gaze;
Skip-connection: Shallower encoder layers can extract rich low-level features which help to sparse spatial representations from massive pixels, while deeper decoder layers can locate semantic salient objects accurately;
Center-surround connection: Inspired by human visual center-surround antagonism mechanism, we further emphasis cross-scale short connections through a nonlinear pooling operation. This helps to increase model sensitivity to local spatial discontinuities against perturbations;
Local-global GANs: Multiple generators learn different groups of spatial representations in different scales, while multiple discriminators can supervise the intermediate prediction results from coarse to fine. Integrating these representations can refine prediction results from coarse to fine. Besides, multi-scale architecture is less vulnerable to perturbations. This is because multi-scale architecture learns high-level representations processed in tandem with fine details, and the architecture appears better equipped to suppress otherwise distracting pixel noise.
V-a The generator
As shown in Fig. 6, the basic generator of GazeGAN includes two parts, a 16-layer modified U-Net equipped with a novel “center-surround connection module” (CSC module), and a 6-layer encoder-decoder module containing two residual blocks. The general idea behind this combination is that the modified U-Net serves as a feature extractor to leverage multi-level features and generates a coarse saliency map, while the 6-layer module aims to refine the residual information between coarse result and ground truth human gaze.
The core of our generator is the modified U-Net with a CSC module. U-Net is a powerful fully convolutional network presented by Olaf et al. , which has made a great breakthrough in biomedical image segmentation by predicting each pixel’s class. In saliency prediction, the goal of U-Net is predicting each pixel’s probability of being salient. Compared to the generator of SalGAN  saliency model, U-Net consists of symmetric encoder and decoder layers, and skip connections between encoder and decoder layers apply a concatenation operation to fuse multi-scale information. For improving the robustness of deep saliency model, we inject the bionic “center-surround” mechanism into the CNN model for the first time. Typical visual neurons are most sensitive in a small region of the visual space (i.e. center of receptive field), while stimuli presented in a border antagonistic region concentric to the center (the surround) inhibits the neuronal response . “Center-surround” operation is sensitive to local spatial discontinuities and is well-suited to detecting locations which stand out from their surround. Here we implement the “center-surround” operation as an across-scale connection module on the basis of U-Net, as shown in Fig. 6. Specifically, we select the feature maps in coarse scale (the surround) from the -th decoder layer, where , and the corresponding fine scale maps (the center) are from the -th decoder layer. We first use a transposed convolution layer to upsample the surround feature maps to have the same resolution (height width) with the center maps. Next, we employ the 11 convolution layers to unify the channels of center and surround maps when keeping resolution unchanged. Then, we compute the difference maps between center and surround maps by a point-to-point nonlinear subtraction as equation 1:
where and represent the -th unified surround feature maps, and the -th unified center feature maps, respectively. represents the difference map between center and surround. Notably, for each CSC module, we utilize 3 nonlinear activation functions, i.e. Tanh, to improve the model robustness by increasing the model nonlinearity. Finally, the difference maps between the -th surround and the -th center are concatenated with the feature maps at the -th decoder layer in channel direction. This way, U-Net has a large number of multifarious feature maps in the decoding path due to skip-connection and CSC module, which allows to transfer more context information. Notice that all activation functions of encoder layers are leaky-ReLUs with slope 0.2, while that of decoder layers are normal ReLUs.
On top of the modified U-Net, we append a 6-layer network containing 2 residual blocks, denoted as the refinement part. Residual blocks increase the overall depth of network, and can avoid gradient vanishing problem for very deep networks .
In final architecture as shown in Fig. 6, we further append a local generator on basis of the global generator , in order to extract more fine-scale features. is able to locate the obvious semantic salient objects, while encodes more detailed features which are useful to refinement, i.e. tiny text regions. can perceive a higher stimuli resolution ( from 360640 to 7201280 for the proposed dataset), so as to extract richer information that allows the network to better capture salient objects. Specifically, we concatenate the feature maps from the last decoder layer of with the feature maps from the second encoder layer of , to integrate the global semantic information from coarse to fine. We feed the original image into the , and feed the downsampled image into the . The and are jointly trained from end to end.
V-B The discriminator
To discriminate human gaze from synthetic saliency map, we train a 5-layer discriminator, which contains 4 convolution layers with increasing number of convolution kernels, increasing by a factor of 2 from 64 to 512 kernels. On top of the 512 feature maps generated by the discriminator layer4, we append a sigmoid layer with filter kernels and sigmoid activation function to obtain the final probability of being the real human gaze. Notice that we concatenate the saliency map (or human gaze) with the original input color image in channel direction, and feed them to the discriminator simultaneously. Thus, GazeGAN is a conditional GAN  because both the generator and the discriminator can observe the input source image. By this way, for distorted stimuli, the discriminator can perceive the specific distortion type, which improves the discriminating ability of the discriminator. We append two discriminators on the ends of and respectively, in order to supervise intermediate prediction results from coarse to fine.
V-C Loss function
Previous works [42, 43, 44] have proved it beneficial to mix the adversarial loss with some task-specific content losses, such as L1 and L2 losses, to train a GAN. This way, the discriminator’s goal remains unchanged, but the generator is tasked to not only fool the discriminator but also to approximate the ground truth in terms of visual content.
V-C1 The content loss
It has been proved that a linear combination of different saliency evaluation metrics achieves a good prediction performance when serving as loss function to train deep saliency models. Huang et al.  selected KL, CC and NSS metrics to train a deep saliency model because these metrics have a derivative that can be used by the gradient descent during back-propagation. They also found that KL loss achieved a good compromise against other losses for single use. Cornia et al.  used a weighted summation of KL, CC and NSS metrics to train their model, and achieved better results compared to using a single loss. These evaluation metrics perform well in measuring the pixel-level similarity between ground-truth and the predicted saliency map. However, during the training process, when we use only the linear combination of pixel-level losses to train our model, we notice that there is still an obvious discrepancy between the grey-level histograms of generated saliency map and the human gaze map. As shown in Fig. 7 (b)-(d), the luminance of non-salient background pixels is gathered in the darkest node (i.e. X=0). Besides, the luminance of salient pixels of human gaze is smoothly distributed across the image according to Fig. 7 (b). The luminance of the most salient pixels of saliency map is gathered in the brightest node (i.e. X=255) as Fig. 7 (c). This is the reason why some local salient regions look very bright in Fig. 7 (c).
We propose a histogram loss in this paper to reduce the histogram discrepancy between the generated saliency map and the human gaze map. The histogram loss includes two steps, i.e. histogram distribution estimation and histogram similarity calculation. For constructing the differentiable histogram loss, we first devise the histogram estimation method based on Ustinova’s work . We denote the pixel luminance of saliency map as , , where represents the number of pixels in the saliency map. Suppose that the distribution of is estimated as the -dimensional histogram with the nodes = 0, = , …, = 255 uniformly filling with the step . Then, we use equation 2 to estimate the probability distribution (denoted as , where ) for each node of the histogram.
We then adopt the min-max normalization method to normalize as , to guarantee that .
Next, we utilize the Alternative Chi-Square Distance (ACS) to measure the histogram similarity.
where and represent the normalized probability distribution at the -th node of histograms of generated saliency map and ground-truth human gaze, respectively. We set to make sure the divisor is always nonzero, and set to 255. For proving the validity of the proposed histogram loss, we compare the 3 different histogram similarity measurements with sAUC score, because sAUC is a benchmark evaluation metric for estimating the accuracy of saliency map. As shown in Fig. 8, ACS aligns better with the sAUC score compared to other measures.111Derivative of proposed ACS loss is provided in Supplementary Material
We introduce four popular pixel-wise loss functions in equations 5-8. We use , and to represent the ground truth density distribution, the ground truth binary fixation map, and the predicted saliency map, respectively.
The distance is a wildly used loss function for pixel-to-pixel image translation tasks . This loss calculates the pixel-wise Manhattan distance between the predicted saliency map and the ground-truth density distribution. Compared to the loss, the loss encourages less blurring artifacts on generated results.
where represents the -th pixel, and is the total number of pixels of the ground truth density map.
The KL function evaluates the loss of information when the predicted distribution is used to approximate the ground truth distribution , thus taking a probabilistic interpretation of predicted saliency map and ground truth density map as follows:
where is a small regularization constant.
The CC function treats the generated saliency map and ground-truth density distribution as random variables, and measures the linear relationship between them, as shown in equation 7.
where is the covariance of and .
The NSS function was defined specifically for the evaluation of saliency models . NSS aims to quantify the saliency map values at the fixated locations and to normalize it with the saliency map variance, as shown in equation 8.
where is the total number of fixation points, and is normalized to have zero mean and unit standard deviation.
As shown in equation 9, the final content loss is a linear combination of four pixel-level losses , KL, CC and NSS, and the histogram loss . We further discuss the performance gain of this content loss in Section-.
where , are five scalars to balance five losses and are set to 10, 100, -20, -20 and 10, respectively. The smaller values for , KL and scores indicate higher similarity between predicted result and ground-truth, whereas for CC and NSS, the higher values indicate higher similarity.
V-C2 The adversarial loss
The adversarial loss is expressed as
where means the original input image, while G and D represent generator and discriminator. G represents the global and local generators (i.e. and ), while D represents the fine-scale and coarse-scale discriminators. G tries to minimize this adversarial loss against an adversarial D which tries to maximize it, i.e. .
Vi Experiments and Results
Vi-a Experimental setup
We use 3 datasets to ensure a comprehensive comparison including: 1) SALICON dataset  contains 10,000 distortion-free training images, 5,000 validation images, and 5,000 test images. Notice that since the ground truth maps of the test set are not publicly available, here we select 4,000 images from the validation set as the test set and the remaining 1,000 images serve as the validation set, 2) MIT1003 dataset  includes 1003 distortion-free images with indoor and outdoor scenes, and 3) The proposed dataset contains 1900 distorted images with indoor, outdoor, graphics, and cartoon scenes.
For SALICON and MIT1003 datasets, we resize input images to for saving computing resource. Considering that images of MIT1003 have different resolutions, we apply zero padding bringing images to have an unified aspect ratio of 4:3 and resize them to have the same size. Images of the proposed dataset have the same input size of , hence we resize them to .
For the SALICON dataset, we first adopt the proposed valid data augmentation transformations to enlarge the 10,000 training images. This way, we obtain another 60,000 distorted stimuli with 6 types of label-preserving transformations. We use all 70,000 stimuli to pre-train the proposed GazeGAN from scratch. For the MIT1003 dataset, we randomly divided it into a training set with 600 images, a validation set with 100 images, and a testing set with 303 images. Similarly, we use the same data augmentation method to enlarge the training set of the MIT1003 dataset. We first pre-train our model on the enlarged SALICON training set, then fine-tune the model on the enlarged MIT1003 training set. For the proposed dataset, we first pre-train the whole network of GazeGAN on the enlarged SALICON training set, then fine-tune the refinement part of generators when freezing the parameters of the modified U-Net parts. This is motivated by reducing overfitting of deep model on limited distorted stimuli.
In the training stage, we encourage the generator to minimize the linear combination of content loss and adversarial loss . Besides, rather than training the discriminator to maximize , we instead minimize -. Adam optimizer  with a fixed learning rate , and the momentum parameter of serves as the optimization method to update the model parameters. We alternatively update the generators and discriminators as suggested by Goodfellow et al. . The batch-size is set as 1. In contrast to other deep saliency models (e.g., [22, 24]) which initialize their network parameters using the pre-trained parameters on ImageNet , the parameters of proposed GazeGAN are initialized from a Gaussian distribution with mean 0 and standard deviation 0.02. This is because U-Net can work with very few training images and still yields a satisfying performance . Therefore, GazeGAN provides an efficient solution to saliency prediction where training samples are limited. Our implementation is based on Tensorflow and two NVIDIA Tesla GPU with 16GB of GPU memory.
|+ KL + CC + NSS +||0.772||2.249||0.728||0.737||0.514|
|+ KL + CC + NSS + HistLoss +||0.808||2.914||0.743||0.764||0.496|
|+ KL + CC + NSS +||0.614||2.011||0.734||0.438||1.187|
|+ KL + CC + NSS + HistLoss +||0.633||2.302||0.747||0.446||1.042|
|SALICON||: Plain U-Net||0.685||1.996||0.624||0.656||0.744|
|: + Refinement Part||0.780||2.340||0.698||0.713||0.580|
|: + CSC Module||0.789||2.577||0.725||0.741||0.535|
|: + Local Generator||0.808||2.914||0.743||0.764||0.496|
|Proposed Data||: Plain U-Net||0.615||1.647||0.567||0.532||1.664|
|: + Refinement Part||0.697||1.735||0.590||0.574||1.201|
|: + CSC Module||0.702||1.894||0.624||0.597||1.023|
|: + Local Generator||0.760||2.140||0.643||0.663||0.781|
Vi-B Ablation analysis
In this section, we evaluate the contribution of each component of the proposed model on different datasets. We first compare the performance of GazeGAN when using different losses, as shown in Table II. We find that the proposed loss function combined of pixel-level loss and histogram loss achieves superior performance over different evaluation metrics.
Next, we focus on the contributions of different modules of our model. For this purpose, we construct four different variations: : the plain U-Net, : the plain U-Net appended by the refinement part containing 2 residual blocks, : the modified U-Net equipped with CSC module and appended by the refinement part, and is constructed by adding the local generator to . Table III shows the ablation analysis results of the proposed model. We can see that every module provides contribution to the final performance.
Vi-C Comparison with the state-of-the-art
We first quantitatively compare GazeGAN with state-of-the-art models on SALICON, MIT1003 and the proposed dataset. Experimental results are reported in Tables IV-VI. Models are sorted based on their NSS score as suggested in the MIT saliency benchmark . GazeGAN achieves top-ranked performance on the SALICON validation set and the proposed dataset over different evaluation metrics. It also shows a competitive performance on the MIT1003 dataset.
Vi-D Finer-grained Comparison on Distorted Dataset
As shown in Fig. 12, we further provide the finer-grained comparison of 22 existing saliency models on each transformation type of the proposed dataset.
For comprehensive comparison, we select 15 early saliency models based on hand-crafted features, IttiKoch , GBVS , Torralba , CovSal  (CovSal-1 utilizes covariance feature and CovSal-2 utilizes both of covariance and mean features), AIM , Hou  (Hou-Lab and Hou-RGB adopt Lab and RGB color spaces respectively), LS , LGS , BMS , RC , Murray , AWS  and ContextAware . We also select 7 deep saliency models, GazeGAN, ML-Net , SalGAN , OpenSALICON , Sal-Net , SAM-ResNet  and SAM-VGG .
We observe the following points from Fig. 12:
: Rotation2, Shearing3, Noise2 and Contrast2 are the most challenging distortions for saliency models. Most saliency models underperform on these distortions. Rotation2 and Shearing3 are severe geometrical distortions, while Noise2 and Contrast2 are high level surface variation distortions. The former changes the spatial structure of image, while the latter alters local contrast of image. Recall that Rotation2 and Shearing3 also have severe impacts on human gaze.
: LS and LGS fail on Boundary. Sal-Net fails on Contrast2. ML-Net and OpenSALICON fail on Noise2 and Contrast2. CovSal-1 and CovSal-2 fail on sAUC metric, especially on Rotation and Boundary, because CovSal model overemphasizes center-bias which is punished by the sAUC metric.
: Deep saliency models obtain the obviously higher performances compared to the early models based on hand-crafted features. GazeGAN achieves top-ranked average performance over different metrics. Besides, GazeGAN is robust to various types of transformation without obvious failure.
Vi-E Discussion on the robustness of GazeGAN
As indicated in Fig. 9, Fig. 12 and Table VI, the proposed GazeGAN achieves a greater robustness against various transformations. In this section, we dig into the robustness of models from different perspectives.
: Hendrycks et al.  pointed that multiscale architectures achieve better robustness by propagating features across different scales at each layer rather than slowly gaining a global representation of the input as in traditional CNNs. GazeGAN utilizes both skip-connections, CSC connections, and local-gloabl GAN architectures. Both of these factors adequately leverage multiscale features. For verifying this perspective, we compare GazeGAN and SalGAN in the feature space, because the generator of SalGAN is made up of vanilla VGG-16 network without multiscale skip-connections. We plot the visualization results of GazeGAN and SalGAN  in the representation space, as shown in Fig. 13. This way, we can see what deep models have learned in feature space. We adopt two feature visualization methods, i.e. Backpropagation (BP)  and Feature Mapping (FM) , to probe the representations of deep saliency models on distorted stimuli. For fair comparison, we select the 4th decoder layer of generator for both GazeGAN and SalGAN, to generate the visualization results. We notice that, in feature space, GazeGAN still focuses on the salient “Human Face” and “Bubble” regions when suffering Noise and JPEG perturbations, while SalGAN is vulnerable to these corruptions.
: Goodfollow et al. verified that most deep models are too linear to resist linear perturbations . Bastani et al. investigated the correlation between model robustness and model linearity, and indicated that deep models with higher linearity are much vulnerable to corruptions . For GazeGAN, as shown in Fig. 6, proposed CSC modules introduce both nonlinear activation functions and nonlinear point-to-point subtraction steps, aiming at increasing the nonlinearity.
: Hybrid adversarial training is a defense strategy for improving robustness of deep CNN models. This method utilizes an ensemble of the original images and the adversarial examples to train the deep models. Adversarial examples are the manually generated images by adding some slight perturbations to original images . In fact, the proposed valid data augmentation strategy provides a similar solution, which is adopting the examples corrupted by an ensemble of several transformations to train the deep CNNs. This hybrid adversarial training strategy is the current most effective method to improve model robustness, and prevents overfitting on a specific distortion type, as pointed out by Goodfollow et al. .
In this work, we construct an eye-tracking dataset containing several common image transformations. Based on our analyses of eye-movement data, we introduce a valid data augmentation strategy using some label-preserving transformations for boosting deep-learning based saliency models. Besides, we propose a new model called GazeGAN which combines skip-connections and center-surround connections to adequately leverage multi-level features. We also design a histogram loss to refine the prediction result. GazeGAN achieves encouraging performance on both normal and distorted datasets. We share our dataset and code with the community to promote research in improving the robustness of saliency models against non-canonical stimuli.
-  X. Huang, C. Shen, X. Boix, and Q. Zhao. Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In Proceedings of IEEE International Conference on Computer Vision, pp. 262-270, 2015.
-  A. Borji and L. Itti. State-of-the-art in visual attention modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 35, No. 1, pp. 185-207, 2013.
-  A. Oliva, A. Torralba, M. S. Castelhano, and J. M. Henderson. Top-down control of visual attention in object detection. In Proceedings of IEEE International Conference on Image Processing, pp. I-253-I-257, 2003.
-  S. Frintrop. A visual attention system for object detection and goal-directed search. Springer, 2005.
-  A. Mishra, Y. Aloimonos, and C. L. Fah. Active segmentation with fixation. In Proceedings of IEEE 12th International Conference on Computer Vision, pp. 468-475, 2009.
-  A. Borji, D. N. Sihite, and L. Itti. Quantitative analysis of human-model agreement in visual saliency modeling: A comparative study. IEEE Transactions on Image Processing, Vol. 22, No. 1, pp. 55-69, 2013.
-  A. Borji, H. R. Tavakoli, D. N. Sihite, and L. Itti. Analysis of scores, datasets, and models in visual saliency prediction. In Proceedings of IEEE International Conference on Computer Vision, pp. 921-928, 2013.
-  L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.20, No. 11, pp. 1254-1259, 1998.
-  H. Jonathan, C. Koch, and P. Perona. Graph-based visual saliency. In In Proceedings of Advances in Neural Information Processing Systems, pp. 545-552, 2007.
-  A. Torralba, A. Oliva, and M. S. Castelhano. Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. Psychological Review, pp. 766, 2006.
-  E. Erdem and A. Erdem. Visual saliency estimation by nonlinearly integrating features using region covariances. Journal of Vision, Vol.13, No. 4, pp. 11-23, 2013.
-  N. Bruce and J. Tsotsos. Attention based on information maximization. Journal of Vision, Vol.7, No. 9, pp. 950-962, 2007.
-  X. Hou, J. Harel, and C. Koch. Image signature: Highlighting sparse salient regions. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.34, No. 1, pp. 194-201, 2012.
-  A. Borji and L. Itti. Exploiting local and global patch rarities for saliency detection. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 478-485, 2012.
-  J. Zhang and S. Sclaroff. Saliency detection: A boolean map approach. In Proceedings of IEEE International Conference on Computer Vision, pp. 153-160, 2013.
-  M. Cheng, N. J. Mitra, and X. Huang. Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.37, No. 3, pp. 569-582, 2015.
-  N. Murray, M. Vanrell, and X. Otazu. Saliency estimation using a non-parametric low-level vision model. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 433-440, 2011.
-  A. Garcia-Diaz, V. Leboran, and X. R. Fdez-Vidal. On the relationship between optical variability, visual saliency, and eye fixations: A computational approach. Journal of Vision, Vol. 12, No. 6, pp. 17-29, 2012.
-  S. Goferman, L. Manor, and A. Tal. Context-aware saliency detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 34, No. 10, pp. 1915-1926, 2012.
-  M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara. A deep multi-level network for saliency prediction. In Proceedings of IEEE International Conference on Pattern Recognition, pp. 3488-3493, 2016.
-  J. Pan, C. Canton, K. McGuinness, and et al. Salgan: Visual saliency prediction with generative adversarial networks. In arXiv preprint arXiv:1701.01081, 2017.
-  Thomas and Christopher. Opensalicon: An open source implementation of the salicon saliency model. In arXiv preprint arXiv:1606.00110, 2016.
-  J. Pan, K. McGuiness, E. Sayrol, N. Conner, and et al. Shallow and deep convolutional networks for saliency prediction. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognization, pp. 598-606, 2016.
-  M. Cornia, L. Baraldi, G. Serra, and et al. Predicting human eye fixations via an lstm-based saliency attentive model. IEEE Transactions on Image Processing, Vol. 27, No. 10, pp. 5142-5154, 2018.
-  M. Jiang, S. Huang, J. Duan, and Q. Zhao. Salicon: Saliency in context. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1072-1080, 2015.
-  T. Judd, K. Ehinger, and F. Durand. Learning to predict where humans look. In Proceedings of International Conference on Computer Vision, pp. 2106-2113, 2009.
-  A. Borji and L. Itti. Cat2000: A large scale fixation dataset for boosting saliency research. In arXiv preprint cs.CV, 2015.
-  J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li. Imagenet: A large-scale hierarchical image database. In Proceedings of International Conference on Computer Vision and Pattern Recognition, pp. 248-255, 2009.
-  W. Zhang, A. Borji, Z. Wang, Patrick, and H. Liu. The application of visual saliency models in objective image quality assessment: A statistical evaluation. IEEE Transactions on Neural Networks and Learning Systems, Vol. 27, No. 6, pp. 1266-1278, 2016.
-  D. Gao, S. Han, and N. Vasconcelos. Discriminant saliency, the detection of suspicious coincidences, and applications to visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 31, No. 6, pp. 989-1005, 2009.
-  C. Kim and P. Milanfar. Visual saliency in noisy images. Journal of Vision, Vol. 13, No. 4, pp. 5-17, 2013.
-  J. Tilke, D. Fredo, and T. Antonio. Fixations on low-resolution images. Journal of Vision, Vol. 11, No. 4, pp. 14-26, 2011.
-  W. Zhang and H. Liu. Toward a reliable collection of eye-tracking data for image quality research: Challenges, solutions, and applications. IEEE Transcation on Image Processing, Vol. 26, No. 5, pp. 2424-2437, 2017.
-  K. Alex, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of Advances in Neural Information Processing Systems, pp. 1097-1105, 2012.
-  Z. Bylinskii, T. Judd, A. Oliva, A. Torralba, and F. Durand. What do different evaluation metrics tell us about saliency models? In arXiv preprint cs.CV, 2017.
-  Z. Bylinskii. Code for computing visual angle. https://github.com/cvzoya/saliency/tree/master/computeVisualAngle, 2014.
-  L. Olivier and B. Thierry. Methods for comparing scanpaths and saliency maps: strengths and weaknesses. Behavior Research Methods, 2013.
-  A. G. Greenwald. Within-subjects designs: To use or not to use? Psychological Bulletin, 1976.
-  Z. Bylinskii, T. Judd, A. Borji, L. Itti, F. Durand, A. Oliva, and A. Torralba. Mit saliency benchmark. http://saliency.mit.edu/.
-  R. Olaf, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of IEEE International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234-241, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of International Conference on Computer Vision and Pattern Recognition, pp. 770-778, 2016.
-  P Isola, J. Zhu, T. Zhou, and A. Efros. Image-to-image translation with conditional adversarial nets. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2017.
-  C. Ledig, L. Theis, and F. Husz¨¢r. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of International Conference on Computer Vision and Pattern Recognition, Vol.2, No.3, 2017.
-  O. Kupyn, V. Budzan, and M. Mykhailych. Deblurgan: Blind motion deblurring using conditional adversarial networks. In arXiv preprint arXiv:1711.07064, 2017.
-  E. Ustinova and V. Lempitsky. Learning deep embeddings with histogram loss. In Proceedings of Advances in Neural Information Processing Systems, pp. 4170-4178, 2016.
-  R. Peters, A. Iyer, L. Itti, and C. Koch. Components of bottom up gaze allocation in natural images. Visual Research, Vol. 45, No. 8, pp. 2397¨C2416, 2005.
-  D. Kingma and J. Ba. Adam: A method for stochastic optimization. In arXiv preprint arXiv: 1412.6980, 2014.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Proceedings of Advances in Neural Information Processing Systems, pp. 2672-2680, 2014.
-  W. Wang and J. Shen. Deep visual attention prediction. IEEE Transaction on Image Processing, Vol. 27, No. 5, pp. 2368-2378, 2018.
-  M. Kümmerer, T. Wallis, and L. Gatys. Understanding low-and high-level contributions to fixation prediction. In Proceedings of IEEE International Conference on Computer Vision, pp. 4799-4808, 2017.
-  M. Kümmerer, L. Theis, and M. Bethge. Deep gaze i: Boosting saliency prediction with feature maps trained on imagenet. In arXiv preprint arXiv:1411.1045, 2014.
-  J. Springenberg adn A. Dosovitskiy, T. Brox, and M. Bmiller. Striving for simplicity: The all convolutional net. In CoRR, abs/1412.6806, 2014.
-  A. Lapedriza A. Oliva A. Torralba B. Zhou, A. Khosla. Learning deep features for discriminative localization. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognization, 2016.
-  D. Hendrycks and T. G. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In Proceedings of IEEE International Conference on Learning Representations, 2019.
-  I. Goodfollow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In Proceedings of International Conference on Learning Representation, 2015.
-  Bastani, Osbert, Lampropoulos, Vytiniotis, Nori, and Criminisi. Measuring neural net robustness with constraints. In Proceedings of Advances in Neural Information Processing Systems, pp. 2613-2621, 2016.