Identifying Reliable Annotations for Large Scale Image Segmentation
Abstract
Challenging computer vision tasks, in particular semantic image segmentation, require large training sets of annotated images. While obtaining the actual images is often unproblematic, creating the necessary annotation is a tedious and costly process. Therefore, one often has to work with unreliable annotation sources, such as Amazon Mechanical Turk or (semi)automatic algorithmic techniques.
In this work, we present a Gaussian process (GP) based technique for simultaneously identifying which images of a training set have unreliable annotation and learning a segmentation model in which the negative effect of these images is suppressed. Alternatively, the model can also just be used to identify the most reliably annotated images from the training set, which can then be used for training any other segmentation method.
By relying on ”deep features” in combination with a linear covariance function, our GP can be learned and its hyperparameter determined efficiently using only matrix operations and gradientbased optimization. This makes our method scalable even to large datasets with several million training instances.
1 Introduction
The recent emergence of large image datasets has led to drastic progress in computer vision. In order to achieve stateoftheart performance for various visual tasks, models are trained from millions of annotated images [14, 28]. However, manually creating expert annotation for large datasets requires a tremendous amount of resources and is often impractical, even with support by major industrial Internet companies. For example, it has been estimated that creating bounding box annotation for object detection tasks takes 25 seconds per box [27], and several minutes of human effort per image can be required to create pixelwise annotation for semantic image segmentation tasks [16].
In order to facilitate the data annotation process and make it manageable, researchers often utilize sources of annotation that are less reliable but that scale more easily to large amounts of data. For example, one harvests images from Internet search engines [24] or uses Amazon Mechanical Turk (MTurk) to create annotation. Another approach is to create annotation is a (semi)automatic way, e.g. using knowledge transfer methods [10, 11].
A downside of such cheap data sources, in particular automatically created annotations, is that they can contain a substantial amount of mistakes. Moreover, these mistakes are often strongly correlated: for example, MTurk workers will make similar annotation errors in all images they handle, and an automatic tool, such as segmentation transfer, will work better on some classes of images than others.
Using such noisily annotated data for training can lead to suboptimal performance. As a consequence, many learning techniques try to identify and suppress the wrong or unreliable annotations in the dataset before training. However, this leads to a classical chickenandegg problem: one needs a good data model to identify mislabeled parts of data and one needs reliable data to estimate a good model.
Our contribution is this work is a Gaussian processes (GP) [20] treatment of the problem of learning with unreliable annotation. It avoids the above chickenandegg problem by adopting a Bayesian approach, jointly learning a distribution of suitable models and confidence values for each training image (see Figure 1). Afterwards, we use the most likely such model to make predictions. All (hyper)parameters are learned from data, so no modelselection over free parameter, such as a regularization constant or noise strength, is required.
We also describe an efficient and optionally distributed implementation of Gaussian processes with lowrank covariance matrix that scales to segmentation datasets with more than 100,000 images (16 million superpixel training instances). Conducting experiments on the task of foreground/background image segmentation with large training sets, we demonstrate that the proposed method outperforms other approaches for identifying unreliably annotated images and that this leads to improved segmentation quality.
1.1 Related work
The problem of unreliable annotation has appeared in the literature previously in different contexts.
For the task of dataset creation, it has become common practice to collect data from unreliable sources, such as MTurk, but have each sample annotated by more than one worker and combine the obtained labels, e.g. by a (weighted) majority vote [21, 26]. For segmentation tasks even this strategy can be too costly, and it is not clear how annotations could be combined. Instead, it has been suggested to have each image annotated only by a single worker, but require workers to first fulfill a grading task [16]. When using images retrieved from search engines, it has been suggested to make use of additional available information, e.g. keywords, to filter out mislabeled images [24].
For learning a classifier from unreliable data, the easiest option is to ignore the problem and rely on the fact that many discriminative learning techniques are to some extent robust against label noise. We use this strategy as one of the baselines for our experiments in Section 4, finding however that it leads to suboptimal results. Alternatively, outlier filtering based on the selflearning heuristic is popular: a prediction model is first trained on all data, then its outputs are used to identify a subset of the data consistent with the learned model. Afterwards, the model is retrained on the subset. Optionally, these steps can be repeated multiple times [4]. We use this idea as a second baseline for our experiments, showing that it improves the performance, but not as much as the method we propose.
Special variants of popular classification methods, such as support vector machines and logistic regression, have been proposed that are more tolerant to label noise by explicitly modeling in the objective function the possibility of label changes. However, these usually result to more difficult optimization problems that need to be solved, and they can only be expected to work if certain assumptions about the noise are fulfilled, in particular that the label noise is statistically independent between different training instances. For an indepth discussion of these and more methods we recommend the recent survey on learning with label noise [8].
Note that recently a method has been proposed by which a classifier is able to selfassess the quality of its predictions [30]. While also based on Gaussian processes this work differs significantly from ours: it aims at evaluating outputs of a learning system using a GP’s posterior distribution, while in this work our goal is to assess the quality of inputs for a learning system, and we do so using the GP’s ability to infer hyperparameters from data.
2 Learning with unreliable annotations
We are given a training set, \mathcal{D}=\{(\mathbf{I}_{j},\mathbf{M}_{j})\}_{j=1}^{n}, that consists of n pairs of images and segmentations masks. Each image \mathbf{I}_{j} is represented as a collection of r_{j} superpixels, (x_{1},\dots,x_{r_{j}}), with x_{k}\in\mathcal{X} for each k\in\{1\dots r_{j}\}, where \mathcal{X} is a universe of superpixels. Correspondingly, any segmentation mask \mathbf{M}_{j} is a collection (y_{1},\dots,y_{r_{j}}), where each y_{j}\in\mathcal{Y} is the semantic label of the superpixel x_{j} and \mathcal{Y} is a finite label set. For convenience we combine all superpixels and semantic labels from the training data and form vectors \mathbf{X} and \mathbf{y} of length N, denoting individual superpixels and semantic labels by a lower index i. In the scope of this work we consider foregroundbackground segmentation problem with \mathcal{Y}=\{+1,1\}, where +1 stands for foreground and 1 for background. An extension of our technique to the multiclass scenario is possible, but beyond the scope of this manuscript.
The main goal of this work is to learn a prediction function, f:\mathcal{X}\rightarrow\mathcal{Y}, in presence of a significant number of mistakes in the labels of the training data. We address this learning problem using Gaussian processes.
2.1 Gaussian processes
Gaussian processes (GPs) are a prominent Bayesian machine learning technique, which in particular is able to reason about noise in data and allows principled, gradientbased hyperparameter tuning. In this section we reiterate key results from the Gaussian processes literature from a practitioner’s view. For more complete discussion, see [20].
A GP is defined by a positivedefinite covariance (or kernel) function, \kappa:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}, that can depend on hyperparameters \bm{\theta}. For any test input, \bar{x}, the GP defines a Gaussian posterior (or predictive) distribution,
\displaystyle p(\bar{y}\bar{x},\mathbf{X},\mathbf{y},\theta)  \displaystyle=\mathcal{N}\left(m(\bar{x}),\sigma(\bar{x})\right).  (1) 
The mean function,
\displaystyle m(\bar{x})  \displaystyle=\bar{\kappa}(\bar{x})^{\top}\mathbf{K}^{1}\mathbf{y},  (2)  
allows us to make predictions (by taking its sign), and the variance,  
\displaystyle\sigma(\bar{x})  \displaystyle=\kappa(\bar{x},\bar{x})\bar{\kappa}(\bar{x})^{\top}\mathbf{K}^{% 1}\bar{\kappa}(\bar{x}),  (3) 
reflects our confidence in this prediction, where \mathbf{K} is the N\times N covariance matrix of the training data with entries \mathbf{K}_{ij}=\kappa(\mathbf{X}_{i},\mathbf{X}_{j}) for i,j\in\{1,\dots,N\} and {\bar{\kappa}(\bar{x})=[\kappa(\mathbf{X}_{1},\bar{x}),\dots,\kappa(\mathbf{X}% _{N},\bar{x})]^{\top}\in\mathbb{R}^{N}}. Note that the mean function (2) is the same as one would obtain from kernel ridge regression [15], which has proven effective also for classification tasks [22].
Due to their probabilistic nature, Gaussian processes can incorporate uncertainty about labels in the training set. One assumes that the label, y, of any training example is perturbed by Gaussian noise with zero mean and variance \varepsilon^{2}. Different noise variances for different examples reflect the situation in which certain example labels are more trustworthy than others.
The specific form of the GP allows us to integrate out the label noise from the posterior distribution. The integral can be computed in closed form, resulting in a new posterior distribution with mean function,
\displaystyle m(\bar{x})  \displaystyle=\bar{\kappa}(\bar{x})^{\top}\mathbf{K}_{\mathcal{E}}^{1}\mathbf% {y},  (4) 
and variance \sigma(\bar{x})=\kappa(\bar{x},\bar{x})\bar{\kappa}(\bar{x})^{\top}\mathbf{K}% _{\mathcal{E}}^{1}\bar{\kappa}(\bar{x}), for an augmented covariance matrix \mathbf{K}_{\mathcal{E}}=\mathbf{K}+\mathcal{E}, where \mathcal{E} is the diagonal matrix that contains the noise variances of all training examples^{1}^{1}1Alternatively, we can think of \mathbf{K}_{\mathcal{E}} as the data covariance matrix for a modified covariance function.. We consider potential hyperparameters of \mathcal{E} as a part of \bm{\theta}.
2.2 Hyperparameter learning
A major advantage of GPs over other regression techniques is that their probabilistic interpretation offers a principled method for hyperparameter tuning based on continuous, gradientbased optimization instead of partitioningbased techniques such as crossvalidation. We treat the unknown hyperparameters as random variables and study the joint probability \bm{p}(y,\bm{\theta}\mathbf{X}) over hyperparameters and semantic labels. Employing typeII likelihood estimation (see [20], chapter 5), we obtain optimal hyperparameters, \bm{\theta}^{*}, by solving the following optimization problem,
\bm{\theta}^{*}=\mathrm{argmax}_{\bm{\theta}}\;\ln p(\mathbf{y}\bm{\theta},% \mathbf{X}).  (5) 
The expression \bm{p}(\mathbf{y}\bm{\theta},\mathbf{X}) in the objective (5) is known as marginal likelihood. Its value and gradient can be computed in closed form,
\displaystyle\!\!\ln p(\mathbf{y}\bm{\theta},\mathbf{X})\!=\!\dfrac{1}{2}\!% \left(\mathbf{y}^{\top}\!\mathbf{K}_{\mathcal{E}}^{1}\mathbf{y}\!+\!\ln% \mathbf{K}_{\mathcal{E}}\!+\!N\!\ln(2\pi)\right),  (6)  
\displaystyle\dfrac{\partial\ln p(\mathbf{y}\bm{\theta},\mathbf{X})}{\partial% \theta}=\dfrac{1}{2}\mathrm{tr}\left((\alpha\alpha^{\top}\mathbf{K}_{\mathcal% {E}}^{1})\dfrac{\partial\mathbf{K}_{\mathcal{E}}}{\partial\theta}\right),  (7) 
where \alpha=\mathbf{K}_{\mathcal{E}}^{1}\mathbf{y}, \theta is any entry of \bm{\theta}, \dfrac{\partial\mathbf{K}_{\mathcal{E}}}{\partial\theta} is an elementwise partial derivative and \mathbf{K}_{\mathcal{E}} denotes the determinant of \mathbf{K}_{\mathcal{E}}. If the entries of \mathbf{K}_{\mathcal{E}} depend smoothly on \bm{\theta} then the maximization problem (5) is also smooth and one can apply standard gradientbased techniques, even for highdimensional \bm{\theta} (i.e. many hyperparameters). While the solution is not guaranteed to be globally optimal, since (5) is not convex, the procedure has been observed to result in good estimates which are largely insensitive to the initialization [20].
2.3 A Gaussian process with groupwise confidences
Our main contribution in this work is a new approach, GPGC, for handling unreliably annotated data in which some training examples are more trustworthy than others. Earlier GPbased approaches either assume that the noise variance is constant for all training examples, i.e. \mathcal{E}=\lambda\mathbf{I} for some \lambda>0, or that the noise variance is a smooth function of the inputs, \mathcal{E}=\text{diag}(g(\mathbf{X}_{1}),\dots,g(\mathbf{X}_{N})), where g is also a Gaussian process function [9, 12]. Neither approach is suitable for our situation: constant noise variance makes it impossible to distinguish between more and less reliable annotations. Inputdependent noise variance can reflect only errors due to image contents, which is not adequate for errors due to an unreliable annotation process. For example, in image segmentation even identically looking superpixels need not share the same noise level if they originate from different images or were annotated by different MTurk workers.
The above insight suggests to allow for arbitrary noise levels, \mathcal{E}=\text{diag}(\varepsilon_{1},\dots,\varepsilon_{N}), for all training instances. However, without additional constraints this would give too much freedom in modelling the data and lead to overfitting. Therefore, we propose to take an intermediate route, based on the idea of estimating label confidence in groups. In particular, for image segmentation problems it is sufficient to model confidences for the entire image segmentation masks, instead of confidences for every individual superpixel. We obtain such a perimage confidence scores by assuming that all superpixel labels from the same image share the same confidence value, i.e. \varepsilon_{i}=\varepsilon_{j} if \mathbf{X}_{i} and \mathbf{X}_{j} belong to the same image. We treat the unknown noise levels as hyperparameters and learn their value in the way described above. Since our confidence about labels is based on the learned noise variances, we also refer to the above procedure as “learning label confidence”. We call the resulting algorithm Gaussian Process with Groupwise Confidences, or GPGC.
Note that we avoid the chickenandegg problem mentioned in the introduction because we simultaneously obtain hyperparameters \bm{\theta}, in particular the noise levels \bm{\varepsilon}=[\varepsilon_{1},\dots,\varepsilon_{N}], and the predictive distribution.
2.4 Instance reweighting
For unbalanced dataset, e.g. in the image segmentation case, where the background class is more frequent than the foreground, it makes sense to balance the data before training. A possible mechanism for this is to duplicate training instances of the minority class. Done naively, however, this would unnecessarily increase the computational complexity. Instead, we propose a computational shortcut that allows to incorporate duplicate instances without overhead. Let w\in\mathbb{N}^{N} be a vector of duplicate counts, i.e. w_{i} is the number of copies of the training instance \mathbf{X}_{i}. Elementary transformations reveal that for the mean function (4), a duplication of training instances is equivalent to changing each hyperparameter \varepsilon_{i} to \varepsilon_{i}\sqrt{w_{i}}. We denote vector of hyperparameters, where \bm{\varepsilon} is scaled by squared root of entries of w as \bm{\theta}_{w}. To incorporate duplicates into the marginal likelihood we also need to scale \bm{\varepsilon} by the square root of vector of duplicate counts. In addition, we need to add some terms to the marginal likelihood, resulting in the following reweighted marginal likelihood,
\ln p_{w}(\mathbf{y}\bm{\theta})\hat{=}\ln p(\mathbf{y}\bm{\theta}_{w})\!+\!% \dfrac{1}{2}\sum\limits_{i=1}^{N}[\ln w_{i}\varepsilon^{2}_{i}\!\!w_{i}\ln% \varepsilon^{2}_{i}],  (8) 
where “\hat{=}” means equality up to a constant that does not depend on the hyperparameters.
Note that the above expressions are welldefined also for noninteger weights, w, which gives us not only the possibility to increase the importance of samples, but also to decrease it, if required.
3 Efficient Implementation
Gaussian processes have a reputation for being computationally demanding. Generally, their computational complexity scales cubically with the number of training instances and their memory consumption grows quadratically, because they have to store and invert the augmented data covariance matrix, \mathbf{K}_{\mathcal{E}}. Thus, standard implementations of Gaussian processes become computationally prohibitive for largescale datasets.
Nevertheless, if the sample covariance matrix has a lowrank structure, all necessary computations can be carried out much faster by utilizing the MorrisonShermanWoodbury identity and the matrix determinant lemma [17, Corollary 4.3.1]. To benefit from this, many techniques for approximating GPs by lowrank GPs have been developed using, e.g., the Nyström decomposition [32], random subsampling [6], kmeans clustering [33], approximate kernel feature maps [19, 29], or inducing points [3, 18].
In this work we follow the general trend in computer vision and rely on an explicit feature map (obtained from a pretrained deep network [5, 25]) in combination with a linear covariance function. This allows us to develop a parallel and distributed implementation of Gaussian Processes with exact inference, even in the largescale regime. Formally, we use a linear covariance function, \kappa, of the following form,
\kappa(x_{1},x_{2})=\phi(x_{1})^{\top}\Sigma\,\phi(x_{2}),  (9) 
where \phi:\mathcal{X}\rightarrow\mathbb{R}^{k} is a kdimensional feature map with k\ll N, and \Sigma=\text{diag}(\sigma_{1}^{2},\dots,\sigma_{k}^{2})\in\mathbb{R}^{k\times k} is a diagonal matrix of feature scales. The entries of \Sigma are assumed to be unknown and included in the vector of hyperparameters \bm{\theta}. The feature map \phi induces a feature matrix {\mathbf{F}=[\phi(\mathbf{X}_{1}),\dots,\phi(\mathbf{X}_{N})]\in\mathbb{R}^{k% \times N}} of the training set. As a result, the augmented covariance matrix has a special structure as sum of a diagonal and a lowrank matrix,
\mathbf{K}_{\mathcal{E}}=\mathcal{E}+\mathbf{F}^{\top}\Sigma\mathbf{F}.  (10) 
This lowrank representation allows us to store \mathbf{K}_{\mathcal{E}} implicitly by storing matrices \mathcal{E}, \Sigma and \mathbf{F}, which reduces the memory requirements from \mathcal{O}(N^{2}) to \mathcal{O}(Nk). Moreover, all necessary computations for the predictive distribution (1) and marginal likelihood (6) require only \mathcal{O}(Nk^{2}) operations instead of \mathcal{O}(N^{3}) [32].
Computing the gradients (7) with respect to unknown hyperparameters generally imposes a computational overhead that scales linearly with the number of hyperparameters [18, 20]. For GPGC, however, we can exploit the homogeneous structure of the hyperparameters, \bm{\theta}=[\bm{\varepsilon},\bm{\sigma}], where \bm{\varepsilon}=[\varepsilon_{1},\dots,\varepsilon_{N}] and \bm{\sigma}=[\sigma_{1},\dots,\sigma_{k}] for deriving an expression for the gradient without such overhead:
\displaystyle\nabla_{\bm{\varepsilon}}\ln p(\mathbf{y}\bm{\theta})=\mathrm{% diag}\left((\alpha\alpha^{\top}\mathbf{K}_{\mathcal{E}}^{1})\mathcal{E}^{% \prime}\right),  (11)  
\displaystyle\nabla_{\bm{\sigma}}\ln p(\mathbf{y}\bm{\theta})=\mathrm{diag}% \left(\mathbf{F}(\alpha\alpha^{\top}\mathbf{K}_{\mathcal{E}}^{1})\mathbf{F}^% {\top}\Sigma^{\prime}\right),  (12) 
where \alpha=\mathbf{K}_{\mathcal{E}}^{1}\mathbf{y} and \mathcal{E}^{\prime} and \Sigma^{\prime} are diagonal matrices formed by vectors \bm{\varepsilon} and \bm{\sigma} respectively.
The computational bottleneck of lowrank Gaussian process learning is constituted by standard linear algebra routines, in particular matrix multiplication and inversion. Thus, a significant reduction in runtime can be achieved by relying on multithreaded linear algebra libraries or even GPUs.
3.1 Distributed implementation
Despite great improvement in performance by utilizing a lowrank covariance function and parallel matrix operations, Gaussian processes still remain computationally challenging for truly large datasets with highdimensional feature maps. For example, one of the datasets we use in our experiments has more than 100,000 images, 16 million superpixels and 4,113dimensional feature representation. Storing the feature matrix alone requires more than 512 GB RAM, which is typically not available on a single workstation, but easily achievable if the representation is distributed across multiple machines.
In order to overcome memory limitations and even further improve the computational performance we developed a distributed version of lowrank Gaussian processes. It relies on the insight is that the feature matrix \mathbf{F} itself is not required for computing the prediction function (4), the marginal likelihood (6) and its gradient (7), if an oracle is available for answering the following four queries:

compute \mathbf{F}v for any v\in\mathbb{R}^{N},

compute \mathbf{F}^{\top}u for any u\in\mathbb{R}^{k},

compute \mathbf{F}D\mathbf{F}^{\top} for any diagonal D\in\mathbb{R}^{N\times N},

compute \mathrm{diag}(\mathbf{F}^{\top}A\mathbf{F}) for any A\in\mathbb{R}^{k\times k}.
See Appendix A for detailed explanation. On top of such an oracle we need only \mathcal{O}(k^{2}+N) bytes of memory and \mathcal{O}(k^{3}+N) operations to accomplish all necessary computations, which is orders of magnitude faster than the original requirements of \mathcal{O}(Nk) bytes and \mathcal{O}(Nk^{2}) operations.
Implementing a distributed version of the oracle is straightforward: suppose that p computational nodes are available. We then split the feature matrix \mathbf{F}=[\mathbf{F}_{1},\mathbf{F}_{2},\dots\mathbf{F}_{p}] into p roughly equallysized parts. Each part is stored on one of the nodes in a distribute way. All oracle operations naturally decompose with respect to the parts of the feature matrix:

{\mathbf{F}v=\sum_{i=1}^{p}\mathbf{F}_{i}v_{i}},

{\mathbf{F}^{\top}u=[(\mathbf{F}_{1}^{\top}u)^{\top},\dots,(\mathbf{F}_{p}^{% \top}u)^{\top}]^{\top}},

{\mathbf{F}D\mathbf{F}^{\top}=\sum_{i=1}^{p}\mathbf{F}_{i}D_{i}\mathbf{F}_{i}^% {\top}},

{\mathrm{diag}(\mathbf{F}^{\top}\!\!A\mathbf{F})\!=\![\mathrm{diag}(\mathbf{F}% _{1}^{\top}\!\!A\mathbf{F}_{1})^{\top}\!\!,\dots,\mathrm{diag}(\mathbf{F}_{p}^% {\top}\!A\mathbf{F}_{p})^{\top}]^{\top}\!\!,}
where we split the vector v and the diagonal matrix D into p parts in the same fashion as we split \mathbf{F}, obtaining v_{i} and D_{i} for all i\in\{1,\dots,p\}. A master node takes care of distributing the objects v, u, D and A over computational nodes. Each computational node i calculates \mathbf{F}_{i}v_{i}, (\mathbf{F}_{i}^{\top}u)^{\top}, \mathbf{F}_{i}D_{i}\mathbf{F}_{i}^{\top} and \mathrm{diag}(\mathbf{F}_{i}^{\top}A\mathbf{F}_{i})^{\top} and sends results to the master node, which collects the results of every operation and aggregates them by taking the sum for operations (i) and (iii) or the concatenation for operations (ii) and (iv). The communication between the master node and computational nodes requires sending messages of size at most \mathcal{O}(k^{2}+N) bytes, which is small in relation to the size of training data.
Consequently, our distributed implementation reduces the time and permachine memory requirements by a factor of p, at the expense of minor overhead for network communication and computations on the master node.
4 Experiments
Method  SVM  GP  GPGC 
HDSeg dataset  
Horses (19,060)  82.5  82.5  83.7 
Dogs (111,668)  80.6  80.5  81.3 
AutoSeg dataset  
Horses (9,007)  81.2  80.3  82.5 
Dogs (41,777)  77.1  77.1  79.4 
Cats (3,006)  73.1  72.4  73.5 
Sheep (5,079)  75.6  75.4  80.0 
We implemented GPGC in Python, relying the OpenBlas library^{2}^{2}2http://openblas.net for linear algebra operations and LBFGS [2] for gradientbased optimization. The code will be made publicly available.
We perform experiments on two largescale datasets for foregroundbackground image segmentation, see Figure 2 for example images.
1) HDSeg [13]^{3}^{3}3http://ist.ac.at/~akolesnikov/HDSeg/. We use the 19,060 images of horses and 111,668 images of dogs with segmentation masks created automatically by the segmentation transfer method [11] for training. The test images are 241 and 306 manually segmented images of horses and dogs, respectively.
2) AutoSeg, a new dataset that we collated from public sources and augmented with additional annotations^{4}^{4}4We will publish the dataset, including precomputed features.. The training images for this dataset are taken from the ImageNet project^{5}^{5}5http://www.imagenet.org. There are four categories: horses (9,007 images), dogs (41,777 images), cats (3,006 images), sheep (5,079 images). All training images are annotated with segmentation masks generated automatically by the GrabCut algorithm [23] from the OpenCV library^{6}^{6}6http://opencv.org with default parameters. We initialize GrabCut with bounding boxes that were also generated automatically by the ImageNet AutoAnnotation method [30]^{7}^{7}7http://groups.inf.ed.ac.uk/calvin/projimagenet/page. The test set consist of 1001 images of horses, 1521 images of dogs, 1480 images of cats and 489 images of sheep with manually created perpixel segmentation masks that were taken from the validation part of the MS COCO ^{8}^{8}8http://mscoco.org/ dataset.
As evaluation metric for both datasets we use the average class accuracy [13]: we compute the percentage of correctly classified foreground pixels and the percentage of correctly classified background pixels across all images and average both values. To assess the significance of reported results, the above single number is not sufficient. Therefore, we use a closely related quantity for this purpose: we compute an average class accuracy as above separately for every image and perform a Wilcoxon signedrank test [31] with significance level 10^{3}.
Prediction 
model 
Selection 
rule 
Horses 
Dogs 
Horses 
Dogs 
Cats 
Sheep 
SVM  
Classifier  
SVM  GPGC 
margin  confidence 
HDSeg dataset  
83.8  84.3 
81.7  82.0 
AutoSeg dataset  
82.5  83.2 
79.2  80.9 
71.9  72.9 
80.2  81.7 
GP  
Classifier  
SVM  GPGC 
margin  confidence 
HDSeg dataset  
83.5  84.3 
81.2  81.7 
AutoSeg dataset  
82.7  83.9 
79.7  81.2 
72.5  73.9 
81.2  82.9 
Method  GPGC  Top1%  Top2%  Top5%  Top10%  Top15%  Top25%  Top50%  Top75% 

HDSeg dataset  
Horses  83.7  82.7  83.0  83.7  83.9  84.1  84.3  84.4  83.9 
Dogs  81.3  80.0  80.3  80.8  81.2  81.4  81.7  81.9  81.6 
AutoSeg dataset  
Horses  82.5  82.4  82.8  83.3  83.8  83.8  83.9  83.4  83.6 
Dogs  79.4  77.1  77.7  78.9  80.1  80.7  81.2  80.8  79.9 
Cats  73.5  57.0  69.3  72.3  72.6  73.3  73.9  74.4  74.3 
Sheep  80.0  78.1  80.4  81.6  82.6  82.5  82.9  81.4  79.4 
4.1 Image Representation
We split every image into superpixels using the SLIC [1] method from scikitimage^{9}^{9}9http://scikitimage.org library. Each superpixel is assigned a semantic label based on the majority vote of pixel labels inside it. For each superpixel we compute appearancebased features using the OverFeat[25] library^{10}^{10}10http://cilvr.nyu.edu/doku.php?id=software:overfeat:start. We extract a 4096dimensional vector from the output of the 20th layer of the pretrained model referred to as fast model in the library documentation. Additionally, we add features that describe the position of a superpixel in its image. For this we split each image into a 4x4 uniform grid and describe position of each superpixels by 16 values, each one specifying the ratio of pixels from the superpixel falling into the corresponding grid cell. We also add a constant (bias) feature, resulting in an overall feature map, \phi:\mathcal{X}\rightarrow\mathbb{R}^{k}, with k=4113. The features within each of the three homogeneous groups (appearance, position, constant) share the same scale hyperparameter in the covariance function (9), i.e. \sigma_{i}=\sigma_{j} if the feature dimensions i and j are within the same group.
4.2 Baseline approaches
We compare GPGC against two baselines. The first baseline is also a Gaussian process, but we assume that all superpixels have the same noise variance. All hyperparameters are again learned by typeII maximum likelihood. We refer this method simply as GP. This baseline is meant to study if a selective estimation of the confidence values indeed has a positive effect on prediction performance.
As second baseline we use a linear support vector machine (SVM), relying on the LibLinear implementation with squared slack variables, which is known to deliver stateoftheart optimization speed and prediction quality [7]. For training SVM models we always perform 5fold crossvalidation to determine the regularization constant C\in\{2^{20},2^{19},\dots,2^{1}\}.
4.3 Foregroundbackground Segmentation
We conduct experiments on the HDSeg and AutoSeg datasets, analyzing the potential of GPGC for two tasks: either as a dedicated method for semantic image segmentation, or as a tool for identifying reliably annotated images, which can be used afterwards, e.g., as a training set for other approaches. For all experiments we reweight training data so that foreground and background classes are balanced and all instances with the same semantic label have the same weight, but the overall weight remains unchanged, i.e. \sum_{i}w_{i}=N. This step removes the effect of different ratios of foreground and background labels for different datasets and their subsets.
The first set of experiments compares GPGC with the baselines, GP and SVM, on the task of foregroundbackground segmentation. Numeric results are presented in Table 1. They show that GPGC achieves best results for all datasets and all semantic classes. According to a Wilcoxon signedrank test, GPGC’s improvement over the baselines is significant to the 10^{3} level in all cases.
We obtain two insights from this. First, the fact that GPGC improves over GP confirms that it is indeed beneficial to learn different confidence hyperparameters for different images. Second, the results also confirm that classification using Gaussian process regression with gradientbased hyperparameter selection yields results comparable with other stateoftheart classifiers, such as SVMs, whose regularization parameter have to be chosen by more tedious crossvalidation procedures.
In a second set of experiments we benchmark GPGC’s ability to suppress images with unreliable annotation. For this, we apply GPGC to the complete training set and use the learned hyperparameter values (see Figure 2 for an illustration) to form a new data set that consists only of the 25% of images that GPGC was most confident about. We compare this approach to SVMbased filtering similar to what has been done in the computer vision literature before [4]: we train an SVM on the original dataset and form perimage confidence values by averaging the SVM margins of the contained superpixels. Afterwards we use the same construction as above, forming a new data set from the 25% of images with highest confidence scores.
We benchmark how useful the resulting data sets are by using them as training sets for either a GP (with single noise variance) or an SVM. Table 2 shows the results. By comparing the results to Table 1, one sees that both methods for filtering out images with unreliable annotation help the segmentation accuracy. However, the improvement from filtering using GPGC is higher than when using the data filtered by the SVM approach, regardless of the classifiers used afterwards. This indicates that GPGC is a more reliable method for suppressing bad annotation. According to a Wilcoxon test, GPGC’s improvement over the other method is significant to the 10^{3} level in 11 out of 12 cases (all except AutoSeg sheep for the SVM classifier).
To understand this effect in more detail, we performed another experiment: we used GPGC to create training sets of different sizes (1% to 75% of the original training sets) and trained the GP model on each of them. The results in Table 3 show that the best results are consistently obtained when using 25%–50% of the data. For example, for the largest dataset (HDSeg dog), the quality of the prediction model keeps increasing up to a training set of over 55,000 images (8 million superpixels). This shows that having many training images (even with unreliable annotations) is beneficial for the overall performance and that scalability is an important feature of our approach.
5 Summary
In this work we presented, GPGC, an efficient and parameterfree method for learning from datasets with unreliable annotation, in particular for image segmentation tasks. The main idea is to use a Gaussian process to jointly model the prediction model and confidence scores of individual annotation in the training data. The confidence values are shared within groups of examples, e.g. all superpixels within an image, and can be obtained automatically using Bayesian reasoning and gradientbased hyperparameter tuning. As a consequence there are no free parameter that need to be tuned.
In experiments on two largescale image segmentation datasets, we showed that by learning individual confidence values GPGC is able to better cope with unreliable annotation than other classification methods. Furthermore, we showed that the estimated confidences allow us to filter out examples with unreliable annotation, thereby providing a way to create a cleaner dataset that can afterwards be used also by other learning methods.
By relying on an explicit feature map and lowrank kernel, GPGC training is very efficient and easily implemented in a parallel or even distributed way. For example, training with 20 machines on the HDSeg dog segmentation dataset, which consists of over 100,000 images (16 million superpixels), takes only a few hours.
Appendix A Reduction to the oracle
We demonstrate that having the oracle from Section 3.1 is sufficient to compute the mean function (2), the marginal likelihood (6), and its gradient (7) without access to the feature matrix \mathbf{F} itself. We highlight terms that oracle can compute by braces with the number of the corresponding oracle operation.
We first apply the ShermanMorrisonWoodbury identity and matrix determinant lemma to the matrix \mathbf{K}_{\mathcal{E}}:
\displaystyle{\mathbf{K}_{\mathcal{E}}^{1}\!=\!(\mathcal{E}+\mathbf{F}^{\top}% \Sigma\mathbf{F})^{1}\!=\!\mathcal{E}^{1}\!\!\mathcal{E}^{1}\mathbf{F}^{% \top}C^{1}\mathbf{F}\mathcal{E}^{1}},  (13)  
\displaystyle\!\!\ln\!\mathbf{K}_{\mathcal{E}}\!=\!\ln\!\mathcal{E}\!+\!% \mathbf{F}^{\top}\Sigma\mathbf{F}\!=\!\ln\!\mathcal{E}\!+\!\ln\!\Sigma\!+% \!\ln\!C,  (14) 
where we denote C=\Sigma^{1}+\underbrace{\mathbf{F}\mathcal{E}^{1}\mathbf{F}^{\top}}_{(iii)}.
For convenience we introduce \tilde{y}=\mathcal{E}^{1}y. Relying on (13) we compute the following expressions:
\displaystyle y^{\top}\mathbf{K}_{\mathcal{E}}^{1}y=y^{\top}\tilde{y}{% \underbrace{(\mathbf{F}\tilde{y})}_{(\mathrm{i})}}^{\top}C^{1}\underbrace{(% \mathbf{F}\tilde{y})}_{(\mathrm{i})},  (15)  
\displaystyle\mathbf{F}\mathbf{K}_{\mathcal{E}}^{1}y=\underbrace{\mathbf{F}% \tilde{y}}_{(\mathrm{i})}\underbrace{(\mathbf{F}\mathcal{E}^{1}\mathbf{F}^{% \top})}_{(\mathrm{iii})}C^{1}\underbrace{(\mathbf{F}\tilde{y})}_{(\mathrm{i})},  (16)  
\displaystyle\mathbf{F}\mathbf{K}_{\mathcal{E}}^{1}\mathbf{F}^{\top}=% \underbrace{(\mathbf{F}\mathcal{E}^{1}\mathbf{F}^{\top})}_{(\mathrm{iii})}(% \mathbf{I}C^{1}\underbrace{(\mathbf{F}\mathcal{E}^{1}\mathbf{F}^{\top})}_{(% \mathrm{iii})}),  (17)  
\displaystyle\alpha=\mathbf{K}_{\mathcal{E}}^{1}y=\tilde{y}\mathcal{E}^{1}% \overbrace{\mathbf{F}^{\top}C^{1}\underbrace{\mathbf{F}\tilde{y}}_{(\mathrm{i% })}}^{(\mathrm{ii})},  (18)  
\displaystyle\mathrm{diag}(\mathbf{K}_{\mathcal{E}}^{1})=\mathrm{diag}(% \mathcal{E}^{1})  (19)  
\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\underbrace{\mathrm{% diag}(\mathbf{F}^{\top}C^{1}\mathbf{F})}_{(\mathrm{iv})}\odot\mathrm{diag}(% \mathcal{E}^{2}), 
where \odot is elementwise (Hadamard) vector multiplication.
Using the above identities, we obtain the mean of the predictive distribution,
m(\bar{x})=\bar{\kappa}(\bar{x})^{\top}\mathbf{K}_{\mathcal{E}}^{1}y=\phi(% \bar{x})^{\top}\underbrace{\mathbf{F}\mathbf{K}_{\mathcal{E}}^{1}y}_{\eqref{% eq:FKy}},  (20) 
and the marginal likelihood,
\ln p(\mathbf{y}\bm{\theta})\!=\!\dfrac{1}{2}(\underbrace{\mathbf{y}^{\top}% \mathbf{K}_{\mathcal{E}}^{1}\mathbf{y}}_{\eqref{eq:yKy}}+\underbrace{\ln% \mathbf{K}_{\mathcal{E}}}_{\eqref{eq:determinant}}+N\!\ln(2\pi)).  (21) 
Finally, we compute the gradient of the marginal likelihood with respect to the noise variances \bm{\varepsilon},
\displaystyle\nabla_{\bm{\varepsilon}}\ln p(\mathbf{y}\bm{\theta})=\mathrm{% diag}\left((\alpha\alpha^{\top}\mathbf{K}_{\mathcal{E}}^{1})\mathcal{E}^{% \prime}\right)  (22)  
\displaystyle=(\alpha\odot\alpha\underbrace{\mathrm{diag}(\mathbf{K}_{% \mathcal{E}}^{1})}_{\eqref{eq:K_diag}})\odot\mathrm{diag}(\mathcal{E}^{\prime}) 
and with respect to the feature scales \bm{\sigma},
\displaystyle\nabla_{\bm{\sigma}}\ln p(\mathbf{y}\bm{\theta})=\mathrm{diag}% \left(\mathbf{F}(\alpha\alpha^{\top}\mathbf{K}_{\mathcal{E}}^{1})\mathbf{F}^% {\top}\Sigma^{\prime}\right)  (23)  
\displaystyle=(\underbrace{(\mathbf{F}\alpha)}_{(\mathrm{i})}\odot\underbrace{% (\mathbf{F}\alpha)}_{(\mathrm{i})}\mathrm{diag}(\underbrace{\mathbf{F}\mathbf% {K}_{\mathcal{E}}^{1}\mathbf{F}^{\top}}_{\eqref{eq:FKF}}))\odot\mathrm{diag}(% \Sigma^{\prime}). 
References
 [1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk. SLIC superpixels compared to stateoftheart superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2012.
 [2] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 1995.
 [3] J. Chen, N. Cao, K. H. Low, R. Ouyang, C. K.Y. Tan, and P. Jaillet. Parallel Gaussian process regression with lowrank covariance matrix approximations. In Uncertainty in Artificial Intelligence (UAI), 2013.
 [4] T. Deselaers, B. Alexe, and V. Ferrari. Weakly supervised localization and learning with generic knowledge. International Journal of Computer Vision (IJCV), 100(3):275–293, 2012.
 [5] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. DeCAF: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learing (ICML), 2014.
 [6] P. Drineas and M. W. Mahoney. On the Nyström method for approximating a Gram matrix for improved kernelbased learning. Journal of Machine Learning Research (JMLR), 6:2153–2175, 2005.
 [7] R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research (JMLR), 2008.
 [8] B. Frénay and M. Verleysen. Classification in the presence of label noise: a survey. IEEE Transactions on Neural Networks and Learning Systems (TNN), 25(5):845–869, 2014.
 [9] P. W. Goldberg, C. K. Williams, and C. M. Bishop. Regression with inputdependent noise: A Gaussian process treatment. Conference on Neural Information Processing Systems (NIPS), 10, 1997.
 [10] M. Guillaumin and V. Ferrari. Largescale knowledge transfer for object localization in ImageNet. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2012.
 [11] M. Guillaumin, D. Küttel, and V. Ferrari. ImageNet autoannotation with segmentation propagation. International Journal of Computer Vision (IJCV), 2014.
 [12] K. Kersting, C. Plagemann, P. Pfaff, and W. Burgard. Most likely heteroscedastic Gaussian process regression. In International Conference on Machine Learing (ICML), 2007.
 [13] A. Kolesnikov, M. Guillaumin, V. Ferrari, and C. H. Lampert. Closedform approximate CRF training for scalable image segmentation. In European Conference on Computer Vision (ECCV), 2014.
 [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Conference on Neural Information Processing Systems (NIPS), 2012.
 [15] C. H. Lampert. Kernel methods in computer vision. Foundations and Trends in Computer Graphics and Vision, 4(3):193–285, 2009.
 [16] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV), 2014.
 [17] K. P. Murphy. Machine learning: a probabilistic perspective. The MIT Press, 2012.
 [18] J. QuiñoneroCandela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research (JMLR), 2005.
 [19] A. Rahimi and B. Recht. Random features for largescale kernel machines. In Conference on Neural Information Processing Systems (NIPS), 2007.
 [20] C. E. Rasmussen and C. K. I. Williams. Gaussian processes for machine learning. The MIT Press, 2006.
 [21] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. Journal of Machine Learning Research (JMLR), 11:1297–1322, 2010.
 [22] R. Rifkin, G. Yeo, and T. Poggio. Regularized leastsquares classification. In Advances in Learning Theory: Methods, Models and Applications, chapter 7. IOS Press, 2003.
 [23] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics (TOG), 23(3):309–314, 2004.
 [24] F. Schroff, A. Criminisi, and A. Zisserman. Harvesting image databases from the web. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 33(4):754–766, 2011.
 [25] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. OverFeat: Integrated recognition, localization and detection using convolutional networks. arXiv:1312.6229 [cs.CV], 2013.
 [26] A. Sorokin and D. Forsyth. Utility data annotation with Amazon Mechanical Turk. In Computer Vision and Pattern Recognition (CVPR), 2008.
 [27] H. Su, J. Deng, and L. FeiFei. Crowdsourcing annotations for visual object detection. In AAAI Workshops on Human Computation, 2012.
 [28] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. DeepFace: Closing the gap to humanlevel performance in face verification. In Computer Vision and Pattern Recognition (CVPR), 2014.
 [29] A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit feature maps. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 34(3):480–492, 2012.
 [30] A. Vezhnevets and V. Ferrari. Associative embeddings for largescale knowledge transfer with selfassessment. In Computer Vision and Pattern Recognition (CVPR), 2014.
 [31] F. Wilcoxon. Individual comparisons by ranking methods. Biometrics bulletin, 1945.
 [32] C. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. In Conference on Neural Information Processing Systems (NIPS), 2001.
 [33] K. Zhang, I. W. Tsang, and J. T. Kwok. Improved Nyström lowrank approximation and error analysis. In International Conference on Machine Learing (ICML), 2008.