GOGGLES: Automatic Image Labeling with Affinity Coding
Abstract.
Generating large labeled training data is becoming the biggest bottleneck in building and deploying supervised machine learning models. Recently, the data programming paradigm has been proposed to reduce the human cost in labeling training data. However, data programming relies on designing labeling functions which still requires significant domain expertise. Also, it is prohibitively difficult to write labeling functions for image datasets as it is hard to express domain knowledge using raw features for images (pixels).
We propose affinity coding, a new domainagnostic paradigm for automated training data labeling. The core premise of affinity coding is that the affinity scores of instance pairs belonging to the same class on average should be higher than those of pairs belonging to different classes, according to some affinity functions. We build the GOGGLES system that implements affinity coding for labeling image datasets by designing a novel set of reusable affinity functions for images, and propose a novel hierarchical generative model for class inference using a small development set.
We compare GOGGLES with existing data programming systems on image labeling tasks from diverse domains. GOGGLES achieves labeling accuracies ranging from a minimum of to a maximum of without requiring any extensive human annotation. In terms of endtoend performance, GOGGLES outperforms the stateoftheart data programming system Snuba by and a stateoftheart fewshot learning technique by , and is only away from the fully supervised upper bound.
1. Introduction
Machine learning (ML) is being increasingly used by organizations to gain insights from data and to solve a diverse set of important problems, such as fraud detection on structured tabular data, identifying product defects on images, and sentiment analysis on texts. A fundamental necessity for the success of ML algorithms is the existence of sufficient highquality labeled training data. For example, the current ConvNet revolution would not be possible without big labeled datasets such as the 1M labeled images from ImageNet (russakovsky2015imagenet). Modern deep learning methods often need tens of thousands to millions of training examples to reach peak predictive performance (sun2017revisiting). However, for many realworld applications, large handlabeled training datasets either do not exist, or is extremely expensive to create as manually labeling data usually requires domain experts (davis2013ctd).
Existing Work. We are not the first to recognize the need for addressing the challenges arising from the lack of sufficient training data. The ML community has made significant progress in designing different model training paradigms to cope with limited labeled examples, such as semisupervised learning techniques (zhu2005semi), transfer learning techniques (pan2010survey) and fewshot learning techniques (DBLP:journals/corr/XianLSA17; fei2006one; DBLP:journals/corr/abs190405046; 2019ChenLKWH19). In particular, the most related learning paradigm that shares a similar setup to us, fewshot learning techniques, usually require users to preselect a source dataset or pretrained model that is in the same domain of the target classification task to achieve best performance. In contrast, our proposal can incorporate as many available sources of information as affinity functions.
Only recently, the data programming paradigm (ratner2016data) and the Snorkel (ratner2017snorkel) and Snuba system (varma2018snuba) that implement the paradigm were proposed in the data management community. Data programming focuses on reducing the human effort in training data labeling, particularly in unstructured data classification tasks (images, text). Instead of asking humans to label each instance, data programming ingests domain knowledge in the form of labeling functions (LFs). Each LF takes an unlabeled instance as input and outputs a label with betterthanrandom accuracy (or abstain). Based on the agreements and disagreements of labels provided by a set of LFs, Snorkel/Snuba then infer the accuracy of different LFs as well as the final probabilistic label for every instance. The primary difference between Snorkel and Snuba is that while Snorkel requires human experts to write LFs, Snuba learns a set of LFs using a small set of labeled examples.
While data programming alleviates human efforts significantly, it still requires the construction of a new set of LFs for every new labeling task. In addition, we find that it is extremely challenging to design LFs for image labeling tasks primarily because raw pixels values are not informative enough for expressing LFs using either Snorkel or Snuba. After consulting with data programming authors, we confirmed that Snorkel/Snuba require images to have associated metadata, which are either text annotations (e.g., medical notes associated with XRay images) or primitives (e.g., bounding boxes for XRay images). These associated text annotations or primitives are usually difficult to come by in practice.
Example 0 ().
Figure 1 shows two example labeling functions for labeling an XRay image as either benign or malignant (varma2018snuba). As we can see, these two functions rely on the bounding box primitive for each image and use the two properties (area and perimeter) of the primitive for labeling. We observe that these domainspecific primitives are difficult to obtain. Indeed, (varma2018snuba) states, in this particular example, radiologists have preextracted these bounding boxes for all images.
Our Proposal. We propose affinity coding, a new domainagnostic paradigm for automated training data labeling without requiring any domain specific functions. The core premise of the proposed affinity coding paradigm is that the affinity scores of instance pairs belonging to the same class on average should be higher than those of instance pairs belonging to different classes, according to some affinity functions. Note that this is quite a natural assumption — if two instances belong to the same class, then by definition, they should be similar to each other in some sense.
Example 0 ().
Figure 2 shows the affinity score distributions of a real dataset we use in our experiments (CUB) using three of the affinity functions discussed in Section 3. In this particular case, affinity function is able to distinguish pairs in the same class from pairs in different classes very well; affinity function also has limited power in separating the two cases; and affinity function is not useful at all in separating the classes.
We build the GOGGLES system that implements the affinity coding paradigm for labeling image datasets (Figure 3). First, GOGGLES includes a novel set of affinity functions that can capture various kinds of image affinities. Given a new unlabeled dataset and the set of affinity functions, we construct an affinity matrix. Second, using a very small set of labeled examples (development set), we can assign classes to unlabeled images based on the affinity score distributions we can learn from the affinity matrix. Compared with the stateoftheart data programming systems, our affinity coding system GOGGLES has the following distinctive features.

Data programming systems need some kinds of metadata (text annotations or domainspecific primitives) associated with each image to express LFs, while GOGGLES makes no such assumptions.

Assuming the existence of metadata, data programming still requires a new set of LFs for every new dataset. In contrast, GOGGLES is a domainagnostic system that leverages affinity functions, which are populated once and can be reused for any new dataset.

Both Snorkel/Snuba and GOGGLES can be seen as systems that leverage many sources of weak supervision to infer labels. Intuitively, the more weak supervision sources a system has, the better labeling accuracy a system can potentially achieve. In data programming, the number of sources is the number of LFs. In contrast, affinity coding uses affinity scores between instance pairs under many affinity functions. Therefore, the number of sources GOGGLES has is essentially the number of instances multiplied by the number of affinity functions, a significantly bigger set of weak supervision sources.
Challenges. We address the following major challenges with GOGGLES:

The success of affinity coding depends on a set of affinity functions that can capture similarities of images in the same class. However, without knowing which classes and labeling task we may have in the future, we do not even know what are the potential distinctive features for each class. Even if we have the knowledge of the particular distinctive features, they might be spatially located in different regions of images in the same class, which makes it more difficult to design domainagnostic affinity functions.

Given an affinity matrix constructed using the set of affinity functions, we need to design a robust class inference module that can infer class membership for all unlabeled instances. This is quite challenging for multiple reasons. First, some of the affinity functions are indicative for the current labeling, while many others are just noise, as shown in Example 2. Our class inference module needs to identify which affinity functions are useful given a labeling task. Second, the affinity matrix is highdimensional with the number of dimension equals to the number of instances multiplied by the number of affinity functions. In this highdimensional space, distance between any two rows in the affinity matrix becomes extremely small, and thus making it even more challenging to infer class assignments. Third, while we can infer from the affinity matrix which instances belong to the same class by essentially performing clustering, we still need to figure out which cluster corresponds to which class, relying only on a small development set.
Contributions. We make the following contributions:

The affinity coding paradigm. We propose affinity coding, a new domainagnostic paradigm for automatic generation of training data. Affinity coding paradigms consists of two main components: a set of affinity functions and a class inference algorithm. To the best of our knowledge, we are the first to propose a domainagnostic approach for automated training data labeling.

Designing affinity functions. GOGGLES features a novel approach that defines affinity functions based on a pretrained VGG16 model (simonyan2014very). VGG16 is a commonly used network for representation learning. Our intuition is that different layers of the VGG16 network capture different highlevel semantic concepts. Each layer may show different activation patterns depending on where a highlevel concept is located in an image. We thus leverage all maxpooling layers of the network, extracting 10 affinity functions per layer, for a total of affinity functions.

Class inference using hierarchical generative model. GOGGLES proposes a novel hierarchical model to identify instances of the same class by maximizing the data likelihood under the generative model. The hierarchical generative model consists of two layers: the base layer consists of multiple Gaussian Mixture Models (GMMs), each modeling an affinity function; and the ensemble layer takes the predictions from each GMM and uses another generative model based on multivariate Bernoulli distribution to obtain the final labels. We show that our hierarchical generative model addresses both the curse of dimensionality problem and the affinity function selection problem. We also give theoretical justifications on the size of development set needed to get correct clustertoclass assignment.
GOGGLES achieves labeling accuracy ranging from a minimum of to a maximum of .
In terms of endtoend performance, GOGGLES outperforms the stateoftheart data programming system Snuba by and a stateoftheart fewshot learning technique by , and is only away from the fully supervised upper bound.
We also make our implementation of GOGGLES opensource on GitHub
2. Preliminary
We formally state the problem of automatic training data labeling in Section 2.1. We then introduce affinity coding, a new paradigm for addressing the problem in Section 2.2.
2.1. Problem Setup
In traditional supervised classification applications, the goal is to learn a classifier based on a labeled training set , where and . The classifier is then used to make predictions on a test set.
In our setting, we only have and no . Let denote the total number of unlabeled data points, and let denote the unknown true label for . Our goal is to assign a probabilistic label for every , where , where with being the number of classes in the labeling task, and .
These probabilistic labels can then be used to train downstream ML models. For example, we can generate a discrete label according to the highest for every instance . Another more principled approach is to use the probabilistic labels directly in the loss function , i.e., the expected loss with respect to : . It has been shown that as the amount of unlabeled data increases, the generalization error of the model trained with probabilistic labels will decrease at the same asymptotic rate as supervised models do with additional labeled data (ratner2016data).
2.2. The Affinity Coding Paradigm
We propose affinity coding, a domainagnostic paradigm for automatic labeling of training data. Figure 3 depicts an overview of GOGGLES, an implementation of the paradigm.
Step 1: Affinity Matrix Construction. An affinity function takes two instances and output a real value representing their similarity. Given a library of affinity functions , a set of unlabeled instances , and a small labeled examples as the development set, we construct an affinity matrix that encodes all affinity scores between all pairs of instances under all affinity functions. Specifically, the row of corresponds to instance and every column of corresponds to the affinity function and the instance , namely, .
Step 2: Class Inference. Given , we would like to infer the class membership for all unlabeled instances. For every unlabeled instance , we associate a hidden variable representing its unknown class. We aim to maximize the data likelihood , where denotes the parameters of the generative model used to generate .
Discussion. The affinity coding paradigm offers a domainagnostic paradigm for training data labeling. Our assumption is that, for a new dataset, there exists one or multiple affinity functions in our library that can capture some kinds of similarities between instances in the same class. We verify that our assumption holds on all five datasets we tested. It is particularly worth noting that, out of the five datasets, three of them are in completely different domains than the ImageNet dataset the VGG16 model is trained on. This suggests that our current is quite comprehensive. We acknowledge that there certainly exists potential new labeling tasks that our current set of affinity functions would fail.
3. Designing Affinity Functions
Our affinity coding paradigm is based on the proposition that examples belonging to the same class should have certain similarities. For image datasets, this proposition translates to images from the same class would share certain visually discriminative highlevel semantic features. However, it is nontrivial to design affinity functions that capture these highlevel semantic features due to two challenges: (1) without knowing which classes and labeling task we may have in the future, we do not even know what those features are. and (2) even assuming we know the particular features that are useful for a given class, they might be spatially located in different regions of images in the same class.
To address these challenges, GOGGLES leverages pretrained convolutional neural networks (VGG16 network (simonyan2014very) in our current implementation) to transplant the data representation from the raw pixel space to semantic space. It has been shown that intermediate layers of a trained neural network are able to encode different levels of semantic features, such as edges and corners in initial layers; and textures, objects and complex patterns in final layers (zeiler2014visualizing).
Algorithm 1 gives the overall procedure of leveraging the VGG16 network for coding multiple affinity functions. Specifically, to address the issue of not knowing which highlevel features might be needed in the future, we use different layers of the VGG16 network to capture different highlevel features that might be useful for different future labeling tasks (Line 1). We call each such highlevel feature a prototype (Line 2). As not all prototypes are actually informative features, we keep the top most “activated” prototypes, which we treat as informative highlevel semantic features (Line 3). For every one of the informative prototype extracted from an image , we need to design an affinity function that checks whether another image has a similar prototype (Line 5). Since these prototypes might be located in different regions, our affinity function is defined to be the maximum similarity between all prototypes of and (Line 6).
We discuss prototype extraction and selection in Section 3.1, and the computation of affinity functions based on prototypes in Section 3.2.
3.1. Extracting Prototypes
In this subsection, we discuss (1) how to extract all prototypes from a given image using a particular layer of the VGG16 network; and (2) how to select top most informative prototypes amongst all the extracted ones.
Extracting all prototypes. To begin, we pass an image through a series of layers until reaching a maxpooling layer of a CNN to obtain the , known as a filter map. We choose maxpooling layers as they condense the previous convolutional operations to provide compact feature representations. The filter map has dimensions , where , and are the number of channels, height and width of the filter map respectively. Let us also denote indexes over the height and width dimensions of with and respectively. Each vector (spanning the channel axis) in the filter map can be backtracked to a rectangular patch in the input image , formally known as the receptive field of . The location of the corresponding patch of a vector can be determined via gradient computation. Since any change in this patch will induce a change in the vector , we say that encodes the semantic concept present in the patch. Formally, all prototypes we extract for are as follows:
Example 0 ().
Figure 4 shows the representation of an image patch in semantic space using a tiger image. An image is passed through VGG16 until a maxpooling layer to obtain the filter map that has dimensions . In this particular example, the yellow rectangular patch highlighted in the image is the receptive field of the orange prototype , which as we can see, captures the “tiger’s head” concept.
Selecting top informative prototypes. In an image , obviously not every patch and the corresponding prototype is a good signal. In fact, many patches in an image correspond to background noise that are uninformative for determining its class. Therefore, we need a way to intelligently select the top most informative semantic prototypes from all the possible ones.
In this regard, we first select top channels that have the highest magnitudes of activation. Note that each channel is a matrix , and the activation of a channel is defined to be the maximum value of its matrix (typically known as the 2D Global Max Pooling operation in CNNs). We denote the indexes of these top channels as , where . Based on the top channels, we can thus define the top prototypes as follow:
(1) 
The top prototypes we extract for image are:
The pair may not be unique across the channels, yielding the same concept prototypes. Hence, we drop the duplicate ’s and only keep the unique prototypes.
Example 0 ().
We illustrate our approach for selecting top prototypes by an example. Suppose we would like to select top2 prototypes in a layer that produces the following filter map of dimension . The three channels are:
First, we sort the three channels by the maximum activation in descent order i.e. the maximum element in the matrix: . Then, we select the first Z=2 channels: . Next, for each of the selected channels we identify the index of its maximum element on the H and W axis: . Finally, we obtain the Z=2 prototypes by stacking the values over all channels that share the same H and W axis index identified in the last step:
,
and .
3.2. Computing Affinity
Having extracted prototypes for each image, we are ready to define affinity functions and compute affinity scores for a pair of images . Affinity functions are supposed to capture various types of similarity between a pair of images. Intuitively, two images are similar if they share some highlevel semantic concepts that are captured by our extracted prototypes. Based on this observation, we define multiple affinity functions, each corresponding to a particular type of semantic concept (prototype). Therefore, the number of affinity functions we can define is equal to the number of maxpooling layers () of the VGG16 network multiplied by the number of topZ prototypes extracted per layer.
Let us consider a particular prototype , that is, the most informative prototype of extracted from layer , we define an affinity function as follows:
(2) 
As we can see, we calculate the similarity between a prototype of and the vector contained in using a similarity function , and pick the highest value as the affinity score. In other words, our approach tries to find the “most similar patch” in each image with respect to a given patch corresponding to one of the topZ prototypes of image . We use the cosine similarity metric as the similarity function defined over two vectors and as follows:
(3) 
Example 0 ().
Figure 5 shows an example affinity matrix for the CUB dataset we use in the experiments. It only shows three of the affinity functions, which we also used in Example 2. The rows and columns are sorted by class only for visual intuition. As we can see, some affinity functions are more informative than others in this labeling task.
Discussion. We use all 5 maxpooling layers from the VGG16. For each maxpooling layer, we use the top prototypes, which we empirically find to be sufficient. Note that while we choose VGG16 to define affinity functions in the current GOGGLES implementation, GOGGLES can be easily extended to use any other representation learning techniques.
In summary, our approach automatically identifies semantically meaningful prototypes from the dataset, and leverages these prototypes for defining affinity functions to produce an affinity matrix.
4. Class Inference
In this section, we describe GOGGLES’ class inference module: given the affinity matrix constructed on examples using affinity functions, where is the number of unlabeled examples and is a very small development set (e.g., 10 labeled examples), we would like to assign a class label for every examples . In other words, our goal is to predict , where denote the feature vector for , namely, the row in , and is a hidden variable representing the class assignment of .
Generative Modelling of the Inference Problem. Recall that our main assumption of affinity coding is that the affinity scores of instance pairs belonging to the same class should be different than affinity scores of instance pairs belonging to different classes. In other words, the feature vector of one class should look different than that of another class. This suggests a generative approach to model how is generated according to different classes. Generative models obtain by invoking the Bayes rules:
(4) 
where is known as the prior probability with , which is the probability that a randomly chosen instance is in class , and is known as the posterior probability. To use Equation 4 for labeling, we need to learn and for every class . is commonly assumed to follow a known distribution family parameterized by , and is written as . Therefore, the entire set of parameter we need to have to compute Equation 4 is . A common way to estimate is by maximizing the log likelihood function:
(5) 
where is the identity function that evaluates to 1 if the condition is true and 0 otherwise. Therefore, the main questions we need to address include (i) what are the generative models to use, namely, the paramterized distributions ; and (ii) how do we maximize Section 4.
Limitations of Existing Models. A commonly used distribution is multivariate Gaussian distribution, where , where is the mean vector and is the covariance matrix, and is the Gaussian PDF:
(6) 
This yields the popular Gaussian Mixture Model (GMM), and there are known algorithm for maximizing the likelihood function under GMM. However, a naive invocation of GMM on our affinity matrix is problematic:

High dimensionality. The number of feature in the affinity matrix is . In the naive GMM, the mean vectors and covariance matrices for all classes (components) have number of parameters, which is much larger than the number of examples . It is widely known that the eigenstructure in the estimated covariance matrix will be significantly and systematically distorted when the number of features exceeds the number of examples (Dempster1972Mar; Velasco).

Affinity function selection. Not all affinity functions are useful, as shown in Figure 5. If the number of noisy functions is small, GMM naturally handles feature selection as the components will not be well separated by noisy functions and will be well separated by “good” functions. However, under such high dimensionality, there could exist too many noisy features that could form false correlations among them and eventually undermine the accuracy of GMM or other generic clustering methods.
4.1. A Hierarchical Generative Model
A fundamental reason for the above two challenges when using GMM is that GMM needs to model correlations between all pairs of columns, which creates a huge number of parameters and makes it difficult for GMM to determine which affinity functions are more informative. In light of this observation, we propose a hierarchical generative model which consists of a set of base models and an ensemble model, as shown in Figure 6. Each base model captures the correlations of a subset of columns in that originate from the same affinity function , and we denote this “subset” matrix as . The output of each base model is a label prediction matrix , where the row stores the probabilistic class assignments of using affinity function . All label prediction matrices are concatenated together to form the concatenated label prediction matrix . The ensemble model takes and models the correlations of all affinity functions, and produces the final probabilistic labels for each unlabeled instance.
The Base Models. Given the part of the affinity matrix generated by a particular affinity function , the base model aims to predict , where denotes the subset of the feature vector corresponding to .
We design a base generative model for computing . As discussed before, a generative model requires specifying the class generative distributions , parameterized by . We use the popular GMM for this purpose with an important modification. Instead of using the full covariance matrix that models the correlations between all pairs of columns in , we use the diagonal covariance matrix, which reduces the number of parameters significantly from to . Note that this simplification is only possible under the base generative model, as each column of corresponds to an independent example.
The output of the base model for affinity function is a label prediction matrix , where , namely, the probability that affinity function believes example is in class .
The Ensemble Model. We concatenate label prediction matrices from base models to obtain the concatenated label prediction matrix . Let denote the new feature vector of the row in . The goal of the ensemble model is to predict .
We again design a generative model for performing the final prediction. As before, we need to decide on a class generative distribution parameterized by . The Gaussian distribution used for the base models is not appropriate for the ensemble mode. This is because the values in the concatenated label prediction matrix are very close to either or . Indeed, in an ideal scenario when all base models work perfectly, all values in will be 0 or 1 that correspond to the ground truth. This kind of discrete or close to discrete data is problematic for Gaussian distribution which is designed for continuous data. Fitting a Gaussian distribution on this kind of data typically incurs the singularity problem and provides poor predictions (bishop2006pattern). In light of this observation, we convert to a onehot encoded matrix by converting the highest class prediction to and the rest predictions to for each instance and each affinity function, and we propose to use a categorical distribution to model .
After converting into a true discrete matrix, Multivariate Bernoulli distribution is a natural fit for modeling , which is parameterized by :
(7) 
where is the dimension of the binary vector , and we have a total of dimensions. The output of the ensemble model is the final label predictions , where , namely, the probability that the ensemble model believes example is in class .
Hierarchical Model Address the Two Challenges. First, the total number of parameters in the hierarchical model is , much smaller than the number of parameters in the naive GMM , effectively addressing the highdimensionality problem. Second, by consolidating the affinity scores produced by each affinity function to produce a binary , the ensemble model can only need to model the accuracy of the affinity functions better instead of the original features, and thus can better distinguish the good affinity functions from the bad ones.
4.2. Parameter Learning
We need to learn the parameters of the base models and the ensemble model under their respective generative distributions. Expectationmaximization algorithm is the canonical algorithm for maximizing the log likelihood function in the presence of hidden variables (dempster1977maximum). We first show the EM algorithm for maximizing the general data log likelihood function in Section 4, and then discuss how it needs to modified to learn the base models and the ensemble model.
EM for Maximizing Section 4 Each iteration of the EM algorithm consists of two steps: an Expectation (E)step and a Maximization (M)step. Intuitively, the Estep determines what is the (soft) class assignment for every instance based on the parameter estimates from last iteration . In other words, Estep computes the posterior probability. The Mstep takes the new class assignments and reestimates all parameters by maximizing Section 4. More precisely, the Mstep maximizes the expected value of Section 4, since the Estep produces soft assignments.

E Step. Given the parameter estimates from the previous iteration , compute the posterior probabilities:
(8) 
M Step. Given the new class assignments as defined by , reestimate by maximizing the following expected log likelihood function:
(9)
EM for Maximizing the Base Model. For each base model associated with affinity function , in the EM algorithm is replaced with , which is a multivariate Gaussian distribution as shown in Equation 6, but with a diagonal covariance matrix. The entire set of parameters is , , where , which update in each Mstep as follows:
(10) 
EM for Maximizing the Ensemble Model. For the ensemble model, in the EM algorithm is replaced with , which is a multivariate Bernoulli distribution, as shown in Equation 7. The entire set of parameters is , where , which we update in each Mstep as follows:
(11) 
4.3. Exploiting Development Set
Consider a scenario without any labeled development set, in this case, the hierarchical model essentially clusters all unlabeled examples into clusters without knowing which cluster corresponds to which class. Following the data programming system (varma2018snuba), we assume we have access to a small development set that is typically too small to train any machine learning models, but is powerful enough to determine the correct “clustertoclass” assignment. Note that the theory developed here can also be used to provide theoretical guarantees on the mapping feasibility in the “clusterthenlabel” category of semisupervised learning approaches (zhu2005semi; peikari2018cluster).
Let denote the set of labeled examples for class . To make our analysis easier, we assume the size of is the same for all classes. Intuitively, we want to map cluster to class if most examples from are in cluster . However, this simple clustertoclass mapping strategy may create conflicting assignments, namely, the same cluster is mapping to multiple classes. We propose a more principled way to obtain the onetoone mapping . We first define the ”goodness” of the mapping as:
(12) 
To see why can represent the ”goodness” of a mapping. We represent development sets with a onehot encoded ground truth matrix where each element is obtained by:
(13) 
is essentially the summation of the elementwise multiplication between the ground truth matrix and label prediction matrix under a column mapping defined by on the development set. Therefore, is expected to be maximized when a mapping makes the two matrices the most similar under cosine distance. Given , the final mapping is obtained by:
(14) 
In other words, the final mapping is a onetoone mappings that maximizes . When , Equation 14 becomes
(15) 
Algorithm for Solving Equation 14. Instead of enumerating all possible mappings with a complexity of (which is actually feasible for a small ), we convert it to the assignment problem which can be solved in . Let denote , then Equation 12 becomes:
(16) 
Finding a that maximizing Equation 16 is essentially the Assignment problem, and there are known algorithms (jonker1987shortest) that solve it with a worst case time complexity of .
This “clustertoclass” mapping is performed after obtaining base model predictions and the ensemble model predictions. After the mapping is obtained, we rearrange the the columns in the label prediction matrix produced by each base model, and the final label matrix produced by the ensemble according to the mapping , so that the true classes are aligned with the clusters.
4.4. The Size of Development Set Needed
In this section, we give an analysis about the size of the development set needed for GOGGLES to produce the correct “clustertoclass” mapping, where the correct mapping is defined to be the mapping that achieves the highest labeling accuracy, which we denote as . Intuitively, the higher is, the less size we need. Consider an extreme scenario with classes and our hierarchical generative model produces two clusters that perfectly separate the two classes. In this case, we only need one labeled example to determine which cluster corresponds to which class with confidence. Figure 7 shows the size of development set required when based on our theory to be discussed in the following, we can see when , only about examples are required to produces the correct clusterclass mapping with a probability close to 1. However, as we will shown in the experiment section, the number of required development set size is actually much smaller in practice. This is because the theoretical lowerbound we will provide is a rather loose one, for ease of derivation.
A Theory on the Size of the Development Set. let us first assume the mapping of each class is independent, so the probability of a completely correct mapping is , where denote the probability that class is correctly mapped to its corresponding cluster.
To simplify derivation, we further assume ”hard” assignment of classes labels: an example is only assigned to one cluster, in other words, only contains 0 and 1. This is a natural assumption because values in will be converted to be binary anyway when evaluating the accuracy of the algorithm. In the development set, we have a labeled set with a size of for every class . Let , denote the number of examples in the development set that are in the th cluster, so . Under the independence assumption, Equation 14 becomes
(17) 
where denote the inverse mapping of , that is . Equation 17 means that each each class is mapped to the cluster in which its majority lies, so class is mapped to its correct cluster only when the majority of are in that cluster. Assume the th cluster is the correct cluster for class , so the probability of the class mapped correctly is:
(18)  
The first sign is because on the right side we don’t consider the situations with ties in majority vote (the second sign), where we break the ties randomly and a correct mapping is also possible. The is then lower bounded by:
(19) 
Suppose the accuracy of our algorithm is known, so the probability of an example being predicted to be its true label equals to . An example in the development set is predicted to be its true label by the algorithm only when it is in the correct cluster, so the probability of it being in the correct cluster equals to . In case of incorrect assignment, we assume the probability of assigning to every possible incorrect classes is equal, being . follow a multinomial distribution:
(20)  
The correct mapping under independent assumption requires the mapping of every class to be correct on their own. This is a rather strict assumption. Without assuming independence, Equation 14 is able to produce a completely correct mapping when some classes would otherwise fail to be mapped correctly on their own. In other words, the probability of a completely correct mapping is:
(21) 
Combining Equation 21 and Equation 19, we get the following theorem.
Theorem 1 ().
The probability that Equation 14 gives the optimal mapping is lower bounded by , where is obtained by Equation 18.
Therefore, the size of development set that produces an optimal mapping with a probability of as least is given by , where is the smallest value of that makes .
The time complexity of solving the right hand side in Equation 18 by a bruteforce iteration over all combinations of is , but it can be solved in using a dynamic programming based approach.
For ease of of notation, we assume the 1st cluster is the correct cluster for class . Let denote the following:
(22)  
so , and for each :
(23) 
The time complexity of obtaining by dynamic programming using Equation 23 is .
5. Experiments
We conduct extensive experiments to evaluate the accuracy of labels generated by GOGGLES. Specifically, we focus on the following three dimensions:

Feasibility and Performance of GOGGLES (Section 5.2). Is it possible to automatically label image datasets using a domainagnostic approach? How does GOGGLES compare with existing data programming systems?

Ablation Study (Section 5.3). How do the two primary innovations in GOGGLES (namely, affinity matrix construction and class inference) compare against other techniques?

Sensitivity Analysis (Section 5.4). Is GOGGLES sensitive to the set of affinity functions? What is the size of the development set needed for GOGGLES to correctly determine the correct “clustertoclass” mapping?
5.1. Setup
Datasets
We consider realworld image datasets with varying domains to evaluate the versatility and robustness of GOGGLES. Since our approach internally uses a pretrained VGG16 model for defining affinity functions, we select datasets which have minimal or no overlap with classes of images from the ImageNet dataset (russakovsky2015imagenet), on which the VGG16 model was originally trained. Robust performance across these datasets show that GOGGLES is domainagnostic with respect to the underlying pretrained model. We perform our experiments on the following datasets, which are roughly ordered by domain overlap with ImageNet:

CUB: The CaltechUCSD Birds2002011 dataset (wah2011caltech) comprises of 11,788 images of 200 bird species. The dataset also provides 312 binary imagelevel attribute annotations that help explain the visual characteristics of the bird in the image, e.g., white head, grey wing etc. We use this metadata information for designing binary labeling functions which are used by a data programming system. To evaluate the task of generating binary labels, we randomly sample 10 classpairs from the 200 classes in the dataset and report the average performance across these 10 pairs for each experiment. These sampled classpairs are not present in the ImageNet dataset. However, since ImageNet and CUB contain common images of other bird species, this dataset may have a higher degree of domain overlap with the images that VGG16 was trained on.

GTSRB: The German Traffic Sign Recognition Benchmark dataset (Stallkamp2012) contains 51,839 images for 43 classes of traffic signs. Again, for testing the performance of binary label generation, we sample 10 random classpairs from the dataset and use them for all the experiments. Although this dataset contains images labeled by specific traffic signs, ImageNet contains a generic “street sign” class, and hence this dataset may also have some degree of domain overlap.

Surface: The surface finish dataset (louhichi2019automated) contains 1280 images of industrial metallic parts which are classified as having “good” (smooth) or “bad” (rough) metallic surface finish. This is a more challenging dataset since the metallic components look very similar to the untrained eye, and has minimal degree of domain overlap with ImageNet.

TBXray: The Shenzhen Hospital Xray set (jaeger2013automatic) has 662 images belonging to 2 classes, normal lung Xray and abnormal Xray showing various manifestations of tuberculosis. These images are of the medical imaging domain and have absolutely no domain overlap with ImageNet.

PNXray: The pneumonia chest Xray dataset (kermany2018identifying) consists of 5,856 chest Xray images classified by trained radiologists as being normal or showing different types of pneumonia. These images are also of the medical imaging domain and have no domain overlap with ImageNet.
Development Set. GOGGLES uses a small development set to determine the optimal class mapping for a given label assignment, the same assumption in Snuba (varma2018snuba). By default, we use only 5 label annotations arbitrarily chosen from each class for this. Hence, for the task of generating binary labels, we use a development set having a size of 10 images for all the experiments. We report the performance of GOGGLES on the remaining images from each dataset.
Data Programming Systems
We compare GOGGLES with existing systems: Snorkel (ratner2017snorkel) and Snuba (varma2018snuba).
Snorkel is the first system that implements the data programming paradigm (ratner2016data). Snorkel requires humans to design several labeling functions that output a noisy label (or abstain) for each instance in the dataset. Snorkel then models the highlevel interdependencies between the possibly conflicting labeling functions to produce probabilistic labels, which are then used to train an end model. For image datasets, these labeling functions typically work on metadata or extraneous annotations rather than imagebased features since it is very hard to hand design functions based on these features.
Since CUB is the only dataset having such metadata available, we report the mean performance of Snorkel on the 10 classpairs sampled from the dataset by using the attribute annotations as labeling functions. More specifically, we combine CUB’s imagelevel attribute annotations (which describe visual characteristics present in an image, such as white head, grey wing etc.) with the classlevel attribute information provided (e.g., class A has white head, class B has grey wing etc.) in order to design labeling functions. Hence, each attribute annotation in the union of the classspecific attributes acts as a labeling function which outputs a binary label corresponding to the class that the attribute belongs to. If an attribute belongs to both classes from the classpair, the labeling function abstains. We used the opensource implementation provided by the authors with our labeling functions for generating the probabilistic labels for the CUB dataset.
Snuba extends Snorkel by further reducing human efforts in writing labeling functions. However, Snuba requires users to provide perinstance primitives for a dataset (c.f. Example 1), and the system automatically generates a set of labeling functions using a labeled small development set.
Since all 6 datasets do not come with userprovided primitives, to ensure a fair comparison with Snuba, we consulted with Snuba’s authors multiple times. They suggested that we use a rich feature representation extracted from images as their primitives, which would allow Snuba to learn labeling functions. As such, we use the logits layer of the pretrained VGG16 model for this purpose, as it has been well documented in the domain of computer vision that such feature representations encode meaningful higher order semantics for images (donahue2014decaf; oquab2014learning). For the VGG16 model trained on ImageNet, this yields us feature vectors having 1000 dimensions for each image. To obtain densely rich primitives which are more tractable for Snuba, we project the logits output onto a feature space of the top10 principal components of the entire data determined using principal component analysis (wold1987principal). We use these projected features having 10 dimensions as primitives for Snuba. Empirical testing revealed that providing more components does not change the results significantly. We also use the same development set size for Snuba and GOGGLES. We used the opensource implementation provided by the authors for learning labeling functions with automatically extracted primitives and for generating the final probabilistic labels.
Fewshot Learning (FSL)
Our affinity coding setup which uses 5 development set labels from each class is comparable to the 2way 5shot setup for fewshot learning from the computer vision domain. Hence, we compare GOGGLES’s endtoend performance with a recent FSL approach (2019ChenLKWH19) that achieves stateoftheart performance on domain adaptation. We use the same development set used by GOGGLES as the fewshot labeled examples for training the FSL model.
The original FSL Baseline implementation uses a model trained on miniImageNet for domain adaptation to the CUB dataset, and achieves better performance than other stateoftheart FSL methods. For a more comparable analysis, we use a VGG16 model trained on ImageNet, which is the same pretrained model GOGGLES uses for affinity coding. Note that our adaptation of the FSL Baseline method achieved a much better performance for domain adaptation on CUB than the original results reported in (2019ChenLKWH19). The FSL models as well as all end models are trained with the Adam optimizer with a learning rate of , same as in (2019ChenLKWH19).
Empirical upperbound (supervised approach).
We also would like to compare GOGGLES’ performance with an empirical upperbound, which is obtained via a typical supervised transfer learning approach for image classification. Specifically, we freeze the convolutional layers of the VGG16 model and only update the weights of the fully connected layers in the VGG16 architecture while training. We also modify the last fully connected “logits” layer of the architecture to our corresponding number of classes.
Ablation Study: Other image representation techniques for computing affinity
GOGGLES computes affinity scores by extracting prototype representations from intermediate layers of a pretrained model. We compare the efficacy of this representation technique with two other typical methods of image representation used in the computer vision domain. We compare the predictive capacity of each representation technique by constructing an affinity matrix from each candidate feature representation using pairwise cosine similarity, and then using our class inference approach for labeling.
HOG. We compare with the histogram of oriented gradients HOG descriptor, which is a very popular feature representation technique used for recognition tasks in classical computer vision literature (weinland2010making; yang2012recognizing). The HOG descriptor (dalal2005histograms) represents an image by counting the number of occurrences of gradient orientation in localized portions of the image.
Logits. We also compare with a modern deep learningbased approach, leveraged by recent works in computer vision (sharif2014cnn; akilan2017late), that uses an intermediate output from a convolutional neural network as an image’s feature representation. We use the logits layer from the trained VGG16 model in our comparison, which is the output of the last fully connected layer, before it is fed to the softmax operation.
Ablation Study: Baseline methods for class inference
The class inference method in GOGGLES consists of a clustering step followed by class mapping. We compare our proposed hierarchical model for clustering with other baseline methods, including Kmeans clustering, Gaussian mixture modeling with expectation maximization (GMM) and spectral coclustering (Spectral). Since these clustering methods are incognizant of the structural semantics of our affinitybased features which are derived from multiple affinity functions, we simply concatenate all affinity functions to create the feature set for each dataset, and then feed these features to the baseline methods. As we would like to see the absolute best performance of the baseline clustering approaches, we use the optimal “clusterclass” mapping for all baselines.
Evaluation Metrics.
We use the train/test split as originally defined in each dataset. We report the labeling accuracy on the training set for comparing different data labeling systems, Snorkel, Snuba, and GOGGLES. We follow the same approach used in (ratner2017snorkel; varma2018snuba) to train an end discriminative model by using the probabilistic labels generated from each data labeling system as training data and report the endtoend accuracy as the end model’s performance on the heldout test set. For labeling tasks, all experiments, including baselines, are conducted 10 times, and we report the average.
Dataset  GOGGLES  Data Programming  Representation  Class Inference Baselines  

(our results)  Snorkel  Snuba  HoG  Logits  KMeans  GMM  Spectral  
CUB  97.83  89.17  58.83  62.93  96.35  98.67  97.62  72.08 
GTSRB  70.51    62.74  75.48  64.77  70.74  69.64  62.40 
Surface  89.18    57.86  85.82  54.08  69.08  69.14  60.82 
TBXray  76.89    59.47  69.13  67.16  76.33  76.70  75.00 
PNXray  74.39    55.50  53.11  71.18  50.66  68.66  75.90 
Average  81.76    58.88  69.30  70.71  73.09  76.35  69.24 
5.2. Feasibility and Performance
Table 1 shows the endtoend system labeling accuracy for GOGGLES, Snorkel, Snuba, and a supervised approach that serves as an upper bound reference for comparison. (1) GOGGLES achieves labeling accuracies ranging from a minimum of to a maximum of . GOGGLES shows an average of improvement over the stateoftheart data programming system Snuba. (2) To ensure a fair comparison, we consulted with authors of Snuba and took their suggested approach of automatically extracting the required primitives. As we can see, the performances of Snuba on all datasets are just slightly better than random guessing. This is primarily because Snuba is really designed to operate on human annotated primitives (c.f. Example 1). Furthermore, Snuba’s performance degrades if the size of the development set is not sufficiently high. Our experiments showed that indeed, if we increase the development set size for Snuba from 10 to 100 (10x increase) for the PNXray dataset, the performance jumps from to . In comparison, GOGGLES still performs better with a development set size of only 10 images. (3) We can only use Snorkel on CUB, as CUB is the only dataset that comes with annotations that we can leverage as labeling functions. These labeling functions may be considered perfect in terms of coverage and accuracy since they are essentially human annotations. GOGGLES uses minimal human supervision and still outperforms Snorkel on CUB.
5.3. Ablation Study
We conduct a series of experiments to understand the goodness of different components in GOGGLES, including the proposed affinity functions and the proposed class inference method. Results are shown in Table 1.
Goodness of Proposed Affinity Functions. We compare GOGGLES affinity functions with the two common methods of obtaining the distance between two images: HOG and Logits. We use the two baseline methods to generate affinity matrices and run GOGGLES’ inference module on them. GOGGLES’ affinity functions outperform the other two on almost all datasets. This is because GOGGLES’s affinity functions covers features at different scales and locations.
Goodness of Proposed Class Inference. We compare GOGGLES’ inference module with three representative clustering methods: Kmeans, GMM, and Spectral coclustering. All methods use the GOGGLES affinity matrix as input data. Note that the three clustering methods are not able to map the clusters to the classes automatically. As we would like to see the absolute best performance of the baseline clustering approaches, we use the optimal “clusterclass” mapping for all baselines. GOGGLES’s inference module has the best average performance. The primary reason for our improvement over generic clustering methods is that our generative model adapts to the design of our affinity matrix. Specifically, our generative model is better at (1) handling the highdimensionality through using the hierarchical structure and reducing the parameters in the base model by using diagonal covariance matrices; and (2) selecting affinity functions through the ensemble model (c.f. Section 4.1).
In terms of running time, without parallelization, our generative model is (the number of base models) slower than the GMM model (the best baseline method). However, in practice (and in our experiments), we can parallelize all of the base models using different slices of the affinity matrix.
5.4. Sensitivity Analysis
Varying Size of the Development Set. We vary the size of the development set from 0 to 40 to understand how it affects performance (Figure 8). As the development set size increases, the accuracy increases initially, but finally converges. This is expected as when the development set is small, the mapping obtained by Equation 14 has a low probability being the optimal as predicted in Figure 7. When the development set size is large enough, the mapping given by Equation 14 converges to the optimal mapping, so the accuracy converges. Another observation is that datasets with higher accuracy converge at a smaller development set size. For example, the CUB dataset has an accuracy of 97.63% and its accuracy converges at a development set size of 2, while the GTSRB dataset requires a development set size of 8 to converge as it achieves an lower accuracy of 70.75%. Finally, the empirical size of the development set required to converge is much smaller than the theory predicted in Figure 7. A development set with 5 examples per class enough for all datasets.
Varying Number of Affinity Functions. We vary the number of affinity functions to study its affects on the results (Figure 9). Accuracy increases as the number of affinity functions increases for all datasets. This is understandable as more affinity functions brings more information that the inference module can exploit.
5.5. EndtoEnd Performance Comparison.
We also use the probabilistic labels generated by Snorkel, Snuba and GOGGLES to train downstream discriminative models following the similar approach taken in (ratner2017snorkel; varma2018snuba). Specifically, we use the VGG16 as the downstream ML model architecture, and tune the weights of the last fully connected layers using crossentropy loss. For training the FSL model, we use the same development set used by Snuba and GOGGLES for labeling. For training the upper bound model, we use the entire training set labels. The performance of each approach on holdout test sets is reported in Table 2.
First, GOGGLES outperforms Snuba by an average of 21%, a similar number to the labeling performance improvement of 23% GOGGLES has over Snuba (c.f. Table 1), and the end model performance of Snuba is worse than FSL. This is because the labels generated by Snuba () are only slightly better than random guessing, and having many extremely noisy labels can be more harmful than having fewer labels in training an end model. Second, GOGGLES outperforms the finetuned stateoftheart FSL method (c.f. Section 5.1.3) by an average of 5%, which is significant considering GOGGLES is only 7% away from the upper bound. Third, not surprisingly, the more accurate the generated labels are, the more performance gain GOGGLES has over FSL (e.g., the improvements are more significant on CUB and Surface, which have higher labeling accuracies compared with other datasets).
This experiment demonstrates the advantage GOGGLES has over FSL and data programming systems — GOGGLES has the exact same inputs compared with FSL (both only have access to the pretrained VGG16 and the development set), and does not require datasetspecific labeling functions needed by data programming systems.
Dataset  FSL  Snorkel  Snuba  GOGGLES 



CUB  84.74  87.85  56.32  95.30  98.44  
GTSRB  90.72    70.11  91.54  98.94  
Surface  76.00    51.67  83.33  92.00  
TBXray  66.42    62.71  70.90  82.09  
PNXray  68.28    62.19  69.06  74.22  
Average  77.23    60.60  82.03  89.14 
6. Related Work
ML Model Training with Insufficient Data. Semi supervised learning techniques (zhu2005semi) combine labeled examples and unlabeled examples for model training; and active learning techniques aim at involving human labelers in a judicious way to minimize labeling cost (settles2012active). Though semisupervised learning and active learning can reduce the number of labeled examples required to obtain a competent model, they still need many labeled examples to start. Transfer learning (pan2010survey) and fewshot learning techniques (DBLP:journals/corr/XianLSA17; fei2006one; DBLP:journals/corr/abs190405046; 2019ChenLKWH19) often use models trained on source tasks with many labeled examples to help with training models on new target tasks with limited labeled examples. Not surprisingly, these techniques often require users to select a source dataset or pretrained model that is in a similar domain as the target task to achieve the best performance. In contrast, our proposal can incorporate several sources of information as affinity functions.
Data Programming. Data programming (ratner2016data), and Snuba (varma2018snuba) and Snorkel (ratner2017snorkel) systems that implement the paradigm were recently proposed in the data management community. Data programming focuses on reducing the human effort in training data labeling, and is the most relevant work to ours. Data programming ingests domain knowledge in the form of labeling functions. Each labeling function takes an unlabeled instance as input and outputs a label with betterthanrandom accuracy (or abstain). As we show in this paper, using data programming for image labeling tasks is particularly challenging, as it requires images to have associated metadata (e.g., text annotations or primitives), and a different set of labeling functions is required for every new dataset. In contrast, affinity coding and GOGGLES offer a domainagnostic and automated approach for image labeling.
Other Related Work in Database Community. Many problems in database community share similar challenges to our work. In particular,data fusion/truth discovery (pochampally2014fusing; rekatsinas2017slimfast), crowdsourcing (das2016towards), and data cleaning (rekatsinas2017holoclean), in one form or another, all need to reconcile information from multiple sources to reach one answer. While the information sources are assumed as input in these problems, labeling training data faces the challenge of lacking enough information sources. In fact, one primary contribution of GOGGLES is the affinity coding paradigm, where each unlabeled instance becomes an information source.
7. Conclusion
We proposed affinity coding, a new paradigm that offers a domainagnostic way of automated training data labeling. Affinity coding is based on the proposition that affinity scores of instance pairs belonging to the same class on average should be higher than those of instance pairs belonging to different classes, according to some affinity functions. We build the GOGGLES system that implements the affinity coding paradigm for labeling image datasets. GOGGLES includes a novel set of affinity functions defined using the VGG16 network, and a hierarchical generative model for class inference. GOGGLES is able to label images with high accuracy without any domainspecific input from users, except a very small development set, which is economical to obtain.
8. Acknowledgements
We thank the SIGMOD’20 anonymous reviewers for their thoughtful and highly constructive feedback. This work was supported by NSF grants CNS1704701 and IIS1563816, and an Intel gift for ISTCARSA.
References
Footnotes
 journalyear: 2020
 copyright: acmlicensed
 conference: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data; June 14–19, 2020; Portland, OR, USA
 booktitle: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD’20), June 14–19, 2020, Portland, OR, USA
 price: 15.00
 doi: 10.1145/3318464.3380592
 isbn: 9781450367356/20/06
 ccs: Mathematics of computing Probabilistic inference problems
 ccs: Computing methodologies Computer vision representations
 ccs: Computing methodologies Cluster analysis
 ccs: Computing methodologies Learning settings
 https://github.com/chudatalab/GOGGLES