Co-training for Demographic Classification Using Deep Learning from Label Proportions
Deep learning algorithms have recently produced state-of-the-art accuracy in many classification tasks, but this success is typically dependent on access to many annotated training examples. For domains without such data, an attractive alternative is to train models with light, or distant supervision. In this paper, we introduce a deep neural network for the Learning from Label Proportion (LLP) setting, in which the training data consist of bags of unlabeled instances with associated label distributions for each bag. We introduce a new regularization layer, Batch Averager, that can be appended to the last layer of any deep neural network to convert it from supervised learning to LLP. This layer can be implemented readily with existing deep learning packages. To further support domains in which the data consist of two conditionally independent feature views (e.g. image and text), we propose a co-training algorithm that iteratively generates pseudo bags and refits the deep LLP model to improve classification accuracy. We demonstrate our models on demographic attribute classification (gender and race/ethnicity), which has many applications in social media analysis, public health, and marketing. We conduct experiments to predict demographics of Twitter users based on their tweets and profile image, without requiring any user-level annotations for training. We find that the deep LLP approach outperforms baselines for both text and image features separately. Additionally, we find that co-training algorithm improves image and text classification by 4% and 8% absolute F1, respectively. Finally, an ensemble of text and image classifiers further improves the absolute F1 measure by 4% on average.
Deep learning methods have produced state-of-the-art accuracy on many different classification tasks especially for image classification[1, 2, 3, 4, 5]. Because these networks typically have millions of parameters, they rely on access to large labeled data sets such as ImageNet , which has over one million annotated images. While transfer learning can help adapt a network to a new domain , it still requires labeled data on the target domain.
An attractive alternative is Learning from Label Proportion (LLP), in which the training samples are divided into a set of bags, and only the label distribution of each bag is known. The main advantage of LLP is that it does not require annotations for individual instances. Furthermore, in many domains label proportions are readily available — for example, by associating geolocated social media messages with county population statistics, we can fit a model of demographics without annotating individual users.
While many LLP models have been proposed based on the logistic hypothesis , SVM , and graphical models , there has been little work that considers deep learning. In this paper, we propose an approach which converts a deep neural network from a supervised classifier to LLP. This method can readily be implemented in popular deep learning packages by introducing a new regularization layer into the network. We propose such a layer, called the Batch Averager. Similar to label regularization , this layer computes the average of its input as the output. Like other regularization layers, this layer is only applied at training time. The Batch Averager is typically appended to the last layer of a network to convert it from a supervised learning to a deep LLP. We use KL-Divergence as the error function to train the network to produce predictions that match the provided label proportions.
This deep LLP framework has these key advantages:
Simplicity: The Batch Averager is based on a very simple tensor operation (average) and, unlike the Batch Normalizer layer , it does not have any training weights and does not noticeably affect training time.
Availability: The framework can convert almost every supervised classification network to LLP.
Accuracy: Our empirical results indicate that the deep LLP has a comparable accuracy to supervised models subject to proper constraints to generate bags.
In addition, we also propose co-training with deep LLP to further improve accuracy. In some applications, multiple views of the data are available (e.g. text and image). For example, many social media sites contain both image and textual data. While there have been many deep learning methods that directly combine text and image features (e.g., ), we instead require a model that is robust to cases where one feature view is missing. For example, some users may not post images, yet we would still like to classify based on the text. An attractive alternative is co-training . Because traditional co-training requires labeled data, we propose a new algorithm that is more suitable for the LLP setting. In this method, we use an image-based LLP model to create bags with estimated label proportions, which we call pseudo-bags. We then fit a text-based LLP model on these pseudo-bags, and use it to in turn create pseudo-bags for the image-based model.
We conduct experiments classifying Twitter users into demographic categories (gender and race/ethnicity) based on the profile image and tweets. We find that the proposed deep LLP framework outperforms both supervised and LLP baselines, and that co-training further improves accuracy of the models.
The remainder of the paper is organized as follows. In Section II, we review related work on LLP and co-training, and Section III provides our deep LLP framework and co-training algorithm. In Section IV we describe the data collected for the experiments. In Section V we present our empirical results; Section VI concludes and describes plans for future work.
Ii Related Work
Deep learning methods have produced state-of-the-art results on many classification tasks, but typically require many labeled training instances. For image classification, researchers usually compare their model by training on large datasets such as ImageNet . VGG-16 network (2014) achieves 90.1% top-5 accuracy (single crop)  on ImageNet by stacking 16 layers on the top of each other. ResNet-152 network (2015) improves that to 93.3% by introducing the residual networks . Inception V1-V3 networks present inception module for convolutional neural network [4, 11, 16], and Inception-V3 (2015) achieves 94.1% top-5 accuracy on ImageNet. The inception module can be combined with the residual networks to create Inception-ResNet-v2 (2016) with 95.1 % accuracy . XCeption (2016) network introduces extreme inception module by using deepwise separable convolution layers  and achieves 94.5% top-5 accuracy .
All of these models highly rely on the vast amount of labeled data. However, with transfer learning, the pre-trained weights (using ImageNet) can move to other image classification tasks . But, still we need annotated data on the target domain, and alternative approaches such as LLP are required in the absence of labeled data.
Only a handful of researchers attempt to use deep learning for LLP settings. Kotzias et al.  propose a model for the particular case of LLP when the label of bags are available. For example, for text classification task, when we have the label of a bag of multiple sentences, they propose a model to infer the label of each sentence using an objective function to smooth the posterior probability of samples based on sample similarity and bag constraints. They use a convolutional neural network to infer sentence similarity for text classification.
Other researchers provide models for image segmentation. In this setting, each image is split into multiple small images (as a bag), and the classifier tries to infer labels of these regions using the label proportion of the bag. Li et al.  suggest a convolutional neural network with a probabilistic method to estimate labels by considering the proportion bias, using the Expectation Maximization algorithm to determine model parameters. In their approach, they use satellite images with know ice area ratio to train a classifier to predict label of small segments of images.
While these methods have promising results on a particular domain, to the best of our knowledge, no method has proposed a framework that can be readily applied to diverse classification tasks. Inspired by label regularization , we fill this gap by introducing Batch Averager as a regularizer layer.
The traditional L2 regularization appears to not be sufficient for deep neural network because overfitting is a severe problem in these networks. Srivastava et al.  introduce a Dropout layer that randomly drops some units in backpropagation step and show that it can significantly reduce overfitting. Furthermore, the Batch Normalizer is introduced to normalize the output of cells and reduce internal covariate shift of weights . Similar to these two normalization layers, our proposed Batch Averager layer applies only at training time (Batch Normalizer uses the moving average and standard deviation to normalize the output at testing time without updating the moving average and standard deviation).
Recently, demographic classification with deep learning has been proposed by researchers. Zhang et al.  offer a method to infer demographic attributes (gender) from the wild (unconstrained images). Similarly, Liu et al.  propose convolutional neural network to classify attributes such as age, gender, and race in the wild. Ranjan et al.  provide a fast RCNN to localize and recognize all faces in the scene with their attributes (e.g. gender, pose). Few attempts to use both textual and image features together.
Co-training is a semi-supervised method that trains on two views of features on a small set of labeled data and iteratively adds pseudo labels from a large set of unlabeled data . Gupta et al.  demonstrate a co-training algorithm that trains on captioned image to recognize human actions with SVMs. Their model also learns from videos of athletic events with commentary.
The majority of these works use captions to leverage image classification accuracy by co-training methods, and cannot be applied to text only classification purpose. Furthermore, they require annotated labeled data. On the other hand, our proposed co-training model trains both image and text classifiers that can be used separately. Additionally, we do not need any labeled data; and if the testing sample has both image and text features, we can apply an ensemble model (by soft voting) to improve the classification accuracy.
In this section, we first define our proposed regularization layer, Batch Averager. Then, we illustrate the deep LLP framework with this layer. Next, we provide an algorithm to create bags with appropriate label distributions for training. Finally, we offer a co-training approach for training on two views of data (e.g. text and image) in LLP settings.
Iii-a Batch Averager Layer
In the Learning from Label Proportion setting, the training data are divided into bags, and only the label distribution for each bag is known. For bag , let be the set of samples in this bag; i.e. where is the feature vector for instance in bag . Let be the provided label proportion for bag , and be the number of samples in this bag (i.e. ). E.g., for binary classification, is the fraction of positive instances in bag .
To implement Deep LLP, we assume that each bag is assigned to exactly one batch for gradient optimization. Also, because deep neural networks are typically trained by GPU cores with limited memory, there is usually a maximum batch size determined by the number of parameters that can fit into GPU memory. As a result, if a bag is too large, we need to break it down into smaller bags.
Our work is inspired by label regularization , and we define a regularization layer for a neural network. The label regularization model estimates bag posterior probabilities by the average of the posterior probabilities of the instances in that bag. In a neural network, this can just be implemented by computing the average of the output of the last layer per batch. As a result, we name this layer Batch Averager.
Let be the (unobserved) output of the last layer of a supervised network. Because in classification tasks, the last layer is typically logit function (either softmax or logistic), it returns a vector of posterior label probabilities; i.e. for bag , instance , and class we have:
Inspired by label regularization, we estimate the posterior probability of bag , , as the average of posterior probability of all instances in bag ; i.e.
We expect that the bag posterior probability () should be close to the bag prior probability (). Again, similar to label regularization, we use KL-divergence as the error function between posterior and prior, and the target of the neural network is to minimize this error function:
The Batch Averager layer is very similar to the Batch Normalizer layer . The latter normalizes the output of the batch to have zero mean and unit variance; the former converts the input layer to its average. As a regularization layer, Batch Averager is only applied at training time, not at testing time. Typically, Batch Averager would be applied to the last layer of the network, immediately after the Logit layer.
Implementing the Batch Averager is straightforward with most current open-source deep learning packages. However, most of these packages assume that the output of the network is a tensor with the same size as batch size (). As a result, we need to repeat the average for times. Similarly, we need to repeat the prior () for times. Because the number of samples per bag (batch size) is typically different per bag, we need to add sample weight for each sample in bag . As a result, the entire bag has sample weight one.
More formally, for bag , we train the network in a similar fashion as traditional supervised learning with feature vectors , labels , and sample weights . We additionally use KL-divergence as the error function for back propagation.
We use Keras111http://www.keras.io to implement this layer, and it supports both the Tensorflow  and Theano  backend. The implementation is very simple (just one line in Keras), and can extend to other deep learning packages. Another implementation aspect to take into consideration is that by default deep learning packages assume fixed batch size. However, the bag size in LLP is typically dynamic. To support dynamic batch size, we implement a generator code to make batches and fit in the network. Also, this generator code automatically breaks down large bags to randomly smaller bags that can fit in the GPU memory.
To implement this layer with dynamic batch size, we use the broadcasting feature in tensor operations. Suppose is the backend (Tensorflow or Theano), we define this function as the activation function for Batch Averager layer:
where is the input tensor of the layer. Because the Batch Averager is appended after the logit layer, is a two-dimensional tensor with size , where is the number of output classes. This function, first creates a zero tensor with the same size of . Then it computes the average of over the first axis as a vector tensor with size . Finally, it adds them together by using the broadcasting feature, and creates a tensor with the same size of , that the average of is repeated over the first axis for times.
Iii-B Deep LLP framework
In this section, we provide a framework to implement deep LLP networks. This framework uses the Batch Average regularizer as the last layer, and can be applied in any classification application (e.g. text, image). We furthermore propose text classification and image classification networks.
Similar to any classification deep network, in Deep LLP framework, the input of the network is the feature vectors (i.e. textual or image feature). Then we feed the network with layers that are typically used for the classification task at hand. The next layer is the Logit (softmax) layer to compute class probabilities. Finally, we add Batch Averager for label regularization, which sends the average of features to the output.
For text classification, we simply use a shallow network with only one dense layer (with 16 cells), followed by a dropout layer to avoid overfitting. We also use a temperature parameter as suggested in previous work . The temperature can be readily implemented in the Logit layer by a tensor operation.
For image classification, we use Xception, a state-of-the-art image classification model . We use the pre-trained weights (trained on ImageNet), and freeze the first two blocks of the network to train it with the maximum batch size 32 and fit the network with color images with dimension . Figure 1 shows a more detailed network for this model.
Iii-C Bag creation algorithm
In some domains, bags are naturally available with soft constraints such as geolocation, allowing us to assign label proportions to groups of instances. For example, in experiments below, we attach the U.S. Census statistics of county demographics to Twitter users from the same location. However, in other domains, our constraint is a prior probability based on an attribute of an instance. For example, according to US Census, 67% of people with the last name ‘Taylor’ are white. The problem is to construct a bag that combines many of these constraints and associate an accurate label proportion for training.
For class and unlabeled samples , we assume we have a prior probability for this class (e.g., . The goal is to select instances from to construct a bag, then to assign the expected proportion of that bag that have class . Algorithm 1 shows how to create bags for this class. The algorithm takes as input a maximum bag size () and a threshold (). In the results below, we set to 64 for all experiments. The algorithm first removes samples with probability lower than the threshold . The threshold is typically greater than .5 (in preliminary experiments, the results were not very sensitive to this parameter and we tune this threshold for each task; e.g. .8 for gender classification and .9 for race classification). Then, samples are sorted by decreasing order of their prior probability. Next, we move first sorted samples to the first bag, and the next samples to the next bag and continue until all remaining samples (above the threshold) are assigned to bags. Finally, we compute the bag label proportion as the average of the prior probability of samples inside the bag. These bags are then used as training data in the deep LLP model.
The advantage of this bag creation approach is that it allows us to associate label proportions with instances using pre-existing population statistics. Of course, these label proportions are likely to be inexact, and contain some selection bias due because Twitter users are not representative of the population. However, several prior works have found that LLP methods are robust to this type of noise ; we also find this to be the case in the present work.
Iii-D Co-training for LLP
In this section, we provide an algorithm for applying co-training in the LLP setting. This algorithm is useful when we have two conditionally independent feature views of data. An interesting case is when we have both text and image views of the data. Some views may naturally have more bags than others, or may have more accurate label proportions associated with them. Thus, our goal is to combine the advantages of each view to produce a more accurate classifier.
To apply our algorithm, we assume we have two sets of bags, one with textual features and one with image features. Let , , be bags, label proportions, and unlabeled data for view (e.g. text or image), where refer to the same set of instances. We propose Algorithm 2 for co-training in LLP settings. The algorithm proceeds by using the model trained on one view to create pseudo bags for the other view. The algorithm is initialized with empty pseudo bags (). For each iteration, we first train a deep LLP model on the union of the current view (e.g. text or image) and the pseudo bags for that view. Next, it predicts the posterior probability of unlabeled samples with the current view. Then, it calls Algorithm 1 to create pseudo bags, using the current view’s posteriors as the priors . Finally, it switches views for the next iteration. In the experiments below, this algorithm tends to converge quickly (e.g., six iterations).
To evaluate our deep LLP framework, we consider the task of predicting the gender and race/ethnicity of a Twitter user based on their tweets and profile image. In this section, we first describe how we collect data from Twitter and create bags, as well as the annotated data used for validation.222Replication code and data will be made available upon publication.
Iv-a Twitter data
It is common that image data comes with metadata. This metadata is useful to create bags. For example, images in Flickr333http://www.flickr.com have caption and description. Also, Twitter users often have a profile image. To demonstrate how the metadata can be used as a constraint to create bags, we collect data from Twitter. For the purpose of this study, we use geolocation and name in the metadata as constraints to create bags.
First, we use the Twitter Streaming API to collect roughly 120K tweets and remove users without an image profile; approximately 33K tweets with an image profile remain. However, not all of users use their own photo. To decrease noise, we want to ensure that there is only one face in the image profile. To do so, we apply the Viola-Jones object detection algorithm . This algorithm detects multiple faces in a scene with a low false positive rate. Finally, roughly 10.5K images with exactly one identified face remain. (Please note that this filtering only affects training data; validation data may contain multiple faces.) For each user, we download the most recent 200 tweets for use by the text-based model.
Next, we use the county and name metadata to create bags. We use the county constraint as described in recent work  by associating each user with a county in the U.S., based on the geolocation information provided in their tweets. This results in 85 bags with an average of 124 users per bags. Figure 2 illustrate a generated bag using the county constraint with 53% prior probability of class ”white.” This figure shows that there are some photos with multiple faces or cartoon faces, due to errors in the face detection algorithm. So, we expect that our photo bags will contain some noise.
We also use name attribute (where available in the user’s profile) to estimate label proportions. For gender classification, we use the first name with the data from US Social Security Administration (SSA). For example, according to SSA baby data444https://www.ssa.gov/oact/babynames/, for the first name ‘Casey’, the probability of being a man is 59%. Then we use Algorithm 1 with and to create bags. Figure 3 shows a bag using name constraints with 87% male prior probability. There is again some noise due to face detection. Additionally, for race/ethnicity classification, we use the last name as described in recent research  to estimate class priors for users who provide their last name, and run Algorithm 1 to create bags. For example, according to US Census, 67% of people with the last name ‘Taylor’ are white.
Finally, for evaluation purposes, we manually annotate 320 photos; the class distribution is shown in Table I. We use this dataset only for the evaluation purposes, not for training.
Iv-B Google search
Figure 2 and Figure 3 show that Twitter profile images are often noisy. Furthermore, the class distributions are unbalanced, since African-Americans are less frequent in both county and name constraints. This motivates a third type of constraint for image classification. We submit keywords (e.g. ‘Latino woman’ and ’Black American man’) to Google images search to identify images that are likely to come from the desired class. Then, we apply the Viola-Jones algorithm to remove photos without exactly one face in the scene. Because Google search sort images in decreasing order of relevance, we associate decreasing label proportions as we descend the list. Specifically, we group search results into bags of size 64 for each contiguous set of results, up to a maximum of 800 results. The first bag is assigned a label proportion of 95% for the positive class, and the proportion is reduced by 5% for each subsequent bag, to a minimum of 55%.
Iv-C CFD dataset
For additional validation, we use the Chicago Face Database (CFD), which contains high-quality frontal images (without background) for both genders and four racial/ethnic categories (white, black, Hispanic, and Asian), and facial expressions . We remove Asian photos (because only few Asian samples are in our Twitter data) and use this dataset as an additional testing set. The advantage of this database is that it has more samples for the Hispanic category than our validation set. Table II shows the class distribution of the CFD database.
However, this dataset has a different distribution of our training set. Our training set mostly has a wild condition (not frontal, with background and lower quality). As a result, we expect a different behavior of the classifier for this set. As we will discuss below in the experimental results, because these images have a very high quality, our race/ethnicity classifier is extremely accurate (with 98% F1 measure for black and white class). However, the F1 metric drops for gender classification because the lack of body features for CFD dataset.
For the purpose of this study, we train image and textual classifiers for the demographic attribute (gender and race/ethnicity) task. For race classification, we consider three categories (white, black, and Hispanic). However, according to the US Census555http://www.census.gov/prod/cen2010/briefs/c2010br-04.pdf, Hispanic is a debated term that refers to Spanish culture or origin regardless of race and can be of any ethnicity. As a result, we consider another classification task restricted only to black and white classes. We refer to the latter as ‘race2’ and the former as ‘race3’ classification task. Since labeled samples for both Twitter and CFD datasets are imbalanced, we report the weighted average F1 measure to compare results.
In this section, we first provide experimental results for text classification task and compare it with the state of the art baselines. Then, we describe image classification results and compare them with third-party face APIs. Finally, we demonstrate how co-training improves both text and image classification, as well as the advantage of an ensemble approach.
V-a Text classification result
Because the search constraints created by searching for Google images do not have any textual features, for text classification task we only use county and name constraints. Table III shows the weighted F1 measure for different tasks with all combination of constraints. According to this table, the county constraint results in higher accuracy for race classification, and name bags have higher accuracy for gender classification. That means that location is more informative for the race classification, and first name is more informative for the gender classification. Also, for the race task, using both constraints together increases the classification accuracy, but for the gender classification, the result of combining constraints is almost same as using only name bags.
Since using both constraints improves (or at least does not harm) the classification task, we use both name and county constraints in all subsequent text classification experiments. We use the maximum batch size of 32 to generate Table III; to do so, any larger bags are randomly split into smaller bags with the maximum size of 32 at each iteration of training.
Table IV presents the weighted average F1 metric for different maximum batch sizes using county and name constraints. According to this table, while gender classification is stable for different batch sizes, the maximum batch size of 32 works better for race tasks. This result is not surprising, because a batch size of 16 is too small for a bag, and batch size of 32 has the advantage of randomly splitting bags (that is created with size 64) to avoid overfitting. An interesting case is batch size of 1, which associates a label distribution to individual instances. The problem with this approach is that it is less robust to noise — for example, if a label proportion of 90% positive is applied to a negative instance, the classifier will attempt to fit this noisy example. However, if this instance is the only negative one in a batch of size 10, then the classifier is given the flexibility of predicting it as negative while still optimizing the loss function. Indeed, we find that batch size 1 is not effective on this task.
|Max batch size||race2||race3||gender||Average|
Finally, we compare our deep LLP model with state-of-the-art shallow LLP models — we use ridge regression for LLP  and label regularization  for comparison. Table V compares our model with maximum batch size 32 with other models for county and name constraints. According to this table, while our model has the same F1 as ridge LLP for race2 task, it has higher F1 for all other tasks. More specifically, on average, while both ridge LLP and label regularization have the same F1 score (76), our model has the highest F1 measure (79).
V-B Image classification result
In this section, we present image classification results using our proposed deep LLP model. For all experiments, we use the XCeption model  with its training weights fit on ImageNet. The ImageNet dataset  that is commonly used for image classification has over a million images for 1,000 objects but does not have any class related to a human face or body. Because we have the best result with the maximum batch size 32 in the last section, we use it again for image classification task. To reduce memory consumption, we freeze first two blocks of XCeption network in the training phrase, and train the remaining layers.
To avoid overfitting and make the model robust to the wild condition of images, we apply various image distortions before each training iteration; we randomly rotate, flip (horizontally), shift (vertically and horizontally), shear, and zoom photos. Also, with probability of 80%, we randomly crop the Viola-Jones detected face to avoid overfitting the background in images. We use Adam , an adaptive stochastic gradient descent algorithm, as an optimization algorithm, and we train all models for up to 20 epochs and report the accuracy on the validation set (Twitter annotated image profiles).
Table VI shows the weighted average F1 measure for different tasks using all constraint combinations. According to this table, the search constraint is more informative in all tasks, and using county with name constraints together has a poor result for race classification. Also, except for the race3 task, using all constraints together has the best result. By comparing this with text classification, because of search constraints, it is apparent that image classification has higher accuracy than text classification.
This table also presents the F1 metric for CFD dataset. For race2, the model has a very high result of 98% F1 measure. This result reveals that race classification is much easier with high-quality frontal photos, as opposed to the noisier images from Twitter. On the other hand, while the gender classifier has very high F1 (95%) for Twitter images, it has lower F1 for CFD dataset. We believe that is in part because the CFD images omit body features, and so the classifier must rely solely on face features.
Since, on average, using all constraints has the highest average F1 measure on all tasks, in all next experiments we present the results of models that trains with all constraints. Figure (b)b illustrates the validation accuracy (on Twitter labeled data) of each training epoch for different classification tasks. Clearly, the gender classification has the highest accuracy and converges faster than other classes, and race3 has the lowest accuracy and converges slower.
To illustrate the impact of adding image distortion to make the model robust to the wild condition of pictures, Table VII compares a model trained with random distortion using all constraints with the model without any distortion (Xception-no-distortion). According to this table, clearly, we need image manipulations for almost all tasks, and adding random distortion to photos has on average 7% absolute improvement of the F1 measure. Similarly, Figure (a)a shows the training and validation loss (KL-divergence) of race2 classification per epoch. According to this figure, the model that trains without image distortion overfits by converging to a lower training loss but with a higher validation loss.
To measure the effect of the underlying deep neural network, Table VII demonstrates the average F1 of deep LLP models using various neural networks (with all constraints). We compare Xception with Inception-v3, Inception-v4, and Inception-v4-aux networks, and the former is same as Inception-v4, but has an auxiliary output layer as proposed by Szegedy (2016) . We add another Batch Averager after the auxiliary layer and use KL-divergence for that too.
According to this table, Inception-v3 has a poor result, and Inception-v4 does not converge for race3 task, but has the highest F1 measure (99%) for race2 for CFD dataset. Also, the need of auxiliary layer for Inception-v4 is clear, and its improvement is significant. However, on average, in contrast to ImageNet reported results, Xception has a slightly better result than Inception-v4-aux. That maybe in part because the Inception-v4-aux is very deep and requires more data for the training phase.
Figures (c)c-(e)e illustrate training loss, validation accuracy, and validation loss of gender classification task using different models. According to these figures, Xception converges faster and has better results, and Inception-v3 has the worst results. Also, it is apparent that adding the auxiliary layer to Inception-v4 improves it, but it still has slightly lower result than Xception.
Finally, we compare deep LLP with four baselines. In our first two baselines, we compare deep LLP with supervised deep learning; we use Xception with its pre-trained weights, and train it for 50 epochs using labeled data. For Xception-wild model, the training data is Twitter evaluation labeled data in wild conditions and we use CFD dataset with high quality images to train Xception-HQ model. Our next two baselines are public face APIs. These APIs can detect multiple faces in the scene with multiple attributes. The Microsoft face API666https://www.microsoft.com/cognitive-services/en-us/face-api can identify gender but does not predict race. For race classification, we use Sightcorp API777https://face.sightcorp.com/, which is the only API supporting race recognition to the best of our knowledge, but it is still in the beta phase and does not have very accurate results. Table VIII compares these baselines to our deep LLP model using all constraints. Our approach outperforms Xception-wild, Xception-HQ, and Sightcorp for all tasks and outperforms the Microsoft API in the wild condition. However, the latter has the better result for CFD dataset with clean, frontal images, which our method was not trained on.
V-C Co-training result
In this section, we provide experimental results for Algorithm 2. For these results, we initialize algorithm with the image view. (It is also possible to initialize it with the textual view, but we do not expect a significant difference.) In each iteration, the algorithm creates pseudo-bags, which are used in the next iteration (by switching the view). In this experiment, we use the same Twitter unlabeled data (with both image and text) as unlabeled samples. Thus, we use the same unlabeled data, but organize them into different bags with different label proportions for training. For text bags, we use county and name constraints, and for image bags, we use county, name, and search constraints.
Table IX presents the result of the co-training algorithm for image classification. In this table, the first column indicates the iteration of Algorithm 2, and the first row (iteration zero) states the F1 measure of deep LLP model without co-training. This table only shows odd iterations (image classification steps) of the co-training algorithm. According to this table, the biggest improvement comes in the first iteration, which improves absolute F1 of race2 by 2% (an error reduction of 25%), and improves the absolute F1 of race3 by 4% (an error reduction of 17%).
By applying the co-training algorithm, the most growth belongs to race3 with an absolute improvement of 7% for Twitter and 12% for CFD dataset. We believe that is in part because that the text classifier can detect Spanish words for Hispanic class and make better pseudo-bags for it. For the race2 classification task, our co-training algorithm improves absolute F1 by 3% for the Twitter dataset. The CFD dataset has already very high F1 measure (98%), and does not have any growth by co-training steps.
Because the gender classification already has a very high F1 measure (95%) on Twitter, the co-training improves it by only 1%. However, it increases the F1 on the CFD dataset by 5%. We believe the lower accuracy for the gender classification task of the CFD database is in part because of the lack of body features. In the absence of body features, the classifier often misclassifies a short hair woman as a man or a long hair man as a woman. Also, because CFD dataset has images with facial expression too, and our training data does not consider that, the classifier sometimes misclassifies a woman with an angry or a fearful expression as a man. The Figure 5 shows some misclassification samples with their predicted probability. (Images are blurred to protect privacy.)
Table X demonstrates the result of even iterations, text classification steps, of Algorithm 2. Again, the first row shows the result of text classification without co-training. According to this table, the second iteration of co-training has the highest improvement (7% on average across all tasks). The highest growth belongs to race2 classes, which improves by 10% in absolute F1. This improvement is likely because the image classifier has a very high F1 for this task, and as a result, it creates accurate pseudo-bags for the text classifier. Finally, the gender classification grows only 3%, since it has a little improvement from image classification.
In our final experiment, we select the last iteration for text classification (6th iteration) and image classification (5th iteration) of co-training algorithm. Then, we use them on the Twitter dataset and blend their predictions by using soft voting. Table XI shows the result. According to this table, the ensemble improves image classification by average 2%, and text classification by average 6%. The ensemble produces the highest F1 for all tasks (except image classification for gender, which has the same accuracy).
Vi Conclusions and future work
In this paper, we introduced two enhancements to deep learning methods for image and text classification: (1) a Batch Averager layer to enable LLP deep learning, and (2) a co-training method for combining deep LLP models trained on different views. We found that a deep model that is trained on population level data using proper constraints is comparable to traditional supervised learning. This approach decreases the burden of human annotation and can readily be implemented with almost every publicly-available deep learning software packages.
We also found that for applications with two views (image and text) of features, a co-training algorithm leverages improvement of classification task for both views, and can be enhanced by an ensemble learning to achieve the highest precision.
In the future, we will investigate additional co-training algorithms for LLP, particularly investigating methods for improving pseudo-bag generation and applying it to a larger set of unlabeled data.
This research was funded in part by the National Science Foundation under grants #IIS-1526674 and #IIS-1618244, and CCC Information Services Inc. provided a server (with multiple GPUs) to run deep learning models.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014. [Online]. Available: http://arxiv.org/abs/1409.1556
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CoRR, vol. abs/1512.03385, 2015. [Online]. Available: http://arxiv.org/abs/1512.03385
-  F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” CoRR, vol. abs/1610.02357, 2016. [Online]. Available: http://arxiv.org/abs/1610.02357
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Computer Vision and Pattern Recognition (CVPR), 2015. [Online]. Available: http://arxiv.org/abs/1409.4842
-  C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception-resnet and the impact of residual connections on learning,” CoRR, vol. abs/1602.07261, 2016. [Online]. Available: http://arxiv.org/abs/1602.07261
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
-  J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” CoRR, vol. abs/1411.1792, 2014. [Online]. Available: http://arxiv.org/abs/1411.1792
-  G. S. Mann and A. McCallum, “Simple, robust, scalable semi-supervised learning via expectation regularization,” in Proceedings of the 24th International Conference on Machine Learning, ser. ICML ’07. New York, NY, USA: ACM, 2007, p. 593â600. [Online]. Available: http://doi.acm.org/10.1145/1273496.1273571
-  K.-T. Lai, F. X. Yu, M.-S. Chen, and S.-F. Chang, “Video event detection by inferring temporal instance labels,” in Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, ser. CVPR ’14. Washington, DC, USA: IEEE Computer Society, 2014, pp. 2251–2258. [Online]. Available: http://dx.doi.org/10.1109/CVPR.2014.288
-  J. Graça, K. Ganchev, and B. Taskar, “Expectation maximization and posterior constraints.” in NIPS, vol. 20, 2007, pp. 569–576.
-  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” CoRR, vol. abs/1502.03167, 2015. [Online]. Available: http://arxiv.org/abs/1502.03167
-  R. Al-Rfou, G. Alain, A. Almahairi, C. Angermüller, and Others, “Theano: A python framework for fast computation of mathematical expressions,” CoRR, vol. abs/1605.02688, 2016. [Online]. Available: http://arxiv.org/abs/1605.02688
-  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, and Others, “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” CoRR, vol. abs/1603.04467, 2016. [Online]. Available: http://arxiv.org/abs/1603.04467
-  T. Sakaki, M. Okazaki, and Y. Matsuo, “Earthquake shakes twitter users: Real-time event detection by social sensors,” in Proceedings of the 19th International Conference on World Wide Web, ser. WWW ’10. New York, NY, USA: ACM, 2010, pp. 851–860. [Online]. Available: http://doi.acm.org/10.1145/1772690.1772777
-  A. Blum and T. Mitchell, “Combining labeled and unlabeled data with co-training,” in Proceedings of the Eleventh Annual Conference on Computational Learning Theory, ser. COLT’ 98. New York, NY, USA: ACM, 1998, pp. 92–100. [Online]. Available: http://doi.acm.org/10.1145/279943.279962
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” CoRR, vol. abs/1512.00567, 2015. [Online]. Available: http://arxiv.org/abs/1512.00567
-  L. Sifre and S. Mallat, “Rigid-motion scattering for texture classification,” CoRR, vol. abs/1403.1687, 2014. [Online]. Available: http://arxiv.org/abs/1403.1687
-  D. Kotzias, M. Denil, N. de Freitas, and P. Smyth, “From group to individual labels using deep features,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’15. New York, NY, USA: ACM, 2015, pp. 597–606. [Online]. Available: http://doi.acm.org/10.1145/2783258.2783380
-  F. Li and G. Taylor, “Alter-cnn: An approach to learning from label proportions with application to ice-water classification,” in Neural Information Processing Systems Workshops (NIPSW) on Learning and privacy with incomplete data and weak supervision, 2015.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, Jan. 2014. [Online]. Available: http://dl.acm.org/citation.cfm?id=2627435.2670313
-  N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. D. Bourdev, “PANDA: pose aligned networks for deep attribute modeling,” CoRR, vol. abs/1311.5591, 2013. [Online]. Available: http://arxiv.org/abs/1311.5591
-  Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” CoRR, vol. abs/1411.7766, 2014. [Online]. Available: http://arxiv.org/abs/1411.7766
-  R. Ranjan, V. M. Patel, and R. Chellappa, “Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition,” CoRR, vol. abs/1603.01249, 2016. [Online]. Available: http://arxiv.org/abs/1603.01249
-  S. Gupta, J. Kim, K. Grauman, and R. Mooney, “Watch, listen & learn: Co-training on captioned images and videos,” in Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I, ser. ECML PKDD ’08. Berlin, Heidelberg: Springer-Verlag, 2008, pp. 457–472. [Online]. Available: http://dx.doi.org/10.1007/978-3-540-87479-9_48
-  P. A. Viola and M. J. Jones, “Rapid object detection using a boosted cascade of simple features,” in 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2001), with CD-ROM, 8-14 December 2001, Kauai, HI, USA. IEEE Computer Society, 2001, pp. 511–518. [Online]. Available: http://dx.doi.org/10.1109/CVPR.2001.990517
-  E. Mohammady and A. Culotta, “Using county demographics to infer attributes of twitter users,” in ACL Joint Workshop on Social Dynamics and Personal Attributes in Social Media, 2014.
-  D. S. Ma, J. Correll, and B. Wittenbrink, “The chicago face database: A free stimulus set of faces and norming data,” Behavior Research Methods, vol. 47, no. 4, pp. 1122–1135, 2015. [Online]. Available: http://dx.doi.org/10.3758/s13428-014-0532-5
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available: http://arxiv.org/abs/1412.6980