# Deep Co-Training for Semi-Supervised Image Recognition

## Abstract

In this paper, we study the problem of semi-supervised image recognition, which is to learn classifiers using both labeled and unlabeled images. We present Deep Co-Training, a deep learning based method inspired by the Co-Training framework [[1]]. The original Co-Training learns two classifiers on two views which are data from different sources that describe the same instances. To extend this concept to deep learning, Deep Co-Training trains multiple deep neural networks to be the different views and exploits adversarial examples to encourage view difference, in order to prevent the networks from collapsing into each other. As a result, the co-trained networks provide different and complementary information about the data, which is necessary for the Co-Training framework to achieve good results. We test our method on SVHN, CIFAR-10/100 and ImageNet datasets, and our method outperforms the previous state-of-the-art methods by a large margin.

###### Keywords:

Co-Training, Deep Networks, Semi-Supervised Learning## 1 Introduction

Deep neural networks achieve the state-of-art performances in many visual tasks such as image recognition [[2], [3], [4], [5], [6], [7], [8], [9], [10]] and object detection [[11], [12], [13], [14]]. However, training networks requires large-scale labeled datasets [[15], [16]] which are usually expensive and difficult to collect. Given the massive amounts of unlabeled natural images, the idea to use datasets without human annotations becomes very appealing [[17]]. In this paper, we study semi-supervised learning with a particular interest in the image recognition problem, the task of which is to use the unlabeled images in addition to the labeled images in order to build better classifiers.

Formally, in the semi-supervised image recognition problem [[18], [19], [20]], we are provided with a image dataset where images in are labeled and images in are not. The task is to build classifiers on the categories in using the data in . The test data contains only the categories that appear in . The problem of learning models on supervised datasets has been extensively studied, and the state-of-the-art methods are deep convolutional networks [[2], [3]]. The core problem is how to use the unlabeled to help learning on .

The method proposed in this paper is inspired by the Co-Training framework [[1]], which is an award-winning method for semi-supervised learning in general. It assumes that each data in has two views, i.e. is given as , and each view is sufficient for learning an effective model. For example, the views can have different data sources [[1]] or different representations [[21], [22], [23]]. Let be the distribution that is drawn from. Co-Training assumes that and trained on view and respectively have consistent predictions on , i.e.,

(1) |

Based on this assumption, Co-Training proposes a dual-view self-training algorithm: it first learns a separate classifier for each view on , and then the predictions of the two classifiers on are gradually added to to continue the training. Blum and Mitchell [[1]] further show that under an additional assumption that the two views of each instance are conditionally independent given the category, Co-Training has PAC-like guarantees on semi-supervised learning.

Given the superior performances of deep neural networks on supervised image recognition, we are interested in extending the Co-Training framework to apply deep learning to semi-supervised image recognition. A naive implementation is to train two neural networks simultaneously on by modeling Eq. 1. But this method suffers from a critical drawback: there is no guarantee that the views provided by the two networks give different and complementary information about each data point. Yet Co-Training is beneficial only if the two views are different, ideally conditionally independent given the category; after all, there is no point in training two identical networks. Moreover, the Co-Training assumption encourages the two models to make similar predictions on both and , which can even lead to collapsed neural networks, as we will show by experiments in Section 3. Therefore, in order to extend the Co-Training framework to take the advantages of deep learning, it is necessary to have a force that pushes networks away to balance the Co-Training assumption that pulls them together.

The force we add to the Co-Training Assumption is *View Difference Constraint* formulated by Eq. 2, which encourages the networks to be different

(2) |

The challenge is to find a proper and sufficient that is compatible with Eq. 1 (e.g. ) and our tasks. We construct by adversarial examples [[24]].

In this paper, we present Deep Co-Training (DCT) for semi-supervised image recognition, which extends the Co-Training framework without the drawback discussed above. Specifically, we model the Co-Training assumption by minimizing the expected Jensen-Shannon divergence between the predictions of the two networks on . To avoid the neural networks from collapsing into each other, we impose the view difference constraint by training each network to be resistant to the adversarial examples [[24], [25]] of the other. The result of the training is that each network can keep its predictions unaffected on the examples that the other network fails on. In other words, the two networks provide different and complementary information about the data because they are trained not to make errors at the same time on the adversarial examples for them. To summarize, the main contribution of DCT is a differentiable modeling that takes into account both the Co-Training assumption and the view difference constraint. It is a end-to-end solution which minimizes a loss function defined on the dataset and . Naturally, we extend the dual-view DCT to a scalable multi-view DCT. We test our method on four datasets, SVHN [[26]], CIFAR10/100 [[27]] and ImageNet [[15]], and DCT outperforms the previous state-of-the-arts by a large margin.

## 2 Deep Co-Training

In this section, we present our model of Deep Co-Training (DCT) and naturally extend dual-view DCT to multi-view DCT. First, we formulate the Co-Training assumption and the view difference constraint in 2.1 and 2.2, respectively.

### 2.1 Co-Training Assumption in DCT

We start with the dual-view case where we are interested in co-training two deep neural networks for image recognition. Following the notations in Section 1, we use and to denote the labeled and the unlabeled dataset. Let denote all the provided data. Let and denote the two views of data . In this paper, and are convolutional representations of before the final fully-connected layer that classifies to one of the categories in . On the supervised dataset , we use the standard cross entropy loss

(3) |

for any data in where is the label for and is the cross entropy between distribution and . Training neural networks using only the supervision from minimizes the expected loss on all the data of .

Next, we model the Co-Training assumption. Co-Training assumes that on the distribution where is drawn from, and agree on their predictions. In other words, we want networks and to have close predictions on . Therefore, we use a natural measure of similarity, the Jensen-Shannon divergence between and , i.e.,

(4) |

where and is the entropy of . Training neural networks based on the Co-Training assumption minimizes the expected loss on the unlabeled set . As for the labeled set , minimizing loss already encourages them to have close predictions on since they are trained with labels; therefore, minimizing on is unnecessary, and we only implement it on (i.e. not on ).

### 2.2 View Difference Constraint in DCT

The key condition of Co-Training to be successful is that the two views are different and provide complementary information about each data . But minimizing Eq. 3 and 4 only encourages the neural networks to output the same predictions on . Therefore, it is necessary to encourage the networks to be different and complementary. To achieve this, we create another set of images where , , which we will generate by adversarial examples [[24], [25]].

Since Co-Training assumes that , we know that . But is all the data we have; therefore, must be built up by a generative method. On the other hand, suppose that and can achieve very high accuracy on naturally obtained data (e.g. ), assuming , also implies that should be constructed by a generative method.

We consider a simple form of generative method which takes data from to build , i.e. . For any , we want to be small so that also looks like a natural image. But when is small, it is very possible that and . Since Co-Training assumes , and we want , when , it follows that . These considerations imply that is an adversarial example [[24]] of that fools the network but not the network . Therefore, in order to prevent the deep networks from collapsing into each other, we propose to train the network (or ) to be resistant to the adversarial examples of (or of ) by minimizing the cross entropy between and (or between and ), i.e.,

(5) |

Using artificially created examples in image recognition has be studied before.
They can serve as regularization techniques to smooth outputs [[28]], or create negative examples to tighten decision boundaries [[20], [29]].
Now, they are used to make networks different.
To summarize the Co-Training with the view difference constraint in a sentence, we want the models to have *the same* predictions on but make *different* errors when they are exposed to adversarial attacks.
By minimizing Eq. 5 on , we encourage the models to generate complementary representations, each is resistant to the adversarial examples of the other.

### 2.3 Training DCT

In Deep Co-Training, the objective function is of the form

(6) |

which linearly combines Eq. 3, Eq. 4 and Eq. 5 with hyperparameters and . We present one iteration of the training loop in Algorithm 2.3. The full training procedure repeats the computations in Algorithm 2.3 for many iterations and epochs using gradient descent with decreasing learning rates.

InputInput \SetKwInOutOutputOutput Data Sampling Sample data batch for and for from s.t. . Sample data batch from .

Create Adversarial Examples Compute the adversarial examples of for all and of for all using FGSM [[24]].

Update Compute the gradients with respect to by backpropagation and update the parameters of and using gradient descent.

Note that in each iteration of the training loop of DCT, the two neural networks receive different supervised data.
This is to increase the difference between them by providing them with supervised data in different time orders.
Consider that the data of the two networks are provided by two data streams and .
Each data from and from are of the form , where and denote a batch of supervised data and unsupervised data, respectively.
We call *a bundle of data streams* if their are the same and the sizes of are the same.
Algorithm 2.3 uses a bundle of data streams to provide data to the two networks.
The idea of using bundles of data streams is important for scalable multi-view Deep Co-Training, which we will present in the following subsections.

### 2.4 Multi-View DCT

In the previous subsection, we introduced our model of dual-view Deep Co-Training. But dual-view is only a special case of multi-view learning, and multi-view co-training has also been studied for other problems [[30], [31]]. In this subsection, we present a scalable method for multi-view Deep Co-Training. Here, the scalability means that the hyperparameters and in Eq. 6 that work for dual-view DCT are also suitable for increased numbers of views. Recall that in the previous subsections, we propose a concept called a bundle of data streams which provides data to the two neural networks in the dual-view setting. Here, we will use multiple data stream bundles to provide data to different views so that the dual-view DCT can be adapted to the multi-view settings.

Specifically, we consider views , in the multi-view DCT. We assume that is a even number for simplicity of presenting the multi-view algorithm. Next, we build independent data stream bundles . Let denote the training data that bundle provides at iteration . Let denote the loss in Step of Algorithm 2.3 when dual training and using data . Then, at each iteration , we consider the training scheme implied by the following loss function

(7) |

We call this *fake multi-view DCT* because Eq. 7 can be considered as independent dual-view DCTs.
Next, we adapt Eq. 7 to the *real multi-view DCT*.
In our multi-view DCT, at each iteration , we consider an index list randomly shuffled from {1, 2, .., n}.
Then, we use the following training loss function

(8) |

Compared with Eq. 7, Eq. 8 randomly chooses a pair of views to train for each data stream bundle at each iteration. The benefits of this modeling are multifold. Firstly, Eq. 8 is converted from independent dual-view trainings; therefore, the hyperparameters for the dual-view setting are also suitable for multi-view settings. Thus, we can save our efforts in tuning parameters for different number of views. Secondly, because of the relationship between Eq. 7 and Eq. 8, we can directly compare the training dynamics between different number of views. Thirdly, compared with computing the expected loss on all the possible pairs and data at each iteration, this modeling is also computationally efficient.

### 2.5 Implementation Details

To fairly compare with the previous state-of-the-art methods, we use the training and evaluation framework of Laine and Aila [[19]]. We port their implementation to PyTorch for easy multi-GPU support. Our multi-view implementation will automatically spread the models to different devices for the maximal utilizations. For SVHN and CIFAR, we use a network architecture similar to [[19]]: we only change their weight normalization and mean-only batch normalization layers [[32]] to the natively supported batch normalization layers [[33]]. This change results in performances a little worse than but close to those reported in their paper. [[19]] thus is the most natural baseline. For ImageNet, we use a small model ResNet-18 [[2]] for fast experiments. In the following, we introduce the datasets SVHN, CIFAR and ImageNet, and how we train our models on them.

**Svhn ** The Street View House Numbers (SVHN) dataset [[26]] contains real-world images of house numbers, each of which is of size .
The label for each image is the centermost digit.
Therefore, this is a classification problem with categories.
Following Laine and Aila [[19]], we only use images out of official training images as the supervised part to learn the models and the full test set of images for testing.
The rest images are considered as the unsupervised part .
We train our method with the standard data augmentation, and our method significantly outperforms the previous state-of-the-art methods.
Here, the data augmentation is only the random translation by at most pixels.
We do not use any other types of data augmentations.

**Cifar ** CIFAR [[27]] has two image datasets, CIFAR-10 and CIFAR-100.
Both of them contain color natural images of size ,
while CIFAR-10 includes categories and CIFAR-100 contains categories.
Both of them have images for training and images for testing.
Following Laine and Aila [[19]], for CIFAR-10, we only use images out of training images as the supervised part and the rest images are used as the unsupervised part . As for CIFAR-100, we use images out of training images as the supervised part and the rest images as the unsupervised part .
We use the full test images for evaluation for both CIFAR-10 and CIFAR-100.
We train our methods with the standard data augmentation, which is the combination of random horizontal flip and translation by at most pixels.

**ImageNet** The ImageNet dataset contains about million natural color images for training and images for validation.
The dataset includes categories, each of which typically has images for training and for evaluation.
Following the prior work that reported results on ImageNet [[18], [34], [35]], we uniformly choose data from million training images as supervised and the rest as unsupervised .
We report the single center crop error rates on the validation set.
We train our models with data augmentation, which includes random resized crop to and random horizontal flip.
We do not use other advanced augmentation techniques such as color jittering or PCA lighting [[4]].

For SVHN and CIFAR, following [[19]], we use a warmup scheme for the hyperparameters and . Specifically, we warmup them in the first epochs such that when the epoch , and after that. For SVHN and CIFAR, we set . For SVHN and CIFAR-10, , and for CIFAR-100 . For training, we train the networks using stochastic gradient descent with momentum and weight decay . The total number of training epochs is and we use a cosine learning rate schedule at epoch . The batch size is set to for SVHN, CIFAR-10 and CIFAR-100.

For ImageNet, we choose a different training scheme. Before using any data from , we first train two ResNet-18 individually with different initializations and training sequences on only the labeled data . Following ResNet [[2]], we train the models using stochastic gradient descent with momentum , weight decay and batch size for epochs, the time of which is the same as training epochs with full supervision. The learning rate is initialized as and multiplied by at the st epoch. Then, we take the two pre-trained models to our unsupervised training loops. This time, we directly set to the maximum values because the previous epochs have already warmed up the models. Here, and . In the unsupervised loops, we use a cosine learning rate and we train the networks for epochs on both and . The batch size is set to .

To make the loss stable across different training iterations, we require that each data stream provides data batches whose proportions of the supervised data are close to the ratio of the size of to the size of . To achieve this, we evenly divide the supervised and the unsupervised data to build each data batch in the data streams. As a result, the difference of the numbers of the supervised images between any two batches is no greater than .

## 3 Results

In this section, we will present the experimental results on four datasets, i.e. SVHN [[26]], CIFAR-10, CIFAR-100 [[27]] and ImageNet [[15]]

Method | SVHN | CIFAR-10 |
---|---|---|

GAN [[36]] | ||

Stochastic Transformations [[18]] | – | |

Model [[19]] | ||

Temporal Ensembling [[19]] | ||

Mean Teacher [[35]] | ||

Bad GAN [[20]] | ||

VAT [[28]] | ||

Deep Co-Training with 2 Views | ||

Deep Co-Training with 4 Views | ||

Deep Co-Training with 8 Views |

### 3.1 SVHN and CIFAR-10

SVHN and CIFAR-10 are the datasets that the previous state-of-the-art methods for semi-supervised image recognition mostly focus on. Therefore, we first present the performances of our method and show the comparisons with the previous state-of-the-art methods on these two datasets. Next, we will also provide ablation studies on the two datasets for better understandings of the dynamics and characteristics of dual-view and multi-view Deep Co-Training.

Table 1 compares our method Deep Co-Training with the previous state-of-the-art methods on SVHN and CIFAR-10 datasets. To make sure these methods are fairly compared, we do not ensemble the models of our method even there are multiple well-trained models after the entire training procedure. Instead, we only report the average performances of those models. Compared with other state-of-the-art methods, Deep Co-Training achieves significant performance improvements when , or views are used. As we will discuss in Section 4, all the methods listed in Table 1 require implicit or explicit computations of multiple models, e.g. GAN [[36]] has a discriminative and a generative network, Bad GAN [[20]] adds another encoder network based on GAN, and Mean Teacher [[35]] has an additional EMA model. Therefore, the dual-view Deep Co-Training does not require more computations in terms of the total number of the networks.

Another trend we observe is that although -view DCT gives significant improvements over -view DCT, we do not see similar improvements when we increase the number of the views to . For this observation, we speculate that this is because compared with -views, -views can use the majority vote rule when we encourage them to have close predictions on . When we increase the number of views to , although it is expected to perform better, the advantages over -views are not that strong compared with that of -views over -views. But -view DCT converges faster than -view DCT, which is even faster than dual-view DCT. The training dynamics of DCT with different numbers of views will be presented in the later subsections. We first provide our results on CIFAR-100 and ImageNet datasets in the next subsection.

Method | CIFAR-100 | CIFAR-100+ |
---|---|---|

Model [[19]] | ||

Temporal Ensembling [[19]] | – | |

Dual-View Deep Co-Training |

### 3.2 CIFAR-100 and ImageNet

Compared with SVHN and CIFAR-10, CIFAR-100 and ImageNet are considered harder benchmarks [[19]] for the semi-supervised image recognition problem because their numbers of categories are 100 and 1000, respectively, greater than 10 categories in SVHN and CIFAR-10. Here, we provide our results on these two datasets. Table 2 compares our method with the previous state-of-the-art methods that report the performances on CIFAR-100 dataset, i.e. Model and Temporal Ensembling [[19]]. Dual-view Deep Co-Training even without data augmentation achieves similar performances with the previous state-of-the-arts that use data augmentation. When our method also uses data augmentation, the error rate drops significantly from to . These results demonstrate the effectiveness of the proposed Deep Co-Training when the number of categories and the difficulty of the datasets increase.

Method | Architecture | # Param | Top-1 | Top-5 |
---|---|---|---|---|

Stochastic Transformations [[18]] | AlexNet | 61.1M | – | 39.84 |

VAE [[34]] with Supervised | Customized | 30.6M | 51.59 | 35.24 |

Mean Teacher [[35]] | ResNet-18 | 11.6M | 49.07 | 23.59 |

Supervised | ResNet-18 | 11.6M | 30.43 | 10.76 |

Supervised | ResNet-18 | 11.6M | 52.23 | 27.54 |

Dual-View Deep Co-Training | ResNet-18 | 11.6M | 46.50 | 22.73 |

Next, we show our results on ImageNet with categories and labeled in Table 3. Our method has better performances than the supervised-only but is still behind the accuracy when supervision is used. When compared with the previous state-of-the-art methods, however, DCT shows significant improvements on both the Top-1 and Top-5 error rates. Here, the performances of [[18]] and [[34]] are quoted from their papers, and the performance of Mean Teacher [[35]] with ResNet-18 [[2]] is from running their official implementation on GitHub. When using the same architecture, DCT outperforms Mean Teacher by for Top-1 error rate, and for Top-5 error rate. Compared with [[18]] and [[34]] that use networks with more parameters and larger input size , Deep Co-Training also achieves lower error rates.

### 3.3 Ablation Study

In this subsection, we will provide several ablation studies for better understandings of our proposed Deep Co-Training method.

#### On and

Recall that the loss function used in Deep Co-Training has three parts, the supervision loss , the co-training loss and the view difference constraint . Both and provide certain amounts of supervisions based on different assumptions. Therefore, it is of interest to study the changes when the loss function and are used alone in addition to in . Fig. 1 shows the plots of the training dynamics of Deep Co-Training when different loss functions are used on SVHN and CIFAR-10 dataset. In both plots, the blue lines represent the loss function that we use in practice in training DCT, the green lines represent only the co-training loss and are applied, and the orange lines represent only the the view difference constraint and are used. From Fig. 1, we can see that the Co-Training assumption () performs better at the beginning, but soon is overtaken by . even falls into an extreme case in the SVHN dataset where its validation accuracy drops suddenly around the -th epoch. However, the loss and around that epoch look smooth and normal. For this phenomenon, we speculate that this is because the networks have collapsed into each other, which motivates us to investigate the dynamics of loss . If our speculation is correct, there will also be abnormalities in loss around that epoch, which indeed we show in the next subsection. Moreover, this also supports our argument at the beginning of the paper that a force to push models away is necessary for co-training multiple neural networks for semi-supervised learning.

#### On the view difference

This is a sanity check on whether in dual-view training, two models tend to collapse into each other when we only model the Co-Training assumption, and if can push them away during training. To study this, we plot when it is minimized as in the Deep Co-Training and when it is not minimized, i.e. . Fig. 2 shows the plots of for SVHN dataset and CIFAR dataset, which correspond to the validation accuracies shown in Fig. 1. It is clear that when is not minimized as in the “” case, is far greater than , indicating that each model is vulnerable to the adversarial examples of the other. Like the extreme case we observe in Fig. 1 for SVHN dataset (left) around the -th epoch, we also see a sudden increase of here in Fig. 2 for SVHN at the similar epochs. This means that every adversarial example of one model fools the other model, i.e. they collapse into each other. The collapse directly causes a significant drop of the validation accuracy in the left of Fig. 1. These experimental results demonstrate the positive correlation between the view difference and the validation error. It also shows that the models in the dual-view training tend to collapse into each other when no force is applied to push them away. Finally, these results also support the effectiveness of our proposed as a loss function to increase the difference between the models.

#### On the number of views

We have provided the performances of Deep Co-Training with different numbers of views for SVHN and CIFAR-10 datasets in Table 1, where we show that increasing the number of the views from 2 to 4 improves the performances of each individual model. But we also observe that the improvement becomes smaller when we further increase the number of views to 8. In Fig. 3, we show the training dynamics of Deep Co-Training when different numbers of views are trained simultaneously.

As shown in Fig. 3, we observe a faster convergence speed when we increase the number of views to train simultaneously. We focus on the epochs from 100 to 200 where the differences between different numbers of views are clearest. The performances of different views are directly comparable because of the scalability of the proposed multi-view Deep Co-Training. Like the improvements of views over views on the final validation accuracy, the improvements of the convergence speed also decrease compared with that of views over views. The experimental results in Table 1 and here suggest that -view DCT achieves a good balance between the performance and the computation efficiency.

## 4 Discussions

In this section, we discuss the relationship between Deep Co-Training and the previous methods. We also present perspectives alternative to the Co-Training framework for discussing Deep Co-Training.

### 4.1 Related Work

Deep Co-Training is also inspired by the recent advances in semi-supervised image recognition techniques [[18], [19], [28], [37], [38]] which train deep neural networks to be resistant to noises , i.e. . We notice that their computations in one iteration require double feedforwardings and backpropagations, one for and one for . We ask the question: what would happen if we train two individual models as doing so requires the same amount of computations? We soon realized that training two models and encouraging them to have close predictions is related to the Co-Training framework [[1]], which has good theoretical results, provided that the two models are conditional independent given the category. However, training models with only the Co-Training assumption is not sufficient for getting good performances because the models tend to collapse into each other, which is against the view difference between different models which is necessary for the Co-Training framework.

As stated in 2.2, we need a generative method to generate images on which two models predict differently. Generative Adversarial Networks (GANs) [[20], [36], [39]] are popular generative models for vision problems, and have also been used for semi-supervised image recognition. A problem of GANs is that they will introduce new networks to the Co-Training framework for generating images, which also need to be learned. Compared with GANs, Introspective Generative Models [[29], [40]] can generate images from discriminative models in a lightweight manner, which bears some similarities with the adversarial examples [[24]]. The generative methods that use discriminative models also include DeepDream [[41]], Neural Artistic Style [[42]], etc. We use adversarial examples in our Deep Co-Training for its natural applicability to avoiding models from collapsing into each other by training each model with the adversarial examples of the others.

Before the work discussed above, semi-supervised learning in general has already been widely studied. For example, the mutual-exclusivity loss used in [[18]] and the entropy minimization used in [[28]] resemble soft implementations of the self-training technique [[43], [44]], one of the earliest approaches for semi-supervised classification tasks. [[17]] provides a good survey for the semi-supervised learning methods in general.

### 4.2 Alternative Perspectives

In this subsection, we discuss the proposed Deep Co-Training method from several perspectives alternative to the Co-Training framework.

#### Model Ensemble

Ensembling multiple independently trained models to get a more accurate and stable classifier is a widely used technique to achieve higher performances [[45]]. This is also applicable to deep neural networks [[46], [47]]. In other words, this suggests that when multiple networks with the same architecture are initialized differently and trained using data sequences in different time orders, they can achieve similar performances but in a complementary way [[48]]. In multi-view Deep Co-Training, we also train multiple models in parallel, but not independently, and our evaluation is done by taking one of them as the final classifier instead of averaging their predicted probabilities. Deep Co-Training in effect is searching for an initialization-free and data-order-free solution.

#### Multi-Agent Learning

After the literature review of the most recent semi-supervised learning methods for image recognition, we find that almost all of them are within the multi-agent learning framework [[49]]. To name a few, GAN-based methods at least have a discriminative network and a generative network. Bad GAN [[20]] adds an encoder network based on GAN. The agents in GANs are interacting in an adversarial way. As we stated in Section 4.1, the methods that train deep networks to be resistant to noises also have the interacting behaviors as what two individual models would have, i.e. double feedforwardings and backpropagations. The agents in these methods are interacting in a cooperative way. Deep Co-Training explicitly models the cooperative multi-agent learning, which trains multiple agents from the supervised data and cooperative interactions between different agents. In the multi-agent learning framework, can be understood as learning from the errors of the others, and the loss function Eq. 8 resembles the simulation of interactions within a crowd of agents.

#### Knowledge Distillation

One characteristic of Deep Co-Training is that the models not only learn from the supervised data, but also learn from the predictions of the other models. This is reminiscent to knowledge distillation [[50]] where student models learn from teacher models instead of the supervisions from the datasets. In Deep Co-Training, all the models are students and learn from not only the predictions of the other student models but also the errors they make.

## 5 Conclusion

In this paper, we present Deep Co-Training, a method for semi-supervised image recognition. It extends the Co-Training framework, which assumes that the data has two complementary views, based on which two effective classifiers can be built and are assumed to have close predictions on the unlabeled images. Motivated by the recent successes of deep neural networks in supervised image recognition, we extend the Co-Training framework to apply deep networks to the task of semi-supervised image recognition. In our experiments, we notice that the models are easy to collapse into each other, which violates the requirement of the view difference in the Co-Training framework. To prevent the models from collapsing, we use adversarial examples as the generative method to generate data on which the views have different predictions. The experiments show that this additional force that pushes models away is helpful for training and improves accuracies significantly compared with the Co-Training-only modeling.

Since Co-Training is a special case of multi-view learning, we also naturally extend the dual-view DCT to a scalable multi-view Deep Co-Training method where the hyperparameters for two views are also suitable for increased numbers of views. We test our proposed Deep Co-Training on the SVHN and CIFAR-10 datasets which are the benchmarks that the previous state-of-the-art methods are tested on. Our method outperforms them by a large margin with , and views. We further provide our results on harder benchmark CIFAR-100 and ImageNet datasets that most of the previous methods have not reported their performances on. The experimental results demonstrate the effectiveness of our method for the problem of semi-supervised image recognition. We also provide alternative perspectives for discussing the proposed Deep Co-Training method, including model ensemble, multi-agent learning and knowledge distillation.

### References

- Blum, A., Mitchell, T.M.: Combining labeled and unlabeled data with co-training. In: Proceedings of the Eleventh Annual Conference on Computational Learning Theory, COLT 1998, Madison, Wisconsin, USA, July 24-26, 1998. (1998) 92–100
- He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2016)
- Huang, G., Liu, Z., Weinberger, K.Q.: Densely connected convolutional networks. IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2017)
- Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q., eds.: Advances in Neural Information Processing Systems 25. (2012) 1097–1105
- Qiao, S., Liu, C., Shen, W., Yuille, A.L.: Few-shot image recognition by predicting parameters from activations. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR. (2018)
- Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)
- Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Computer Vision and Pattern Recognition (CVPR). (2015)
- Wang, Y., Xie, L., Liu, C., Qiao, S., Zhang, Y., Zhang, W., Tian, Q., Yuille, A.: SORT: Second-Order Response Transform for Visual Recognition. IEEE International Conference on Computer Vision (2017)
- Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. CoRR abs/1311.2901 (2013)
- Qiao, S., Zhang, Z., Shen, W., Wang, B., Yuille, A.L.: Gradually updated neural networks for large-scale image recognition. CoRR abs/1711.09280 (2017)
- Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. In: International Conference on Learning Representations. (2015)
- Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014. (2014)
- Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. (2015)
- Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada. (2015)
- Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115(3) (2015) 211–252
- Lin, T., Maire, M., Belongie, S.J., Bourdev, L.D., Girshick, R.B., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. CoRR abs/1405.0312 (2014)
- Zhu, X.: Semi-supervised learning literature survey. Technical report, Computer Science, University of Wisconsin-Madison (2006)
- Sajjadi, M., Javanmardi, M., Tasdizen, T.: Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain. (2016) 1163–1171
- Laine, S., Aila, T.: Temporal ensembling for semi-supervised learning. International Conference on Learning Representations, ICLR, 2017 (2017)
- Dai, Z., Yang, Z., Yang, F., Cohen, W.W., Salakhutdinov, R.: Good semi-supervised learning that requires a bad GAN. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA. (2017) 6513–6523
- Nigam, K., Ghani, R.: Analyzing the effectiveness and applicability of co-training. In: Proceedings of the 2000 ACM CIKM International Conference on Information and Knowledge Management, McLean, VA, USA, November 6-11, 2000. (2000) 86–93
- Bai, X., Wang, B., Yao, C., Liu, W., Tu, Z.: Co-transduction for shape retrieval. IEEE Transactions on Image Processing 21(5) (May 2012) 2747–2757
- Xia, R., Wang, C., Dai, X., Li, T.: Co-training for semi-supervised sentiment classification based on dual-view bags-of-words representation. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers. (2015) 1054–1063
- Goodfellow, I., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. International Conference on Learning Representations, ICLR, 2015 (2015)
- Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., Yuille, A.L.: Adversarial examples for semantic segmentation and object detection. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017. (2017) 1378–1387
- Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011. (2011)
- Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto (2009)
- Miyato, T., Maeda, S., Koyama, M., Ishii, S.: Virtual adversarial training: a regularization method for supervised and semi-supervised learning. CoRR abs/1704.03976 (2017)
- Jin, L., Lazarow, J., Tu, Z.: Introspective classification with convolutional nets. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA. (2017) 823–833
- Zhou, Z.H., Li, M.: Tri-training: exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge and Data Engineering 17(11) (Nov 2005) 1529–1541
- Xu, C., Tao, D., Xu, C.: A survey on multi-view learning. CoRR abs/1304.5634 (2013)
- Salimans, T., Kingma, D.P.: Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain. (2016) 901
- Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on Machine Learning, ICML. (2015)
- Pu, Y., Gan, Z., Henao, R., Yuan, X., Li, C., Stevens, A., Carin, L.: Variational autoencoder for deep learning of images, labels and captions. In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain. (2016) 2352–2360
- Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA. (2017) 1195–1204
- Salimans, T., Goodfellow, I.J., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain. (2016) 2226–2234
- Bachman, P., Alsharif, O., Precup, D.: Learning with pseudo-ensembles. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada. (2014) 3365–3373
- Rasmus, A., Berglund, M., Honkala, M., Valpola, H., Raiko, T.: Semi-supervised learning with ladder networks. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada. (2015) 3546–3554
- Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein GAN. CoRR abs/1701.07875 (2017)
- Tu, Z.: Learning generative models via discriminative approaches. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition. (June 2007) 1–8
- Mordvintsev, A., Olah, C., Tyka, M.: Deepdream - a code example for visualizing neural networks. Google Research (2015)
- Gatys, L.A., Ecker, A.S., Bethge, M.: A neural algorithm of artistic style. CoRR abs/1508.06576 (2015)
- III, H.J.S.: Probability of error of some adaptive pattern-recognition machines. IEEE Trans. Information Theory 11(3) (1965) 363–371
- Fralick, S.C.: Learning to recognize patterns without a teacher. IEEE Trans. Information Theory 13(1) (1967) 57–64
- Breiman, L.: Bagging predictors. Machine Learning 24(2) (1996) 123–140
- Zhou, Z., Wu, J., Tang, W.: Ensembling neural networks: Many could be better than all. Artif. Intell. 137(1-2) (2002) 239–263
- Zhou, Z.H.: Ensemble Methods: Foundations and Algorithms. Chapman & Hall/CRC Press (2012)
- Breiman, L.: Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statist. Sci. 16(3) (08 2001) 199–231
- Panait, L., Luke, S.: Cooperative multi-agent learning: The state of the art. Autonomous Agents and Multi-Agent Systems 11 (2005) 387–434
- Hinton, G.E., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. CoRR abs/1503.02531 (2015)