Unsupervised High-level Feature Learning by Ensemble Projection for Semi-supervised Image Classification and Image Clustering

Unsupervised High-level Feature Learning by Ensemble Projection for Semi-supervised Image Classification and Image Clustering

Abstract

This paper investigates the problem of image classification with limited or no annotations, but abundant unlabeled data. The setting exists in many tasks such as semi-supervised image classification, image clustering, and image retrieval. Unlike previous methods, which develop or learn sophisticated regularizers for classifiers, our method learns a new image representation by exploiting the distribution patterns of all available data. Particularly, a rich set of visual prototypes are sampled from all available data, and are taken as surrogate classes to train discriminative classifiers; images are projected via the classifiers; the projected values, similarities to the prototypes, are stacked to build the new feature vector. The training set is noisy. Hence, in the spirit of ensemble learning we create a set of such training sets which are all diverse, leading to diverse classifiers. The method is dubbed Ensemble Projection (EP). EP captures not only the characteristics of individual images, but also the relationships among images. It is conceptually simple and computationally efficient, yet effective and flexible. Experiments on eight standard datasets show that: (1) EP outperforms previous methods for semi-supervised image classification; (2) EP produces promising results for self-taught image classification, where unlabeled samples are a random collection of images rather than being from the same distribution as the labeled ones; and (3) EP improves over the original features for image clustering. The code of the method is available at the project page.

1Introduction

Providing efficient solutions to image classification has always been a major focus in computer vision. Recent years have witnessed considerable progress in image classification. However, most popular systems [?] heavily rely on manually labeled training data, which is expensive and sometimes impractical to acquire. Despite substantial efforts towards efficient annotation by developing online games [?] or appealing software tools [?], collecting training data for classification is still very time-consuming and tedious. The scarcity of annotations, combined with the explosion of image data, starts shifting focus towards learning with less supervision. As a result, numerous techniques have been developed to learn classification models with cheaper annotations. The most notable ones include unsupervised feature learning [?], semi-supervised learning [?], active learning [?], transfer learning [?], weakly-supervised learning [?], self-taught learning [?], and image clustering [?].

In this paper, we are interested in the problem of image classification with limited or no annotation. Instead of regularizing the classifiers like most of the previous methods [?], we learn a new feature representation using the all available data (labeled + unlabeled). Specifically, we aim to learn a new feature representation by exploiting the distribution patterns of the data to be handled. The setting assumes the availability of unlabeled data in the same or a similar distribution as the test data. This form of weak supervision is naturally available in applications such as semi-supervised image classification and image clustering, where data in the same or a similar distribution as the test data is available. The learned feature is specifically tuned for the data distribution of interest and performs better for the data than the standard features the method started with. The features to start with for our method can be hand-crafted features [?], learned features in a supervised manner [?] or learned features in an unsupervised way [?].

Learning with unlabeled data has been quite successful in many fields, for instance in semi-supervised learning (SSL) [?], in image clustering [?], and in unsupervised feature representation learning [?]. Typically these methods build upon the local-consistency assumption that data samples with high similarity should share the same label. This is also called smoothness of manifold, and it is often used to regularize the training process for the classifiers or feature representations. In this paper, we propose another way to exploit the local-consistency assumption to learn a new feature representation. The new feature representation is learned in a discriminative way to capture not only the information of individual images, but also the relationships among images. The learning is conceptually straightforward and computationally simple. The learned features can be fed into any classifiers for the final classification of the unlabeled samples. Thus, the method is agnostic to the classifier choice. This facilitates the deployment of SSL methods, as users often have their favorite classifiers and are reluctant to drop them. For image clustering, we apply the same feature learning methods to all provided images, and then feed the learned features to standard clustering methods such as -means and Spectral Clustering. Below, we present our motivations and outline the method.

Figure 1: The pipeline of Ensemble Projection (EP). EP consists of unsupervised feature learning (left panel) and plain classification or clustering (right panel). For feature learning, EP samples an ensemble of T diverse prototype sets from all known images and learns discriminative classifiers on them for the projection functions. Images are then projected using these functions to obtain their new representation. These features are fed into standard classifiers and clustering methods for image classification and clustering respectively.
Figure 1: The pipeline of Ensemble Projection (EP). EP consists of unsupervised feature learning (left panel) and plain classification or clustering (right panel). For feature learning, EP samples an ensemble of diverse prototype sets from all known images and learns discriminative classifiers on them for the projection functions. Images are then projected using these functions to obtain their new representation. These features are fed into standard classifiers and clustering methods for image classification and clustering respectively.

1.1Motivations

People learn and abstract the concepts of object classes well from their intrinsic characteristics, such as colors, textures, and shapes. For instance, sky is blue, and a football is spherical. We also do so by comparing new object classes to those classes that have already been learned. For example, a leopard is similar in appearance to a jaguar, but is smaller. This paradigm of learning-by-comparison or characterization-by-comparison is part of Eleanor Rosch’s prototype theory [?], that states that an object’s class is determined by its similarity to prototypes which represent object classes. The theory has been used successfully in transfer learning [?], when labeled data of different classes are available. An important question is whether the theory can also be used for feature representation learning when a large amount of unlabeled data is available. This paper investigates this problem.

To use this paradigm, we first need to create the prototypes automatically from the available data. In keeping with Eleanor Rosch’s prototype theory [?], ideal prototypes should have two properties: 1) images in the same prototype are to be from the same class; and 2) images of different prototypes are to be from different classes. They guarantee meaningful comparisons and reduce ambiguity. Without access to labels of data samples, the prototypes have to be created in an unsupervised way, based on some assumptions. In addition to the widely-used local-consistency, we propose another one called exotic-consistency, which states that samples that are far apart in the feature space are very likely to come from different classes. The assumptions have been verified experimentally, and will be presented in Section 3.1. Based on these two assumptions, it stands to reason that samples along with their closest neighbors can be “good” prototypes, and such prototypes that are far apart can play the role of different classes. According to this observation, we design a method to sample the prototype set from all available images by encoding them on a graph with links reflecting their affinity.

The sampled prototypes are taken as surrogate classes and discriminative learning is yields projection functions tuned to the classes. Images are then linked to the prototypes via their projection values (classification scores) by the functions. Since information carried by a single prototype set is limited and can be noisy, we borrow ideas from ensemble learning [?] to create an ensemble of diverse prototype sets, which in turn leads to an ensemble of projection functions, to mitigate the influence of the deficiencies of each training set. The idea is that if the deficiency modes of the individual training sets are different or ‘orthogonal’, ensemble learning is able to cancel out or at least mitigate their effect. This conjecture is verified with a simulated experiment in Section 3.2, and is also supported by the superior performance of our method in real applications. With the ensemble of classifiers, images are then represented by the concatenation of their classification scores – similarities to all the sampled image prototypes – for the final classification, which is in keeping with prototype theory [?]. We call the method Ensemble Projection (EP). Its schematic diagram is sketched in Figure 1.

1.2Contributions

EP was evaluated on eight image classification datasets, ranging from texture classification, over object classification and scene classification, to style classification. For SSL, EP is compared to three baselines and three other methods. For image clustering, EP is compared to the original features it started with. Two standard clustering methods are used: -means and Spectral Clustering. The experiments show that: (1) EP improves over the original features by exploiting the data distribution of interest, and outperforms competing SSL methods; (2) EP produces promising results for self-taught image classification where the unlabeled data does not follow the same distribution as the labeled ones; (3) EP improves over the original features for image clustering.

This paper is an extension of our conference papers [?]. In addition to putting the two tasks, image clustering and semi-supervised image classification, into the same framework, this paper brings several new contributions. First, in the conference papers, EP was validated only with hand-crafted features, such as LBP [?], GIST [?], and PHOG [?]. These features, however, are obsolete for image classification. Recently, features learned by CNN has resulted in state-of-the-art performance in various classification tasks [?]. In this paper, we validate the efficacy of EP also with CNN features. Second, experiments are conducted on eight standard classification datasets instead of only four in [?]. Third, more analyses and insights are given. Our feature learning method can be used for other tasks as well. For instance, [?] extended the idea to generate hashing functions for efficient image retrieval.

The rest of this paper is organized as follows. Section 2 reports on related work. Section 3 describes the observations that motivate the method. Section 4 is devoted to the approach, followed by experiments in Section 5. Section 6 concludes the paper.

2Related Work

Our method is generally relevant to image feature learning, semi-supervised learning, ensemble learning, and image clustering.

Supervised Feature Learning

: Over the past years, a wide spectrum of features, from pixel-level to semantic-level, have been designed and used for different vision tasks. Due to the semantic gap, recent work extract high-level features, which go beyond single images and are probably impregnated with semantic information. Notable examples are Image Attributes [?], Classemes [?], and Object Bank [?]. While getting pleasing results, these methods all require additional labeled training data, which is exactly what we want to avoid. There have been attempts, [?], to avoid the extra attribute-level supervision, but they still require canonical class-level supervision. Our representation learning however, is fully unsupervised. The pre-trained CNN features [?] have shown state-of-the-art performance on various classification tasks. Our feature learning is complementary to their methods. As shown in the experiment, our method can improve on top of the CNN features by exploiting the distribution patterns of the data to be classified. Although the technique of fine-tuning can boost the performance of CNN features for the specific tasks at hand [?], it needs labeled data of a moderate size, which is not always available in our setting. Our method can be understood as unsupervised feature enhancing or fine-tuning.

Unsupervised Feature Learning

: Our method is akin to methods which learn middle- or high-level image representation in an unsupervised manner. [?] employs -means mining filters of image patches and then applys the filters for feature computation. [?] generates surrogate classes by augmenting each patch with its transformed versions under a set of transformations such as translation, scaling, and rotation, and trains a CNN on top of these surrogate classes to generate features. The idea is very similar to ours, but our surrogate classes are generated by augmenting seed images with their close neighbors. The learning methods are also different. [?] discovers a set of representative patches by training discriminative classifiers with small, compact patch clusters from one dataset, and testing them on another dataset to find similar patches. The found patches are then used to train new classifiers, which are applied back to the first dataset. The process iterates and terminates after rounds, resulting in a set of representative patches and their corresponding ‘filters’. The idea of learning ‘filters’ from compact clusters shares similarities with what we do, but our clusters are images rather than patches. Other forms of weak supervision have also been exploited to learn good feature representation without human labeled data, and they all obtain very promising results. For instance, [?] uses the spatial relationships of image windows in an image as the supervision to train a neural network; [?] exploits the tracking results of objects in videos to guide the training of a neural network to learn feature representations; and [?] exploits the ego-motion of cameras for the training. These methods aim for general feature representation. Our method, however, is designed to ‘tune’ or enhance vision features specifically for the datasets on which the vision tasks are performed.

Semi-supervised Learning

: SSL aims at enhancing the performance of classification systems by exploiting an additional set of unlabeled data. Due to its great practical value, SSL has a rich literature [?]. Amongst existing methods, the simplest methodology for SSL is based on the self-training scheme [?] where the system iterates between training classification models with current ‘labeled’ training data and augmenting the training set by adding its highly confident predictions in the set of unlabeled data; the process starts from human labeled data and stops until some termination condition is reached, the maximum number of iterations. [?] and [?] presented two methods in this stream for image classification. While obtaining promising results, they both require additional supervision: [?] need image tags and [?] image attributes.

The second group of SSL methods is based on label propagation over a graph, where nodes represent data examples and edges reflect their similarities. The optimal labels are those that are maximally consistent with the supervised class labels and the graph structure. Well known examples include Harmonic-Function [?], Local-Global Consistency [?], Manifold Regularization [?], and Eigenfunctions [?]. While having strong theoretical support, these methods are unable to exploit the power of discriminative learning for image classification.

Another group of methods utilize the unlabeled data to regularize the classifying functions – enforcing the boundaries to pass through regions with a low density of data samples. The most notable methods are transductive SVMs [?], Semi-supervised SVMs [?], and semi-supervised random forests [?]. These methods have difficulties to extend to large-scale applications, and developing an efficient optimization for them is still an open question. Readers are referred to [?] for a thorough overview of SSL.

Ensemble Learning

: Our method learns the representation from an ensemble of prototype sets, thus sharing ideas with ensemble learning (EL). EL builds a committee of base learners, and finds solutions by maximizing the agreement. Popular ensemble methods that have been extended to semi-supervised scenarios are Boosting [?] and Random Forests [?]. However, these methods still differ significantly from ours. They focus on the problem of improving classifiers by using unlabeled data. Our method learns new representations for images using all data available. Thus, it is independent of the classification method. The reason we use EL is to capture rich visual attributes from a series of prototype sets, and to mitigate the deficiency of the sampled prototype sets. Other work close to ours is Random Ensemble Metrics [?], where images are projected to randomly subsampled training classes for supervised distance learning.

Image Clustering

: A plethora of methods have been developed for image clustering. [?] modeled objects as constellations of visual parts and estimated parameters using the expectation-maximization algorithm for unsupervised classification. [?] proposed using aspect models to discover object classes from an unordered image collection. Later on, [?] used Hierarchical Latent Dirichlet Allocation to automatically discover object class hierarchies. For scene class discovery, [?] proposed to combine information projection and clustering sampling. These methods assume explicit distributions for the samples. Image classes, nevertheless, are arranged in complex and widely diverging shapes, making the design of explicit models difficult. An alternative strand, which is more versatile in handling structured data, builds on similarity-based methods. [?] applied the affinity propagation algorithm of [?] for unsupervised image categorization. [?] developed partially matching image features to compute image similarity and used spectral methods for image clustering. The main difficulty of this strand is how to measure image similarity as the semantic level goes up. Readers are referred to [?] for a survey.

3Observations

In this section, we motivate our approach and explain why it is working. We experimentally verify our assumptions: First, given a standard distance metric over images, do the assumptions local-consistency and exotic-consistency hold, and to what extent? Second, is ensemble learning able to cancel out the deficiency of the individual training sets, given that the number of such training sets are sufficiently large and the deficiency modes of them are different or ‘orthogonal’?

3.1Observation 1

The assumptions of local-consistency and exotic-consistency do hold for real image datasets. An ideal image representation along with a distance metric should ensure that all images of the same class are more similar to each other than to those of other classes. However, this does not strictly hold for most of vision systems in reality. In this section, we want to verify whether the relaxed assumptions local-consistency and the exotic-consistency hold. These state images are very likely from the same class as their close neighbors, and very likely from different classes than those far from them. In order to examine the assumptions, we tabulate how often an image is from the same class as its -nearest neighbor. We refer to the frequency as label co-occurrence probability . is averaged across images and class labels in the dataset. Four features were tested: GIST [?], PHOG [?], LBP [?], and the CNN feature [?]. The Euclidean distance is used here.

Figure ? shows the results on six datasets (Datasets and features will be introduced in Section 5). The results reveal that using the distance metric in conventional ways ( clustering by -means and spectral methods) will result in very noisy training sets, because the label co-occurrence probability drops very quickly with . Sampling in the very close neighborhood of a given image is likely to generate more instances of the same class, whereas sampling far-away tends to gather samples of different classes. This suggests that samples along with a few very close neighbors, namely “compact” image clusters, can form a training set for a single class, and a set of such image clusters far away from each other in feature space can serve as good prototype sets for different classes. Furthermore, sampling in this way provides the chance of creating a large number of diverse prototype sets, due to the small size of each sampled prototype set. Also, from this figure, it is evident that the CNN feature performs significantly better than the rest, which suggests that using the CNN feature in our system is recommendable.

3.2Observation 2

Ensemble learning is able to cancel out or substantially mitigate the deficiency of individual training sets, given that the number of such training sets is sufficiently large and the modes of the deficiency are different or ‘orthogonal’.

We examined this idea in supervised image categorization. Given the ground truth data divided into training and test sets: , (i) we artificially synthesized a set of weak training sets (training sets with different modes of deficiency) from training data , and (ii) ensemble learning was then performed on these sets and its performance on test data classification was measured.

In order to guarantee the diversity of the training sets (for ensemble learning), each weak training set is formed by randomly taking of the images in , and randomly re-assigning labels of a fixed percentage of these images. Hence, corresponds to the ‘oracle’ performance as every sample is assigned its true label. A classifier is trained for each of these weak training sets. At test time, each of these classifiers returns the class label of each image in . The winning label is the mode of the results returned by all the classifiers. Figure ? evaluates this for the Scene-15 dataset [?]. Logistic Regression is used as the classifiers with the CNN feature [?] as input. When the label noise percentage is low, the classification precision starts out high and levels quickly with , as one would expect. But interestingly, for even as high as , the classification precision, which starts low, converges to a similarly high precision given sufficient weak training sets (). This suggests that ensemble learning is able to cancel out the deficiency of individual training sets. It learns the essence of image classes when the modes of deficiency are different for different training sets, and given a sufficiently large number of such training sets.

We were inspired by the two observations, and would like to investigate whether the assumptions of local-consistency and exotic-consistency are enough to generate a set of such weak training sets in an unsupervised manner, with which ensemble learning is able to learn useful visual attributes for semi-supervised image classification and image clustering.

4Our Approach

The training data consists of both labeled data and unlabeled data , where denotes the feature vector of image , represents its label, and is the number of classes. For image clustering, , and is the total number of images. Most previous semi-supervised learning (SSL) methods learn a classifier from with a regulation term learned from . Our method learns a new image representation from all known data , and trains standard classifier models with . is a vector of similarities of image to a series of sampled image prototypes.

Let us assume that Ensemble Projection (EP) learns knowledge from prototype sets , where , is the index of the chosen image, is the pseudo-label indicating which prototype belongs to. is the number of prototypes (surrogate classes) in , and the number of images sampled for each prototype (class) ( and in Figure 1). Below, we first present our sampling method for creating a single prototype set in the th trial, followed by EP.

4.1Max-Min Sampling

As stated, we want the prototypes to be inter-distinct and intra-compact, so that each one represents a different visual concept. To this end, we design a 2-step sampling method, termed Max-Min Sampling. The Max step is based on the exotic-consistency and caters for the inter-distinct property; the Min-step is based on the local-consistency assumption and caters for the intra-compact requirement. In particular, we first sample a skeleton of the prototype set, by looking for image candidates that are strongly spread out, i.e. at large distances from each other. We then enrich the skeleton to a prototype set by including the closest neighbors of the skeleton images. The algorithm for creating is given in Algorithm ?. For the skeleton, we sampled hypotheses – each one consists of randomly sampled images. For each hypothesis, the average pairwise distance between the images is then computed. Finally, we take the hypothesis yielding the largest average mutual distance as the skeleton. This simple procedure guarantees that the sampled seed images are far from each other. Once the skeleton is created, the Min-step extends each seed image to an image prototype by introducing its nearest neighbors (including itself), in order to enrich the characteristics of each image prototype and reduce the risk of introducing noisy images. The pseudo-labels are shared by all images specifying the same prototype. It is worth pointing out that the randomized Max-step may not generate the optimal skeleton. However, it serves its purpose well. For one thing, we do not need the optimal one – we only need the prototypes to be far apart, not farthest apart. Moreover, randomization allows diverse visual concepts to be captured in different ’s. The influence of the optimality of each single skeleton is tested in Section ?. The Euclidean distance is used here, but it is easy to change to other distance metrics if needed.

4.2Ensemble Projection

We now explore the use of the image prototype sets created in Section 4.1 for a new image representation. Because the prototypes are compact in feature space, each of them implicitly defines a visual concept. This is especially true when the dataset is large, which is to be expected given the vast number of unlabeled images that are available. Since the information carried by a single prototype set is quite limited and noisy, we borrow an idea from ensemble learning (EL), namely to create an ensemble of such sets to accumulate wisdom from a brood set of training images. A sanity check of this was already presented for a simulated situation in Section 3.2.

As is well-known [?], EL benefits from the precision of its base learners and their diversity. To obtain a good precision, discriminative learning method is employed for the base learner ; logistic regression is used in our implementation to project each input image to the image prototypes to measure the similarities. This choice is both due to its training efficiency and because lower capacity models are better suited for the sparse, small-size datasets under consideration. To achieve a high diversity, randomness is introduced in different trials of Max-Min Sampling to create an ensemble of diverse prototype sets, so that a rich set of image attributes are captured. The vector of all similarities is then concatenated and used as a new image representation for the final classification. A standard classifier (SVMs, Boosting, or Random Forest) can then be trained on with the learned feature for the semi-supervised classification, as unlabeled data has already been explored when obtaining . Likely, image clustering is performed by injecting the learned feature to a standard clustering method. The whole procedure of EP is presented in Algorithm ?. By now, the whole pipeline in Figure 1 has been explained.

5Experiments

The effectiveness of the approach is evaluated in the situations of: (1) semi-supervised image classification, where the amount of labeled data is sparse relative to the total amount of data; and (2) image clustering, where no labeled data is provided. In this section, we will first introduce the datasets and the features used, followed by experimental results for the two tasks and their corresponding analysis.

Datasets:

The method is evaluated on diverse classification tasks: texture classification, object classification, scene classification, event classification, style classification, and satellite image classification. Eight standard datasets are used for the evaluation:

  • Texture-25 [?]: texture classes, with samples per class.

  • Caltech-101 [?]: object classes, with to images per class, and images in total,

  • STL-10 [?]: object classes including airplane, bird, car, cat, deer, dog, horse, monkey, ship, truck, with training images per class, test images per class, and unlabeled images for unsupervised learning.

  • Scene-15 [?]: scene classes with both indoor and outdoor environments, images in total. Each class has to images.

  • Indoor-67 [?]: indoor classes such as shoe shop, mall and garage, with a total of images and at least images per class.

  • Event-8 [?]: sports event classes including rowing, badminton, polo, bocce, snowboarding, croquet, sailing, and rock climbing, with a total of images.

  • Building-25 [?]: architectural styles such as American craftsman, Baroque, and Gothic, with images in total.

  • LandUse-21 [?]: classes of satellite images in terms of land usage, such as agricultural, airplane, forest. There are images in total, with images per class.

Features:

The following three features were used in our earlier papers [?] due to their simplicity and low dimensionality: GIST [?], Pyramid of Histogram of Oriented Gradients (PHOG) [?], and Local Binary Patterns (LBP) [?]. However, these features are obsolete and yield results inferior than alternative features recently developed for image classification. In this paper, we replaced them with the CNN features [?]. These were obtained from an off-the-shelf CNN pre-trained on the ImageNet data. They were chosen as CNN features have achieved state-of-the-art performance for image classification [?]. For implementation, we used the MatConvNet [?] toolbox, with a -layer CNN pre-trained model being used. The convolutional results at layer were stacked as the CNN feature vector, with dimensionality of . We also tested the LLE-coded SIFT feature [?]. However, it is not on par with the CNN features.

Competing methods:

For semi-supervised classification, six classifiers were adopted to evaluate the method, with three baselines: -NN, Logistic Regression (LR), and SVMs with RBF kernels, and three semi-supervised classifiers: Harmonic Function (HF) [?], LapSVM [?], and Anchor Graph (AG) [?]. HF formulates the SSL learning problem as a Gaussian Random Field on a graph for label propagation. LapSVM extends SVMs by including a smoothness penalty term defined on the Laplacian graph. AG aims to address the scalability issue of graph-based SSL, and constructs a tractable large graph by coupling anchor-based label prediction and adjacency matrix design. For image clustering, we compare our learned feature to the original CNN feature with two standard clustering algorithms: -means and Spectral Clustering. Existing systems for image clustering often report performance on relatively easy datasets and it is hard to compare with them on these standard classification datasets.

Experimental settings:

We conducted four sets of experiments: (1) compare our method with competing methods for semi-supervised image classification on the eight datasets, where the unlabeled images are from the same class as the labeled ones; (2) evaluate the robustness of our method against the choice of its parameters and classifier models in the context of semi-supervised image classification; (3) evaluate the performance of our method for the task of self-taught image classification on the STL-10 dataset, where the feature is learned from the unlabeled images and the performance is tested on the labeled set; and (4) evaluate our method for the task of image clustering on the eight datasets.

For all experimental setups except (2), the same set of parameters were used for all the classifiers. We used for the -NN classifier, L2-regularized LR of LIBLINEAR [?] with , and the SVMs with RBF kernel of LIBSVM [?] with and the default , . For LapSVM, we used the scheme suggested by [?]: was set as the inductive model, and was set as . For HF, the weight matrix was computed with the Gaussian function , where is automatically set by using the self-tuning method [?]. For AG, we followed the suggestion from the original work [?] and used the following for both our learned feature and the original CNN feature: anchors and features reduced to dimensions via PCA.

As to the parameters of our method, a wide variety of values for them were tested in experimental setup (2). In experimental setups (1), (3) and (4), we fixed them to the following values: , , , and , which leads to a feature vector of dimensions. Note that the learned feature may contain redundancy across different dimensions, as some prototype sets are similar to others. We leave the task of selecting useful features to the discriminative classifiers.

5.1Semi-supervised Image Classification

In this section, we evaluate all methods across all datasets for semi-supervised image classification. Different numbers of training images per class were tested: Scene-15 and Indoor-67 with {1, 2, 5, 10, 20, 50, 100}, LandUse-21 with {1, 2, 5,10, 20, 30, 50}, Texture-25 with {1, 2, 3, 5, 7, 10 , 15}, Building-25, Event-8, and Caltech-101 with {1, 2, 5, 10, 15, 20, 30}, and STL-10 with {1, 5, 10, 20, 50, 100, 500}. The different choices are due to the different structures of the datasets: different number of classes and different number of images per class. In keeping with most existing systems for semi-supervised classification [?], we evaluate the method in the transductive manner, where we take the training and test samples as a whole, and randomly choose labeled samples from the whole dataset to learn and infer labels of other samples whose labels are held back as the unlabeled samples. The reported results are the average performance over runs with random labeled-unlabeled splits.

Comparison to baselines

: Figure ? shows the results of the three baseline classifiers with our learned feature and the original CNN feature as input, and Table ? lists the results of all methods when labeled training samples are available for each class. From the figure, it is easy to observe that the three plain classifiers -NN, LR and SVMs perform consistently better when working with our feature than working with the original CNN features. This is, of course, not a fair comparison, as our feature has been learned with the help of unlabeled samples, while the CNN features not. However, this experiment serves as a good sanity check: given the access to the unlabeled samples, does the proposed feature learning improve the performance of the system over the original feature? The figure shows clear advantages of our method over the original CNN feature across different datasets and classifiers. The most pronounced improvement occurs in the scenarios where a small number of labeled training samples is available, from to . This is exactly what the method is designed for – classification tasks where the labeled training samples are sparse relative to the available unlabeled samples. Since LR performs generally the best when working with our learned feature, we will take LR + EP as our method to compare to other SSL methods. The comparison is made in the next section.

Comparison to other SSL Methods

: In this section, we compare our method (LR + EP) with the three SSL methods HF, AG, and LapSVM. The classification precision is reported for HF and AG, while the mean average precision (mAP) of rounds of binary classification is used for LapSVM. This is because the implementation of LapSVM from the authors performs binary classification [?]. Because LapSVM is computationally expensive, we only compare our method to it for the scenario where labeled training samples per class are used.

Figure ? shows the results of our method (LR + EP) and that of HF and AG, and Table ? lists the precision of the methods when labeled training examples per class are used. Table ? lists the mAP of our method, HF and LapSVM, when labeled training samples are available for each class. The figure and the tables show that our method outperforms the competing SSL methods consistently for semi-supervised image classification. For instance, if labeled training examples per class are used, our method (LR + EP) improves over the best competing method AG by 7.2% in terms of precision on Scene-15, and by 11.9% on Indoor-67. This suggests that our method can achieve superior results for semi-supervised image classification, even when combined with very standard classifiers. It can be found from the figure and tables that graph-based SSL methods such as HF and AG are not very stable. This is mainly due to their sensitivity to the graph structure, which was observed in [?] as well.

The superior performance of our method to other SSL methods can be ascribed to two factors: (1) in addition to the local-consistency assumption, our method also exploits the exotic-consistency assumption; (2) the discriminative projections abstract high-level attributes from the sampled prototypes, being more “yellow-smooth” than “dark-structured”. As already proven in fully supervised scenarios [?], prototype-linked, attribute-based features are very helpful for image classification. The superior performance of our method to the original feature [?] is that our method learns the statistics of the to-be-classified dataset, while standard CNN features are trained on a different dataset, though very large. The exploitation of dataset-specific properties by EP can be understood as feature enhancing or fine-tuning in an unsupervised manner.

We further investigate the complementarity of our learned feature with other SSL methods for semi-supervised classification. It is interesting to see from the bottom panel of Table ? that using such combinations boosts the performance also. This suggests that our scheme of exploiting unlabeled data and the previous ones doing so capture complementary information. However, using the standard Logistic Regression generally yields the best results for our learned feature.

Robustness to Parameters

In this section, we examine the influence of the parameters of our method on its classification performance. They are the total number of prototype sets , the number of prototypes in each set , the number of images in each prototype , and the number of skeleton hypotheses used in Max-Min Sampling. LR was used as the classifier here. The parameters were evaluated as follows. Each time the value of one changes while the other ones being kept fixed to the values described in the experimental settings.

Figure ? shows the results over a range of their values. The figure shows that the performance of our method increases pretty fast with , but then stabilizes quickly. It implies that the method benefits from exploiting more “novel” visual attributes (image prototypes). After increases to some threshold ( for the eight datasets), basically no new attributes are added, and performance stops going up much. For , the figure shows that the performance generally increases with it. This is expected because a large leads to precise attribute assignment. In other words, a large generates more prototypes per set, thus increasing the possibility of linking every image to its desirable attribute. However, we seen that when outpaces , the increase is not worth the computing time. A larger would lead to confusing attributes, as it starts to draw very similar or even identical samples into different prototypes. Also, a large results in high-dimensional features, which in turn cause over-fitting.

For , a similar trend was obtained – as increases, the characteristics of the prototypes are enriched, thus boosting the performance. But beyond some threshold ( in our experiments), more noisy images are introduced, thus degrading the performance. One possible solution to further enrich the training samples of each prototype is to perform image transformations such as translation, rotation, and scaling to the seed images, and to add the transformed images into the prototype. This technique of enriching training data has been successfully used recently for image classification [?] and for feature learning [?]. For , Figure ? shows that it does not affect the performance as much as the three parameters analyzed so far. This does not mean that there is no need to use the exotic-consistency assumption. Instead, it suggests that a random selection of images from a dataset of images already fulfills the requirement of the assumption: images should be apart from each other. This is generally true because holds for the datasets considered.

Although the performance of EP will be affected by the choice of its parameters, we can see from Figure ? that each of the parameters has a wide range of reasonable values to choose from. It is not difficult to choose a set of parameter values that produces better results than competing methods (Figure ? and Table ?). Also, the parameters are quite intuitive and their roles are similar to the parameters of some other methods: analogues of , and can be found in RANSAC, -NN, and Bagging, for instance.

Robustness to Classifier Models

In this section, we evaluate the robustness of our learned features against classifier models. Different values of the balancing parameter between model accuracy and model complexity were tested for the LR classifier across the eight datasets. labeled training examples per class were used. A set of values were tested for the parameter of LR. Figure ? shows the results. It is evident from the figure that our learned feature consistently outperforms the original CNN feature over a large range of parameter values for the classifier models. This property is important for semi-supervised classification, as labeled data is limited in this scenario and probably cannot afford model selection techniques such as Cross-Validation.

Efficiency

Although additional time is needed for feature learning (the direct use of the original feature needs no training at this stage), our method is efficient. The efficiency is due to two reasons: 1) Training logistic regression is very efficient; and 2) the performance of our method stabilizes quickly with respect to as Figure ? shows. The training on the datasets takes minutes on a Core i5 2.80 GHz desktop PC. Furthermore, our method is inherently parallelizable and can take advantage of multi-core processors. It is worth noting that this extra-training time is compensated by using a simpler classifier such as logistic regression for the classification.

k

5.2Self-taught Image Classification

In order to evaluate the generality of our method, we tested it in a more general scenario, where the unlabeled data is the set of unlabeled images from the STL-10 dataset. Projection functions were learned from this unlabeled dataset and the performance was tested on the STL-10 dataset. Again, we held the training image and test images as a whole, and chose only a small fraction as training images (for the classifiers) with others as test images for evaluation. The average accuracy of runs with random training-test splits was reported. Figure ? shows the classification performance with different numbers of labeled training images per class. From the figure and table, it can be observed that our learned feature from the random image collection still outperforms the original CNN feature when the number of labeled training images is small. This is a very helpful property for semi-supervised learning, as it happens quite often that one has no prior access to the data to be classified. The success could be ascribed to the fact that the “universal visual world” (the random image collection) contains abundant high-level, valuable visual attributes such as “blue and open” in some image clusters and “textured and man-made” in others. Exploiting these “hidden” visual attributes is very beneficial for narrowing down the semantic gap between low-level features and high-level classification tasks.

However, the figure also shows that as the number of labeled training images increases, the advantage of our learned feature vanishes. The method even produces worse results than the original CNN feature when the number of training samples is large. This is to be expected as the method is designed to improve classification systems by exploiting unlabeled data. Therefore, when a sufficient number of labeled images are available, introducing additional unlabeled ones may hurt the system. This is a general, open problem for semi-supervised learning (self-taught learning) [?]. One possible solution is to study when the classification systems should switch from semi-supervised learning to fully supervised learning. Another solution could be to use the labeled training images directly as the skeleton to generate the prototype sets. This strategy, however, is more limited than ours, and is difficult to use for tasks, such as image clustering, where no labeled samples are available. We leave these issues as future work.

5.3Image Clustering

In this section, we evaluated our learned feature for the task of image clustering. Given a collection of images without any labels, the task is to group them so that images in the same group are more (semantically) similar to each other than to those in other groups. We follow existing work [?] and evaluate the task on the image classification datasets, in particular on the eight datasets used for semi-supervised image classification. To the best of our knowledge, we are the first to evaluate the performance of image clustering on as many as eight standard classification datasets, some of which are still very challenging for supervised image classification. Most clustering methods have been tested only on relatively simple datasets, such as , and classes of the Caltech dataset, and classes of the ETH shape dataset.

Since our main aim is to validate whether the proposed learning is able to boost the performance of the original feature for image clustering, we chose two standard clustering algorithms – Spectral Clustering and -means – to compare the two features. As to the implementation, we use the parallel implementation of [?] for Spectral Clustering and the vl-feat library of [?] for -means algorithm. Since Spectral Clustering and -means both require the number of clusters as a parameter, we set it to the number of semantic classes of the datasets, leading to weakly-supervised image clustering.

Table ? lists the results of the two features when combined with -means and Spectral Clustering. Purity is used as the evaluation criterion, which measures the percentage of images from the dominant class within their clusters, averaged over all clusters. The dominant class of a cluster is the (semantic) class that has more image members than other classes in the cluster. It is easy to see from the table that features learned by EP outperform the original CNN features for image clustering by a considerable margin. For instance, when -means is used, EP outperforms the CNN feature by on Event-8, and by on STL-10; when Spectral Clustering is used, the improvement is on Scene-15, and on Indoor-67. Again, our feature is learned from the original CNN feature, but goes beyond one single image and captures the similarity relationship among images. The superior performance of the learned feature suggests that it is worth some effort to analyze properties of the datasets to learn a better feature representation before performing image clustering. This is useful for the task of clustering, as all the data is available to use from the very beginning. This pre-processing step of analyzing datasets has not yet raised much attention in the community. We hope that this work will stimulate more efforts in this direction.

6Conclusion

This paper has tackled the problem of feature learning for the tasks of semi-supervised image classification and image clustering. We proposed a simple, yet effective feature learning method to exploit the available, unlabeled data. By using two consistency assumptions, we generate a diverse set of training data for surrogate classes to learn visual attributes in a discriminative way. By doing so, images are classified and linked to the surrogate classes – images are represented with their affinities to a rich set of discovered image attributes for classification and clustering. Experiments on eight datasets showed the superior performance of the learned feature for both semi-supervised image classification and image clustering. In addition, the method is conceptually simple, computationally efficient, and flexible to use. The future work is to extend the method to relevant tasks, such as image segmentation.

Acknowledgements. The work is supported by the ERC Advanced Grant Varcity (#273940).

10182
This is a comment super asjknd jkasnjk adsnkj
""
The feedback cannot be empty
Submit
Cancel
Comments 0
""
The feedback cannot be empty
   
Add comment
Cancel

You’re adding your first comment!
How to quickly get a good reply:
  • Offer a constructive comment on the author work.
  • Add helpful links to code implementation or project page.