Scalable Deep Learning Logo Detection

Scalable Deep Learning Logo Detection

Hang Su, Shaogang Gong, Xiatian Zhu
Abstract

Existing logo detection methods usually consider a small number of logo classes and limited images per class with a strong assumption of requiring tedious object bounding box annotations, therefore not scalable to real-world dynamic applications. In this work, we tackle these challenges by exploring the webly data learning principle without the need for exhaustive manual labelling. Specifically, we propose a novel incremental learning approach, called Scalable Logo Self-co-Learning (SL), capable of automatically self-discovering informative training images from noisy web data for progressively improving model capability in a cross-model co-learning manner. Moreover, we introduce a very large (2,190,757 images of 194 logo classes) logo dataset “WebLogo-2M” by an automatic web data collection and processing method. Extensive comparative evaluations demonstrate the superiority of the proposed SL method over the state-of-the-art strongly and weakly supervised detection models and contemporary webly data learning approaches.

Webly Learning, Scalable Logo Detection, Incremental Learning, Self-Learning, Co-Learning.

I Introduction

Automated logo detection from unconstrained “in-the-wild” images benefits a wide range of applications, e.g. brand trend prediction for commercial research and vehicle logo recognition for intelligent transportation [1, 2, 3]. This is inherently a challenging task due to the presence of many logos in diverse context with uncontrolled illumination, low-resolution, and background clutter (Fig. 1).

Existing logo detection methods typically consider a small number of logo classes with the need for large sized training data annotated at the logo object instance level, i.e. object bounding boxes [4, 5, 1, 6, 2, 7, 8, 3]. Whilst this controlled setting allows for a straightforward adoption of the state-of-the-art object detection models [9, 10, 11], it is unscalable to real-world logo detection applications when a much larger number of logo classes are of interest but limited by (1) the extremely high cost for constructing large scale logo dataset with exhaustive logo instance bounding box labelling therefore unavailability [12]; and (2) lacking incremental model learning to progressively update and expand the model to increasingly more training data without fine-grained labelling. Existing models are mostly one-pass trained and statically generalised to new test data.

Fig. 1: Logo detection challenges: significant logo variation in object size, illumination, background clutter, and occlusion.
Dataset Logo Classes Images Supervision Noisy Construction Scalability Availability
TopLogo-10 [13] 10 700 Object-Level Manually Weak
TennisLogo-20 [14] 20 2,000 Object-Level Manually Weak
FlickrLogos-27 [5] 27 810 Object-Level Manually Weak
FlickrLogos-32 [1] 32 2,240 Object-Level Manually Weak
Logo32-270 [15] 32 8,640 Object-Level Manually Weak
BelgaLogos [4] 37 1321 Object-Level Manually Weak
LOGO-NET [16] 160 73,414 Object-Level Manually Weak
WebLogo-2M (Ours) 194 2,190,757 Image-Level Automatically Strong
TABLE I: Statistics and characteristics of existing logo detection benchmarking datasets.

In this work, we consider scalable logo detection learning in a very large collection of unconstrained images without exhaustive fine-grained object instance level labelling for model training. Given that existing datasets mostly have small numbers of logo classes, one possible strategy is to learning from a small set of labelled training classes and adopting the model to other novel (test) logo classes, that is, Zero-Shot Learning (ZSL) [17, 18, 19]. This class-to-class model transfer and generalisation in ZSL is achieved by knowledge sharing through an intermediate semantic representation for all classes, such as mid-level attributes [18] or a class embedding space of word vectors [19]. However, they are limited as many logos do not share attributes or other forms of semantic representations due to their unique characteristics. A lack of large scale logo datasets (Table I), in both class numbers and image instance numbers per class, limits severely learning scalable logo detection models. This study explores the webly data learning principle for addressing both large scale dataset construction and incremental logo detection model learning without exhaustive manual labelling of increasing image data expansion. We call this setting scalable logo detection.

The contributions of this work are three-fold: (1) We investigate the scalable logo detection problem, characterised by modelling a large quantity of logo classes without exhaustive bounding box labelling. This is different from existing methods typically considering only a small number of logo classes with the need for exhaustive manual labelling at the fine-grained object bounding box level for each class. This scalability problem is under-studied in the literature. (2) We propose a novel incremental learning approach to scalable deep learning logo detection by exploiting multi-class detection with synthetic context augmentation. We call this method Scalable Logo Self-co-Learning (SL), since it automatically discovers potential positive logo images from noisy web data to progressively improve the model discrimination and generalisation capability in an iterative joint self-learning and co-learning manner. (3) We introduce a large logo detection dataset including 2,190,757 images from 194 logo classes, called WebLogo-2M, created by automatically sampling webly logo images from the social media Twitter. Importantly, this dataset construction scheme allows to further expand easily the dataset with new logo classes and images, therefore offering a favourable solution for scalable dataset construction. Extensive experiments demonstrate the superiority of the SL method over not only the state-of-the-art strongly (Faster R-CNN [9], SSD [20], YOLOv2 [11]) and weakly (WSL [21]) supervised detection models but also webly learning methods (WLOD [22]), on the newly introduced WebLogo-2M dataset.

Ii Related Works

Logo Detection Early logo detection methods are established on hand-crafted visual features (e.g. SIFT and HOG) and conventional classification models (e.g. SVM) [8, 6, 2, 7, 5]. These methods were only evaluated by small logo datasets with a limited number of both logo images and classes. A few deep methods [23, 16, 13, 14] have been recently proposed by exploiting the state-of-the-art object detection models such as R-CNN [24, 9, 10]. This in turn inspires large data construction [16]. However, all these existing models are not scalable to real world deployments due to two stringent requirements: (1) Accurately labelled training data per logo class; (2) Strong object-level bounding box annotations. This is because, both requirements give rise to time-consuming training data collection and annotation, which is not scalable to a realistically large number of logo classes given limited human labelling effort. In contrast, our method eliminates both needs by allowing the detection model learning from image-level weakly annotated and noisy images automatically collected from the social media (webly). As such, we enable automated introduction of any quantity of new logos for both dataset construction/expansion and model updating without the need for exhaustive manual labelling.

Logo Datasets A number of logo detection benchmarking datasets exist in the literature (Table I). All existing datasets are constructed manually and typically small in both image number and logo category thus insufficient for deep learning. Recently, Hoi et al. [16] attempt to address this small logo dataset problem by creating a larger LOGO-NET dataset. However, this dataset is not publicly accessible. To address this scalability problem, we propose to collect logo images automatically from the social media. This brings about two unique benefits: (1) Weak image level labels can be obtained for free; (2) We can easily upgrade the dataset by expanding the logo category set and collecting new logo images without human labelling therefore scalable to any quantity of logo images and logo categories. To our knowledge, this is the first attempt to construct a large scale logo dataset by exploiting inherently noisy web data.

Model Self-Learning Self-training is a special type of incremental learning wherein the new training data are labelled by the model itself – predicting logo positions and class labels in weakly labelled or unlabelled images before converting the most confident predictions into the training data [25]. A similar approach to our model is the detection model by Rosenberg et al. [26]. This model also explores the self-training mechanism. However, this method needs a number of per class strongly and accurately labelled training data at the object instance level to initialise their detection model. Moreover, it assumes all unlabelled images belong to the target object categories. These two assumptions limit severely model effectiveness and scalability given webly collected training data without any object bounding box labelling whilst with a high ratio of noisy irrelevant images.

Model Co-Learning Model co-learning is a generic meta-learning strategy originally designed for semi-supervised learning, based on two sufficient and yet conditionally independent feature representations with a single model algorithm [27]. Later on, co-learning was further developed into the designs of using different model parameter settings [28] or models [29, 30, 31] on the same feature representation. Overall, the key is that both models in co-learning need to be independently effective but also complementary to each other. Recently, there are some attempts on semi-supervised classification methods in the co-learning spirit [32, 33, 34, 35]. Beyond these, we further extend the co-learning concept from semi-supervised learning to webly learning for scalable logo detection. In particular, we unite co-learning and self-learning in a single detection deep learning framework with the capability of incrementally improving the logo detection models. To our knowledge, this is the first attempt of exploiting such a self-co-learning approach in the logo detection literature.

Iii WebLogo-2M Logo Detection Dataset

We present a scalable method to automatically construct a large logo detection dataset, called WebLogo-2M, with 2,190,757 webly images from 194 logo classes (Table II).

Logos Raw Images Filtered Images Noise Rate (%)
194 4,941,317 2,190,757 Varying
- - (6/2583/179,789) (25.0/90.2/99.8)
TABLE II: Statistics of the WebLogo-2M dataset. Numbers in parentheses: the minimum/median/maximum per class.

Iii-a Logo Image Collection and Filtering

Logo Selection A total of 194 logo classes from 13 different categories are selected in the WebLogo-2M dataset (Fig. 4). They are popular logos and brands in our daily life, including the 32 logo classes of FlickrLogo-32 [1] and the 10 logo classes of TopLogo-10 [13]. Specifically, the logo class selection was guided by an extensive review of social media reports regarding to the brand popularity 111http://www.ranker.com/crowdranked-list/ranking-the-best-logos-in-the-world222http://zankrank.com/Ranqings/?currentRanqing=logos333http://uk.complex.com/style/2013/03/the-50-most-iconic-brand-logos-of-all-time and market-value444http://www.forbes.com/powerful-brands/list/#tab:rank555http://brandirectory.com/league_tables/table/apparel-50-2016.

Image Source Selection We selected the social media website Twitter as the data source of WebLogo-2M. Twitter offers well structured multi-media data stream sources and more critically, unlimited data access permission therefore facilitating the collection of large scale logo images666We also attempted at Google and Bing search engines, and three other social media (Facebook, Instagram, and Flickr). However, all of them are rather restricted in data access and limiting incremental big data collection, e.g. Instagram allows only 500 times of image downloading per hour through the official web API..

Image Collection We collected 4,941,317 webly logo images. Specifically, through the Twitter API, one can automatically retrieve images from tweets by matching query keywords against tweets in real time. In our case, we query the logo brand names so that images in tweets containing the query words can be extracted. The retrieved images are then labelled with the corresponding logo name at the image level, i.e. weakly labelled.

Logo Image Filtering We obtained a total of 2,190,757 images after conducting a two-steps auto-filtering: (1) Noise Removal: We removed images of small width and/or height (e.g. less than 100 pixels), statistically we observed that such images are mostly without any logo objects (noise). (2) Duplicate Removal: We identified and discarded exact-duplicates (i.e. multiple copies of the same image). Specifically, given an reference image, we removed those with identical width and height. This image spacial size based scheme is not only computationally cheaper than the appearance based alternative [36], but also very effective. For example, we manually examined the de-duplicating process on randomly selected reference images and found that over 90 of the images are true duplicates.

Iii-B Properties of WebLogo-2M

Compared to existing logo detection databases [4, 1, 16, 13], this webly logo image dataset presents three unique properties inherent to large scale data exploration for learning scalable logo models:

(I) Weak Annotation All WebLogo-2M images are weakly labelled at the image level by the query keywords. These labels are obtained automatically in data collection without human fine-grained labelling. This is much more scalable than manually annotating accurate individual logo bounding boxes, particularly when the number of both logo images and classes are very large.

(II) Noisy (False Positives) Images collected from online web sources are inherently noisy, e.g. often no logo objects appearing in the images therefore providing plenty of natural false positive samples. For estimating a degree of noisiness, we sampled randomly at most 1,000 web images per class for all 194 classes and manually examined whether they are true or false logo images777 In the case of sparse logo classes with less than 1,000 webly collected images, we examined all available images.. As shown in Fig. 2, the true logo image ratio varies significantly among 194 logos, e.g. 75% for “Rittersport” vs. 0.2% for “3M”. On average, true logo images take only vs. the remaining as false positives. Such noisy images pose significant challenges to model learning, even though there are plenty of data scalable to very large size in both class numbers and samples per class.

Fig. 2: True logo image ratios (%). This was estimated from at most 1,000 random logo images per class over 194 classes.

(III) Class Imbalance The WebLogo-2M dataset presents a natural logo object occurrence imbalance in public scenes. Specifically, logo images collected from web streams exhibit a power-law distribution (Fig. 3). This property is often artificially eliminated in most existing logo datasets by careful manual filtering, which not only requires extra labelling effort but also renders the model learning challenges unrealistic. We preserve the inherent class imbalance nature in the data for achieving fully automated dataset construction and retaining realistic model learning challenges. This requires minimising model learning bias towards densely-sampled classes [37].

Fig. 3: Imbalanced logo image class distribution, ranging from 6 images (“Soundrop”) to 179,789 images (“Youtube”), i.e. 29,965 imbalance ratio.
(a)
(b)
Fig. 4: A glimpse of the WebLogo-2M dataset. (a) Example webly (Twitter) logo images randomly selected from the class “Adidas” with logo instances manually labelled by green dashed bounding boxes only for facilitating viewing. Most images contain no “Adidas” object, i.e. false positives, suggesting a high noise degree in webly collected data. (b) Clean images of 194 logo classes automatically collected from the Google Image Search, used in synthetic training images generation and augmentation. (c) One example true positive webly image per logo class, totally 194 images, showing the rich and diverse context in unconstrained images where typical logo objects reside in practice, as compared to those clean logo images in (b).

Further Remark Since the proposed dataset construction method is completely automated, new logo classes can be easily added without human labelling. This permits good scalability to enlarging the dataset cumulatively, in contrast to existing methods [12, 16, 38, 39, 4, 1, 16, 13] that require exhaustive human labelling therefore hampering further dataset updating and enlarging. This automation is particularly important for creating object detection datasets with expensive needs for labelling explicitly object bounding boxes, more so than for constructing cheaper image-level class annotation datasets [40]. While being more scalable, this WebLogo-2M dataset also provides more realistic challenges for model learning given weaker label information, noisy image data, unknown scene context, and significant class imbalance.

Iii-C Benchmarking Training and Test Data

We define a benchmarking logo detection setting here. In the scalable webly learning context, we deploy the whole WebLogo-2M dataset (2,190,757 images) as the training data. For performance evaluation, a set of images with fine-grained object-level annotation groundtruth is required. To that end, we construct an independent test set of 6,558 logo images with logo bounding box labels by (1) assembling 2,870 labelled images from the FlickrLogo-32 [1] and TopLogo [13] datasets and (2) manually labelling 3,688 images independently collected from the Twitter. Note that, the only purpose of labelling this test set is for performance evaluation of different detection methods, independent of WebLogo-2M auto-construction.

Iv Training A Multi-Class Logo Detector

We aim to automatically train a multi-class logo detection model incrementally from noisy and weakly labelled web images. Different from existing methods building a detector in a one-pass “batch” learning procedure, we propose to incrementally enhance the model capability “sequentially”, in a joint spirit of self-learning [25] and co-learning [27]. This is due to the unavailability of sufficient accurate fine-grained training images per logo class. In other words, the model must self-select trustworthy images from the noisy webly labelled data (WebLogo-2M) to progressively develop and refine itself. This is a catch-22 problem: The lack of sufficient good-quality training data leads to a suboptimal model which in turn produces error-prone predictions. This may cause model drift – the errors in model prediction will be propagated through the iterations therefore have the potential to corrupt the model knowledge structure. Also, the inherent data imbalance over different logo classes may make model learning biased towards only a few number of majority classes, therefore leading to significantly weaker capability in detecting minority classes. Moreover, the two problems above are intrinsically interdependent with possibly one negatively affecting the other. It is non-trivial to solve these challenges without exhaustive fine-grained human annotations.

Formulation Rational In this work, we present a scalable logo detection deep learning solution capable of addressing the aforementioned two issues in a self-co-learning manner. The intuition is: Web knowledge provides ambiguous but still useful coarse image level logo annotations, whilst self-learning offers a scalable learning means to explore iteratively such weak/noisy information and co-learning allows for mining the complementary advantages of different modelling approaches in order to further improve the self-learning effectiveness. We call our method Scalable Logo Self-co-Learning (SL).

Model Design To establish a more effective SL framework, we select strongly-supervised rather than weakly-supervised object detection deep learning models for two reasons: (1) The performance of weakly-supervised models [41] are much inferior than that of strongly supervised counterparts; (2) The noisy webly weak labels may further hamper the effectiveness of weakly supervised learning. In our model instantiation, we choose the Faster R-CNN [9] and YOLO [11] models for self-co-learning. Conceptually, this model selection is independent of the SL notion and stronger deep learning detector models generally lead to a more advanced SL solution. A schematic overview of the SL framework is depicted in Fig. 5.

Fig. 5: Overview of the Scalable Logo Self-co-Learning (SL) method. (a) Model initialisation by using synthetic logo training images (Sec. IV-A). (b) Incrementally self-mining positive logo images from noisy web data pool (Sec. IV-B). (c) Incrementally co-learning the detection models by mined web images and context-enhanced synthetic data (Sec. IV-C). This process is repeated iteratively for progressive training data mining and model update.

Iv-a Model Bootstrap

To start the SL process, we first provide reasonably discriminative logo detection co-learning models with sufficient bootstrapping training data discovery. Both Faster R-CNN and YOLOv2 need strongly supervised learning from object-level bounding box annotations to achieve logo object detection discrimination, which however is not available in our scalable webly learning setting.

To address this problem in the scalable logo detection context, we propose to exploit the idea of synthesising fine-grained training logo images, therefore maintaining model learning scalability for accommodating large quantity of logo classes. In particular, this is achieved by generating synthetic training images as in [13]: Overlaying logo icon images at random locations of non-logo background images so that bounding box annotations can be automatically and completely generated. The logo icon images are automatically collected from the Google Image Search by querying the corresponding logo class name (Fig. 4 (b)). The background images can be chosen flexibly, e.g. the non-logo images in the FlickrLogo-32 dataset [1] and others retrieved by irrelevant query words from web search engines. To enhance appearance variations in synthetic logos, colour and geometric transformation can be applied [13].

Training Details We synthesised training images per logo class, in total 194,000 images. For learning the Faster R-CNN and YOLOv2 models, we set the learning rate at and the learning iterations at . Following [13], we pre-trained the detector models on ImageNet 1000-class object classification images [12] for model warmup.

Iv-B Incremental Self-Mining Noisy Web Images

After the logo detector models are discriminatively bootstrapped, we proceed to improve their detection capability with incrementally self-mined positive (likely) logo images from weakly labelled WebLogo-2M data. To identify the most compatible training images, we define a selection function using the detection score of up-to-date model:

(1)

where denotes the -th step detector model (Faster R-CNN or YOLOv2), and denotes a logo image with the web image-level label with the total logo class number. , indicates the maximal detection score of on the logo class by model . For positive logo image selection, we consider a high threshold detection confidence (0.9 in our experiments) [42] for strictly controlling the impact of model detection errors in degrading the incremental learning benefits. This new training data discovery process is summarised in Alg. 1.

With this self-mining process at -th iteration, we obtain a separate set of updated training data for Faster R-CNN and YOLOv2, denoted as and respectively. Each of the two sets represents detection performance with some distinct characteristics due to the different formulation designs of the two models, e.g. Faster R-CNN is based on region proposals whilst YOLOv2 relies on pre-defined-grid centred regression. This creates a satisfactory condition (i.e. diverse and independent modelling) for cross-model co-learning.

Input: Current model , Unexplored data , Self-discovered logo training data ();
Output: Updated self-discovered training data , Updated unlabelled data pool ;
Initialisation:
;
;
for image in
     Apply to get the detection results;
     Evaluate image as a potential positive logo image;
          if Meeting selection criterion
                = ;
                = ;
          end if
end for
Return and .

Algorithm 1 Incremental Self-Mining Noisy Web Images

Iv-C Incremental Model Co-Learning

Given the two up-to-date training sets and , we conduct co-learning on the two detection models (Fig. 5(d)). Specifically, we incrementally update the Faster R-CNN model with the self-mined set by YOLOv2, and vice verse. As such, the complementary modelling advantages can be exploited incrementally in a self-mining cross-updating manner.

Recall that the logo images across classes are imbalanced (Fig. 3). This can lead to biased model learning towards well-sampled classes (the majority classes), resulting in poor performance against sparsely-labelled classes (the minority classes) [37]. To address this problem, we propose the idea of cross-class context augmentation for not only fully exploring the contextual richness of WebLogo-2M data but also addressing the intrinsic imbalanced logo class problem, inspired by the potential of context enhancement [13].

Specifically, we ensure that at least images will be newly introduced into the training data pool in each self-discovery iteration for each detection model. Suppose web images are self-discovered for the logo class (Alg. 1), we generate synthetic images where

(2)

Therefore, we only perform synthetic data augmentation for those classes with less than real web images mined in the current iteration. We set considering that too many synthetic images may bring in negative effects due to the imperfect logo appearance rendering against background. Importantly, we choose the self-mined logo images of other classes () as the background images specifically for enriching the contextual diversity of logo class (Fig. 6). We utilise the SCL synthesising method [13] as in model bootstrap (Sec. IV-A).

Fig. 6: Example logo images by synthetic context augmentation. Red box: model detection; Green box: synthetic logo ground truth.

Once we have self-mined web training images and context enriched synthetic data, we perform detection model fine-tuning at the learning rate of by iterations depending on the training data size at each iteration. We adopt the original deep learning loss formulation for both Faster R-CNN and YOLOv2. Model generalisation is improved when the training data quality is sufficient in terms of both true-false logo image ratio and the context richness.

Iv-D Incremental Learning Stop Criterion

We conduct the incremental model self-co-learning until meeting some stop criterion, for example, the model performance gain becomes marginal or zero. We adopt the YOLOv2 as the deployment logo detection model due to its superior efficiency and accuracy (see Table V). In practice, we can assess the model performance on an independent set, e.g. the test evaluation data.

V Experiments

Competitors We compared the proposed SL model with six state-of-the-art alternative detection approaches:
(1) Faster R-CNN [9]: A competitive region proposal driven object detection model which is characterised by jointly learning region proposal generation and object classification in a single deep model. In our scalable webly learning context, the Faster R-CNN is optimised with synthetic training data generated by the SCL [13] method, exactly the same as our SL model.
(2) SSD [20]: A state-of-the-art regression optimisation based object detection model. We similarly learn this strongly supervised model with synthetic logo instance bounding box labels as Faster R-CNN above.
(3) YOLOv2 [11]: A contemporary bounding box regression based multi-class object detection model. We learned this model with the same training data as SSD and Faster R-CNN.
(4) Weakly Supervised object Localisation (WSL) [21]: A state-of-the-art weakly supervised detection model allowing to be trained with image-level logo label annotations in a multi-instance learning framework. Therefore, we can directly utilise the webly labelled WebLogo-2M images to train the WSL detection model. Note that, noisy logo labels inherent to web data may pose additional challenges in addition to high complexity in logo appearance and context.
(5) Webly Learning Object Detection (WLOD) [22]: A state-of-the-art weakly supervised object detection method where clean Google images are used to train exemplar classifiers which is deployed to classify region proposals by EdgeBox [43]. In our implementation, we further improved the classification component by exploiting an ImageNet-1K and pascalVOC trained VGG-16 [44] model as the deep feature extractor and the L2 distance as the matching metric. We adopted the nearest neighbour classification model with Google logo images (Fig. 4(b)) as the labelled training data.
(6) WLOD+SCL: a variant of WLOD [22] with context enriched training data by exploiting SCL [13] to synthesise various context for Google logo images.
Overall, these existing methods cover both strongly and weakly supervised learning based detection models.

Performance Metrics For the quantitative performance measure of logo detection, we utilised the Average Precision (AP) for each individual logo class, and the mean Average Precision (mAP) for all classes [45]. A detection is considered being correct when the Intersection over Union (IoU) between the predicted and groundtruth exceeds .

V-a Comparative Evaluations

Method mAP (%)
SSD [20] 8.8
Faster R-CNN [9] 14.9
YOLOv2 [11] 18.4
WSL [21] 3.6
WLOD [22] 19.3
WLOD[22] + SCL[13] 7.8
SL (Ours) 46.9
TABLE III: Logo detection performance on WebLogo-2M.
Fig. 7: Quantitative evaluations of the (a) WLOD and (b) SL models. Green dashed boxes: ground truth. Red solid boxes: detected. WLOD fails to detect visually ambiguous ( column) logo instance, success on relatively clean ( column) logo instances, while only fires partially on the salient one ( column). The SL model can correctly detect all these logo instances with varying context and appearance quality.
Iteration 0 1 2 3 4 5 6 7 8
mAP (%) 18.4 28.6 33.2 39.1 42.2 44.4 45.6 46.9 46.9
mAP Gain (%) N/A 10.2 4.6 5.9 3.1 2.2 1.2 1.3 0.0
Training Images 5,862 21,610 41,314 54,387 74,855 86,599 98,055 107,327 Stop
TABLE IV: Model performance development over incremental SL iterations.

We compared the scalable logo detection performance on the WebLogo-2M benchmarking test data in Table III. It is evident that the proposed SL model significantly outperforms all other alternative methods, e.g. surpassing the best baseline WLOD by 27.6% (46.9%-19.3%) in mAP. We also have the following observations:
(1) The weakly supervised learning based model WSL produces the worst result, due to the joint effects of complex logo appearance variation against unconstrained context and the large proportions of false positive logo images (Fig. 2).
(2) The WLOD method performs reasonably well suggesting that the knowledge learned from auxiliary data sources (ImageNet and Pascal VOC) is transferable to some degree, confirming the similar findings as in [46, 47].
(3) By utilising synthetic training images with rich context and background, fully supervised detection models YOLOv2 and Faster R-CNN are able to achieve the 3/ best results among all competitors. This suggests that context augmentation is critical for object detection model optimisation, and the combination of strongly supervised learning model + auto training data synthesising is a superior strategy over weakly supervised learning in webly learning setting.
(4) Another supervised model SSD yields much lower detection performance. This is similar to the original finding that this model is sensitive to the bounding box size of objects with weaker detection performance on small objects such as in-the-wild logo instances.
(5) WLOD+SCL produces a weaker result (7.8%) compared to WLOD (19.3%). This indicates that joint supervised learning is critical to exploit context enriched data augmentation, otherwise likely introducing some distracting effects resulting in degraded matching.

Qualitative Evaluation For visual comparison, we show a number of qualitative logo detection examples from three classes by the SL and WLOD models Fig. 7.

V-B Further Analysis and Discussions

V-B1 Effects of Incremental Model Self-Co-Learning

We evaluated the effects of incremental model self-co-learning on self-discovered training data and context enriched synthetic images by examining the SL model performance at individual iterations. Table IV and Fig. 9 show that the SL model improves consistently from the to iterations of self-co-learning. In particular, the starting data mining brings about the maximal mAP gain of 10.2% (28.6%-18.4%) with the per-iteration benefit mostly dropping gradually. This suggests that our model design is capable of effectively addressing the notorious error propagation challenge thanks to (1) a proper detection model initialisation by logo context synthesising for providing a sufficiently good starting-point detection; (2) a strict selection on self-evaluated detections for reducing the amount of false positives, suppressing the likelihood of error propagation; and (3) cross-model co-learning with cross-logo context enriched synthetic training data augmentation with the capability of addressing the imbalanced data learning problem whilst enhancing the model robustness against diverse unconstrained background clutters. We also observed that more images are mined along the incremental data mining process, suggesting that the SL model improves over time in the capability of tackling more complex context, although potentially leading to more false positives simultaneously which can cause lower model growing rates, as indicated in Fig. 8.

Fig. 8: Randomly selected images self-discovered in the (a) , (b) , and (c) iterations for the logo class “Android”. Red box: SL model detection. Red cross: false detection. The images mined in the iteration have clean logo instances and background, whilst those discovered in the and iterations have more diverse logo appearance variations in richer and more complex context. More false detections are likely to be produced in the and self-discovery.
Fig. 9: Evaluating the model co-learning and self-learning strategies, and the effect of Context Enhancement (CE) based training data class balancing in the SL on WebLogo-2M.
Method mAP (%)
Self-Learning (Faster R-CNN) 36.8
Self-Learning (YOLO) 39.4
Co-Learning (Faster R-CNN) 44.2
Co-Learning (YOLO) (SL) 46.9
TABLE V: Co-learning versus self-learning.

V-B2 Effects of Cross-Model Co-Learning

We assessed the benefits of cross-model co-learning between Faster R-CNN and YOLOv2 in SL in comparison of the single-model self-learning strategy. In contrast to co-learning, the self-learning exploits self-mined new training data for incremental model update without the benefit of cross-model complementary advantages. Table V and Fig. 9 show that both models benefit clear performance gains from co-learning, e.g. 7.4% (44.2-36.8) for Faster R-CNN, and 7.5% (46.9-39.4) for YOLOv2. This verifies the motivation and our idea of exploiting the co-learning principle for maximising the complementary information of different detection modelling formulations in the scalable logo detection model optimisation.

Iteration 0 1 2 3 4 5
With CE 18.4 28.6 33.2 39.1 42.2 44.4
Without CE 18.4 25.3 27.7 28.7 28.9 28.0
TABLE VI: Effects of training data Context Enhancement (CE). Metric: mAP (%).

V-B3 Effects of Synthetic Context Enhancement

We evaluated the impact of training data context enhancement (i.e. the cross-class context enriched synthetic training data) on the SL model performance. Table VI shows that context enhancement not only provides clearly model improvement across iterations due to the suppression of negative imbalance learning effect and enriched context knowledge, but also simultaneously enlarges the data mining capacity due to potentially less noisy training data aggregation. Without context enhancement and training class balancing, the model stops to improve by the 4 iteration of the incremental learning, with much weaker model generalisation performance at 28.9% vs 46.9% by the full SL. This suggests the importance of context and data balance in detection model learning, therefore validating our model design considerations.

Vi Conclusion

We present a scalable end-to-end logo detection solution including logo dataset establishment and multi-class logo detection model learning, realised by exploring the webly data learning principle without the tedious cost of manually labelling fine-grained logo annotations. Particularly, we propose a new incremental learning method named Scalable Logo Self-co-Learning (SL) for enabling reliable self-discovery and auto-labelling of new training images from noisy web data to progressively improve the model detection capability in a cross-model co-learning manner given unconstrained in-the-wild images. Moreover, we construct a very large logo detection benchmarking dataset WebLogo-2M by automatically collecting and processing web stream data (Twitter) in a scalable manner, therefore facilitating and motivating further investigation of scalable logo detection in future studies by the wider community. We have validated the advantages and superiority of the proposed SL approach in comparisons to the state-of-the-art alternative methods ranging from strongly-supervised and weakly-supervised detection models to webly data learning models through extensive comparative evaluations and analysis on the benefits of incremental model training and context enhancement, using the newly introduced WebLogo-2M logo benchmark dataset. We provide in-depth SL model component analysis and evaluation with insights on model performance gain and formulation.

Acknowledgement

This work was partially supported by the China Scholarship Council, Vision Semantics Ltd., the Royal Society Newton Advanced Fellowship Programme (NA150459), and InnovateUK Industrial Challenge Project on Developing and Commercialising Intelligent Video Analytics Solutions for Public Safety.

References

  • [1] S. Romberg, L. G. Pueyo, R. Lienhart, and R. Van Zwol, “Scalable logo recognition in real-world images,” in Proceedings of the 1st ACM International Conference on Multimedia Retrieval.    ACM, 2011, p. 25.
  • [2] S. Romberg and R. Lienhart, “Bundle min-hashing for logo recognition,” in Proceedings of the 3rd ACM conference on International conference on multimedia retrieval.    ACM, 2013, pp. 113–120.
  • [3] C. Pan, Z. Yan, X. Xu, M. Sun, J. Shao, and D. Wu, “Vehicle logo recognition based on deep learning architecture in video surveillance for intelligent traffic system,” in IET International Conference on Smart and Sustainable City, 2013, pp. 123–126.
  • [4] A. Joly and O. Buisson, “Logo retrieval with a contrario visual query expansion,” in ACM International Conference on Multimedia, 2009, pp. 581–584.
  • [5] Y. Kalantidis, L. G. Pueyo, M. Trevisiol, R. van Zwol, and Y. Avrithis, “Scalable triangulation-based logo recognition,” in ACM International Conference on Multimedia Retrieval, 2011, p. 20.
  • [6] J. Revaud, M. Douze, and C. Schmid, “Correlation-based burstiness for logo retrieval,” in ACM International Conference on Multimedia, 2012, pp. 965–968.
  • [7] R. Boia, A. Bandrabur, and C. Florea, “Local description using multi-scale complete rank transform for improved logo recognition,” in IEEE International Conference on Communications, 2014, pp. 1–4.
  • [8] K.-W. Li, S.-Y. Chen, S. Su, D.-J. Duh, H. Zhang, and S. Li, “Logo detection with extendibility and discrimination,” Multimedia tools and applications, vol. 72, no. 2, pp. 1285–1310, 2014.
  • [9] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems, 2015, pp. 91–99.
  • [10] R. Girshick, “Fast r-cnn,” in IEEE International Conference on Computer Vision, 2015.
  • [11] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [12] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
  • [13] H. Su, X. Zhu, and S. Gong, “Deep learning logo detection with data expansion by synthesising context,” IEEE Winter Conference on Applications of Computer Vision, 2017.
  • [14] Y. Liao, X. Lu, C. Zhang, Y. Wang, and Z. Tang, “Mutual enhancement for detection of multiple logos in sports videos,” in IEEE International Conference on Computer Vision, 2017.
  • [15] Y. Li, Q. Shi, J. Deng, and F. Su, “Graphic logo detection with deep region-based convolutional networks,” in IEEE Visual Communications and Image Processing, 2017, pp. 1–4.
  • [16] S. C. Hoi, X. Wu, H. Liu, Y. Wu, H. Wang, H. Xue, and Q. Wu, “Logo-net: Large-scale deep logo detection and brand recognition with deep region-based convolutional networks,” arXiv preprint arXiv:1511.02462, 2015.
  • [17] X. Xu, T. Hospedales, and S. Gong, “Transductive zero-shot action recognition by word-vector embedding,” International Journal of Computer Vision, pp. 1–25, 2017.
  • [18] C. H. Lampert, H. Nickisch, and S. Harmeling, “Attribute-based classification for zero-shot visual object categorization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 3, pp. 453–465, 2014.
  • [19] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov et al., “Devise: A deep visual-semantic embedding model,” in Advances in Neural Information Processing Systems, 2013.
  • [20] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed, “Ssd: Single shot multibox detector,” in European Conference on Computer Vision, 2016.
  • [21] L. Dong, J.-B. Huang, Y. Li, S. Wang, and M.-H. Yang, “Weakly supervised object localization with progressive domain adaptation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  • [22] X. Chen and A. Gupta, “Webly supervised learning of convolutional networks,” in IEEE International Conference on Computer Vision, 2015, pp. 1431–1439.
  • [23] F. N. Iandola, A. Shen, P. Gao, and K. Keutzer, “Deeplogo: Hitting logo recognition with the deep neural network hammer,” arXiv, 2015.
  • [24] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014.
  • [25] K. Nigam and R. Ghani, “Analyzing the effectiveness and applicability of co-training,” in Proceedings of the International Conference on Information and Knowledge Management, 2000.
  • [26] C. Rosenberg, M. Hebert, and H. Schneiderman, “Semi-supervised self-training of object detection models,” in Seventh IEEE Workshop on Applications of Computer Vision, 2005.
  • [27] A. Blum and T. Mitchell, “Combining labeled and unlabeled data with co-training,” in Proceedings of the eleventh annual conference on Computational learning theory.    ACM, 1998, pp. 92–100.
  • [28] W. Wang and Z.-H. Zhou, “Analyzing co-training style algorithms,” in European Conference on Machine Learning.    Springer, 2007, pp. 454–465.
  • [29] S. Goldman and Y. Zhou, “Enhancing supervised learning with unlabeled data,” in ICML, 2000, pp. 327–334.
  • [30] Z.-H. Zhou and M. Li, “Semi-supervised learning by disagreement,” Knowledge and Information Systems, vol. 24, no. 3, pp. 415–439, 2010.
  • [31] Z. Jiang, S. Zhang, and J. Zeng, “A hybrid generative/discriminative method for semi-supervised classification,” Knowledge-Based Systems, vol. 37, pp. 137–145, 2013.
  • [32] X. Xu, W. Li, D. Xu, and I. W. Tsang, “Co-labeling for multi-view weakly labeled learning,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 6, pp. 1113–1125, 2016.
  • [33] T. Batra and D. Parikh, “Cooperative learning with visual attributes,” arXiv preprint arXiv:1705.05512, 2017.
  • [34] C. Gong, D. Tao, S. J. Maybank, W. Liu, G. Kang, and J. Yang, “Multi-modal curriculum learning for semi-supervised image classification,” IEEE Transactions on Image Processing, vol. 25, no. 7, pp. 3249–3260, 2016.
  • [35] A. Fakeri-Tabrizi, M.-R. Amini, C. Goutte, and N. Usunier, “Multiview self-learning,” Neurocomputing, vol. 155, pp. 117–127, 2015.
  • [36] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in British Machine Vision Conference, vol. 1, no. 3, 2015, p. 6.
  • [37] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, 2009.
  • [38] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision, 2014.
  • [39] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,” International Journal of Computer Vision, vol. 111, no. 1, pp. 98–136, 2015.
  • [40] J. Hoffman, S. Guadarrama, E. S. Tzeng, R. Hu, J. Donahue, R. Girshick, T. Darrell, and K. Saenko, “Lsda: Large scale detection through adaptation,” in Advances in Neural Information Processing Systems, 2014, pp. 3536–3544.
  • [41] R. G. Cinbis, J. Verbeek, and C. Schmid, “Weakly supervised object localization with multi-fold multiple instance learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 1, pp. 189–203, 2017.
  • [42] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao, “Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop,” arXiv, 2015.
  • [43] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals from edges,” in European Conference on Computer Vision, 2014.
  • [44] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [45] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
  • [46] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features off-the-shelf: an astounding baseline for recognition,” in Workshop of IEEE Conference on Computer Vision and Pattern Recognition, 2014.
  • [47] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” in Advances in Neural Information Processing Systems, 2014.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
133389
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description