Scalable Deep Learning Logo Detection
Abstract
Existing logo detection methods usually consider a small number of logo classes and limited images per class with a strong assumption of requiring tedious object bounding box annotations, therefore not scalable to real-world dynamic applications. In this work, we tackle these challenges by exploring the webly data learning principle without the need for exhaustive manual labelling. Specifically, we propose a novel incremental learning approach, called Scalable Logo Self-co-Learning (SL), capable of automatically self-discovering informative training images from noisy web data for progressively improving model capability in a cross-model co-learning manner. Moreover, we introduce a very large (2,190,757 images of 194 logo classes) logo dataset “WebLogo-2M” by an automatic web data collection and processing method. Extensive comparative evaluations demonstrate the superiority of the proposed SL method over the state-of-the-art strongly and weakly supervised detection models and contemporary webly data learning approaches.
I Introduction
Automated logo detection from unconstrained “in-the-wild” images benefits a wide range of applications, e.g. brand trend prediction for commercial research and vehicle logo recognition for intelligent transportation [1, 2, 3]. This is inherently a challenging task due to the presence of many logos in diverse context with uncontrolled illumination, low-resolution, and background clutter (Fig. 1).
Existing logo detection methods typically consider a small number of logo classes with the need for large sized training data annotated at the logo object instance level, i.e. object bounding boxes [4, 5, 1, 6, 2, 7, 8, 3]. Whilst this controlled setting allows for a straightforward adoption of the state-of-the-art object detection models [9, 10, 11], it is unscalable to real-world logo detection applications when a much larger number of logo classes are of interest but limited by (1) the extremely high cost for constructing large scale logo dataset with exhaustive logo instance bounding box labelling therefore unavailability [12]; and (2) lacking incremental model learning to progressively update and expand the model to increasingly more training data without fine-grained labelling. Existing models are mostly one-pass trained and statically generalised to new test data.

Dataset | Logo Classes | Images | Supervision | Noisy | Construction | Scalability | Availability |
---|---|---|---|---|---|---|---|
TopLogo-10 [13] | 10 | 700 | Object-Level | ✗ | Manually | Weak | ✓ |
TennisLogo-20 [14] | 20 | 2,000 | Object-Level | ✗ | Manually | Weak | ✗ |
FlickrLogos-27 [5] | 27 | 810 | Object-Level | ✗ | Manually | Weak | ✓ |
FlickrLogos-32 [1] | 32 | 2,240 | Object-Level | ✗ | Manually | Weak | ✓ |
Logo32-270 [15] | 32 | 8,640 | Object-Level | ✗ | Manually | Weak | ✗ |
BelgaLogos [4] | 37 | 1321 | Object-Level | ✗ | Manually | Weak | ✓ |
LOGO-NET [16] | 160 | 73,414 | Object-Level | ✗ | Manually | Weak | ✗ |
WebLogo-2M (Ours) | 194 | 2,190,757 | Image-Level | ✓ | Automatically | Strong | ✓ |
In this work, we consider scalable logo detection learning in a very large collection of unconstrained images without exhaustive fine-grained object instance level labelling for model training. Given that existing datasets mostly have small numbers of logo classes, one possible strategy is to learning from a small set of labelled training classes and adopting the model to other novel (test) logo classes, that is, Zero-Shot Learning (ZSL) [17, 18, 19]. This class-to-class model transfer and generalisation in ZSL is achieved by knowledge sharing through an intermediate semantic representation for all classes, such as mid-level attributes [18] or a class embedding space of word vectors [19]. However, they are limited as many logos do not share attributes or other forms of semantic representations due to their unique characteristics. A lack of large scale logo datasets (Table I), in both class numbers and image instance numbers per class, limits severely learning scalable logo detection models. This study explores the webly data learning principle for addressing both large scale dataset construction and incremental logo detection model learning without exhaustive manual labelling of increasing image data expansion. We call this setting scalable logo detection.
The contributions of this work are three-fold: (1) We investigate the scalable logo detection problem, characterised by modelling a large quantity of logo classes without exhaustive bounding box labelling. This is different from existing methods typically considering only a small number of logo classes with the need for exhaustive manual labelling at the fine-grained object bounding box level for each class. This scalability problem is under-studied in the literature. (2) We propose a novel incremental learning approach to scalable deep learning logo detection by exploiting multi-class detection with synthetic context augmentation. We call this method Scalable Logo Self-co-Learning (SL), since it automatically discovers potential positive logo images from noisy web data to progressively improve the model discrimination and generalisation capability in an iterative joint self-learning and co-learning manner. (3) We introduce a large logo detection dataset including 2,190,757 images from 194 logo classes, called WebLogo-2M, created by automatically sampling webly logo images from the social media Twitter. Importantly, this dataset construction scheme allows to further expand easily the dataset with new logo classes and images, therefore offering a favourable solution for scalable dataset construction. Extensive experiments demonstrate the superiority of the SL method over not only the state-of-the-art strongly (Faster R-CNN [9], SSD [20], YOLOv2 [11]) and weakly (WSL [21]) supervised detection models but also webly learning methods (WLOD [22]), on the newly introduced WebLogo-2M dataset.
Ii Related Works
Logo Detection Early logo detection methods are established on hand-crafted visual features (e.g. SIFT and HOG) and conventional classification models (e.g. SVM) [8, 6, 2, 7, 5]. These methods were only evaluated by small logo datasets with a limited number of both logo images and classes. A few deep methods [23, 16, 13, 14] have been recently proposed by exploiting the state-of-the-art object detection models such as R-CNN [24, 9, 10]. This in turn inspires large data construction [16]. However, all these existing models are not scalable to real world deployments due to two stringent requirements: (1) Accurately labelled training data per logo class; (2) Strong object-level bounding box annotations. This is because, both requirements give rise to time-consuming training data collection and annotation, which is not scalable to a realistically large number of logo classes given limited human labelling effort. In contrast, our method eliminates both needs by allowing the detection model learning from image-level weakly annotated and noisy images automatically collected from the social media (webly). As such, we enable automated introduction of any quantity of new logos for both dataset construction/expansion and model updating without the need for exhaustive manual labelling.
Logo Datasets A number of logo detection benchmarking datasets exist in the literature (Table I). All existing datasets are constructed manually and typically small in both image number and logo category thus insufficient for deep learning. Recently, Hoi et al. [16] attempt to address this small logo dataset problem by creating a larger LOGO-NET dataset. However, this dataset is not publicly accessible. To address this scalability problem, we propose to collect logo images automatically from the social media. This brings about two unique benefits: (1) Weak image level labels can be obtained for free; (2) We can easily upgrade the dataset by expanding the logo category set and collecting new logo images without human labelling therefore scalable to any quantity of logo images and logo categories. To our knowledge, this is the first attempt to construct a large scale logo dataset by exploiting inherently noisy web data.
Model Self-Learning Self-training is a special type of incremental learning wherein the new training data are labelled by the model itself – predicting logo positions and class labels in weakly labelled or unlabelled images before converting the most confident predictions into the training data [25]. A similar approach to our model is the detection model by Rosenberg et al. [26]. This model also explores the self-training mechanism. However, this method needs a number of per class strongly and accurately labelled training data at the object instance level to initialise their detection model. Moreover, it assumes all unlabelled images belong to the target object categories. These two assumptions limit severely model effectiveness and scalability given webly collected training data without any object bounding box labelling whilst with a high ratio of noisy irrelevant images.
Model Co-Learning Model co-learning is a generic meta-learning strategy originally designed for semi-supervised learning, based on two sufficient and yet conditionally independent feature representations with a single model algorithm [27]. Later on, co-learning was further developed into the designs of using different model parameter settings [28] or models [29, 30, 31] on the same feature representation. Overall, the key is that both models in co-learning need to be independently effective but also complementary to each other. Recently, there are some attempts on semi-supervised classification methods in the co-learning spirit [32, 33, 34, 35]. Beyond these, we further extend the co-learning concept from semi-supervised learning to webly learning for scalable logo detection. In particular, we unite co-learning and self-learning in a single detection deep learning framework with the capability of incrementally improving the logo detection models. To our knowledge, this is the first attempt of exploiting such a self-co-learning approach in the logo detection literature.
Iii WebLogo-2M Logo Detection Dataset
We present a scalable method to automatically construct a large logo detection dataset, called WebLogo-2M, with 2,190,757 webly images from 194 logo classes (Table II).
Logos | Raw Images | Filtered Images | Noise Rate (%) |
---|---|---|---|
194 | 4,941,317 | 2,190,757 | Varying |
- | - | (6/2583/179,789) | (25.0/90.2/99.8) |
Iii-a Logo Image Collection and Filtering
Logo Selection A total of 194 logo classes from 13 different categories are selected in the WebLogo-2M dataset (Fig. 4). They are popular logos and brands in our daily life, including the 32 logo classes of FlickrLogo-32 [1] and the 10 logo classes of TopLogo-10 [13]. Specifically, the logo class selection was guided by an extensive review of social media reports regarding to the brand popularity 111http://www.ranker.com/crowdranked-list/ranking-the-best-logos-in-the-world222http://zankrank.com/Ranqings/?currentRanqing=logos333http://uk.complex.com/style/2013/03/the-50-most-iconic-brand-logos-of-all-time and market-value444http://www.forbes.com/powerful-brands/list/#tab:rank555http://brandirectory.com/league_tables/table/apparel-50-2016.
Image Source Selection We selected the social media website Twitter as the data source of WebLogo-2M. Twitter offers well structured multi-media data stream sources and more critically, unlimited data access permission therefore facilitating the collection of large scale logo images666We also attempted at Google and Bing search engines, and three other social media (Facebook, Instagram, and Flickr). However, all of them are rather restricted in data access and limiting incremental big data collection, e.g. Instagram allows only 500 times of image downloading per hour through the official web API..
Image Collection We collected 4,941,317 webly logo images. Specifically, through the Twitter API, one can automatically retrieve images from tweets by matching query keywords against tweets in real time. In our case, we query the logo brand names so that images in tweets containing the query words can be extracted. The retrieved images are then labelled with the corresponding logo name at the image level, i.e. weakly labelled.
Logo Image Filtering We obtained a total of 2,190,757 images after conducting a two-steps auto-filtering: (1) Noise Removal: We removed images of small width and/or height (e.g. less than 100 pixels), statistically we observed that such images are mostly without any logo objects (noise). (2) Duplicate Removal: We identified and discarded exact-duplicates (i.e. multiple copies of the same image). Specifically, given an reference image, we removed those with identical width and height. This image spacial size based scheme is not only computationally cheaper than the appearance based alternative [36], but also very effective. For example, we manually examined the de-duplicating process on randomly selected reference images and found that over 90 of the images are true duplicates.
Iii-B Properties of WebLogo-2M
Compared to existing logo detection databases [4, 1, 16, 13], this webly logo image dataset presents three unique properties inherent to large scale data exploration for learning scalable logo models:
(I) Weak Annotation All WebLogo-2M images are weakly labelled at the image level by the query keywords. These labels are obtained automatically in data collection without human fine-grained labelling. This is much more scalable than manually annotating accurate individual logo bounding boxes, particularly when the number of both logo images and classes are very large.
(II) Noisy (False Positives) Images collected from online web sources are inherently noisy, e.g. often no logo objects appearing in the images therefore providing plenty of natural false positive samples. For estimating a degree of noisiness, we sampled randomly at most 1,000 web images per class for all 194 classes and manually examined whether they are true or false logo images777 In the case of sparse logo classes with less than 1,000 webly collected images, we examined all available images.. As shown in Fig. 2, the true logo image ratio varies significantly among 194 logos, e.g. 75% for “Rittersport” vs. 0.2% for “3M”. On average, true logo images take only vs. the remaining as false positives. Such noisy images pose significant challenges to model learning, even though there are plenty of data scalable to very large size in both class numbers and samples per class.

(III) Class Imbalance The WebLogo-2M dataset presents a natural logo object occurrence imbalance in public scenes. Specifically, logo images collected from web streams exhibit a power-law distribution (Fig. 3). This property is often artificially eliminated in most existing logo datasets by careful manual filtering, which not only requires extra labelling effort but also renders the model learning challenges unrealistic. We preserve the inherent class imbalance nature in the data for achieving fully automated dataset construction and retaining realistic model learning challenges. This requires minimising model learning bias towards densely-sampled classes [37].

![]() |
![]() |
Further Remark Since the proposed dataset construction method is completely automated, new logo classes can be easily added without human labelling. This permits good scalability to enlarging the dataset cumulatively, in contrast to existing methods [12, 16, 38, 39, 4, 1, 16, 13] that require exhaustive human labelling therefore hampering further dataset updating and enlarging. This automation is particularly important for creating object detection datasets with expensive needs for labelling explicitly object bounding boxes, more so than for constructing cheaper image-level class annotation datasets [40]. While being more scalable, this WebLogo-2M dataset also provides more realistic challenges for model learning given weaker label information, noisy image data, unknown scene context, and significant class imbalance.
Iii-C Benchmarking Training and Test Data
We define a benchmarking logo detection setting here. In the scalable webly learning context, we deploy the whole WebLogo-2M dataset (2,190,757 images) as the training data. For performance evaluation, a set of images with fine-grained object-level annotation groundtruth is required. To that end, we construct an independent test set of 6,558 logo images with logo bounding box labels by (1) assembling 2,870 labelled images from the FlickrLogo-32 [1] and TopLogo [13] datasets and (2) manually labelling 3,688 images independently collected from the Twitter. Note that, the only purpose of labelling this test set is for performance evaluation of different detection methods, independent of WebLogo-2M auto-construction.
Iv Training A Multi-Class Logo Detector
We aim to automatically train a multi-class logo detection model incrementally from noisy and weakly labelled web images. Different from existing methods building a detector in a one-pass “batch” learning procedure, we propose to incrementally enhance the model capability “sequentially”, in a joint spirit of self-learning [25] and co-learning [27]. This is due to the unavailability of sufficient accurate fine-grained training images per logo class. In other words, the model must self-select trustworthy images from the noisy webly labelled data (WebLogo-2M) to progressively develop and refine itself. This is a catch-22 problem: The lack of sufficient good-quality training data leads to a suboptimal model which in turn produces error-prone predictions. This may cause model drift – the errors in model prediction will be propagated through the iterations therefore have the potential to corrupt the model knowledge structure. Also, the inherent data imbalance over different logo classes may make model learning biased towards only a few number of majority classes, therefore leading to significantly weaker capability in detecting minority classes. Moreover, the two problems above are intrinsically interdependent with possibly one negatively affecting the other. It is non-trivial to solve these challenges without exhaustive fine-grained human annotations.
Formulation Rational In this work, we present a scalable logo detection deep learning solution capable of addressing the aforementioned two issues in a self-co-learning manner. The intuition is: Web knowledge provides ambiguous but still useful coarse image level logo annotations, whilst self-learning offers a scalable learning means to explore iteratively such weak/noisy information and co-learning allows for mining the complementary advantages of different modelling approaches in order to further improve the self-learning effectiveness. We call our method Scalable Logo Self-co-Learning (SL).
Model Design To establish a more effective SL framework, we select strongly-supervised rather than weakly-supervised object detection deep learning models for two reasons: (1) The performance of weakly-supervised models [41] are much inferior than that of strongly supervised counterparts; (2) The noisy webly weak labels may further hamper the effectiveness of weakly supervised learning. In our model instantiation, we choose the Faster R-CNN [9] and YOLO [11] models for self-co-learning. Conceptually, this model selection is independent of the SL notion and stronger deep learning detector models generally lead to a more advanced SL solution. A schematic overview of the SL framework is depicted in Fig. 5.

Iv-a Model Bootstrap
To start the SL process, we first provide reasonably discriminative logo detection co-learning models with sufficient bootstrapping training data discovery. Both Faster R-CNN and YOLOv2 need strongly supervised learning from object-level bounding box annotations to achieve logo object detection discrimination, which however is not available in our scalable webly learning setting.
To address this problem in the scalable logo detection context, we propose to exploit the idea of synthesising fine-grained training logo images, therefore maintaining model learning scalability for accommodating large quantity of logo classes. In particular, this is achieved by generating synthetic training images as in [13]: Overlaying logo icon images at random locations of non-logo background images so that bounding box annotations can be automatically and completely generated. The logo icon images are automatically collected from the Google Image Search by querying the corresponding logo class name (Fig. 4 (b)). The background images can be chosen flexibly, e.g. the non-logo images in the FlickrLogo-32 dataset [1] and others retrieved by irrelevant query words from web search engines. To enhance appearance variations in synthetic logos, colour and geometric transformation can be applied [13].
Training Details We synthesised training images per logo class, in total 194,000 images. For learning the Faster R-CNN and YOLOv2 models, we set the learning rate at and the learning iterations at . Following [13], we pre-trained the detector models on ImageNet 1000-class object classification images [12] for model warmup.
Iv-B Incremental Self-Mining Noisy Web Images
After the logo detector models are discriminatively bootstrapped, we proceed to improve their detection capability with incrementally self-mined positive (likely) logo images from weakly labelled WebLogo-2M data. To identify the most compatible training images, we define a selection function using the detection score of up-to-date model:
(1) |
where denotes the -th step detector model (Faster R-CNN or YOLOv2), and denotes a logo image with the web image-level label with the total logo class number. , indicates the maximal detection score of on the logo class by model . For positive logo image selection, we consider a high threshold detection confidence (0.9 in our experiments) [42] for strictly controlling the impact of model detection errors in degrading the incremental learning benefits. This new training data discovery process is summarised in Alg. 1.
With this self-mining process at -th iteration, we obtain a separate set of updated training data for Faster R-CNN and YOLOv2, denoted as and respectively. Each of the two sets represents detection performance with some distinct characteristics due to the different formulation designs of the two models, e.g. Faster R-CNN is based on region proposals whilst YOLOv2 relies on pre-defined-grid centred regression. This creates a satisfactory condition (i.e. diverse and independent modelling) for cross-model co-learning.
Input:
Current model ,
Unexplored data ,
Self-discovered logo training data ();
Output:
Updated self-discovered training data ,
Updated unlabelled data pool ;
Initialisation:
;
;
for image in
Apply to get the detection results;
Evaluate image as a potential positive logo image;
if Meeting selection criterion
= ;
= ;
end if
end for
Return and .
Iv-C Incremental Model Co-Learning
Given the two up-to-date training sets and , we conduct co-learning on the two detection models (Fig. 5(d)). Specifically, we incrementally update the Faster R-CNN model with the self-mined set by YOLOv2, and vice verse. As such, the complementary modelling advantages can be exploited incrementally in a self-mining cross-updating manner.
Recall that the logo images across classes are imbalanced (Fig. 3). This can lead to biased model learning towards well-sampled classes (the majority classes), resulting in poor performance against sparsely-labelled classes (the minority classes) [37]. To address this problem, we propose the idea of cross-class context augmentation for not only fully exploring the contextual richness of WebLogo-2M data but also addressing the intrinsic imbalanced logo class problem, inspired by the potential of context enhancement [13].
Specifically, we ensure that at least images will be newly introduced into the training data pool in each self-discovery iteration for each detection model. Suppose web images are self-discovered for the logo class (Alg. 1), we generate synthetic images where
(2) |
Therefore, we only perform synthetic data augmentation for those classes with less than real web images mined in the current iteration. We set considering that too many synthetic images may bring in negative effects due to the imperfect logo appearance rendering against background. Importantly, we choose the self-mined logo images of other classes () as the background images specifically for enriching the contextual diversity of logo class (Fig. 6). We utilise the SCL synthesising method [13] as in model bootstrap (Sec. IV-A).

Once we have self-mined web training images and context enriched synthetic data, we perform detection model fine-tuning at the learning rate of by iterations depending on the training data size at each iteration. We adopt the original deep learning loss formulation for both Faster R-CNN and YOLOv2. Model generalisation is improved when the training data quality is sufficient in terms of both true-false logo image ratio and the context richness.
Iv-D Incremental Learning Stop Criterion
We conduct the incremental model self-co-learning until meeting some stop criterion, for example, the model performance gain becomes marginal or zero. We adopt the YOLOv2 as the deployment logo detection model due to its superior efficiency and accuracy (see Table V). In practice, we can assess the model performance on an independent set, e.g. the test evaluation data.
V Experiments
Competitors
We compared the proposed SL model with
six state-of-the-art alternative detection approaches:
(1) Faster R-CNN [9]:
A competitive region proposal driven object detection model which is
characterised by jointly learning region proposal generation and
object classification in a single deep model.
In our scalable webly learning context, the Faster R-CNN is
optimised with synthetic training data
generated by the SCL [13] method,
exactly the same as our SL model.
(2) SSD [20]:
A state-of-the-art regression optimisation based object detection model.
We similarly learn this strongly supervised model with
synthetic logo instance bounding box labels as Faster R-CNN above.
(3) YOLOv2 [11]:
A contemporary bounding box regression based
multi-class object detection model.
We learned this model with the same training data as SSD and Faster R-CNN.
(4) Weakly Supervised object Localisation (WSL) [21]:
A state-of-the-art weakly supervised detection model allowing to be trained with
image-level logo label annotations in a multi-instance learning framework.
Therefore, we can directly utilise the webly labelled WebLogo-2M images
to train the WSL detection model. Note that, noisy logo labels inherent to web data
may pose additional challenges in addition to high complexity in logo appearance and context.
(5) Webly Learning Object Detection (WLOD) [22]:
A state-of-the-art weakly supervised object detection method where
clean Google images are used to train exemplar classifiers which
is deployed to classify region proposals by EdgeBox [43].
In our implementation, we further improved the classification component
by exploiting an ImageNet-1K and pascalVOC trained VGG-16 [44] model as the deep feature extractor and the L2 distance as the matching metric.
We adopted the nearest neighbour classification model with
Google logo images (Fig. 4(b))
as the labelled training data.
(6) WLOD+SCL:
a variant of WLOD [22] with context enriched training data by
exploiting SCL [13] to synthesise various context for
Google logo images.
Overall, these existing methods
cover both strongly and weakly supervised learning
based detection models.
Performance Metrics For the quantitative performance measure of logo detection, we utilised the Average Precision (AP) for each individual logo class, and the mean Average Precision (mAP) for all classes [45]. A detection is considered being correct when the Intersection over Union (IoU) between the predicted and groundtruth exceeds .
V-a Comparative Evaluations
Method | mAP (%) |
---|---|
SSD [20] | 8.8 |
Faster R-CNN [9] | 14.9 |
YOLOv2 [11] | 18.4 |
WSL [21] | 3.6 |
WLOD [22] | 19.3 |
WLOD[22] + SCL[13] | 7.8 |
SL (Ours) | 46.9 |

Iteration | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|---|---|
mAP (%) | 18.4 | 28.6 | 33.2 | 39.1 | 42.2 | 44.4 | 45.6 | 46.9 | 46.9 |
mAP Gain (%) | N/A | 10.2 | 4.6 | 5.9 | 3.1 | 2.2 | 1.2 | 1.3 | 0.0 |
Training Images | 5,862 | 21,610 | 41,314 | 54,387 | 74,855 | 86,599 | 98,055 | 107,327 | Stop |
We compared the scalable logo detection performance
on the WebLogo-2M benchmarking test data in Table III.
It is evident that the proposed SL model significantly outperforms all other alternative methods,
e.g. surpassing the best baseline WLOD by 27.6% (46.9%-19.3%) in mAP.
We also have the following observations:
(1) The weakly supervised learning based model WSL produces the worst result,
due to the joint effects of complex logo appearance variation against unconstrained context and
the large proportions of false positive logo images (Fig. 2).
(2) The WLOD method performs reasonably well suggesting that
the knowledge learned from auxiliary data sources (ImageNet and Pascal VOC)
is transferable to some degree, confirming the similar findings as in [46, 47].
(3) By utilising synthetic training images with rich context and background,
fully supervised detection models YOLOv2 and
Faster R-CNN are able to achieve the 3/ best results among all competitors.
This suggests that context augmentation is critical for object detection model optimisation, and the combination of strongly supervised learning model + auto training data synthesising
is a superior strategy over weakly supervised learning in webly learning setting.
(4)
Another supervised model SSD
yields much lower detection performance.
This is similar to the original finding
that this model is sensitive to the bounding box size of objects
with weaker detection performance on small objects such as
in-the-wild logo instances.
(5) WLOD+SCL produces a weaker result (7.8%) compared to WLOD (19.3%).
This indicates that joint supervised
learning is critical to exploit context enriched data augmentation,
otherwise likely introducing some distracting effects resulting in degraded matching.
Qualitative Evaluation For visual comparison, we show a number of qualitative logo detection examples from three classes by the SL and WLOD models Fig. 7.
V-B Further Analysis and Discussions
V-B1 Effects of Incremental Model Self-Co-Learning
We evaluated the effects of incremental model self-co-learning on self-discovered training data and context enriched synthetic images by examining the SL model performance at individual iterations. Table IV and Fig. 9 show that the SL model improves consistently from the to iterations of self-co-learning. In particular, the starting data mining brings about the maximal mAP gain of 10.2% (28.6%-18.4%) with the per-iteration benefit mostly dropping gradually. This suggests that our model design is capable of effectively addressing the notorious error propagation challenge thanks to (1) a proper detection model initialisation by logo context synthesising for providing a sufficiently good starting-point detection; (2) a strict selection on self-evaluated detections for reducing the amount of false positives, suppressing the likelihood of error propagation; and (3) cross-model co-learning with cross-logo context enriched synthetic training data augmentation with the capability of addressing the imbalanced data learning problem whilst enhancing the model robustness against diverse unconstrained background clutters. We also observed that more images are mined along the incremental data mining process, suggesting that the SL model improves over time in the capability of tackling more complex context, although potentially leading to more false positives simultaneously which can cause lower model growing rates, as indicated in Fig. 8.


Method | mAP (%) |
---|---|
Self-Learning (Faster R-CNN) | 36.8 |
Self-Learning (YOLO) | 39.4 |
Co-Learning (Faster R-CNN) | 44.2 |
Co-Learning (YOLO) (SL) | 46.9 |
V-B2 Effects of Cross-Model Co-Learning
We assessed the benefits of cross-model co-learning between Faster R-CNN and YOLOv2 in SL in comparison of the single-model self-learning strategy. In contrast to co-learning, the self-learning exploits self-mined new training data for incremental model update without the benefit of cross-model complementary advantages. Table V and Fig. 9 show that both models benefit clear performance gains from co-learning, e.g. 7.4% (44.2-36.8) for Faster R-CNN, and 7.5% (46.9-39.4) for YOLOv2. This verifies the motivation and our idea of exploiting the co-learning principle for maximising the complementary information of different detection modelling formulations in the scalable logo detection model optimisation.
Iteration | 0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|
With CE | 18.4 | 28.6 | 33.2 | 39.1 | 42.2 | 44.4 |
Without CE | 18.4 | 25.3 | 27.7 | 28.7 | 28.9 | 28.0 |
V-B3 Effects of Synthetic Context Enhancement
We evaluated the impact of training data context enhancement (i.e. the cross-class context enriched synthetic training data) on the SL model performance. Table VI shows that context enhancement not only provides clearly model improvement across iterations due to the suppression of negative imbalance learning effect and enriched context knowledge, but also simultaneously enlarges the data mining capacity due to potentially less noisy training data aggregation. Without context enhancement and training class balancing, the model stops to improve by the 4 iteration of the incremental learning, with much weaker model generalisation performance at 28.9% vs 46.9% by the full SL. This suggests the importance of context and data balance in detection model learning, therefore validating our model design considerations.
Vi Conclusion
We present a scalable end-to-end logo detection solution including logo dataset establishment and multi-class logo detection model learning, realised by exploring the webly data learning principle without the tedious cost of manually labelling fine-grained logo annotations. Particularly, we propose a new incremental learning method named Scalable Logo Self-co-Learning (SL) for enabling reliable self-discovery and auto-labelling of new training images from noisy web data to progressively improve the model detection capability in a cross-model co-learning manner given unconstrained in-the-wild images. Moreover, we construct a very large logo detection benchmarking dataset WebLogo-2M by automatically collecting and processing web stream data (Twitter) in a scalable manner, therefore facilitating and motivating further investigation of scalable logo detection in future studies by the wider community. We have validated the advantages and superiority of the proposed SL approach in comparisons to the state-of-the-art alternative methods ranging from strongly-supervised and weakly-supervised detection models to webly data learning models through extensive comparative evaluations and analysis on the benefits of incremental model training and context enhancement, using the newly introduced WebLogo-2M logo benchmark dataset. We provide in-depth SL model component analysis and evaluation with insights on model performance gain and formulation.
Acknowledgement
This work was partially supported by the China Scholarship Council, Vision Semantics Ltd., the Royal Society Newton Advanced Fellowship Programme (NA150459), and InnovateUK Industrial Challenge Project on Developing and Commercialising Intelligent Video Analytics Solutions for Public Safety.
References
- [1] S. Romberg, L. G. Pueyo, R. Lienhart, and R. Van Zwol, “Scalable logo recognition in real-world images,” in Proceedings of the 1st ACM International Conference on Multimedia Retrieval. ACM, 2011, p. 25.
- [2] S. Romberg and R. Lienhart, “Bundle min-hashing for logo recognition,” in Proceedings of the 3rd ACM conference on International conference on multimedia retrieval. ACM, 2013, pp. 113–120.
- [3] C. Pan, Z. Yan, X. Xu, M. Sun, J. Shao, and D. Wu, “Vehicle logo recognition based on deep learning architecture in video surveillance for intelligent traffic system,” in IET International Conference on Smart and Sustainable City, 2013, pp. 123–126.
- [4] A. Joly and O. Buisson, “Logo retrieval with a contrario visual query expansion,” in ACM International Conference on Multimedia, 2009, pp. 581–584.
- [5] Y. Kalantidis, L. G. Pueyo, M. Trevisiol, R. van Zwol, and Y. Avrithis, “Scalable triangulation-based logo recognition,” in ACM International Conference on Multimedia Retrieval, 2011, p. 20.
- [6] J. Revaud, M. Douze, and C. Schmid, “Correlation-based burstiness for logo retrieval,” in ACM International Conference on Multimedia, 2012, pp. 965–968.
- [7] R. Boia, A. Bandrabur, and C. Florea, “Local description using multi-scale complete rank transform for improved logo recognition,” in IEEE International Conference on Communications, 2014, pp. 1–4.
- [8] K.-W. Li, S.-Y. Chen, S. Su, D.-J. Duh, H. Zhang, and S. Li, “Logo detection with extendibility and discrimination,” Multimedia tools and applications, vol. 72, no. 2, pp. 1285–1310, 2014.
- [9] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems, 2015, pp. 91–99.
- [10] R. Girshick, “Fast r-cnn,” in IEEE International Conference on Computer Vision, 2015.
- [11] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- [12] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
- [13] H. Su, X. Zhu, and S. Gong, “Deep learning logo detection with data expansion by synthesising context,” IEEE Winter Conference on Applications of Computer Vision, 2017.
- [14] Y. Liao, X. Lu, C. Zhang, Y. Wang, and Z. Tang, “Mutual enhancement for detection of multiple logos in sports videos,” in IEEE International Conference on Computer Vision, 2017.
- [15] Y. Li, Q. Shi, J. Deng, and F. Su, “Graphic logo detection with deep region-based convolutional networks,” in IEEE Visual Communications and Image Processing, 2017, pp. 1–4.
- [16] S. C. Hoi, X. Wu, H. Liu, Y. Wu, H. Wang, H. Xue, and Q. Wu, “Logo-net: Large-scale deep logo detection and brand recognition with deep region-based convolutional networks,” arXiv preprint arXiv:1511.02462, 2015.
- [17] X. Xu, T. Hospedales, and S. Gong, “Transductive zero-shot action recognition by word-vector embedding,” International Journal of Computer Vision, pp. 1–25, 2017.
- [18] C. H. Lampert, H. Nickisch, and S. Harmeling, “Attribute-based classification for zero-shot visual object categorization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 3, pp. 453–465, 2014.
- [19] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov et al., “Devise: A deep visual-semantic embedding model,” in Advances in Neural Information Processing Systems, 2013.
- [20] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed, “Ssd: Single shot multibox detector,” in European Conference on Computer Vision, 2016.
- [21] L. Dong, J.-B. Huang, Y. Li, S. Wang, and M.-H. Yang, “Weakly supervised object localization with progressive domain adaptation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015.
- [22] X. Chen and A. Gupta, “Webly supervised learning of convolutional networks,” in IEEE International Conference on Computer Vision, 2015, pp. 1431–1439.
- [23] F. N. Iandola, A. Shen, P. Gao, and K. Keutzer, “Deeplogo: Hitting logo recognition with the deep neural network hammer,” arXiv, 2015.
- [24] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014.
- [25] K. Nigam and R. Ghani, “Analyzing the effectiveness and applicability of co-training,” in Proceedings of the International Conference on Information and Knowledge Management, 2000.
- [26] C. Rosenberg, M. Hebert, and H. Schneiderman, “Semi-supervised self-training of object detection models,” in Seventh IEEE Workshop on Applications of Computer Vision, 2005.
- [27] A. Blum and T. Mitchell, “Combining labeled and unlabeled data with co-training,” in Proceedings of the eleventh annual conference on Computational learning theory. ACM, 1998, pp. 92–100.
- [28] W. Wang and Z.-H. Zhou, “Analyzing co-training style algorithms,” in European Conference on Machine Learning. Springer, 2007, pp. 454–465.
- [29] S. Goldman and Y. Zhou, “Enhancing supervised learning with unlabeled data,” in ICML, 2000, pp. 327–334.
- [30] Z.-H. Zhou and M. Li, “Semi-supervised learning by disagreement,” Knowledge and Information Systems, vol. 24, no. 3, pp. 415–439, 2010.
- [31] Z. Jiang, S. Zhang, and J. Zeng, “A hybrid generative/discriminative method for semi-supervised classification,” Knowledge-Based Systems, vol. 37, pp. 137–145, 2013.
- [32] X. Xu, W. Li, D. Xu, and I. W. Tsang, “Co-labeling for multi-view weakly labeled learning,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 6, pp. 1113–1125, 2016.
- [33] T. Batra and D. Parikh, “Cooperative learning with visual attributes,” arXiv preprint arXiv:1705.05512, 2017.
- [34] C. Gong, D. Tao, S. J. Maybank, W. Liu, G. Kang, and J. Yang, “Multi-modal curriculum learning for semi-supervised image classification,” IEEE Transactions on Image Processing, vol. 25, no. 7, pp. 3249–3260, 2016.
- [35] A. Fakeri-Tabrizi, M.-R. Amini, C. Goutte, and N. Usunier, “Multiview self-learning,” Neurocomputing, vol. 155, pp. 117–127, 2015.
- [36] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in British Machine Vision Conference, vol. 1, no. 3, 2015, p. 6.
- [37] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, 2009.
- [38] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision, 2014.
- [39] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,” International Journal of Computer Vision, vol. 111, no. 1, pp. 98–136, 2015.
- [40] J. Hoffman, S. Guadarrama, E. S. Tzeng, R. Hu, J. Donahue, R. Girshick, T. Darrell, and K. Saenko, “Lsda: Large scale detection through adaptation,” in Advances in Neural Information Processing Systems, 2014, pp. 3536–3544.
- [41] R. G. Cinbis, J. Verbeek, and C. Schmid, “Weakly supervised object localization with multi-fold multiple instance learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 1, pp. 189–203, 2017.
- [42] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao, “Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop,” arXiv, 2015.
- [43] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals from edges,” in European Conference on Computer Vision, 2014.
- [44] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- [45] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
- [46] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features off-the-shelf: an astounding baseline for recognition,” in Workshop of IEEE Conference on Computer Vision and Pattern Recognition, 2014.
- [47] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” in Advances in Neural Information Processing Systems, 2014.