Universal Sketch Perceptual Grouping

Universal Perceptual Grouping


In this work we aim to develop a universal sketch grouper. That is, a grouper that can be applied to sketches of any category in any domain to group constituent strokes/segments into semantically meaningful object parts. The first obstacle to this goal is the lack of large-scale datasets with grouping annotation. To overcome this, we contribute the largest sketch perceptual grouping (SPG) dataset to date, consisting of unique sketches evenly distributed over object categories. Furthermore, we propose a novel deep universal perceptual grouping model. The model is learned with both generative and discriminative losses. The generative losses improve the generalisation ability of the model to unseen object categories and datasets. The discriminative losses include a local grouping loss and a novel global grouping loss to enforce global grouping consistency. We show that the proposed model significantly outperforms the state-of-the-art groupers. Further, we show that our grouper is useful for a number of sketch analysis tasks including sketch synthesis and fine-grained sketch-based image retrieval (FG-SBIR).

Sketch Perceptual grouping, Universal grouper, Deep grouping model, Dataset.
Figure 1: Examples of the SPG dataset. Stroke groups are colour coded.

1 Introduction

Humans effortlessly detect objects and object parts out of a cluttered background. The Gestalt school of psychologists [1, 2] argued that this ability to perceptually group visual cues/patterns into objects is built upon a number of grouping principles, termed Gestalt laws of grouping. These include five categories, namely proximity, similarity, continuity, closure, and symmetry [3].

Computer vision research in grouping or segmentation has long exploited these perceptual grouping principles. For example, in image segmentation [4, 5, 6, 7], pixel visual appearance similarity and local proximity are often used to group pixels into objects. These principles are exploited by the human visual system to robustly perform perceptual grouping in diverse contexts and for diverse object categories. Exploiting them is important for a universal grouping algorithm.

We aim to develop such a universal grouper for human free-hand sketches which takes a sketch as input and groups the constituent strokes into semantic parts. Note that this is different from semantic segmentation for either photos [7] or sketches [8, 9, 10], where each segmented part is given a label, and the labels are often object category-dependent. Only the group relationship between strokes is predicted so that the grouper can universally be applied to any object category.

Very few existing studies [11, 12] investigate sketch perceptual grouping. These approaches compute hand-crafted features from each stroke and use the proximity and continuity principles to compute a stroke affinity matrix for subsequent clustering/grouping. They thus have a number of limitations: (i) Only two out of the five principles are exploited, while the unused ones such as closure are clearly useful in grouping human sketches which can be fragmented (see Fig. 1). (ii) How the principles are formulated is determined manually rather than learned from data. (iii) Fixed weightings of different principles are used which are either manually set [11] or learned [12]. However, for different sketches, different principles could be used by humans with different weightings. Therefore a more dynamic sketch-specific grouping strategy is preferable. Nevertheless, the existing sketch perceptual grouping datasets [8, 12] are extremely small, containing 2,000 annotated sketches at most. This prevents more powerful and flexible deep neural network models from being developed.

The first contribution of this paper is to provide the first large-scale sketch perceptual grouping (SPG) dataset consisting of 20,000 sketches with ground truth grouping annotation, i.e., 10 times larger than the largest dataset to date [12]. The sketches are collected from 25 representative object categories with 800 sketches per category. Some examples of the sketches and their annotation can be seen in Fig. 1. A dataset of such size makes the development of a deep universal grouper possible.

Even with sufficient training samples, learning a deep universal sketch grouper is non-trivial. In particular, there are two challenges: how to make the deep grouper generalisable to unseen object categories and domains without any training data from them; and how to design training losses that enforce both local (stroke pairwise) grouping consistency and global (whole sketch level) grouping consistency given variable number of strokes per sketch. Most losses used by existing deep models are for supervised classification tasks; grouping is closer to clustering than classification so few options exist.

In this paper, we propose a novel deep sketch grouping model to overcome both challenges. Specifically, treating a sketch as a sequence of strokes/segments, our model is a sequence-to-sequence variational autoencoder (VAE). The reconstruction loss in this deep generative model forces the learned representation to preserve information richer than required for the discriminative grouping task alone. This has been proven to be useful for improving model generalisation ability [13], critical for making the grouper universal. As for the discriminative grouping learning objectives, we deploy two losses: a pairwise stroke grouping loss enforcing local grouping consistency and a global grouping loss to enforce global grouping consistency. This separation of the local and global grouping losses enables us to balance the two and makes our model more robust against annotation noise. Based on the proposed grouper we develop a simple sketch synthesis model by grouping and abstracting photo edge maps. The synthesised sketches can be used to learn a strong unsupervised fine-grained sketch-based image retrieval (FG-SBIR) model, i.e., using photos only.

Our contributions are as follows: (1) We contribute the largest sketch perceptual grouping dataset to date with extensive human annotation. To drive future research, we will make the dataset publicly available. (2) For the first time, a deep universal sketch grouper is developed based on a novel deep sequence-to-sequence VAE with both generative and discriminative losses. (3) Extensive experiments show the superiority of our grouper against existing ones, especially when evaluated on new categories or new domains. Its usefulness on a number of sketch analysis tasks including sketch synthesis and FG-SBIR is also demonstrated.

2 Related work

Perceptual Grouping:  Humans can easily extract salient visual structure from apparent noise. Gestalt psychologists referred to this phenomenon as perceptual organisation [1, 2] and introduced the concept of perceptual grouping, which accounts for the observation that humans naturally group visual patterns into objects. A set of simple Gestalt principles were further developed, including proximity, similarity and continuity [3], with closure, connectedness and common fate introduced later, primarily for studying human vision systems [14, 4].

Sketch Groupers:  Very few studies exist on grouping sketch strokes into parts. The most related studies are [11, 12]. They compute an affinity matrix between strokes using hand-crafted features based on proximity and continuity principles. The two principles are combined with fixed weights learned from human annotated stroke groups. In contrast, we assume that when humans draw sketches and annotate them into groups, all grouping principles could be used. Importantly, using which ones and by how much are dependent on the specific sketch instance. Our model is thus a deep neural network that takes the sketch as input and aims to model all principles implicitly via both generative and discriminative grouping losses. It thus has the potential to perform principle selection and weighting dynamically according to the sketch input. We also provide a much larger dataset compared to the one provided in [12]. We show that on both datasets, our model outperforms that in [12] by a big margin. Note that perceptual grouping has been modelled using a deep autoencoder in [15]. However, the objective is to group discrete graphical patterns which has richer visual cues that make them more akin to the problem of image segmentation, and thus easier than grouping line drawings in sketches.

Sketch Semantic Segmentation:  A closely related problem to sketch grouping is sketch semantic segmentation [8, 9, 10]15. The key difference is that a sketch grouper is universal in that it can be applied to any object category as it only predicts whether strokes belong to the same group rather than what group. In contrast, sketch segmentation models need to predict the label of each group. As a result, typically one model is needed for each object category. Note that although two different problems are tackled, our work can potentially benefit sketch semantic segmentation in two ways: (i) The grouping principles modelled implicitly in our grouping model could be used for semantic segmentation, e.g., by modifying/fine-tuning our model to a fully supervised one. (ii) The SPG dataset also contains group ID labels for each category so can be used for developing deep segmentation models, which has not been possible to date due to the small sizes of existing sketch segmentation datasets [8, 9, 10].

Sketch Stroke Analysis:  Like our model, a number of recent sketch models are based on stroke modelling. [10] studied stroke semantic segmentation. A sequence-to-sequence variational autoencoder is used in [16] for a different purpose of conditional sketch synthesis. The work in [17] uses a sketch RNN for sketch abstraction problem by sequentially removing redundant strokes. A stroke-based model is naturally suited for perceptual grouping. Modelling Gestalt principles is harder if a sketch is treated as a 2D pixel array instead of strokes.

Fine-grained SBIR:  FG-SBIR has been a recent focus in sketch analysis [18, 19, 20, 21, 22, 23]. Training a FG-SBIR model typically requires expensive photo-sketch pair collection, which severely restricts its applicability to large number of object categories. In this work, we show that our universal grouper is general enough to be applied to edge maps computed from object photos. The edge maps can then be abstracted by removing the least important groups. The abstracted edge map can be used to substitute human sketches and form synthetic sketch-photo pairs for training a FG-SBIR model. We show that the performance of a model trained in this way approaches that of the same model trained with human labelled data, and is superior to the state-of-the-art unsupervised alternative [17].

3 Free-hand Sketch Perceptual Grouping Dataset

We contribute the Sketch Perceptual Grouping dataset, the largest free-hand sketch perceptual grouping dataset to date. It contains sketches distributed over categories with each sketch manually annotated into parts.

Category Selection  The sketches come from the QuickDraw dataset [16], which is by far the largest sketch dataset. It contains categories of everyday objects. Out of these, are selected for SPG (see Table 1) based on following criteria: (i) Complexity: the category should contain at least three semantic parts, meaning categories such as cloud and moon are out. (ii) Variety: The selected categories need to be sufficiently different from each other to be appropriate for testing the grouper’s generalisation ability to unseen classes. For example, only one of the four-legged animal classes is chosen.

Figure 2: Examples to illustrate our sketch selection process. See details in text.

Sketch Instance Selection:  Each QuickDraw category contains at least 100,000 sketches. Annotating all of them is not feasible. So 800 sketches are chosen from each category. First, some quality screening is performed. Specifically, since all QuickDraw sketches were drawn within 20 seconds, there are a large number of badly drawn sketches that are unrecognisable by humans, making part grouping impossible. We thus first discard sketches which could not be recognised by an off-the-shelf sketch classifier [24]. The remaining sketches are then subject to the following instance selection criteria: (i) Majority: Sketches in each category can form subcategories which can be visually very different from each other. Only the sketches from the majority subcategory are selected. For example, the top row of Fig. 2 shows that most sketches from the alarm clock category belong to the “with hand” subcategory, whilst a small minority depicts digital clocks without hands. Only sketches from the former are selected. (ii) Complexity: Over-abstract sketches with less than three parts are removed. (iii) Ambiguity: Finally, we eliminate sketches that contain both the target object and other objects/background to avoid ambiguity of the object category. Fig. 2 shows examples of how these criteria are enforced during instance selection.

Annotation:  Now given a sketch, each annotator is asked to group the strokes into groups. Each group has a semantic meaning and typically corresponds to an object part. So apart from the grouping label, the group ID is also annotated. Even though the group ID information is not used in our perceptual grouping model, it can be used when the task is sketch semantic segmentation. To obtain consistent grouping annotation, 25 annotators are recruited and each only annotates one category. Examples of the annotation can be seen in Fig. 1.

4 Deep Universal Sketch Perceptual Grouper

4.1 Model Overview

Our deep sketch grouper is a variant of the sequence-to-sequence variational autoencoder (VAE) [25, 26]. As shown in Fig. 3, it is essentially a deep encoder-decoder with both the encoder and decoder being RNNs for modelling a sketch as a set of strokes. The encoder produces a global representation of the sketch, which is used as a condition for a variational decoder that aims to reconstruct the input sketch. Such sketch synthesis is a side task here. Our main aim is for the decoder to produce a representation of each stroke useful for grouping them. Once learned, the decoder should implicitly model all the grouping principles used by the annotators in producing the grouping labels, so that the learned stroke representation can be used to compute a stroke affinity matrix indicating the correct stroke grouping. To this end, the decoder has two branches: a generative branch to reconstruct the input sketch; and a discriminative branch that produces the discriminative stroke feature/affinity matrix.

Figure 3: A schematic of the proposed deep perceptual grouper.

4.2 Encoder and Decoder Architecture

Traditional perceptual grouping methods treat sketches as images composed of static pixels, thus neglecting the dependency between different segments and strokes (each stroke consists of a variable number of line segments). In our dataset, all the sketches are captured in a vectorised format, making sequential modelling of sketches possible. More specifically, we first represent a sketch as a sequence of stroke-segments . Each segment is a tuple , where and denote the offsets along the horizontal and vertical directions respectively, while represents the drawing state, following the same representation used for human handwriting [27].

With these stroke segments as inputs, both the encoder and decoder are RNNs. In particular, we adopt the same architecture as in sketch-rnn [16] for conditional sketch synthesis. That is, a bi-directional RNN [28] is used as the encoder to extract the global embedding of the input sketch. The final state output of the encoder is then projected to a mean and a variance vector, to define an IID Gaussian distribution. That distribution is then sampled to produce a random vector as the representation of the input sketch. Thus is not a deterministic output of the encoder given a sketch, but a random vector conditional on the input. The decoder is an LSTM model. Its initial state is conditional on via a single fully connected (FC) layer. At each time step, it then predicts the offset for each stroke segment in order to reconstruct the input sketch. For further details on the encoder/decoder architecture, please refer to [16].

4.3 Formulation

The decoder splits into two branches after the LSTM hidden cell outputs: a generative branch to synthesise a sketch and a discriminative branch for grouping. Different learning objectives are used for the two branches: in the generative branch, two losses encourage the model to reconstruct the input sketch; in the discriminative branch, the sketch grouping annotation is used to train the decoder to produce an accurate stroke affinity matrix for grouping.

Group Affinity Matrix:  The grouping annotation is represented as a sparse matrix denoting the group relationship between segments . Denoting the sketch segment as , we have:


Where each element of the matrix indicates whether the and segments belong to the same group or not. A straightforward design of the discriminative learning objective is to make the affinity matrix computed using the learned stroke feature as similar as possible to , via an or loss. However, we found that in practice this works very poorly. This is because conveys two types of grouping constraints: each element enforces a binary pairwise constraint for two segments, whilst the whole matrix also enforces global grouping constraint, e.g., if and are in the same group, and and are also in the same group, then global grouping consistency dictates that and must also belong to the same group. Balancing these two is critical because pairwise grouping predictions are typically noisy and can lead to global grouping inconsistency. However, using a single loss makes it impossible to achieve a balance. We thus propose to use two losses to implements the two constraints.

Local Grouping Loss:  This loss requires that the pairwise relationship between two segments are kept when the pairwise affinity is measured using the learned stroke segment feature. The decoder LSTM learns a mapping function and map the stroke segment to a 128D feature vector . To measure the affinity of any two segments in the input sketch, the absolute element-wise feature difference is computed to obtain a symmetric absolute feature difference matrix, as:


Each vector is then subject to a binary classification loss (cross-entropy) to obtain the local affinity prediction , between the and segments. The local grouping loss, , is thus computed as:


Global Grouping Loss:  Using only a local grouping loss may lead to global grouping inconsistency. However, formulating the global grouping consistency into a loss for a deep neural network is not straightforward. Our strategy is to first derive a global grouping representation for each segment using the local affinity prediction . We then use a triplet ranking loss to enforce that the segments belonging the same group have more similar grouping relationships to each other, than to a segment outside the group. Although the triplet ranking loss involves three segments only, since each segment is represented by its grouping relationship to all other segments, this loss is a global grouping loss. More concretely, we first construct the local affinity prediction matrix with as elements. Each row vector of , is then used as a global grouping relationship vector to represent . The final global grouping loss is:


where represents an anchor segment, a positive segment in the same group and a negative segment from a different group, is a margin and denotes a distance function between two feature inputs. Here we take the squared Euclidean distance under the normalisation.

Generative Losses:  For the generative branch, we use the same generative losses as in [16]. These include a reconstruction loss and a KL loss measuring the difference between the latent random vector and a IID Gaussian vector with zero-mean and unit variance.

Full Learning Objective:  Our full loss can be written as:


where the hyper-parameters , , and describe the relative importance of the different losses in the full training objective.

Model Testing:  During testing stage, given a sketch, the trained model is used to compute an estimated segment affinity matrix, . This affinity matrix is then used to generate the final grouping. Since the number of groups varies for different sketches, the group number also needs to be estimated. To this end, we adopt a recent agglomerative clustering method [29] to obtain the final grouping. Note that the method does not introduce any additional free parameters.

4.4 Applications to Sketch Analysis

Sketch Synthesis from Edge Map:  A simple sketch synthesis method can be developed based on the proposed universal grouper. The method is based on grouping edge maps extracted from photo images and removing the least important groups. Assume that the segments of an edge map have been grouped in groups, denoted as . An importance measure is defined as:


where , and measure the importance from the perspectives of length, numbers and distribution of the segments in group respectively. A less important group has smaller number of segments with shorter lengths but occupies a bigger region. We thus have:


where is the number of segments in , is the length of segment , and are the width and height of the object, respectively, denotes the average position of group in the image plane, represents the average position of segment , and Euclidean distance is used. With the importance measure computed for each group, we can then drop the least important groups defined as those with where is a threshold.

Fine-Grained Sketch Based Image Retrieval:  We further develop an unsupervised FG-SBIR method following [17]. Specifically, we apply our grouper to edge maps extracted from photos to synthesise human style sketches. Three threshold values of are used for each photo to accounts for the variable levels of abstraction among human sketchers. The photos and corresponding synthesised sketches are then used as data to train an off-the-shelf FG-SBIR model [19]. During testing, the grouping and group removal processes are applied to the human sketches, again with three different thresholds. The matching scores using the three abstracted sketches plus the original query sketch are then fused to produce the final retrieval results. Note that for this unsupervised FG-SBIR model to work well, our grouper must be truly universal: it needs to work well on both human sketches which it was trained on, and photo edge maps.

5 Experiments

5.1 Datasets and Settings

Dataset Splits and Preprocessing:  Among the 25 categories in the new SPG dataset, we randomly select 20 as seen categories, and use the remaining 5 categories as unseen categories to test the generalisation of our universal grouper. In each seen category, we select 650 sketches for training, 50 for validation, and 100 for testing. For the unseen categories, no data are used for training and we randomly select 100 sketches per category for testing to have the same per-category size as the seen categories. We normalise all the sketch strokes, and augment the sketch via stroke removal and distortion [24].

Implementation Details:  Our deep grouper is implemented on Tensorflow on a single Titan X GPU. For model training, we set the importance weights , and for different losses (Eq. (4)) to 0.5, 0.6, and 1, respectively. The Adam optimiser [30] is applied with the parameters , , . The initial learning rate is set to 0.0003 with exponential weight decay. The model is trained for 22,000 iterations with a batch size of 100.

Evaluation Metrics:  Sketch perceptual grouping shares many common characteristics with the unsupervised image segmentation problem [5]. We thus adopt the same metrics including variation of information (VOI), probabilistic rand index (PRI), and segmentation covering (SC) as defined in [31]. More detailed definition of these metrics in the context of sketch grouping are: (i) Variation of Information: In this metric, the distance between two groups in terms of their average conditional entropy is calculated. (ii) Probabilistic Rand Index: Rand index compares the compatibility of assignments between pairs of stroke segments in each group. (iii) Segmentation Covering: the overlapping between the machine grouping and human grouping is measured. For SC and PRI, higher scores are better, while for VOI, a lower score indicates better results.

Competitors:  Very few sketch perceptual grouping methods exist. The state-of-the-art model Edge-PG [12] uses two Gestalt principles, namely proximity (spatial closeness) and continuity (slope trend) to compute an affinity matrix and feeds the matrix to a graph cut algorithm to get the groups. The weightings of the two principles are learned from data using RankSVM. This method thus differs from ours in that hand-crafted features are used and only two principles are modelled. Beyond sketch grouping, many semantic image segmentation methods have been proposed lately based on fully convolutional networks (FCN). We choose one of the state-of-the-art models, DeepLab [7] as a baseline. It is trained to take images as input and output the semantic grouping, i.e., each pixel is assigned a class label. A conditional random field (CRF) is integrated to the network to enforce the proximity and similarity principles. Note that: (1) DeepLab is a supervised semantic segmentation method. It thus needs not only grouping annotation as our model does, but also group ID annotation, which is not used by our model and Edge-PG. This gives it an unfair advantage. (2) It performs grouping at the pixel level whilst both our model and Edge-FG do it at the stroke/segment level.

5.2 Results on Perceptual Grouping

Category Ours Edege-PG [12] DeepLab [7]
Airplane 0.58 0.88 0.78 0.72 0.80 0.71 1.09 0.72 0.65
Alarm clock 0.46 0.93 0.83 0.59 0.84 0.73 0.86 0.80 0.70
Ambulance 0.67 0.86 0.77 1.35 0.67 0.60 1.19 0.71 0.63
Ant 0.86 0.83 0.69 1.32 0.68 0.62 1.38 0.69 0.60
Apple 0.25 0.92 0.91 0.54 0.88 0.79 0.82 0.83 0.72
Backpack 0.57 0.88 0.79 1.29 0.70 0.61 1.59 0.67 0.59
Basket 0.76 0.84 0.74 1.27 0.71 0.59 1.37 0.69 0.61
Butterfly 0.83 0.76 0.65 1.30 0.69 0.58 1.58 0.66 0.58
Cactus 0.51 0.90 0.83 0.86 0.82 0.71 0.90 0.79 0.68
Calculator 0.50 0.86 0.83 0.98 0.77 0.68 1.17 0.72 0.64
Camp fire 0.28 0.95 0.91 1.05 0.71 0.65 0.77 0.85 0.74
Candle 0.89 0.78 0.69 1.47 0.65 0.57 1.54 0.67 0.60
Coffee cup 0.38 0.91 0.86 0.85 0.83 0.68 0.98 0.79 0.66
Crab 0.69 0.81 0.74 1.29 0.69 0.56 1.58 0.67 0.60
Duck 0.86 0.83 0.69 0.95 0.74 0.68 1.63 0.65 0.57
Face 0.81 0.84 0.74 1.24 0.69 0.61 0.80 0.82 0.73
Ice-cream 0.41 0.94 0.85 0.79 0.82 0.71 1.40 0.68 0.62
Pig 0.63 0.84 0.78 1.55 0.63 0.50 0.98 0.77 0.67
Pineapple 0.50 0.93 0.82 0.63 0.83 0.72 1.05 0.74 0.65
Suitcase 0.54 0.89 0.83 0.58 0.82 0.75 1.10 0.73 0.64
Average 0.59 0.87 0.79 1.03 0.75 0.65 1.20 0.73 0.65
Table 1: Comparative grouping results on seen categories.
Category Ours Edge-PG [12]
Angel 0.70 0.87 0.73 1.19 0.69 0.60
Bulldozer 0.81 0.85 0.73 1.37 0.65 0.58
Drill 0.67 0.78 0.77 1.45 0.61 0.53
Flower 0.39 0.90 0.84 0.79 0.75 0.64
House 0.46 0.91 0.83 0.85 0.77 0.69
Average 0.64 0.86 0.77 1.13 0.69 0.61
Table 2: Perceptual grouping results on unseen categories.

Results on Seen Categories:  In this experiment, the model are trained on the seen category training set and tested on the seen category testing set. From Table 1, we can see that: (i) Our model achieves the best performance across all 25 categories on each metric, except the VOI metric on face where our model is slightly inferior to DeepLab. The improvement on VOI is particularly striking indicating that the groups discovered by our model in each sketch are distinctive to each other. In contrast, the two compared models tend to split a semantic part into multiple groups resulting similar groups (see Fig. 4). (ii) Edge-PG is much worse than our method because it is based on hand-crafted features for only two principles, while our model implicitly learns the features and combination strategy based on end-to-end learning from human group annotation. (iii) Although DeepLab also employs a deep neural network and uses additional annotations, its result is no better than Edge-PG. This suggests that for sketch perceptual grouping, it is important to treat sketches as a set of strokes rather than pixels, as strokes already grouping pixels. These constraints are ignored by the DeepLab types of models designed for photo image segmentation.

Figure 4: Qualitative grouping results on seen categories.
Figure 5: Qualitative grouping results on unseen categories.

Some examples of the grouping results are shown in Fig. 4. As expected, ignoring the stroke level grouping constraint on pixels, each stroke is often split into multiple groups by DeepLab [7]. Edge-PG [12] does not suffer from that problem. However, it suffers from the limitations on modelling only two principles. For example, to group the clock contour (second column) into one group, the closure principle needs to be used. It is also unable to model even the two principles effectively due to the limited expressive power of hand-crafted features: in the airplane example (first column), the two wings should be grouped together using the continuity principle, but broken into two by Edge-PG. In contrast, our model produces more consistent groupings using multiple principles dynamically. For instance, both DeepLab and our model successfully deploy the similarity principle to group the two legs of both the alarm clock (second column) and duck (third column) together. But DeepLab does so by explicitly encoding the principle in its CRF layer, while our model does it implicitly. In the cactus example (last column), to produce the correct grouping of those spikes, both continuity, similarity and less prevalent principles such as common fate need to be combined. Only our model is able to do that because it has implicitly learned to model all the principles used by humans to annotate the groupings.

Results on Unseen Categories:  In this experiment, models learned using seen categories are tested directly on unseen categories without any fine-tuning. This experiment is thus designed to evaluate whether the grouper is indeed universal, i.e., can be applied to any new object category. Note that as a supervised segmentation method, DeepLab cannot be applied here because each category has a unique set of group IDs. The results of our model and Edge-GP are shown in Table 2. It can been seen that our model outperforms Edge-GP by a big margin. Importantly, comparing Table 2 with Table 1, our model’s performance on PRI and SC hardly changed. In contrast, the Edge-PG’s performance on the unseen categories is clearly worse than that on the seen categories. This suggests that our grouper is more generalisable and universal. Some qualitative results are shown in Fig. 5. It again shows that the lack of powerful feature learning and limitation on only two principles contribute to the weaker results of Edge-GP.

Edge-PG [12] 1.69 0.62 0.53
Ours 0.96 0.78 0.71
Table 3: Compare Edge-PG [12] with ours on Edge-PG [12]’s datasets.
Method Seen Categories Unseen Categories
Ours - A - G 1.45 0.65 0.59 1.53 0.64 0.56
Ours - R - G 1.12 0.71 0.64 1.36 0.68 0.59
Ours - R - A 1.27 0.69 0.63 1.48 0.64 0.57
Ours - G 0.63 0.86 0.78 0.71 0.84 0.73
Ours - A 0.75 0.80 0.72 0.95 0.78 0.67
Ours - R 0.68 0.83 0.76 0.86 0.78 0.69
Ours + 2.68 0.58 0.49 2.63 0.59 0.49
Ours full model 0.59 0.87 0.79 0.64 0.86 0.77
Table 4: Performance of different variants of our model on seen and unseen categories.
Method Shoe-V2 Chair-V2
Top1 Top10 Top1 Top10
Scribbler [32] 8.86% 32.28% 31.27% 78.02%
LDSA [17] 21.17% 55.86% 41.80% 84.21%
Ours 26.88% 61.86% 45.57% 88.61%
Upper Bound 34.38% 79.43% 48.92% 90.71%
Table 5: FG-SBIR performance on Shoe-V2 and Chair-V2 dataset

Results on Unseen Dataset:  To further demonstrate the generalisation ability of our universal grouper, we test the trained model on a different dataset. Specifically, we choose 10 categories from the dataset in [33] including 5 categories overlapping with our dataset and 5 new categories. Note that the sketches in this datasets are from the database proposed in [34], which are drawn without the 20 second constraint, thus exhibiting much more details with better quality in general. This dataset thus represents a different domain. Table 3 shows that our model again demonstrates better generalisation ability than Edge-PG.

Ablation Study:  Our model is trained with a combination of generative and discriminative losses (Sec. 4.3). These include the local grouping loss , global grouping loss , reconstruction loss and KL loss . Among them, all but the KL loss can be removed, leading to six variants of our model, e.g., Ours - A - G is obtained by removing and . In addition, we implement Ours + which is having an loss on the predicted affinity matrix w.r.t. the ground truth matrix to examine the importance of having separate local and global grouping losses. The results are shown in Table 4. Clearly all three losses contribute to the performance of our model. The poorest result was obtained when an loss is added on the predicted affinity matrix, suggesting that balancing the local and global grouping losses is critical for learning a good grouper. We further show that the improvement of our full model over Ours - R on the unseen categories (0.64 vs 0.86) is bigger on the seen categories (0.59 vs. 0.68). This indicates that the generative loss helps the model generalise to unseen categories.

Figure 6: Applying our grouper to synthesise abstract sketches from photo edge map. (a) columns show the photos; (b) columns give the edge maps extracted from the photos and the grouping results; (c) columns provide synthesised abstract sketches.

5.3 Applications on Sketch Synthesis and FG-SBIR

One application of our grouper is to use it as an abstraction model so that edge maps extracted from photos can be grouped and abstracted to synthesise human-like sketches. Fig. 6 shows some examples of edge map grouping results and the synthesised sketches. It can be seen that our grouper is generalisable to photo edges and our abstraction method produces visually appealing sketches. The synthesised sketches are then used to train a state-of-the-art FG-SBIR model [19] without using any real human sketches. We use the largest FG-SBIR datasets QMUL Shoe-V2 and Chair-V2 [35]. We first compare with the same FG-SBIR model trained using synthesised sketches from the deep conditional GAN network in [32] (denoted as Scribbler). As can be seen in Table 5, our model performs much better. This suggests that our edge abstraction model, albeit simple, synthesises more realistic sketches from edge maps. We further compare with a recently proposed unsupervised FG-SBIR model LDSA [17] which is also based on abstracting photo edge maps to synthesise sketches. Table 5 shows that our model outperforms LDSA model by 5.71% and 3.63% on top 1 accuracy on Shoe-V2 and Chair-V2, respectively. The results are not far off the upper-bound which is obtained using the same FG-SBIR model trained with the real sketch-photo pairs in Shoe-V2 and Chair-V2. This shows that our method enables FG-SBIR to be used without the expensive collection of sketch-photo pairs.

6 Conclusion

We have proposed an end-to-end sketch perceptual grouping model. This is made possible by collecting a new large-scale sketch grouping dataset SPG. Our grouper is trained with generative losses to make it generalisable to new object categories and datasets/domains. Two grouping losses were also formulated to balance the local and global grouping constraints. Extensive experiments showed that our model significantly outperforms existing groupers. We also demonstrated our grouper’s application to sketch synthesis and FG-SBIR.


  1. email: {like1990,zhhg}@bupt.edu.cn, {t.hospedales}@ed.ac.uk
  2. email: {kaiyue.pang,j.song,yizhe.song,t.xiang}@qmul.ac.uk
  3. email: {like1990,zhhg}@bupt.edu.cn, {t.hospedales}@ed.ac.uk
  4. email: {kaiyue.pang,j.song,yizhe.song,t.xiang}@qmul.ac.uk
  5. email: {like1990,zhhg}@bupt.edu.cn, {t.hospedales}@ed.ac.uk
  6. email: {kaiyue.pang,j.song,yizhe.song,t.xiang}@qmul.ac.uk
  7. email: {like1990,zhhg}@bupt.edu.cn, {t.hospedales}@ed.ac.uk
  8. email: {kaiyue.pang,j.song,yizhe.song,t.xiang}@qmul.ac.uk
  9. email: {like1990,zhhg}@bupt.edu.cn, {t.hospedales}@ed.ac.uk
  10. email: {kaiyue.pang,j.song,yizhe.song,t.xiang}@qmul.ac.uk
  11. email: {like1990,zhhg}@bupt.edu.cn, {t.hospedales}@ed.ac.uk
  12. email: {kaiyue.pang,j.song,yizhe.song,t.xiang}@qmul.ac.uk
  13. email: {like1990,zhhg}@bupt.edu.cn, {t.hospedales}@ed.ac.uk
  14. email: {kaiyue.pang,j.song,yizhe.song,t.xiang}@qmul.ac.uk
  15. Their relationship is analogous to that between unsupervised image segmentation [5, 6] and semantic segmentation [7].


  1. Wagemans, J., Elder, J.H., Kubovy, M., Palmer, S.E., Peterson, M.A., Singh, M., von der Heydt, R.: A century of gestalt psychology in visual perception: I. Perceptual grouping and figure–ground organization. Psychological bulletin (2012)
  2. Wagemans, J., Feldman, J., Gepshtein, S., Kimchi, R., Pomerantz, J.R., van der Helm, P.A., van Leeuwen, C.: A century of gestalt psychology in visual perception: II. Conceptual and theoretical foundations. Psychological bulletin (2012)
  3. Wertheimer, M.: Laws of organization in perceptual forms. London, England: Kegan Paul, Trench, Trubner & Company (1938)
  4. Ren, X., Malik, J.: Learning a classification model for segmentation. In: ICCV. (2003)
  5. Xia, X., Kulis, B.: W-net: A deep model for fully unsupervised image segmentation. ArXiv e-prints (2017)
  6. Wang, C., Yang, B., Liao, Y.: Unsupervised image segmentation using convolutional autoencoder with total variation regularization as preprocessing. In: ICASSP. (2017)
  7. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv:1606.00915 (2016)
  8. Sun, Z., Wang, C., Zhang, L., Zhang, L.: Free hand-drawn sketch segmentation. In: ECCV. (2012)
  9. Huang, Z., Fu, H., Lau, R.W.: Data-driven segmentation and labeling of freehand sketches. TOG (2014)
  10. Schneider, R.G., Tuytelaars, T.: Example-based sketch segmentation and labeling using crfs. TOG (2016)
  11. Qi, Y., Guo, J., Li, Y., Zhang, H., Xiang, T., Song, Y.Z.: Sketching by perceptual grouping. In: ICIP. (2013)
  12. Qi, Y., Song, Y.Z., Xiang, T., Zhang, H., Hospedales, T., Li, Y., Guo, J.: Making better use of edges via perceptual grouping. In: CVPR. (2015)
  13. Hinton, G., Salakhutdinov, R.: Reducing the dimensionality of data with neural networks. Science (2006)
  14. Amir, A., Lindenbaum, M.: A generic grouping algorithm and its quantitative analysis. TPAMI (1998)
  15. Lun, Z., Zou, C., Huang, H., Kalogerakis, E., Tan, P., Cani, M.P., Zhang, H.: Learning to group discrete graphical patterns. TOG (2017)
  16. Ha, D., Eck, D.: A neural representation of sketch drawings. arXiv preprint arXiv:1704.03477 (2017)
  17. Muhammad, U.R., Song, Y.Z., Xiang, T., Hospedales, T.: Learning deep sketch abstraction. In: CVPR. (2018)
  18. Li, Y., Hospedales, T.M., Song, Y.Z., Gong, S.: Fine-grained sketch-based image retrieval by matching deformable part models. In: BMVC. (2014)
  19. Yu, Q., Liu, F., Song, Y.Z., Xiang, T., Hospedales, T.M., Loy, C.C.: Sketch me that shoe. In: CVPR. (2016)
  20. Song, J., Song, Y.Z., Xiang, T., Hospedales, T., Ruan, X.: Deep multi-task attribute-driven ranking for fine-grained sketch-based image retrieval. In: BMVC. (2016)
  21. Pang, K., Song, Y.Z., Xiang, T., Hospedales, T.M.: Cross-domain generative learning for fine-grained sketch-based image retrieval. In: BMVC. (2017)
  22. Hu, C., Li, D., Song, Y.Z., Xiang, T., Hospedales, T.: Sketch-a-classifier: Sketch-based photo classifier generation. In: CVPR. (2018)
  23. Song, J., Qian, Y., Song, Y.Z., Xiang, T., Hospedales, T.: Deep spatial-semantic attention for fine-grained sketch-based image retrieval. In: ICCV. (2017)
  24. Yu, Q., Yang, Y., Liu, F., Song, Y.Z., Xiang, T., Hospedales, T.M.: Sketch-a-net: A deep neural network that beats humans. IJCV (2017)
  25. Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A.M., Jozefowicz, R., Bengio, S.: Generating sentences from a continuous space. Computer Science (2015)
  26. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. ArXiv e-prints (2013)
  27. Graves, A.: Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 (2013)
  28. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. TSP (1997)
  29. Yang, J., Parikh, D., Batra, D.: Joint unsupervised learning of deep representations and image clusters. In: CVPR. (2016)
  30. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  31. Arbeláez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image segmentation. TPAMI (2011)
  32. Sangkloy, P., Lu, J., Fang, C., Yu, F., Hays, J.: Scribbler: Controlling deep image synthesis with sketch and color. CVPR (2017)
  33. Qi, Y., Guo, J., Song, Y.Z., Xiang, T., Zhang, H., Tan, Z.H.: Im2sketch: Sketch generation by unconflicted perceptual grouping. Neurocomputing (2015)
  34. Eitz, M., Hays, J., Alexa, M.: How do humans sketch objects? TOG (2012)
  35. Yu, Q., Song, Y.Z., Xiang, T., Hospedales, T.M.: SketchX – Shoe/Chair fine-grained SBIR dataset. http://sketchx.eecs.qmul.ac.uk (2017)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description