Learning like a Child:
Fast Novel Visual Concept Learning from Sentence Descriptions of Images
In this paper, we address the task of learning novel visual concepts, and their interactions with other concepts, from a few images with sentence descriptions. Using linguistic context and visual features, our method is able to efficiently hypothesize the semantic meaning of new words and add them to its word dictionary so that they can be used to describe images which contain these novel concepts. Our method has an image captioning module based on [mao2014deep] with several improvements. In particular, we propose a transposed weight sharing scheme, which not only improves performance on image captioning, but also makes the model more suitable for the novel concept learning task. We propose methods to prevent overfitting the new concepts. In addition, three novel concept datasets are constructed for this new task, and are publicly available on the project page. In the experiments, we show that our method effectively learns novel visual concepts from a few examples without disturbing the previously learned concepts. The project page is: www.stat.ucla.edu/~junhua.mao/projects/child_learning.html.
Recognizing, learning and using novel concepts is one of the most important cognitive functions of humans. When we were very young, we learned new concepts by observing the visual world and listening to the sentence descriptions of our parents. The process was slow at the beginning, but got much faster after we accumulated enough learned concepts [bloom2002children]. In particular, it is known that children can form quick and rough hypotheses about the meaning of new words in a sentence based on their knowledge of previous learned words [carey1978acquiring, heibeck1987word], associate these words to the objects or their properties, and describe novel concepts using sentences with the new words [bloom2002children]. This phenomenon has been researched for over 30 years by the psychologists and linguists who study the process of word learning [swingley2010fast].
For the computer vision field, several methods are proposed [fei2006one, salakhutdinov2010one, tommasi2014learning, lazaridou2014wampimuk] to handle the problem of learning new categories of objects from a handful of examples. This task is important in practice because we sometimes do not have enough data for novel concepts and hence need to transfer knowledge from previously learned categories. Moreover, we do not want to retrain the whole model every time we add a few images with novel concepts, especially when the amount of data or model parameters is very big.
However, these previous methods concentrate on learning classifiers, or mappings, between single words (e.g. a novel object category) and images. We are unaware of any computer vision studies into the task of learning novel visual concepts from a few sentences and then using these concepts to describe new images – a task that children seem to do effortlessly. We call this the Novel Visual Concept learning from Sentences (NVCS) task (see Figure 2).
In this paper, we present a novel framework to address the NVCS task. We start with a model that has already been trained with a large amount of visual concepts. We propose a method that allows the model to enlarge its word dictionary to describe the novel concepts using a few examples and without extensive retraining. In particular, we do not need to retrain models from scratch on all of the data (all the previously learned concepts and the novel concepts). We propose three datasets for the NVCS task to validate our model, which are available on the project page.
Our method requires a base model for image captioning which will be adapted to perform the NVCS task. We choose the m-RNN model [mao2014deep], which performs at the state of the art, as our base model. Note that we could use most of the current image captioning models as the base model in our method. But we make several changes to the model structure of m-RNN partly motivated by the desire to avoid overfitting, which is a particular danger for NVCS because we want to learn from a few new images. We note that these changes also improve performance on the original image captioning task, although this improvement is not the main focus of this paper. In particular, we introduce a transposed weight sharing (TWS) strategy (motivated by auto-encoders [bengio2009learning]) which reduces, by a factor of one half, the number of model parameters that need to be learned. This allows us to increase the dimension of the word-embedding and multimodal layers, without overfitting the data, yielding a richer word and multimodal dense representation. We train this image captioning model on a large image dataset with sentence descriptions. This is the base model which we adapt for the NVCS task.
Now we address the task of learning the new concepts from a small new set of data that contains these concepts. There are two main difficulties. Firstly, the weights for the previously learned concepts may be disturbed by the new concepts. Although this can be solved by fixing these weights. Secondly, learning the new concepts from positive examples can introduce bias. Intuitively, the model will assign a baseline probability for each word, which is roughly proportional to the frequency of the words in the sentences. When we train the model on new data, the baseline probabilities of the new words will be unreliably high. We propose a strategy that addresses this problem by fixing the baseline probability of the new words.
We construct three datasets to validate our method, which involves new concepts of man-made objects, animals, and activities. The first two datasets are derived from the MS-COCO dataset [lin2014microsoft]. The third new dataset is constructed by adding three uncommon concepts which do not occur in MS-COCO or other standard datasets. These concepts are: quidditch, t-rex and samisen (see section 5)333The dataset is available at www.stat.ucla.edu/~junhua.mao/projects/child_learning.html. We are adding more novel concepts in this dataset. The latest version of the dataset contains 8 additional novel concepts: tai-ji, huangmei opera, kiss, rocket gun, tempura, waterfall, wedding dress, and windmill.. The experiments show that training our method on only a few examples of the new concepts gives us as good performance as retraining the entire model on all the examples.
2 Related Work
Deep neural network Recently there have been dramatic progress in deep neural networks for natural language and computer vision. For natural language, Recurrent Neural Networks (RNNs [elman1990finding, mikolov2010recurrent]) and Long-Short Term Memories (LSTMs [hochreiter1997long]) achieve the state-of-the-art performance for many NLP tasks such as machine translation [kalchbrenner2013recurrent, cho2014learning, sutskever2014sequence] and speech recognition [mikolov2010recurrent]. For computer vision, deep Convolutional Neural Networks (CNN [lecun2012efficient]) outperform previous methods by a large margin for the tasks of object classification [krizhevsky2012imagenet, simonyan2014very] and detection [girshick2014rcnn, ouyang2014deepid, zhu2014learning]. The success of these methods for language and vision motivate their use for multimodal learning tasks (e.g. image captioning and sentence-image retrieval).
Multimodal learning of language and vision The methods of image-sentence retrieval [frome2013devise, socher2014grounded], image description generation [kulkarni2011baby, mitchell2012midge, gupta2012image] and visual question-answering [gao2015you, malinowski2014multi, antol2015vqa] have developed very fast in recent years. Very recent works of image captioning includes [mao2014explain, kiros2014unifying, karpathy2014deep, vinyals2014show, donahue2014long, fang2014captions, chen2014learning, lebret2014simple, malinowski2014pooling, klein2014fisher, xu2015show, ma2015multimodal]. Many of them (e.g. [mao2014deep, vinyals2014show]) adopt an RNN-CNN framework that optimizes the log-likelihood of the caption given the image, and train the networks in an end-to-end way. An exception is [fang2014captions], which incorporates visual detectors, language models, and multimodal similarity models in a high-performing pipeline. The evaluation metrics of the image captioning task is also discussed [elliott2014comparing, vedantam2014cider]. All of these image captioning methods use a pre-specified and fixed word dictionary, and train their model on a large dataset. Our method can be directly applied to any captioning models that adopt an RNN-CNN framework, and our strategy to avoid overfitting is useful for most of the models in the novel visual concept learning task.
Zero-shot and one-shot learning For zero-shot learning, the task is to associate dense word vectors or attributes with image features [socher2013zero, frome2013devise, elhoseiny2013write, antol2014zero, lazaridou2014wampimuk]. The dense word vectors in these papers are pre-trained from a large amount of text corpus and the word semantic representation is captured from co-occurrence with other words [mikolov2013distributed]. [lazaridou2014wampimuk] developed this idea by only showing the novel words a few times. In addition, [sharmanska2012augmented] adopted auto-encoders with attribute representations to learn new class labels and [weston2010large] proposed a method that scales to large datasets using label embeddings.
Another related task is one-shot learning task of new categories [fei2006one, lake2011one, tommasi2014learning]. They learn new objects from only a few examples. However, these work only consider words or attributes instead of sentences and so their learning target is different from that of the task in this paper.
3 The Image Captioning Model
We need an image captioning as the base model which will be adapted in the NVCS task. The base model is based on the m-RNN model [mao2014deep]. Its architecture is shown in Figure 2(a). We make two main modifications of the architecture to make it more suitable for the NVCS task which, as a side effect, also improves performance on the original image captioning task. Firstly and most importantly, we propose a transposed weight sharing strategy which significantly reduces the number of parameters in the model (see section 3.2). Secondly, we replace the recurrent layer in [mao2014deep] by a Long-Short Term Memory (LSTM) layer [hochreiter1997long]. LSTM is a recurrent neural network which is designed to solve the gradient explosion and vanishing problems. We briefly introduce the framework of the model in section 3.1 and describe the details of the transposed weight sharing strategy in section 3.2.
3.1 The Model Architecture
As shown in Figure 2(a), the input of our model for each word in a sentence is the index of the current word in the word dictionary as well as the image. We represent this index as a one-hot vector (a binary vector with only one non-zero element indicating the index). The output is the index of the next word. The model has three components: the language component, the vision component and the multimodal component. The language component contains two word embedding layers and a LSTM layer. It maps the index of the word in the dictionary into a semantic dense word embedding space and stores the word context information in the LSTM layer. The vision component contains a 16-layer deep convolutional neural network (CNN [simonyan2014very]) pre-trained on the ImageNet classification task [ILSVRCarxiv14]. We remove the final SoftMax layer of the deep CNN and connect the top fully connected layer (a 4096 dimensional layer) to our model. The activation of this 4096 dimensional layer can be treated as image features that contain rich visual attributes for objects and scenes. The multimodal component contains a one-layer representation where the information from the language part and the vision part merge together. We build a SoftMax layer after the multimodal layer to predict the index of the next word. The weights are shared across the sub-models of the words in a sentence. As in the m-RNN model [mao2014deep], we add a start sign and an end sign to each training sentence. In the testing stage for image captioning, we input the start sign into the model and pick the best words with maximum probabilities according to the SoftMax layer. We repeat the process until the model generates the end sign .
3.2 The Transposed Weight Sharing (TWS)
For the original m-RNN model [mao2014deep], most of the weights (i.e. 98.49%) are contained in the following two weight matrices: and where represents the size of the word dictionary.
The weight matrix between the one-hot layer and first word embedding layer is used to compute the input of the first word embedding layer :
where is an element-wise non-linear function, is the one-hot vector of the current word. Note that it is fast to calculate Equation 1 because there is only one non-zero element in . In practice, we do not need to calculate the full matrix multiplication operation since only one column of is used for each word in the forward and backward propagation.
The weight matrix between the multimodal layer and the SoftMax layer is used to compute the activation of the SoftMax layer :
where is the activation of the multimodal layer and is the SoftMax non-linear function.
Intuitively, the role of the weight matrix in Equation 1 is to encode the one-hot vector into a dense semantic vector . The role of the weight matrix in Equation 2 is to decode the dense semantic vector back to a pseudo one-hot vector with the help of the SoftMax function, which is very similar to the inverse operation of Equation 1. The difference is that is in the dense multimodal semantic space while is in the dense word semantic space.
To reduce the number of the parameters, we decompose into two parts. The first part maps the multimodal layer activation vector to an intermediate vector in the word semantic space. The second part maps the intermediate vector to the pseudo one-hot word vector, which is the inverse operation of Equation 1. The sub-matrix of the second part is able to share parameters with in a transposed manner, which is motivated by the tied weights strategy in auto-encoders for unsupervised learning tasks [bengio2009learning]. Here is an example of linear decomposition: , where . Equation 2 is accordingly changed to:
where is a element-wise function. If is an identity mapping function, it is equivalent to linearly decomposing into and . In our experiments, we find that setting as the scaled hyperbolic tangent function leads to a slightly better performance than linear decomposition. This strategy can be viewed as adding an intermediate layer with dimension 512 between the multimodal and SoftMax layers as shown in Figure 2(b). The weight matrix between the intermediate and the SoftMax layer is shared with in a transposed manner. This Transposed Weight Sharing (TWS) strategy enables us to use a much larger dimensional word-embedding layer than the m-RNN model [mao2014deep] without increasing the number of parameters. We also benefit from this strategy when addressing the novel concept learning task.
4 The Novel Concept Learning (NVCS) Task
Suppose we have trained a model based on a large amount of images and sentences. Then we meet with images of novel concepts whose sentence annotations contain words not in our dictionary, what should we do? It is time-consuming and unnecessary to re-train the whole model from scratch using all the data. In many cases, we cannot even access the original training data of the model. But fine-tuning the whole model using only the new data causes severe overfitting on the new concepts and decrease the performance of the model for the originally trained ones.
To solve these problems, we propose the following strategies that learn the new concepts with a few images without losing the accuracy on the original concepts.
4.1 Fixing the originally learned weights
Under the assumption that we have learned the weights of the original words from a large amount of data and that the amount of the data for new concepts is relatively small, it is straightforward to fix the originally learned weights of the model during the incremental training. More specifically, the weight matrix can be separated into two parts: , where and associate with the original words and the new words respectively. E.g., as shown in Figure 3, for the novel visual concept “cat”, is associated with 29 new words, such as cat, kitten and pawing. We fix the sub-matrix and update the sub-matrix as illustrated in Figure 3.
4.2 Fixing the baseline probability
In Equation 3, there is a bias term . Intuitively, each element in represents the tendency of the model to output the corresponding word. We can think of this term as the baseline probability of each word. Similar to , can be separated into two parts: , where and associate with the original words and the new words respectively. If we only present the new data to the network, the estimation of is unreliable. The network will tend to increase the value of which causes overfitting to the new data.
The easiest way to solve this problem is to fix during the training for novel concepts. But this is not enough. Because the average activation of the intermediate layer across all the training samples is not , the weight matrix plays a similar role to in changing the baseline probability. To avoid this problem, we centralize the activation of the intermediate layer and turn the original bias term into as follows:
After that, we set every element in to be the average value of the elements in and fix when we train on the new images. We call this strategy Baseline Probability Fixation (BPF).
In the experiments, we adopt a stochastic gradient descent algorithm with an initial learning rate of 0.01 and use AdaDelta [zeiler2012adadelta] as the adaptive learning rate algorithm for both the base model and the novel concept model.
4.3 The Role of Language and Vision
In the novel concept learning (NVCS) task, the sentences serve as a weak labeling of the image. The language part of the model (the word embedding layers and the LSTM layer) hypothesizes the basic properties (e.g. the parts of speech) of the new words and whether the new words are closely related to the content of the image. It also hypothesizes which words in the original dictionary are semantically and syntactically close to the new words. For example, suppose the model meets a new image with the sentence description “A woman is playing with a cat”. Also suppose there are images in the original data containing sentence description such as “A man is playing with a dog”. Then although the model has not seen the word “cat” before, it will hypothesize that the word “cat” and “dog” are close to each other.
The vision part is pre-trained on the ImageNet classification task [ILSVRCarxiv14] with 1.2 million images and 1,000 categories. It provides rich visual attributes of the objects and scenes that are useful not only for the 1,000 classification task itself, but also for other vision tasks [donahue2013decaf].
Combining cues from both language and vision, our model can effectively learn the new concepts using only a few examples as demonstrated in the experiments.
5.1 Strategies to Construct Datasets
We use the annotations and images from the MS COCO [lin2014microsoft] to construct our Novel Concept (NC) learning datasets. The current release of COCO contains 82,783 training images and 40,504 validation images, with object instance annotations and 5 sentence descriptions for each image. To construct the NC dataset with a specific new concept (e.g. “cat”), we remove all images containing the object “cat” according to the object annotations. We also check whether there are some images left with sentences descriptions containing cat related words. The remaining images are treated as the Base Set where we will train, validate and test our base model. The removed images are used to construct the Novel Concept set (NC set), which is used to train, validate and test our model for the task of novel concept learning.
5.2 The Novel Visual Concepts Datasets
|NC-3||150 ()||120 ()||30 ()|