Meta-Learning via Feature-Label Memory Network

# Meta-Learning via Feature-Label Memory Network

Dawit Mureja    Hyunsin Park    Chang D. Yoo
Korea Advanced Institute of Science and Technology
School of Electrical Engineering
{dawitmureja,hs.park,cd_yoo}@kaist.ac.kr
###### Abstract

Deep learning typically requires training a very capable architecture using a large dataset. However, many important learning problems demand an ability to draw valid inferences from a small size dataset, and such problems pose a particular challenge for deep learning. In this regard, various researches on “meta-learning” are being actively conducted. Recent work has suggested a Memory Augmented Neural Network (MANN) for meta-learning. MANN is an implementation of a Neural Turing Machine (NTM) with the ability to rapidly assimilate new data in its memory, and use this data to make accurate predictions. In models such as the MANN, the input data samples and their appropriate labels from previous step are bound together in the same memory locations. This often leads to memory interference when performing a task as these models have to retrieve a feature of an input from a certain memory location and read only the label information bound to that location. In this paper, we tried to address this issue by presenting a more robust MANN. We revisited the idea of meta-learning and proposed a new memory augmented neural network by explicitly splitting the external memory into feature and label memories. The feature memory is used to store the features of input data samples and the label memory stores their labels. Hence, when predicting the label of a given input, the memory augmented network with separate feature and label memory unit uses the feature memory unit as a reference to extract the stored feature of the input, and based on that feature, it retrieves the label information of the input from the label memory unit. In order for the network to function in this framework, a new memory-writing module to encode label information into the label memory in accordance with the meta-learning task structure is designed. Here, we demonstrate that the memory-augmented network outperforms MANN by a large margin in supervised one-shot classification tasks using Omniglot and MNIST datasets.

Meta-Learning via Feature-Label Memory Network

Dawit Mureja and Hyunsin Park and Chang D. Yoo Korea Advanced Institute of Science and Technology School of Electrical Engineering

## 1 Introduction

Deep learning is heavily dependent on big data. Traditional gradient based neural networks require extensive and iterative training using large datasets. In these models, training occurs through a continuous update of weight parameters in order to optimize the loss function during training. However, when there is only a little data to learn from, deep learning is prone to poor performance because traditional networks will not acquire enough knowledge about the specific task via weight updates, and hence, they fail to make accurate predictions when tested.

Previous works have approached the task of learning from few samples using different methods such as probabilistic models based on Bayesian learning (?), generative models using probability density functions (??), Siamese neural networks (?), and meta-learning based memory augmented models (??).

In this work, we revisited the problem of meta-learning using memory augmented neural networks. Meta-learning is a two-tiered learning framework in which an agent learns not only about the specific task, for instance, image classification, but also about how the task structure varies across target domains (??). Neural architectures with an external memory such as Neural Turing Machines (NTMs) (?) and memory networks (?) have shown the ability of meta-learning.

Recent memory augmented neural networks for meta-learning such as MANN (?) use a plain memory matrix as an external memory. In these models, input data samples and their labels are bound together in the same memory locations.In models such as the MANN, the input data samples and their appropriate labels from previous step are bound together in the same memory locations. This often leads to memory interference when performing a task as they have to retrieve a feature of an input from a certain memory location and read only the label information bound to that location.

Our primary contribution in this work is designing a different version of NTM (?) by splitting the external memory into feature and label memories to avoid any catastrophic interference. The feature memory is used to store input data features and the label memory is used to encode the label information of the inputs. Therefore, during testing, ideal performance in our model requires using the feature memory as a reference to accurately retrieve the stored feature of an input image and effectively reading the corresponding label information from the label memory. In order to accomplish this, we designed a new memory writing module based on the meta-learning task structure that monitors the way in which information is written into the label memory.

## 2 Related Work

Our work is based on a recent work done by a ? They approached the problem of one shot learning with the notion of meta learning and suggested a Memory Augmented Neural Network (MANN). MANN is an implementation of NTM (?) with an ability to rapidly assimilate new data, and use this data to make accurate predictions after a few samples.

In previous implementation of NTM, memory was addressed both by content and location. However, in their work, they presented a new memory access module. This memory access module is called Least Recently Used Access (LRUA)(?). It is a pure content-based memory writer that writes memories either to the least recently used location or to the most recently used location of the memory. According to this module, new information is written into rarely used locations (preserving recently encoded information) or it is written to the last used location (to update the memory with newer, and possibly relevant, information).

In this work, we used a similar task structure used in recent works (??). As we implemented supervised learning, the model is tasked to infer information from a labelled training data. This involves presenting the label along with input at time step . However, in our work, the training data was presented in the following manner: , where is the dataset, is the input at time step and is the class label from previous time step . Therefore, the model sees the following input sequence: (Figure 1(a)).

Moreover, the label used for a particular class of input images in a certain episode is not necessarily the same as the label used for the same class of input images in another episode. Random shuffling of labels is used from episode to episode in order to prevent the model from slowly learning sample-class bindings in its weights. Instead, it learns to store input information into the feature memory and store the corresponding output information into the label memory, when presented at the next time step, after which sample-class bindings, between the input features in the feature memory and the class labels in the label memory, will be formed for later use (Figure 1(b)).

## 4 Memory Augmented Model

Neural Turing Machine (NTM) (?) is a memory augmented neural network that has two main components: a controller and an external memory. It can be seen as a differentiable version of a Turing machine. The controller is a neural network that provides an internal representation of the input used by read and write heads to interact with the external memory. It can be either feed-forward or recurrent neural network.

In this work, we designed a memory augmented neural network, a different version of NTM, with its memory split into partitions: Feature memory () and Label memory (). The feature memory is used as a reference memory to retrieve the stored representation of an input data. The label memory is used to read an output information of the input based on the retrieved information from the feature memory. In our model, we used Long Short Term Memory (LSTM) (?) as a controller due to its better performance compared to other controller models. Figure 2 shows the high-level diagram of our model.

Feature-Label Memory Network (FLMN) has two memories, and hence, has two write heads. Feature memory write head writes into the feature memory (). Label memory write head is a writer to the label memory (). Even though information is encoded in both memories, output information is read only from the label memory using the label read head.

Here is how our model works. Given some input at time step , the controller produces three interface vectors, , and . Key vector () is used to retrieve a particular memory, , from a row of the feature memory; i.e. . Add vectors ( and ) are used to modify the content of feature memory () and label memory (), respectively.

### 4.1 Reading from the Label Memory

Before the output information of the input image is read from the label memory, the corresponding feature of is retrieved from the feature memory using a key . When retrieving memory, the row of the feature memory () is addressed using cosine similarity measure,

 K[kt,Mft(i)]=kt⋅Mft(i)∥kt∥⋅∥Mft(i)∥ (1)

This measure, , is then used to produce read-weight vector () whose elements are computed according to the following softmax:

 wrt(i)←exp(K[kt,Mft(i)])∑jexp(K[kt,Mft(j)]) (2)

The read weights are then used to read from label memory (. The read memory, , is computed as follows,

 rt←∑iwrt(i)Mlt(i) (3)

### 4.2 Writing into the Feature Memory

In order to write into the feature memory, we implemented the LRUA module (?) with slight modifications. According to this module, new information is written either into rarely used locations or to the last used location. The distinction between these two options is accomplished by an interpolation using usage weight vector .

The usage weight vector at a given time step is computed by decaying the previous usage weights and adding the current write weights of the feature memory and read weights as follows,

 wut←γwut−1+wwft+wrt (4)

where, is a decay parameter.

In order to access the least-used location of the feature memory, least-used weight vector is defined from the usage weight vector ,

 wlu(i)={1$if$wu(i)=min(wu)0$otherwise$ (5)

Write weights for the feature memory () are then obtained by using a learnable sigmoid gate parameter to compute a convex combination of the previous read weights and previous least-used weights.

 wwft←σ(α)wrt−1+(1−σ(α))wlut−1 (6)

where, and is a scalar gate parameter to interpolate between weights.

Therefore, new content is written either to the previously used memory (if is 1) or the least-used memory (if is 0). Before writing into the feature memory, the least used location of the memory is cleared. This can be done via element-wise multiplication using the least-used weights from the previous time step:

 Mft(i)←Mft−1(i)⋅(1−wlut−1(i)),∀i (7)

Then writing into memory occurs in accordance with the computed weight vectors using the feature add vector () as follows,

 Mft(i)←Mft(i)+wwft(i)aft,∀i (8)

### 4.3 Writing into the Label Memory

According to (3), the read memory is retrieved from the label memory using the read weights with the elements computed using (2) which involves the feature memory . Hence, the label memory should be written in a similar manner as the feature memory so that when an input image is provided to the network at time step , the network retrieves the stored feature of the input from and based on that feature, it extracts the label of the input image from .

In order to accomplish the above scenario, we designed a new memory writing module for the label memory. The new module is based on the task setup in which the model was trained. As mentioned earlier, during training, the model sees the following input sequence: . The label at time step is the appropriate label for the input which was presented along with the label at time step . Based on this observation, we designed a recursive memory writing module.

According to this module, the label memory write-weight vector at time step is computed from the previous feature memory write-weight vector in a recursive manner as follows,

 wwlt(i)←wwft−1(i) (9)

The label memory () is then written according to the write weights using the label add vector .

 Mlt(i)←Mlt−1(i)+wwlt(i)alt,∀i (10)

This memory is then read as shown in (3) to give a read memory, , which will be used by the controller as an input to a softmax classifier, and as an additional input for the next controller state.

Based on this module, the label at time step will be written into the label memory in the same manner as the input (from the previous time step ) was written into the feature memory. This enhances the model to accurately retrieve input information from the feature memory and use this feature to effectively read the corresponding output information from the label memory without any interference.

## 5 Experimental Results

We tested our model in one-shot image classification tasks using Omniglot and miniMNIST datasets. The omniglot dataset consists of 1623 characters from 50 different alphabets. The number of samples per each class (character) is 20. The dataset is also called MNIST transpose due to the fact that it contains large number of classes with relatively few data samples per class. This makes the dataset ideal for one-shot learning.

### 5.1 Experiment Setup

In this work, we implemented both our model and MANN (?) and compared their performance in supervised one-shot classification tasks. However, the experimental settings we used for implementing MANN are slightly different from the implementation of MANN in the original paper (?).

In the paper, the number of reads from the memory used was four. Data augmentation was performed by randomly translating and rotating character images. New classes were also created through , and rotations of existing data. A minibatch size of 16 was used.

In our case, one read from memory was used. In order to make a fair comparison, we tried to balance the memory of the two models. we used an x memory matrix for MANN, where is the number of memory locations, and is the size at each location. For our model, we split the memory into two and we used x memory matrix for each memory. Using these settings, we performed three types of experiments.

### 5.2 Experiment: Type I

In the first experiment, the original omniglot dataset was used without performing any data modification. Out of the 1623 available classes, 1209 classes were used for training and the rest 414 classes were used for testing the models. Note that these two sets are disjoint. Therefore, after training, both models were tested with never-seen omniglot classes. For computational simplicity, image sizes were down scaled to . One-hot vector representations were used for class labels and training was done using 100,000 episodes. Several experiments were performed for different number of classes (and different number of samples per each class) in an episode. Figure 3 shows the training accuracy of the models for 5 classes and 10 samples (per each class) in an episode.

As we can see from Figure 3, our model has outperformed MANN in making accurate predictions. The instance accuracy of our model has reached nearly accuracy within the first 20,000 episodes of training, while the instance accuracy of MANN could only reach accuracy.

### 5.3 Experiment: Type II

In our second experiment, we performed data augmentation without creating new classes. The dataset was augmented by rotating and translating random character images of an episode. The angle for rotation was chosen randomly from a uniform distribution with a size of an episode. This was accompanied by a translation in the x and y dimensions with values uniformly sampled between -10 and 10 pixels. Images were then downscaled to

In a similar manner as the previous experiment, 1209 classes (plus augmentations) were used for training and 414 classes (plus augmentations) were used for testing. Figure 4 shows the training accuracy of MANN and our model for 100,000 episodes.

Not only has our model performed better in making accurate predictions but also has learned faster than MANN. This can be shown by plotting the loss graph of training for the two experiments (Figure 5).

In both types of experiments, the training process has stopped at the mark of the 100,000 episode. Without any further training, the models were tested with never-seen omniglot classes from the testing set. The testing results are summarized in the Table 1. We borrowed the test result of MANN from ? for a reference.

As we can see from the table, our model has demonstrated higher classification accuracy in both experiments compared to MANN. FLMN has reached an accuracy of 85.6% (Experiment I) and 86.5% (Experiment II) on just second presentation of an input sample from a class with in an episode reaching up to 94.1% and 94.4% accuracy by the instance, respectively. On the other hand, MANN achieved an accuracy of 66.7% (Experiment I) and 65.5% (Experiment II) in the instance reaching up to 78.1% and 77.2% accuracy by the instance, respectively.

### 5.4 Zero-shot learning

In this experiment, the models were tasked to perform MNIST classification after being trained with omniglot dataset. We used 1209 classes of omniglot dataset for training. For testing, we prepared a miniMNIST dataset. miniMNIST contains only 20 image samples per each class which are randomly selected from the original MNIST dataset. The images were downscaled to . After 100,000 episodes of training, the models were tested with never-seen MNIST classes. Testing results are summarized in the following table.

As we can refer from Table 2, FLMN was able to achieve accuracy on the instance in classifying never-seen-before images from miniMNIST dataset after being trained with omniglot dataset.

## 6 Conclusion

In this paper, we implemented meta-learning framework and proposed Feature-Label Memory Network (FLMN). The novelty of our model is that it stores input data samples and their matching labels into separate memories preventing any memory interference. We also introduced a new memory writing method associated with the task structure of meta-learning. We have shown that our model has outperformed MANN in supervised one-shot classification tasks using Omnigot and miniMNIST datasets. Future work includes testing our model with more complex datasets and experimenting the performance of our model in other tasks.

## References

• [Christophe, Ricardo, and Pavel] Christophe, G.-C.; Ricardo, V.; and Pavel, B. 2004. Introduction to the special issue on meta-learning. Introduction to the special issue on meta-learning. Machine learning, 54(3):187â193.
• [Fei-Fei, Fergus, and Perona] Fei-Fei, L.; Fergus, R.; and Perona, P. 2006. One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence 28(4):594–611.
• [Graves, Wayne, and Danihelka] Graves, A.; Wayne, G.; and Danihelka, I. 2014. Neural turing machines. arXiv preprint arXiv:1410.5401.
• [Hochreiter and Schmidhuber] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural Computation 9(8):1735–1780.
• [Koch] Koch, G. 2015. Siamese neural networks for one-shot image recognition. PhD thesis, University of Toronto.
• [Lake et al.] Lake, B. M.; Salakhutdinov, R.; Gross, J.; and Tenenbaum, J. B. 2011. One shot learning of simple visual concepts. In Proceedings of the Annual Conference of the Cognitive Science Society 72:2.
• [Rezende et al.] Rezende, D. J.; Mohamed, S.; Danihelka, I.; Gregor, K.; and Wierstra, D. 2016. One-shot generalization in deep generative models. In Proceedings of the International Conference on Machine Learning, JMLR:W&CP 48.
• [Salakhutdinov, Tenenbaum, and Torralba] Salakhutdinov, R.; Tenenbaum, J.; and Torralba, A. 2012. One-shot learning with a hierarchical nonparametric bayesian model. Proceedings of ICML Workshop on Unsupervised and Transfer Learning, PMLR 27:195–206.
• [Santoro et al.] Santoro, A.; Bartunov, S.; Botvinick, M.; Wierstra, D.; and Lillicrap, T. 2016. Meta-learning with memory-augmented neural networks. In Proceedings of The 33rd International Conference on Machine Learning, 1842–1850.
• [Vinyals et al.] Vinyals, O.; Blundell, C.; Lillicrap, T.; Kavukcuoglu, K.; and Wierstra, D. 2016. Matching networks for one shot learning. In Advances in Neural Information Processing Systems 3630–3638.
• [Weston, Chopra, and Bordes] Weston, J.; Chopra, S.; and Bordes, A. 2014. Memory networks. arXiv preprint arXiv:1410.3916.
• [Woodward and Finn] Woodward, M., and Finn, C. 2017. Active one-shot learning. arXiv preprint arXiv:1702.06559.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters