Chameleon: Learning Model Initializations Across Tasks With Different Schemas

Chameleon: Learning Model Initializations Across Tasks With Different Schemas

Lukas Brinkmeyer, Rafael Rego Drumond*,
Randolf Scholz, Josif Grabocka, Lars Schmidt-Thieme
Information Systems and Machine-Learning Lab - University of Hildesheim
Samelsonplatz 1
Hildesheim - 31141 - Germany
{brinkmeyer,radrumond,scholz,josif,schmidt-thieme}@ismll.uni-hildesheim.de
Equal Contribution
Abstract

Parametric models, and particularly neural networks, require weight initialization as a starting point for gradient-based optimization. In most current practices, this is accomplished by using some form of random initialization. Instead, recent work shows that a specific initial parameter set can be learned from a population of tasks, i.e., dataset and target variable for supervised learning tasks. Using this initial parameter set leads to faster convergence for new tasks (model-agnostic meta-learning). Currently, methods for learning model initializations are limited to a population of tasks sharing the same schema, i.e., the same number, order, type and semantics of predictor and target variables.

In this paper, we address the problem of meta-learning parameter initialization across tasks with different schemas, i.e., if the number of predictors varies across tasks, while they still share some variables. We propose Chameleon, a model that learns to align different predictor schemas to a common representation. We use permutations and masks of the predictors of the training tasks at hand. In experiments on real-life data sets, we show that Chameleon successfully can learn parameter initializations across tasks with different schemas providing a 26% lift on accuracy on average over random initialization and of 5% over a state-of-the-art method for fixed-schema learning model initializations. To the best of our knowledge, our paper is the first work on the problem of learning model initialization across tasks with different schemas.

Introduction

Figure 1: The main idea of Chameleon is to shift feature vectors of different tasks to a shared representation. In this picture the top part represents schemas from different tasks of the same domain. The bottom part represent the aligned features in a fixed schema.

Due to the rapid growth of Deep Learning [Goodfellow-et-al-2016], the field of Machine Learning has gained an immense popularity among researchers and the industry. Even people without much knowledge regarding this area are starting to build their own models for research or business applications. However, this lack of expertise might become an obstacle in various cases such as finding the adequate hyperparameters for the desired modeling of the task. Meta-learning [finn2018one] and hyperparameter optimization [wistuba2015learning, schilling2016scalable] have aided both experts and non-experts in finding the best solutions such as model size, step length, regularization and parameter initialization [kotthoff2017auto].

Current approaches optimize virtually all of these parameters through the use of hyperparameter search methods with exception of the weights initialization. Instead, a suitable distribution is selected in almost all cases [glorot2010understanding, He_2015_ICCV]. The reason is clear, a simple binary classification model with one fully-connected hidden layer with 64 neurons, input size 2 and output size 1, already yields at least 192 continuous weight parameters. This makes the hyperparameter space of the initial network weights, even for small networks, far too complex for finding a good combination in a feasible time.

Recent approaches in meta-learning and multi-task learning have shown that it is possible to learn such a per-weight initialization across different tasks if they have the same schema. For that reason, all of these approaches rely on using image tasks which can be easily scaled to a fixed shape. We want to extend this work to classical vector data, e.g. learning an initialization from a hearts disease data set which can then be used for a diabetes detection task that relies on similar features. In contrast to image data, there is no trivial solution to map different feature vectors to a common representation. Simply padding the smaller task, so that both data sets have the same number of features, still does not guarantee that similar features contained in both tasks are aligned. Thus, we require an encoder that maps hearts and diabetes data to one feature representation which then can be used to find a general initialization via popular meta-learning algorithms like reptile [reptile2018].

We propose a set-wise feature transformation model called chameleon, named after a reptile capable of adjusting its colors according to the environment in which it is located. chameleon deals with different schemas by projecting them to a fixed input space while keeping features from different tasks but of the same type or distribution in the same positions, as illustrated by Figure 1.

Our main contributions are as follows: (1) We show how current meta-learning approaches can work with tasks of different representation as long as they the share some variables with respect to type or semantics. (2) We propose an architecture, chameleon, that can generalize a projection of any feature space to a shared predictor vector. (3) We demonstrate that we can improve state-of-the-art meta-learning methods by combining them with this component.

Related Work

Our goal is to extend recent meta-learning approaches that make use of multi-task and transfer learning methods by adding a feature alignment to cast different inputs to a fixed representation. In this section we will discuss various works related to our approach.

Research on Transfer Learning [sung2018comp, pan2010survey] has shown that training a model on different auxiliary tasks before actually fitting it to the target problem may provide better results. For instance, [zoph2018learning] pre-train blocks of convolutional neural networks on a small data set before training a joint model with a much larger data set. In contrast, using only a single image data set with different tasks, [ranjan2019hyperface] train a truncated version of Alex-Net [krizhevsky2012imagenet] with different heads. Each head tackles one specific task such as detecting human-faces, gender, position, landmark and rotation of the image. Although they work on the same input data, each task has a specific loss that back-propagates to the initial layers of the network.

Similarly, the work of [kendall2018multi] train an encoder with three different decoder heads, aiming to reconstruct the image pixel-wise based on semantic segmentation, object segmentation, and pixel depth. These papers show that different tasks can be used in parallel during the training or pre-training of a model with a distinct loss to further improve on the performance of the next task. Moreover, the positive effect of transfer learning has also been successfully demonstrated across tasks from different domains [ganin2016domain, tzeng2015domain, tzeng2017domain].

Motivated by transfer learning, few-shot learning research aims to generalize a representation of a new class of samples given only few instances [duan2017one, finn2017one, snell2017prototypical]. Several meta-learning approaches have been developed to tackle this problem by introducing architectures and parameterization techniques specifically suited for few-shot classification [munk2017meta, mishra2018snail]. Moreover, [finn2017model] showed that an adapted learning paradigm could be sufficient for learning across tasks. Other research directions such as the work of [jomaa2019dataset2vec] uses meta-learning to extract useful meta-features from data sets to improve hyper-parameter optimization. The Model Agnostic Meta-Learning (maml) method describes a model initialization algorithm by training an arbitrary model across different tasks. Instead of sequentially training the model one task at a time, it uses update steps from different tasks to find a common gradient direction that achieves a fast convergence for any objective. In other words, for each meta-learning update, we would need an initial value for the model parameters . Then, we sample tasks , and for each task we find an updated version of using examples from the task performing gradient descent as in:

(1)

The final update of will be:

(2)

Finn et al. state that maml does not require learning an update rule [ravi2016optimization], or restricting their model architecture [santoro2016meta]. They extended their approach by incorporating a probabilistic component such that for a new task, the model is sampled from a distribution of models to guarantee a higher model diversification for ambiguous tasks [finn2018one]. However, maml requires to compute second order derivatives, ending up being computationally heavy. reptile [reptile2018] simplifies maml by numerically approximating Equation (2) to replace the second derivative with the weights difference:

(3)

Which means we can use the difference between the previous and updated version as an approximation of the second derivatives to reduce computational costs. The serial version is presented in Algorithm 1.\@floatalgocf[t!]     \end@float

All of these approaches rely not only on a fixed feature space but also on an identical alignment across all tasks. However, even similar data sets only share a subset of their features, while often times having a different order or representation. To make the feature space unique across all data sets, it is required to align the features. The work of [jomaa2019dataset2vec] uses meta-learning to extract useful meta-features from data sets independent of their size to improve hyper-parameter optimization. However, the data set features are compressed without aligning them since the goal is to generate a fixed number of meta-features and no instance-wise transformation.

Different approaches in the literature deal with feature alignment in various fields. The work from [pan2010cross] describes a procedure for sentiment analysis which aligns words across different domains that have the same sentiment. Recently, [zhang2018local] proposed an unsupervised framework called Local Deep-Feature Alignment. The procedure computes the global alignment which is shared across all local neighborhoods by multiplying the local representations with a transformation matrix. So far none of these methods find a common alignment between features of different data sets.

We propose a novel feature alignment component named chameleon which enables state-of-the-art methods to work on top of tasks whose predictor vectors differ not only in their length but also their concrete alignment.

Methodology

Methods like reptile and maml try to find the best initialization for a specific model to work on a set of similar tasks. Typically, these approaches select a simple two-layer feed-forward neural network as their base model, in this work referred to as . A task consists of predictor data , a target , a predefined training/test split and a loss function . However, every task has to share the same predictor space of common size , where similar features shared across tasks are in the same position. An encoder is needed to map the data representation with feature length to a shared latent representation with fixed feature length .

(4)

Where represents the number of instances in , is the number of features of task , and is the size of the desired feature space. By combining this encoder with a model that works on a fixed input size and outputs the predicted target i.e. binary classification, it is possible to apply the reptile algorithm to learn an initialization across tasks with different predictor vectors. The optimization objective then becomes the meta loss for the combined network over a set of tasks :

(5)

Where is the concatenation of the initial weights for the encoder and for model , and are the updated weights after applying the learning procedure for iterations on the task as defined in Algorithm 1 for the inner updates of reptile.

Figure 2: The Chameleon Architecture: represents the number of samples in , is equal to the feature length of , and K is defined as the desired feature space. “Conv()” is a convolution operation with input channels, output channels and kernel length . “Dense Layer()” is a fully connected layer with neurons that transforms each of the inputs to an -dimensional feature vector.

It is important to mention that learning one weight parameterization across any heterogeneous set of tasks is extremely difficult since it is most likely impossible to find one initialization for two tasks with a vastly different number and type of features. By contrast if two tasks are of similar length and share similar features, one can align the similar features in a shared representation (schema) so that a model can directly learn across different tasks by transforming the predictor as illustrated in Figure 1.

Chameleon

Consider a set of tasks where a reordering matrix exists that transforms predictor data into with having the same schema for every task :

(6)

Every represents the feature of sample . Every represent how much of feature (from samples in ) should be shifted to position in the adapted input . Finally, every represent the new feature of sample in with the adapted shape and size. This can be also expressed as . In order to achieve the same when permuting two features of a task , we must simply permute the corresponding rows in to achieve the same .

For example: Consider that task has features [apples, bananas, melons] and task features [lemons, bananas, apples]. Both can be transformed to the same representation [apples, lemons, bananas, melons] by replacing missing features with zeros and reordering them. This transformation must have the same result for and independently of their feature order. In a real life scenario, features might come with different names or sometimes their similarity is not clear to the human eye.

Our proposed component, denoted by , takes a task and outputs the corresponding reordering matrix:

(7)

The component is a neural network parameterized by . It consists of two 1D-Convolutions, two dense hidden layers, where the last one is the output layer that estimates the alignment matrix via a softmax activation. The input transposed to size [] (where N is the number of samples) represents all the values that each feature has for every sample. We apply a linear transformation using 1D-Convolutions with kernel length 1 and output size of 32 channels, twice. The output of size [] is followed by one regular linear transformation of size 64 and another one of size , resulting in one vector per original feature displaying the relation to each of the features in the target space. Each of these vectors passes through a softmax layer, computing the probability that a feature of will be shifted to each position of . This simplifies the objective of the network to a simple classification based on the new position of each feature in . The overall architecture can be seen in Figure 2. This reordering can be used to encode the tasks to the shared representation as defined in Equation (6). The encoder necessary for training reptile across tasks with different predictor vector by optimizing Equation (5) is then given as:

(8)

Reordering Training

Simply joint training the network as described above, will not teach how to reorder the features to a shared representation. That is why training specifically with the objective of reordering features is necessary. In order to do so, it is essential to optimize the model on a reordering objective while using a meta data set. This contains similar tasks , meaning for all tasks there exists a reordering matrix that maps to the shared representation. If is known beforehand, we can optimize Chameleon by minimizing the expected reordering loss over the meta data set:

(9)

where is the softmax cross-entropy loss, is the ground-truth (one-hot encoding of the new position for each variable), and is the prediction. This training procedure can be seen in Algorithm 2. The trained chameleon model can then be used to compute the for any unseen task .

\@float

algocf[]     \end@float After this training procedure we can use the learned weights as initialization for before optimizing with reptile without further using . Experiments show that this procedure improves our results significantly while compared to only optimizing the joint meta-loss.

Training the chameleon component to reorder similar tasks to a shared representation not only requires a meta data set, but one where the true reordering matrix is provided for every task. In application this means manually matching similar features of different training tasks so that novel tasks can be matched automatically. However, it is possible to sample a broad number of tasks from a single data set by sampling smaller sub tasks from it, selecting a random subset of features in arbitrary order for random instances. Thus, it is not necessary to manually match the features since all these sub tasks share the same apart from the respective permutation of the rows as mentioned above.

Experiments

In this section, we describe our experimental setup and present the results of our approach on different data sets. For the sake of simplicity we restrict ourselves to binary classification tasks. As described in the previous section, our architecture consists of chameleon and a base model . In all of our experiments we compare the performance of four approaches by training on randomly sampled tasks: (i) with model from a random initialization, (ii) with model from the initialization obtained by reptile training on the meta training data padded to size , (iii) with the joint model from the initialization obtained by reptile training on the meta training data , and (iv) with the joint model from the initialization obtained by training chameleon as defined in Equation (9) before using reptile.

The number of gradient updates on a new task in the reptile algorithm is set to 10. Likewise, a task from the meta test data is evaluated by initializing with the current weights , performing 10 updates with its training data and then scoring the respective test data. All experiments are conducted with the same model architecture. The base model is a feed-forward neural network with two dense hidden layers which have 64 neurons each. chameleon is utilized as described in the previous section and combined with the base network. This architecture was chosen to match the original one used in the papers for maml and reptile.

Data set
Samples
Features
Features in Train
Wine 6497 12 8
Abalone 4177 8 5
Telescope 19020 11 8
Hearts 597 13 13*
Diabetes 768 8 0*
Table 1: List of data sets used in each of our experiments.  Used in conjunction with all of Hearts features in Training and all of Diabetes Features in Test

Meta Data sets

For experimental purposes, we utilize a single data set as meta data set by sampling the training and test tasks from the data set. This allows us to evaluate our method on different domains without manually matching training tasks since is naturally given. Novel features can also be introduced during testing by not only splitting the instances but also the features of a data set in train and test partition. Training tasks are then sampled by selecting a random subset of the training features in arbitrary order for instances, while a stratified sampling guarantees that test tasks contain both features from train and test, while sampling the instances from the test set only. In order to demonstrate our method on two actually separated tasks we use one data set as and a different, related data set as .

Data Sets.

We evaluate our approach on five UCI data sets [uciref] in four experimental setups. The characteristics of the individual data sets are described in Table 1. The Hearts and the Diabetes dataset will be used in conjunction to evaluate the performance when testing on a similar but separate dataset. We binarize both datasets from the level of the severity of the disease to “Disease” or “No Disease”. For the Wine data set we use binary labels according to the color of the Wine (red or white). In the case of the Abalone dataset, the binary labels indicate whether the number of rings of an abalone shell is less than 10 or greater equal 10. The Telescope dataset is already a binary classification task.

In contrast to our baseline approach [reptile2018] and most related literature [finn2017model, snell2017prototypical, finn2018one] we are not evaluating our approach on image data. Realigning features is not feasible with images since only the original sequence of pixels represents the image, while the channels (for colored inputs) already follow a standard input sequence (red, green and blue). Additionally, as mentioned before image schemas can be aligned trivially by scaling to the same size, which is why we keep our scope on standard 1-dimensional non-sequential data.

Design Setup.

For the experiments on the Wine, Abalone and Telescope datasets, 30 percent of the instances are used for initializing chameleon, 50 percent for training and 20 percent for testing. In the final experiment, the Hearts and the diabetes dataset are used in conjunction. The full training is performed on the Hearts dataset, while the diabetes dataset is used for testing. For all experiments, we sample 30 training and 30 validation instances for a new task. During the reordering-training phase and the inner updates of joint training we use the adam optimizer [adam] with an initial learning rate of 0.00001 and 0.0001 respectively. The meta-updates of reptile are carried out with a learning rate of 0.01. The reordering training phase is run for 1000 epochs. All experiments are conducted in two variants: (i) with all features being utilized during training and (ii) with novel features during test time. For this purpose we impose a train test split on the features as stated in Table 1 and when sampling a validation task we sample 25 to 75% of the features from the unseen ones. We call this experiment setting ”Split”, while (i) is refered to as ”No Split”. In the final experiment, we evaluate the use of the Hearts Disease data for training, while evaluating the resulting initialization on the Diabetes data. These data sets share a subset of features i.e. age and blood pressure, while having strong correlation between other features due to their shared domain. Our work is built on top of reptile [reptile2018] but can be used in conjunction with any meta-learning method. We opted to use reptile since it does not require second derivatives, and the code is publicly available [gitreptile] while also being easy to adapt to our problem.

Figure 3: Meta test loss (cf. eq. (3.5), ) on the Wine data set. Graph (a) shows the average Meta-Test loss over 10 runs during reptile training. The graph is smoothed with a moving average of size 20 due to the variance when using a meta batch size of one when sampling Test tasks. Each point represents the test loss on a sampled task after training on it for 10 epochs. Graphs (b), (c) and (d) show a snapshot for the correponding training of a single task until convergence after initializing with the current learned parameters . The dotted black line marks the value plotted in graph (a) after 10 epochs. Note that the line for scratch training does not improve since it represents the model randomly initialized for every task.
Figure 4: Heat Map of the feature shifts for the Wine data set computed with chameleon after reordering-training. The x-axis represents the twelve features of the wine data set in the correct order and the y-axis shows to which position these features are shifted.
Figure 5: Meta test loss (cf. eq. (3.5), ) on the Heart vs. Diabetes data sets. The graph shows the average Meta-Test loss over 10 runs during reptile training. The graph is smoothed with a moving average of size 20 due to the variance when using a meta batch size of one when sampling Test tasks. Each point represents the test loss on a sampled task after training on it for 10 epochs.

Initial Experiment.

In our first experiment we use the Wine data set [uciref, wineref] to illustrate how our approach can learn a useful initialization across tasks of different representation (contribution 1). We analyze the four approaches described above by training the model or respectively. The models for approaches (ii) and (iii) are trained using reptile, while for (iv) chameleon is first trained for 1000 epochs before performing reptile training on the joint model. In every reptile epoch we evaluate the current initialization by training on a task sampled from the evaluation set (we always discard the updates from the evaluation set). In parallel we also repeat the last step for (i) but evaluate on the model with randomly initialized weights, we refer to this step as Scratch.

Our experimental results can be analyzed by observing the test loss for a task after performing the inner updates. Figure 3(a) shows the validation loss during the reptile training for the Wine experiments. Each data point corresponds to the training of a validation task over 10 epochs and evaluating its test data afterwards. Figures 3(b), (c) and (d) show an exemplary snapshot of the test loss when training a sampled task with the proposed method (iv). The snapshots show the training of a task until convergence when initializing with the current , while the actual reptile training only computes 10 updates, indicated by the dotted line, before performing a meta-update. The snapshots show the expected reptile behavior, namely a faster convergence when using the current learned initialization compared to a random one. Note that Figure 3(a) shows the test loss after 10 updates, which is why the snapshot (b) after 1000 meta-epochs shows a lower convergence while the graph in (a) is already outperforming training from scratch. One can see that all three approaches trained with reptile generate an initialization that converges faster than random. However, simply adding the chameleon component only leads to a faster meta-training convergence. Only when including the pre-training phase defined in Algorithm 2, one can observe a lower meta-loss which corresponds to a more efficient initialization.

Loss Accuracy
Data Set
Wine (NS)
Wine (S)
Abalone (NS)
Abalone (S)
Telescope (NS)
Telescope (S)
Hearts+Diabetes
Scratch Reptile Cha. Cha.+
0.82 0.49 0.48 0.37
0.83 0.48 0.48 0.43
0.83 0.54 0.54 0.52
0.82 0.54 0.54 0.52
0.82 0.51 0.46 0.36
0.82 0.52 0.48 0.42
0.84 0.59 0.59 0.56
Scratch Reptile Cha. Cha.+
0.52 0.78 0.79 0.85
0.52 0.79 0.80 0.82
0.52 0.72 0.72 0.74
0.52 0.72 0.72 0.73
0.52 0.76 0.78 0.84
0.52 0.75 0.77 0.80
0.51 0.69 0.68 0.70
Table 2: Loss and accuracy scores for the experiments with each data set and each model averaged over 20 tasks. All results are measured after performing 10 update steps on a new task and evaluating the loss/accuracy on a validation set. ’S’ stands for ”Split” and ’NS’ for ”No Feature Split”. ”Cha.” represents the Reptile experiments using Chameleon along with the same model, while in ”Cha+” Chameleon is pre-trained. Best results are bold-faced.

The result of pretraining chameleon on the Wine data set can be seen in Figure 4. The x-axis shows the twelve distinct features of the Wine data, and the y-axis represents the predicted position. The color indicates the average portion of the feature that is shifted to the corresponding position. One can see that the component manages to learn the true feature position in almost all cases. Moreover, this illustration does also show that chameleon can be used to detect similarity between different features by indicating which pairs are confused most often. For example, feature two and three are showing a strong correlation which is very plausible since they depict the “free sulfur dioxide” and “total sulfur dioxide level” of the Wine. This demonstrates that our proposed architecture is able to learn an alignment between different feature spaces (contribution 2).

Final Results.

In Figures 5 we can see the results for the hearts vs. diabetes experiment (our supplementary material contains the remaining results). In all experiments simply adding chameleon does not elevate the performance by a significant margin. However, adding reordering-training always results in a clear performance lift over the other approaches. The fact that we can only see performance improvements when using reordering-training shows that a higher model capacity alone is not responsible for the results. In all experiments, pre-training chameleon shows superior performance (contribution 3). When adding novel features during test time chameleon is still able to outperform the other setups although with a lower margin. Most importantly, when training on the Hearts data while evaluating on the Diabetes data, our method still provides noticeable improvements. These findings are also depicted in our numerical results stated in Table 2.

Hardware.

We run our experiments in CPU using Intel Xeon E5410 processors. Pre-training takes in average 10 minutes, while joint training takes around 7 hours to complete for all the tasks of each experiment. Our code is available online for reproduction purposes at: https://github.com/radrumond/Chameleon.

Conclusion

In this paper we show how multi-tasking meta-learners can work along tasks with different predictor vectors. In order to do so we present chameleon, a network capable of realigning features based on their distributions. We also propose a novel pre-training framework that is shown to learn useful permutations across tasks in a supervised fashion without requiring actual labels. Our model shows performance lifts even when presented with features not seen in training.

By experimenting with different data sets and even combining two distinct ones from the same domain, we show that chameleon can be generalized for any kind of task. Another advantage of chameleon is that it can be used with any model and optimization method.

As future work, we would like to extend chameleon to time-series features and reinforcement learning, as these tend to present more variations in different tasks and they would benefit greatly from our model.

References

Supplementary Material

Figure 6: The graphs shows the average Meta-Test loss over 10 runs during reptile training. The graph is smoothed with a moving average of size 20 due to the variance when using a meta batch size of one when sampling Test tasks. Each point represents the test loss on a sampled task after training on it for 10 epochs.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
392267
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description