Learnable Parameter Similarity
Most of the existing approaches focus on specific visual tasks while ignoring the relations between them. Estimating task relation sheds light on the learning of high-order semantic concepts, e.g., transfer learning. How to reveal the underlying relations between different visual tasks remains largely unexplored. In this paper, we propose a novel Learnable Parameter Similarity (LPS) method that learns an effective metric to measure the similarity of second-order semantics hidden in trained models. LPS is achieved by using a second-order neural network to align high-dimensional model parameters and learning second-order similarity in an end-to-end way. In addition, we create a model set called ModelSet500 as a parameter similarity learning benchmark that contains 500 trained models. Extensive experiments on ModelSet500 validate the effectiveness of the proposed method. Code will be released at https://github.com/Wanggcong/learnable-parameter-similarity.
Learnable Parameter Similarity
Guangcong Wang Sun Yat-sen University email@example.com Jianhuang Lai††thanks: Corresponding author: Jianhuang Lai. Sun Yat-sen University firstname.lastname@example.org Wenqi Liang Sun Yat-sen University email@example.com Guangrun Wang Sun Yat-sen University firstname.lastname@example.org
noticebox[b]Preprint. Under review.\end@float
Purpose-specific visual tasks have achieved greater commercial success by focusing on specific optimization problems, e.g., face recognition, object classification, object detection, visual object tracking, and instance segmentation. Can we exploit the underlying relations between different tasks and extend these task-specific methods to task-generic ones? Or can we connect shallow AI to general AI via task relations? Task relation sheds light on the learning of high-order semantic concepts.
Lots of evidence reveals transfer learning approaches [14, 18, 9] that exploit the underlying relations between different tasks can further improve purpose-specific visual tasks with less labeled data. For instance, domain adaption methods [18, 9] attempt to gain knowledge from source tasks and then apply it to a different but related target task. It is assumed that the knowledge learned from source tasks can help the learning of the target task.
Driven by these transfer learning methods, one would think: how to measure the relations between different tasks? Existing methods that offer partial solutions for this problem can be categorized into two groups. In the first group, a wide variety of transfer learning methods simply assume that tasks are related or unrelated based on human intuition or experience. For example, the knowledge gained from car recognition could be applied to truck recognition because cars intuitively look like trucks. However, one drawback of these methods is that human intuition could be different from machine learning principles. A negative transfer  could happen when human intuition is unreliable and the source domain data could lead to the reduced performance in the target domain. When the number of source domains is very large in some scenarios, it is hard to directly tell which is the best one for transfer learning.
In the second group, some methods attempt to jointly optimize multiple tasks and estimate task relations by cross-validation. For example, a taskonomy method  computes an affinity matrix among tasks based on whether the solution for one task can be sufficiently easily read out of the representation trained for another task. It uses transfer networks for the first-order transfer of 26 tasks. However, this pipeline requires a large amount of computation cost to jointly train all of the subsets of a task set. When a new task comes, it is needed to jointly train this new task and old tasks, which strongly limits its applications in real-world scenarios.
To address these drawbacks of existing methods, we propose a novel Learnable Parameter Similarity (LPS) method that learns second-order similarity to measure task relations by using trained task-specific models. Our observation is that the distance between intra-task models is closer than that between inter-task models. Let denote a set of tasks. For each task , we repeat the training procedure times and thus obtain trained models. We then use these task-specific models as metadata points to train a second-order neural network to measure the parameter similarity.
Different from existing transfer learning methods, the LPS method measures task relation using task-specific models that are trained on independent task-specific datasets without jointly optimizing two many subsets of a task set. LPS pays attention to higher-order semantics/concepts, as illustrated in Figure 1. Data points produce a data similarity metric. Data similarity metrics produce a parameter similarity metric. If data is the zero-order similarity and data similarity is the first-order similarity, then parameter similarity can be regarded as the second-order similarity. LPS is also different from learning to learn methods. The former is to learn second-order similarity based on first-order similarity while the later is to learn hyper-parameters to know how to learn, which still focuses on the optimization of the first-order similarity.
Overall, the key contributions of this paper are:
We propose a novel Learnable Parameter Similarity (LPS) method that learns second-order similarity to measure task relation by using trained task-specific models instead of jointly training a large number of transfer networks.
We introduce a hierarchical second-order network to deal with high-dimensional unaligned deep models and learn an effective parameter representation.
We create a parameter similarity learning benchmark called ModelSet500 and extensive experiments on ModelSet500 validate the effectiveness of the proposed method.
2 Related work
2.1 Transfer Learning
Transfer Learning [14, 2, 13, 16, 8, 18, 9, 12, 6] is to transfer knowledge from a source domain to a target domain, which has already achieved significant success in many areas including classification, regression, and clustering. Lots of approaches simply assume that source and target tasks are related or unrelated. For example, Tzeng et. al.  proposed to transfer the knowledge from RGB image based classification to depth image based classification. Liang et. al.  proposed to transfer the person re-identification knowledge from one scene to another scene. In addition, some methods jointly optimize multiple tasks and estimate task relations by cross-validation. For example, Rosenstein et al.  proposed to detect and avoid negative transfer using very little data from the target task and empirically showed that dissimilar tasks may hurt the performance of the target task. Zamir et. al.  proposed to transfer multiple different tasks to target tasks based on an affinity matrix of transferabilities. These approaches either assume that those tasks that are intuitively related have a positive impact on the transfer learning or attempt to learn task relation by jointly optimizing multiple labeled datasets. Different from these methods, the proposed method learns task relation from trained task-specific models.
2.2 Meta-learning/Learning to Learn
Meta-learning, also known as learning to learn, is a subfield of machine learning that focuses on automated learning algorithms. Recently, lots of approaches [21, 1, 4, 10, 11] aimed to automatically search hyper-parameters by using learning to learn models. For example, Zoph and Le  used a recurrent network to generate the model descriptions of neural networks and trained the RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set. Liu et. al.  used a sequential model-based optimization strategy to search for structures in order of increasing complexity, while simultaneously learning a surrogate model to guide the search through structure space. Andrychowicz et. al.  attempted to learn an LSTM based neural optimizer to learn how to optimize neural networks. Finn et. al.  introduced a meta-learning method based on learning easily adaptable model parameters through gradient descent. However, these methods aim to automatically learn the better first-order similarity by optimizing hyper-parameters while the proposed method is to learn second-order similarity.
3 Second-Order Similarity
We first consider data similarity, which is also termed first-order similarity in this paper. For a learning task, let denote a training dataset with data points . Let be an objective function with parameters on . The objective function is optimized by the standard gradient descent as follows
where is a step size multiplier at . After a sequence of updates, the algorithm converges to a minimizer . With the parameter , the learning task can measure similarity between data points. In this way, given a set of learning tasks, we can obtain a set of trained task-specific parameters. Now the question is how to measure the similarity between different task-specific parameters?
Data similarity/metric learning aims to minimize the distance between intra-class data points and maximize the distance between inter-class data points. Similar to data similarity learning, the goal of parameter similarity/metric learning is to minimize the distance between intra-task models and maximize the distance between inter-task models, as illustrated in Figure 1. Intra-task models are obtained by repeating the training procedure many times for each task.
The intra-task models are diverse and contain rich statistical semantics due to the intractable non-convexity of deep networks and the coupled relationship between data points and parameters in convolution operation. Data points and parameters share the same semantics. For example, car classification and bus classification share a similar model while car classification and bird classification could produce dissimilar models. That is, the distance between car classification and bus classification models is closer than that between car classification and bird classification models. Learning to distinguish intra-task and inter-task parameters can uncover the underlying parameter patterns of local solutions and thus provides a parameter similarity metric for task relation.
Let denote a set of tasks . For , we train deep models with a task label . We obtain a model set with models from tasks, where . is a trained model, which can be regarded as a metadata point. We then use these metadata points to train the second-order similarity learning by
where is a second-order similarity objective function with parameters on and is a step size multiplier at .
Second-order similarity learning is naturally different from existing learning to learn methods. First, the goal of second-order similarity learning is to learn parameter similarity based on first-order similarity while learning to learn methods learn hyper-parameters of an optimizer to navigate data similarity learning, which is naturally the first-order similarity learning. Second, second-order similarity learning optimizes model parameters and second-order model parameters (different from “hyper-parameter") separately while learning to learn methods optimize both model parameters and hyper-parameters simultaneously. Such a disjoint optimization method guarantees that trained deep models have converged to a good local solution.
4 Second-Order Neural Networks
In this section, we introduce a hierarchical second-order neural network to learn an effective parameter representation for second-order similarity, as shown in Figure 2. Specifically, the second-order network consists of branches to deal with semantic levels of layers. For each branch, we attempt to align metadata and then project it into a common space for parameter similarity learning with a fully connected layer. Finally, we weight the losses of different branches to control the relative importance.
Although the idea of parameter similarity learning is similar to data similarity learning, it is a challenging problem because of high-dimensional metadata points (higher than images, e.g., 96M for ResNet-50), changeable order of filters and their channels at a layer, and hierarchical semantics.
One of the most challenging problems is that the order of filters and their channels is changeable when repeating the training procedure. Therefore, directly using fully connected layers to project these metadata points into a common space cannot obtain good performance. We prove that the order of filters and their channels at a layer could be shuffled during different training procedures as follows.
Consider the convolution operation in visual tasks. Let be feature maps of size at the -th convolutional layer. is the -th feature map of (). is the convolution operator. We re-use as filters of size at the -th convolutional layer for simplification (do not be confused by , and mentioned above). is the -th filter of (). is the -th channel of . , produces a feature map . The convolution operation is given by
In this way, filters produce feature maps as output. We can see that the convolution operation is independent of the order of filters . Although shuffled filters at the -layer naturally lead to shuffled channels of , it does not reduce informative features. The convolution operation for each filter at a layer is symmetric. Therefore, the order of is changeable. We perform the next convolution operation by
According to Eq. (3), shuffled leads to shuffled , where . According to Eq. (4), the order of channels of is determined by the order of and the channel order of the filter . Because the order of is changed, the channel order of the filter have to be changed if we fix the channel order of the output . For example, if we exchange the order of two filters and , and also exchange the channel and of for any , the output of the network is fixed. That is, there are a large number of filter map combinations such that the output is the same.
Let and denote the filter order and channel order of . We denote the order of as . We define the order chain rule of convolutional filters as follows
If , then is fixed because it is constrained by the channels of natural images, e.g., RGB. determines
If , then is determined by while determines .
If , then is determined . is fixed because it is constrained by the loss of neural networks (or labels).
The order chain rule provides an ideal case that intra-task models are only affected by the changeable order of filters and their channels. In fact, intra-task deep models are also affected by hierarchical semantics and non-convex optimization. Therefore, give two deep models with a shared task label, it is difficult to align the changeable order of filters and their channels, which confuses parameter similarity learning.
To address this problem, we propose to transform of different intra-task models into a “standard" order to align these model parameters. Let is the second-order filters of size . and could have different number filters at a layer. , , and , we first compute the Frobenius inner product between filter maps and second-order filter maps by
where denotes the Frobenius inner product. In Eq. (5), we compute the Frobenius inner product between all pairs of and filter maps. Without any prior information, we cannot directly compute the standard . We estimate the standard based on the max matching between and . We have
When , we do not need to align by the max operation. , , and , we directly compute by
, we then estimate the standard by using the sort operation
where is the new filter order indexes. Based on the order chain rule, is fixed when . Therefore, we do not need to align .
After aligning , we add a fully connected layer to project the aligned parameter representation into common space. Finally, we train a second-order model by using a conventional metric learning loss, e.g., cross-entropy loss  or triplet loss . In this paper, we simply use the cross-entropy loss for model similarity measure
where is the -dimensional value of the one-hot label . represents the probability of the -th visual task at the -th branch. controls the relative importance of branches. . When measuring the parameter similarity between two deep models, we extract parameter features from the last fully connected layer of the second-order network. We normalize parameter features and use the cosine similarity to compute the parameter similarity.
In this section, we conduct extensive experiments to validate the effectiveness of our proposed method. All experiments are conducted with Pytorch .
Dataset and Modelset. The CIFAR-100 dataset  has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class.
To measure model-based parameter similarity, we create a model set called ModelSet500 based on CIFAR-100. Specially, we split 100 classes into 50 groups as 50 visual tasks with task labels 049. Each task contains 2 classes. For each task, we repeat the training procedure 10 times to obtain 10 deep models. Finally, we obtain 500 deep models for 50 tasks.
First-order and Second-order Network Implementation. We implement a ResNet-20 network  as the first-order network in our experiments. The network consists of one convolutional layer, three residual blocks, one global average pooling layer, and one fully-connected layer. The blocks of ResNet-20 consist of 6 convolutional layers, respectively (excluding convolutional layers which are used for downsampling in residual blocks). When computing the parameter similarity measure, we omit the last fully-connected layer and convolutional layers. We use SGD with a mini-batch size of 128. The learning rate starts from 0.1 and is divided by 10 after 80 epochs and 120 epochs, respectively. We train ResNet-20 for 160 epochs. We use a weight decay of 0.0001 and a momentum of 0.9. We set the size of padding to 4 and perform random cropping with and random horizontal flip.
When implementing the second-order network, we use SGD with a mini-batch size 1. The learning rate starts from 0.001 and is divided by 10 after 40 epochs and 80 epochs, respectively. We train 100 epochs. We set the loss weights and in Section 5.1 5.2 and 5.3 because the first branch plays a major role in Eq. (9). We analyze the performance of different branches in Section 5.4 and 5.5.
5.1 Task classification
We first adopt the conventional image classification evaluation metric for task classification on ModelSet500. The goal of task classification is to predict task labels of trained models. For each task, we sample eight deep models for training while the other two for test. Therefore, the training set contains 400 deep models while the test set contains 100 deep models. We train the second-order network by classifying 50 task classes.
To evaluate the effectiveness of the second-order network, we set three baselines, i.e., random prediction without using any second-order model, only fully connected layer, Frobenius inner product without aligning filters and their channels. Compared with these three baselines, we can see that the proposed method achieves better performance. As shown in Table 1, our method achieves 79.8% top-1 accuracy while the three baselines are lower than 6%. The main reason may be that the changeable order of filters and their channels confuses the second-order neural networks.
|methods||top-1 (%)||top-5 (%)||top-10 (%)|
|only one FC layer||5.3||17.5||29.3|
|Frobenius+FC, w/ alignment (Ours)||79.8||94.8||98.0|
5.2 Task retrieval
We then adopt the conventional image retrieval evaluation metric for task retrieval on ModelSet500. Given a query deep model, the goal of task retrieval is to retrieve the target deep models from a gallery. We split ModelSet500 into a training set and a test set. The training set includes 40 tasks and the test set includes 10 tasks. For each task in the test set, we sample two deep models for query models while the others are used for gallery models. Task retrieval is challenging and practical in real-world scenarios. It can be used to retrieve similar visual tasks for transfer learning. Different from task classification, we train the second-order network by classifying 40 task classes and extract the fully connected layer as parameter representation during the test phase. We use the normalized parameter representation for task retrieval.
We also set three baselines like task classification. Compared with these three baselines, we can see that the proposed method achieves better performance. As shown in Table 2, our method achieves 80.7% rank-1 accuracy while the three baselines are lower than 15%. The reason is the same as Section 5.1.
|methods||rank-1 (%)||rank-5 (%)||rank-10 (%)|
|only one FC layer||15.0||55.8||83.3|
|Frobenius+FC, w/o alignment||13.3||58.3||80.0|
|Frobenius+FC, w/ alignment (Ours)||80.7||95.0||95.8|
5.3 Task transferability
To validate that the parameter similarity can be used to measure task transferability, we conduct an experiment to study task transferability with respect to parameter similarity. We use the task retrieval setting to obtain parameter representation, as discussed in Section 5.2. We evaluate task transferability by training first-order models for one task (a labeled source domain) and applying these models to another task (an unlabeled target domain) for data feature extraction and k-means clustering. We use adjusted rand index (ARI)  for evaluation, which is widely used in cluster analysis.
As shown in Figure (a)a, although there are some outliers, we observe that parameter similarity is roughly proportional to task transferability (ARI). In most of the cases, the higher parameter similarity score the source and target tasks have, the better task transferability they can achieve.
5.4 Effect of different branches
To show the relative importance of different branches for the parameter similarity measure, we conduct an experiment on ModelSet500 by keeping one of branches while removing the other branches. That is, when evaluating the -th branch, we set if , otherwise . We use the task classification setting. Different from other experiments, the learning rate in this experiment is divided by 10 after 150 epochs and 225 epochs, respectively. We train each branch for 300 epochs because it takes more time to align the order of both filters and their channels.
As shown in Figure (b)b, it is observed that the first branch is the most important branch and it can achieve 79.8% top-1 accuracy. From the 2rd to 13th branch, the second-order network still can learn a weak discriminative parameter representation since the results are higher than top-1 accuracy of random prediction, i.e., 2.0% (50 classes). However, from the 13th to 19th layer, the second-order cannot learn a discriminative parameter representation since the results are nearly reduced to 2.0% top-1 accuracy. The reason could be that the order chain rule leads to the cumulative error of alignment at higher layers.
Besides, it is still hard to fuse all of the branches to improve better performance because the 2rd to 19th branch is much lower. This is the reason why we only use the first branch for task classification and retrieval.
5.5 Effectiveness of the alignment methods
To show the effectiveness of the alignment methods, we conduct an experiment by isolating different alignment methods, i.e., filter alignment (filt.) and channel alignment (chan.). The training setting is the same as Section 5.4. We only analyze the 2rd to 19th branch because the 1st branch only contains one unaligned case, i.e., unaligned filters (see Eq. (7)). As shown in Figure (c)c, it is observed that both alignment methods are important for parameter similarity learning.
6 Advantages and disadvantages
The proposed method comes with advantages and disadvantages compared with the existing methods. The advantages are that the learnable parameter similarity model does not need to jointly train a large number of transfer networks to estimate task relations. The second-order network does not need to align convolutional filters if the first-order network does not contain convolutional layers in some non-image scenarios. Learnable parameter similarity can be also used to quickly retrieve related tasks from the model market (set) for transfer learning.
The disadvantages are that the proposed learnable parameter similarity cannot measure the similarity between different networks and it is still hard to fuse all of the branches to improve better performance. The reason could be that the alignment at higher layers are still unsatisfactory.
In this paper, we present a Learnable Parameter Similarity (LPS) method to reveal the underlying relations between different visual tasks. Our approach learns task relations by using task-specific trained deep models instead of jointly training a large number of transfer networks. We present a hierarchical second-order network to deal with high-dimensional unaligned deep models. We evaluate the LPS method on a new parameter similarity learning benchmark ModelSet500 and extensive experiments show the effectiveness of the proposed method. Future work is to further explore the fusion of different layers and the alignment of the convolutional filters at higher layers and develop a parameter similarity metric for heterogeneous networks.
-  Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
-  Yoshua Bengio. Deep learning of representations for unsupervised and transfer learning. In Proceedings of the 2011 International Conference on Unsupervised and Transfer Learning workshop-Volume 27, pages 17–37. JMLR. org, 2011.
-  Shengyong Ding, Liang Lin, Guangrun Wang, and Hongyang Chao. Deep feature learning with relative distance comparison for person re-identification. Pattern Recognition, 48(10):2993–3003, 2015.
-  Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei A Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213, 2017.
-  Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
-  Bin Li, Qiang Yang, and Xiangyang Xue. Transfer learning for collaborative filtering via a rating-matrix generative model. In Proceedings of the 26th annual international conference on machine learning, pages 617–624. ACM, 2009.
-  Wenqi Liang, Guangcong Wang, Jianhuang Lai, and Junyong Zhu. M2m-gan: Many-to-many generative adversarial transfer learning for person re-identification. arXiv preprint arXiv:1811.03768, 2018.
-  Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pages 19–34, 2018.
-  Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
-  Ming-Yu Liu and Oncel Tuzel. Coupled generative adversarial networks. In Advances in neural information processing systems, pages 469–477, 2016.
-  Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Deep transfer learning with joint adaptation networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2208–2217. JMLR. org, 2017.
-  Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
-  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
-  Michael T Rosenstein, Zvika Marx, Leslie Pack Kaelbling, and Thomas G Dietterich. To transfer or not to transfer. In NIPS 2005 workshop on transfer learning, volume 898, pages 1–4, 2005.
-  Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deep learning face representation from predicting 10,000 classes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1891–1898, 2014.
-  Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7167–7176, 2017.
-  Nguyen Xuan Vinh, Julien Epps, and James Bailey. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11(Oct):2837–2854, 2010.
-  Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3712–3722, 2018.
-  Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.