Learnable Parameter Similarity
Abstract
Most of the existing approaches focus on specific visual tasks while ignoring the relations between them. Estimating task relation sheds light on the learning of highorder semantic concepts, e.g., transfer learning. How to reveal the underlying relations between different visual tasks remains largely unexplored. In this paper, we propose a novel Learnable Parameter Similarity (LPS) method that learns an effective metric to measure the similarity of secondorder semantics hidden in trained models. LPS is achieved by using a secondorder neural network to align highdimensional model parameters and learning secondorder similarity in an endtoend way. In addition, we create a model set called ModelSet500 as a parameter similarity learning benchmark that contains 500 trained models. Extensive experiments on ModelSet500 validate the effectiveness of the proposed method. Code will be released at https://github.com/Wanggcong/learnableparametersimilarity.
Learnable Parameter Similarity
Guangcong Wang Sun Yatsen University wanggc3@mail2.sysu.edu.cn Jianhuang Lai^{†}^{†}thanks: Corresponding author: Jianhuang Lai. Sun Yatsen University stsljh@mail.sysu.edu.cn Wenqi Liang Sun Yatsen University liangwq8@mail2.sysu.edu.cn Guangrun Wang Sun Yatsen University wanggrun@mail2.sysu.edu.cn
noticebox[b]Preprint. Under review.\end@float
1 Introduction
Purposespecific visual tasks have achieved greater commercial success by focusing on specific optimization problems, e.g., face recognition, object classification, object detection, visual object tracking, and instance segmentation. Can we exploit the underlying relations between different tasks and extend these taskspecific methods to taskgeneric ones? Or can we connect shallow AI to general AI via task relations? Task relation sheds light on the learning of highorder semantic concepts.
Lots of evidence reveals transfer learning approaches [14, 18, 9] that exploit the underlying relations between different tasks can further improve purposespecific visual tasks with less labeled data. For instance, domain adaption methods [18, 9] attempt to gain knowledge from source tasks and then apply it to a different but related target task. It is assumed that the knowledge learned from source tasks can help the learning of the target task.
Driven by these transfer learning methods, one would think: how to measure the relations between different tasks? Existing methods that offer partial solutions for this problem can be categorized into two groups. In the first group, a wide variety of transfer learning methods simply assume that tasks are related or unrelated based on human intuition or experience. For example, the knowledge gained from car recognition could be applied to truck recognition because cars intuitively look like trucks. However, one drawback of these methods is that human intuition could be different from machine learning principles. A negative transfer [16] could happen when human intuition is unreliable and the source domain data could lead to the reduced performance in the target domain. When the number of source domains is very large in some scenarios, it is hard to directly tell which is the best one for transfer learning.
In the second group, some methods attempt to jointly optimize multiple tasks and estimate task relations by crossvalidation. For example, a taskonomy method [20] computes an affinity matrix among tasks based on whether the solution for one task can be sufficiently easily read out of the representation trained for another task. It uses transfer networks for the firstorder transfer of 26 tasks. However, this pipeline requires a large amount of computation cost to jointly train all of the subsets of a task set. When a new task comes, it is needed to jointly train this new task and old tasks, which strongly limits its applications in realworld scenarios.
To address these drawbacks of existing methods, we propose a novel Learnable Parameter Similarity (LPS) method that learns secondorder similarity to measure task relations by using trained taskspecific models. Our observation is that the distance between intratask models is closer than that between intertask models. Let denote a set of tasks. For each task , we repeat the training procedure times and thus obtain trained models. We then use these taskspecific models as metadata points to train a secondorder neural network to measure the parameter similarity.
Different from existing transfer learning methods, the LPS method measures task relation using taskspecific models that are trained on independent taskspecific datasets without jointly optimizing two many subsets of a task set. LPS pays attention to higherorder semantics/concepts, as illustrated in Figure 1. Data points produce a data similarity metric. Data similarity metrics produce a parameter similarity metric. If data is the zeroorder similarity and data similarity is the firstorder similarity, then parameter similarity can be regarded as the secondorder similarity. LPS is also different from learning to learn methods. The former is to learn secondorder similarity based on firstorder similarity while the later is to learn hyperparameters to know how to learn, which still focuses on the optimization of the firstorder similarity.
Overall, the key contributions of this paper are:

We propose a novel Learnable Parameter Similarity (LPS) method that learns secondorder similarity to measure task relation by using trained taskspecific models instead of jointly training a large number of transfer networks.

We introduce a hierarchical secondorder network to deal with highdimensional unaligned deep models and learn an effective parameter representation.

We create a parameter similarity learning benchmark called ModelSet500 and extensive experiments on ModelSet500 validate the effectiveness of the proposed method.
2 Related work
2.1 Transfer Learning
Transfer Learning [14, 2, 13, 16, 8, 18, 9, 12, 6] is to transfer knowledge from a source domain to a target domain, which has already achieved significant success in many areas including classification, regression, and clustering. Lots of approaches simply assume that source and target tasks are related or unrelated. For example, Tzeng et. al. [18] proposed to transfer the knowledge from RGB image based classification to depth image based classification. Liang et. al. [9] proposed to transfer the person reidentification knowledge from one scene to another scene. In addition, some methods jointly optimize multiple tasks and estimate task relations by crossvalidation. For example, Rosenstein et al. [16] proposed to detect and avoid negative transfer using very little data from the target task and empirically showed that dissimilar tasks may hurt the performance of the target task. Zamir et. al. [20] proposed to transfer multiple different tasks to target tasks based on an affinity matrix of transferabilities. These approaches either assume that those tasks that are intuitively related have a positive impact on the transfer learning or attempt to learn task relation by jointly optimizing multiple labeled datasets. Different from these methods, the proposed method learns task relation from trained taskspecific models.
2.2 Metalearning/Learning to Learn
Metalearning, also known as learning to learn, is a subfield of machine learning that focuses on automated learning algorithms. Recently, lots of approaches [21, 1, 4, 10, 11] aimed to automatically search hyperparameters by using learning to learn models. For example, Zoph and Le [21] used a recurrent network to generate the model descriptions of neural networks and trained the RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set. Liu et. al. [10] used a sequential modelbased optimization strategy to search for structures in order of increasing complexity, while simultaneously learning a surrogate model to guide the search through structure space. Andrychowicz et. al. [1] attempted to learn an LSTM based neural optimizer to learn how to optimize neural networks. Finn et. al. [4] introduced a metalearning method based on learning easily adaptable model parameters through gradient descent. However, these methods aim to automatically learn the better firstorder similarity by optimizing hyperparameters while the proposed method is to learn secondorder similarity.
3 SecondOrder Similarity
We first consider data similarity, which is also termed firstorder similarity in this paper. For a learning task, let denote a training dataset with data points . Let be an objective function with parameters on . The objective function is optimized by the standard gradient descent as follows
(1) 
where is a step size multiplier at . After a sequence of updates, the algorithm converges to a minimizer . With the parameter , the learning task can measure similarity between data points. In this way, given a set of learning tasks, we can obtain a set of trained taskspecific parameters. Now the question is how to measure the similarity between different taskspecific parameters?
Data similarity/metric learning aims to minimize the distance between intraclass data points and maximize the distance between interclass data points. Similar to data similarity learning, the goal of parameter similarity/metric learning is to minimize the distance between intratask models and maximize the distance between intertask models, as illustrated in Figure 1. Intratask models are obtained by repeating the training procedure many times for each task.
The intratask models are diverse and contain rich statistical semantics due to the intractable nonconvexity of deep networks and the coupled relationship between data points and parameters in convolution operation. Data points and parameters share the same semantics. For example, car classification and bus classification share a similar model while car classification and bird classification could produce dissimilar models. That is, the distance between car classification and bus classification models is closer than that between car classification and bird classification models. Learning to distinguish intratask and intertask parameters can uncover the underlying parameter patterns of local solutions and thus provides a parameter similarity metric for task relation.
Let denote a set of tasks . For , we train deep models with a task label . We obtain a model set with models from tasks, where . is a trained model, which can be regarded as a metadata point. We then use these metadata points to train the secondorder similarity learning by
(2) 
where is a secondorder similarity objective function with parameters on and is a step size multiplier at .
Secondorder similarity learning is naturally different from existing learning to learn methods. First, the goal of secondorder similarity learning is to learn parameter similarity based on firstorder similarity while learning to learn methods learn hyperparameters of an optimizer to navigate data similarity learning, which is naturally the firstorder similarity learning. Second, secondorder similarity learning optimizes model parameters and secondorder model parameters (different from “hyperparameter") separately while learning to learn methods optimize both model parameters and hyperparameters simultaneously. Such a disjoint optimization method guarantees that trained deep models have converged to a good local solution.
4 SecondOrder Neural Networks
In this section, we introduce a hierarchical secondorder neural network to learn an effective parameter representation for secondorder similarity, as shown in Figure 2. Specifically, the secondorder network consists of branches to deal with semantic levels of layers. For each branch, we attempt to align metadata and then project it into a common space for parameter similarity learning with a fully connected layer. Finally, we weight the losses of different branches to control the relative importance.
Although the idea of parameter similarity learning is similar to data similarity learning, it is a challenging problem because of highdimensional metadata points (higher than images, e.g., 96M for ResNet50), changeable order of filters and their channels at a layer, and hierarchical semantics.
One of the most challenging problems is that the order of filters and their channels is changeable when repeating the training procedure. Therefore, directly using fully connected layers to project these metadata points into a common space cannot obtain good performance. We prove that the order of filters and their channels at a layer could be shuffled during different training procedures as follows.
Consider the convolution operation in visual tasks. Let be feature maps of size at the th convolutional layer. is the th feature map of (). is the convolution operator. We reuse as filters of size at the th convolutional layer for simplification (do not be confused by , and mentioned above). is the th filter of (). is the th channel of . , produces a feature map . The convolution operation is given by
(3) 
In this way, filters produce feature maps as output. We can see that the convolution operation is independent of the order of filters . Although shuffled filters at the layer naturally lead to shuffled channels of , it does not reduce informative features. The convolution operation for each filter at a layer is symmetric. Therefore, the order of is changeable. We perform the next convolution operation by
(4) 
According to Eq. (3), shuffled leads to shuffled , where . According to Eq. (4), the order of channels of is determined by the order of and the channel order of the filter . Because the order of is changed, the channel order of the filter have to be changed if we fix the channel order of the output . For example, if we exchange the order of two filters and , and also exchange the channel and of for any , the output of the network is fixed. That is, there are a large number of filter map combinations such that the output is the same.
Let and denote the filter order and channel order of . We denote the order of as . We define the order chain rule of convolutional filters as follows

If , then is fixed because it is constrained by the channels of natural images, e.g., RGB. determines

If , then is determined by while determines .

If , then is determined . is fixed because it is constrained by the loss of neural networks (or labels).
The order chain rule provides an ideal case that intratask models are only affected by the changeable order of filters and their channels. In fact, intratask deep models are also affected by hierarchical semantics and nonconvex optimization. Therefore, give two deep models with a shared task label, it is difficult to align the changeable order of filters and their channels, which confuses parameter similarity learning.
To address this problem, we propose to transform of different intratask models into a “standard" order to align these model parameters. Let is the secondorder filters of size . and could have different number filters at a layer. , , and , we first compute the Frobenius inner product between filter maps and secondorder filter maps by
(5) 
where denotes the Frobenius inner product. In Eq. (5), we compute the Frobenius inner product between all pairs of and filter maps. Without any prior information, we cannot directly compute the standard . We estimate the standard based on the max matching between and . We have
(6) 
When , we do not need to align by the max operation. , , and , we directly compute by
(7) 
, we then estimate the standard by using the sort operation
(8) 
where is the new filter order indexes. Based on the order chain rule, is fixed when . Therefore, we do not need to align .
After aligning , we add a fully connected layer to project the aligned parameter representation into common space. Finally, we train a secondorder model by using a conventional metric learning loss, e.g., crossentropy loss [17] or triplet loss [3]. In this paper, we simply use the crossentropy loss for model similarity measure
(9) 
where is the dimensional value of the onehot label . represents the probability of the th visual task at the th branch. controls the relative importance of branches. . When measuring the parameter similarity between two deep models, we extract parameter features from the last fully connected layer of the secondorder network. We normalize parameter features and use the cosine similarity to compute the parameter similarity.
5 Experiments
In this section, we conduct extensive experiments to validate the effectiveness of our proposed method. All experiments are conducted with Pytorch [15].
Dataset and Modelset. The CIFAR100 dataset [7] has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class.
To measure modelbased parameter similarity, we create a model set called ModelSet500 based on CIFAR100. Specially, we split 100 classes into 50 groups as 50 visual tasks with task labels 049. Each task contains 2 classes. For each task, we repeat the training procedure 10 times to obtain 10 deep models. Finally, we obtain 500 deep models for 50 tasks.
Firstorder and Secondorder Network Implementation. We implement a ResNet20 network [5] as the firstorder network in our experiments. The network consists of one convolutional layer, three residual blocks, one global average pooling layer, and one fullyconnected layer. The blocks of ResNet20 consist of 6 convolutional layers, respectively (excluding convolutional layers which are used for downsampling in residual blocks). When computing the parameter similarity measure, we omit the last fullyconnected layer and convolutional layers. We use SGD with a minibatch size of 128. The learning rate starts from 0.1 and is divided by 10 after 80 epochs and 120 epochs, respectively. We train ResNet20 for 160 epochs. We use a weight decay of 0.0001 and a momentum of 0.9. We set the size of padding to 4 and perform random cropping with and random horizontal flip.
When implementing the secondorder network, we use SGD with a minibatch size 1. The learning rate starts from 0.001 and is divided by 10 after 40 epochs and 80 epochs, respectively. We train 100 epochs. We set the loss weights and in Section 5.1 5.2 and 5.3 because the first branch plays a major role in Eq. (9). We analyze the performance of different branches in Section 5.4 and 5.5.
5.1 Task classification
We first adopt the conventional image classification evaluation metric for task classification on ModelSet500. The goal of task classification is to predict task labels of trained models. For each task, we sample eight deep models for training while the other two for test. Therefore, the training set contains 400 deep models while the test set contains 100 deep models. We train the secondorder network by classifying 50 task classes.
To evaluate the effectiveness of the secondorder network, we set three baselines, i.e., random prediction without using any secondorder model, only fully connected layer, Frobenius inner product without aligning filters and their channels. Compared with these three baselines, we can see that the proposed method achieves better performance. As shown in Table 1, our method achieves 79.8% top1 accuracy while the three baselines are lower than 6%. The main reason may be that the changeable order of filters and their channels confuses the secondorder neural networks.
methods  top1 (%)  top5 (%)  top10 (%) 

random prediction  2.0  9.6  18.3 
only one FC layer  5.3  17.5  29.3 
Frobenius+FC,w/o alignment  5.5  18.3  29.0 
Frobenius+FC, w/ alignment (Ours)  79.8  94.8  98.0 
5.2 Task retrieval
We then adopt the conventional image retrieval evaluation metric for task retrieval on ModelSet500. Given a query deep model, the goal of task retrieval is to retrieve the target deep models from a gallery. We split ModelSet500 into a training set and a test set. The training set includes 40 tasks and the test set includes 10 tasks. For each task in the test set, we sample two deep models for query models while the others are used for gallery models. Task retrieval is challenging and practical in realworld scenarios. It can be used to retrieve similar visual tasks for transfer learning. Different from task classification, we train the secondorder network by classifying 40 task classes and extract the fully connected layer as parameter representation during the test phase. We use the normalized parameter representation for task retrieval.
We also set three baselines like task classification. Compared with these three baselines, we can see that the proposed method achieves better performance. As shown in Table 2, our method achieves 80.7% rank1 accuracy while the three baselines are lower than 15%. The reason is the same as Section 5.1.
methods  rank1 (%)  rank5 (%)  rank10 (%) 

random prediction  10.0  41.0  65.1 
only one FC layer  15.0  55.8  83.3 
Frobenius+FC, w/o alignment  13.3  58.3  80.0 
Frobenius+FC, w/ alignment (Ours)  80.7  95.0  95.8 
5.3 Task transferability
To validate that the parameter similarity can be used to measure task transferability, we conduct an experiment to study task transferability with respect to parameter similarity. We use the task retrieval setting to obtain parameter representation, as discussed in Section 5.2. We evaluate task transferability by training firstorder models for one task (a labeled source domain) and applying these models to another task (an unlabeled target domain) for data feature extraction and kmeans clustering. We use adjusted rand index (ARI) [19] for evaluation, which is widely used in cluster analysis.
As shown in Figure (a)a, although there are some outliers, we observe that parameter similarity is roughly proportional to task transferability (ARI). In most of the cases, the higher parameter similarity score the source and target tasks have, the better task transferability they can achieve.
5.4 Effect of different branches
To show the relative importance of different branches for the parameter similarity measure, we conduct an experiment on ModelSet500 by keeping one of branches while removing the other branches. That is, when evaluating the th branch, we set if , otherwise . We use the task classification setting. Different from other experiments, the learning rate in this experiment is divided by 10 after 150 epochs and 225 epochs, respectively. We train each branch for 300 epochs because it takes more time to align the order of both filters and their channels.
As shown in Figure (b)b, it is observed that the first branch is the most important branch and it can achieve 79.8% top1 accuracy. From the 2rd to 13th branch, the secondorder network still can learn a weak discriminative parameter representation since the results are higher than top1 accuracy of random prediction, i.e., 2.0% (50 classes). However, from the 13th to 19th layer, the secondorder cannot learn a discriminative parameter representation since the results are nearly reduced to 2.0% top1 accuracy. The reason could be that the order chain rule leads to the cumulative error of alignment at higher layers.
Besides, it is still hard to fuse all of the branches to improve better performance because the 2rd to 19th branch is much lower. This is the reason why we only use the first branch for task classification and retrieval.
5.5 Effectiveness of the alignment methods
To show the effectiveness of the alignment methods, we conduct an experiment by isolating different alignment methods, i.e., filter alignment (filt.) and channel alignment (chan.). The training setting is the same as Section 5.4. We only analyze the 2rd to 19th branch because the 1st branch only contains one unaligned case, i.e., unaligned filters (see Eq. (7)). As shown in Figure (c)c, it is observed that both alignment methods are important for parameter similarity learning.
6 Advantages and disadvantages
The proposed method comes with advantages and disadvantages compared with the existing methods. The advantages are that the learnable parameter similarity model does not need to jointly train a large number of transfer networks to estimate task relations. The secondorder network does not need to align convolutional filters if the firstorder network does not contain convolutional layers in some nonimage scenarios. Learnable parameter similarity can be also used to quickly retrieve related tasks from the model market (set) for transfer learning.
The disadvantages are that the proposed learnable parameter similarity cannot measure the similarity between different networks and it is still hard to fuse all of the branches to improve better performance. The reason could be that the alignment at higher layers are still unsatisfactory.
7 Conclusion
In this paper, we present a Learnable Parameter Similarity (LPS) method to reveal the underlying relations between different visual tasks. Our approach learns task relations by using taskspecific trained deep models instead of jointly training a large number of transfer networks. We present a hierarchical secondorder network to deal with highdimensional unaligned deep models. We evaluate the LPS method on a new parameter similarity learning benchmark ModelSet500 and extensive experiments show the effectiveness of the proposed method. Future work is to further explore the fusion of different layers and the alignment of the convolutional filters at higher layers and develop a parameter similarity metric for heterogeneous networks.
References
 [1] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
 [2] Yoshua Bengio. Deep learning of representations for unsupervised and transfer learning. In Proceedings of the 2011 International Conference on Unsupervised and Transfer Learning workshopVolume 27, pages 17–37. JMLR. org, 2011.
 [3] Shengyong Ding, Liang Lin, Guangrun Wang, and Hongyang Chao. Deep feature learning with relative distance comparison for person reidentification. Pattern Recognition, 48(10):2993–3003, 2015.
 [4] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1126–1135. JMLR. org, 2017.
 [5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [6] Judy Hoffman, Eric Tzeng, Taesung Park, JunYan Zhu, Phillip Isola, Kate Saenko, Alexei A Efros, and Trevor Darrell. Cycada: Cycleconsistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213, 2017.
 [7] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 [8] Bin Li, Qiang Yang, and Xiangyang Xue. Transfer learning for collaborative filtering via a ratingmatrix generative model. In Proceedings of the 26th annual international conference on machine learning, pages 617–624. ACM, 2009.
 [9] Wenqi Liang, Guangcong Wang, Jianhuang Lai, and Junyong Zhu. M2mgan: Manytomany generative adversarial transfer learning for person reidentification. arXiv preprint arXiv:1811.03768, 2018.
 [10] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, LiJia Li, Li FeiFei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pages 19–34, 2018.
 [11] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
 [12] MingYu Liu and Oncel Tuzel. Coupled generative adversarial networks. In Advances in neural information processing systems, pages 469–477, 2016.
 [13] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Deep transfer learning with joint adaptation networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2208–2217. JMLR. org, 2017.
 [14] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
 [15] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
 [16] Michael T Rosenstein, Zvika Marx, Leslie Pack Kaelbling, and Thomas G Dietterich. To transfer or not to transfer. In NIPS 2005 workshop on transfer learning, volume 898, pages 1–4, 2005.
 [17] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deep learning face representation from predicting 10,000 classes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1891–1898, 2014.
 [18] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7167–7176, 2017.
 [19] Nguyen Xuan Vinh, Julien Epps, and James Bailey. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11(Oct):2837–2854, 2010.
 [20] Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3712–3722, 2018.
 [21] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.