# Structured Discriminative Tensor Dictionary Learning for

Unsupervised Domain Adaptation

###### Abstract

Unsupervised Domain Adaptation (UDA) addresses the problem of performance degradation due to domain shift between training and testing sets, which is common in computer vision applications. Most existing UDA approaches are based on vector-form data although the typical format of data or features in visual applications is multi-dimensional tensor. Besides, current methods, including the deep network approaches, assume that abundant labeled source samples are provided for training. However, the number of labeled source samples are always limited due to expensive annotation cost in practice, making sub-optimal performance been observed. In this paper, we propose to seek discriminative representation for multi-dimensional data by learning a structured dictionary in tensor space. The dictionary separates domain-specific information and class-specific information to guarantee the representation robust to domains. In addition, a pseudo-label estimation scheme is developed to combine with discriminant analysis in the algorithm iteration for avoiding the external classifier design. We perform extensive results on different datasets with limited source samples. Experimental results demonstrates that the proposed method outperforms the state-of-the-art approaches.

Structured Discriminative Tensor Dictionary Learning for

Unsupervised Domain Adaptation

Songsong Wu , Yan Yan , Hao Tang , Jianjun Qian , Jian Zhang , Xiao-Yuan Jing

Nanjing University of Posts and Telecommunications,
Texas State University

University of Trento,
Nanjing University of Science and Technology

University of Technology Sydney,
Wuhan University

sswu@njupt.edu.cn, tomyan@txstate.edu, hao.tang@unitin.it, csjqian@njust.edu.cn, Jian.Zhang@uts.edu.au, jingxy2000@126.com

## 1 Introduction

A typical assumption in learning based visual recognition is that training and test data obey an identical distribution as they belong to the same domain. In practical applications, this assumption can be easily violated due to the distribution divergence of training data from source domain and test data from target domain. Such domain shift [?] is an universal issue in applications such as image recognition with varying lighting conditions and shooting angles of camera, challenging traditional recognition models. Domain adaptation [?] addresses this issue by training the model using the data from both domains so as to transfer the discriminative knowledge from the source and the target.

Based on the amount of available labeled samples in target domain, domain adaptation can be performed in two scenarios [?], semi-supervised domain adaptation (SDA) and unsupervised domain adaptation (UDA). In SDA, a small number of target samples are with class label, so it’s essential to learn the discriminative model with the assistance of labeled source samples. The labels are unavailable in UDA, thus it relies on modeling the distribution relation between domains to achieve cross-domain recognition. In this paper, we aims to tackle domain shift problem in the scenario of UDA, which is more challenging and widespread in reality.

Instance adaptation [?; ?] specifies the important weights of source samples in the objective function to match the data distribution of source and target domain. This principle works well only when the support of target distribution is contained in that of source distribution. Feature adaptation seeks domain-invariant representations of samples so that their distributions are coincident and the discriminative information is preserved. The domain-invariant feature can be obtained through linear projection [?; ?], kernel mapping [?; ?], sparse coding [?; ?; ?], and metric learning [?; ?]. Classifier adaptation retrains a predefined classifier by learning the classifier parameters to guarantee its good generalization in the target domain [?; ?]. Besides the aforementioned shallow learning based domain adaptation methods, domain adaptation via deep learning [?; ?; ?; ?; ?; ?] achieves notable improvement and becomes increasingly popular. The deep DA methods extract nonlinear domain-invariant feature and train domain-robust classifier in an end-to-end manner.

Most shallow UDA methods treat data as vectors, meaning that multi-dimensional data such as images and videos or their features need to be converted from other form to vectors beforehand. This operation can incur several obstacles to domain adaption, including (1) the vectorization breaks the internal structure of data, which is demonstrated to be essential for recognition [?]; (2) the vectorization increases the risk of model over-fitting because resulted vector is always long. Deep learning based domain adaptation methods encounter the dilemma of structure information loss because feature maps from convolutional layers need to be converted into vectors before they feed into the fully connected layers. In addition, the number of parameters in fully connected layers becomes large when feature map is transformed from tensor to vector, increasing the over-fitting risk of deep model, especially when training data are insufficient.

To address the aforementioned issues, we propose a Structured Discriminative Tensor Dictionary Learning (SDTDL) approach for unsupervised domain adaptation. SDTDL seeks data representation that is discrimiantive and robust to domain-shift by separating the domain factor and class factor in tensor space (Fig. 1). Specifically, a sample is factorized into domain part and class part characterized by domain-specific sub-dictionary and class-specific sub-dictionary, respectively. The resulted representation is a block diagonal sparse tensor with its nonzero blocks consisting of domain-specific representation and class-specific representation. Classification is accomplished base on reconstruction error associated with class-specific representation.

Overall, our main contributions are threefold: (1) we propose a discriminative dictionary learning approach based on tensor model for UDA. The method preserves the internal structure information of data and is able to tackle the small-sample-size problem. (2) we model domain factor and class factor separately to build a structured dictionary to guarantee the discriminativeness and domain invariance of feature. (3) exhaustive experiments on object recognition and digit recognition tasks demonstrate that the proposed SDTDL outperforms the existing shallow methods and achieves competitive results compared with deep learning approaches.

## 2 Related Work

Feature adaptation methods based on shallow learning include feature augmentation, feature alignment and feature transformation. [?] and GFK [?] are two representative feature augmentation methods using intermediate subspaces to model domain shift. Subspace Alignment (SA) [?] extract linear features by aligning the subspaces of source and target domains. The feature alignment idea is extended in CORAL [?] through covariance recoloring. Feature transformation methods seek a common latent feature space in which source samples and target samples are indistinguishable. The features can be obtained by linear projection [?; ?] or nonlinear mapping [?]. Most recently, TAISL [?] is proposed to learn a tensor-form feature via Turker tensor decomposition, which is the most related work with our method. In contrast, the proposed SDTDL is able to use the valuable label information in source samples and do not need to train a classifier, which promotes the performance and efficiency in UDA.

Recently, deep convolutional neural network (CNN) based methods are developed with promising performance. Domain Adaptation Nural Network (DANN) [?] combines CNN and adversarial learning to achieve an end-to-end unsupervised domain adaptation. DDC [?] learns two feature extractors for the source and target domains respectively with GAN. DIFA [?] extends the feature augmentation principle to generative adversarial networks. As deep UDA methods requires a large number of samples for parameter training, their effects are prone to be limited in the scenario of small sample size. By comparison, the proposed SDTDL is more suitable to address small sample size problem in domain adaptation, which is demonstrated by the experimental results in Sec. 5.

## 3 Notations and Background

Symbol | Description | Symbol | Description |
---|---|---|---|

, | Tensor samples | Class labels | |

, | Tensor dictionaries | , | Factor matrices |

, | Sparse coefficients | Mode- flatting of | |

The stack of and | Product of with | ||

Identity matrix | Vector with all ones |

Tensor Preliminaries. Table 1 lists the symbols used in this paper. An -th order tensor is an -dimensional data array, with element denoted as . The Frobenius squared norm of is defined as . The mode- flatting of reorders its elements into a matrix . The -mode product of a tensor with a matrix , denoted as , performs matrix multiplication along the -th mode, which can be performed equivalently by matrix multiplication and re-tensorization of undoing the mode- flattening. For conciseness and clarity, we denote the product of a tensor with a set of matrices by

(1) |

Similarly, We define .

The Tucker decomposition of tensor is defined as

(2) |

where is a scale, and is a rank-one tensor produced by the outer product of factor vectors. Given , the core tensor can be obtained as , where . Tucker decomposition can be written in matrix format as

(3) |

where . Note that the factor matrix in each mode satisfies the constraint .

Problem Definition. A domain is composed of a feature space with a marginal probability distribution , where . A task associated with a specific domain is defined by a label space and the conditional probability distribution , where . Domain adaptation considers a source domain and a target domain satisfying , and .

In this paper, we are given a set of labeled source samples , where is a th-mode tensor and is its class label. We are also given a set of unlabeled target samples . We aim to infer the class label of by learning from the source and target samples.

## 4 The Proposed SDTDL

### 4.1 Formalization

For easy understanding, we assume the labels of target samples have been predicted at present, and provide the details of label prediction and target sample selection in sec 4.3. We select partial target samples based on their prediction confidence to be additional training samples to aid modal training. The selected target samples from the -th class are denoted as , and the source samples belonging to the -th class are denoted as .

We model the generation process of cross-domain data as the combination of domain factor and class factor, in which a sample () is factorized as

(4) |

where is determined by the unique character of the domain from which is sampled and is determined by the semantic information of the class to which belongs.

In order to obtain “ parsimonious” representations of -th order tensor samples, we propose to learn a structured tensor dictionary composed of factor matrices, i.e. . The structure of arises from the structure of each factor matrix. Specifically, is composed of domain-specific sub-dictionary and class-shared sub-dictionary matrix , i.e. . In order to distinguish source domain from target domain, is further divided into source-specific sub-dictionary and target-specific sub-dictionary . This leads us to the following factorization of a source sample

(5) |

where is the source-specific sub-dictionary, and is the domain representation of in tensor format. Similarly, we have for target sample with the target-specific sub-dictionary .

Model (4) indicates that is merely determined by class factor, thus it’s safe to assume that and can be represented over a shared sub-dictionary . Due to the success of structured discriminative dictionary learning in image classification [?], we divide into a serial of class-specific sub-dictionaries for discriminative representation. To from class , its tensor representation over is given by

(6) |

where is the class-specific representation. Similarly, we have , where provides the class-specific representation.

Based on our notation of and in section 3, we define and as the representations of source sample set and target sample set over the shared of class respectively. In order to correct the domain shift, the class-conditional distributions of representation in source and target should be aligned. Here we adopt Maximum Mean Discrepancy [?] to measure distribution divergence, then we have

(7) |

where and . Beyond that, the intra-class variance of representation should be small to facilitate the discriminativeness. To that end, the following objective is to be minimized for source domain

(8) |

where is produced by arranging duplicate so that and have the same size. In the same way, should be minimized for target domain. To satisfy both (7) and (8), we need to minimize the following objective

(9) |

By considering the above criteria together, our learning model can be written as

(10) | ||||

s. t. | ||||

The first and second terms are the fidelity of the reconstruction over the structured tensor dictionary. The third term can be viewed as discriminant analysis of the representation. determines the weighting of target domain compared with source domain, and trades off between fidelity term and discriminative term. The constraints require that the factor matrices in each mode are orthogonal matrices.

### 4.2 Optimization

In this section, we solve model (10) using alternative optimization strategy, in which we seek the optimal solution for some certain variables while keeping all the others fixed at the values of the previous iteration till the iteration converges.

Optimize . With the fixed , the fidelity loss in regard to class can be written as , where . Considering all the source samples, model (10) becomes

(11) | ||||

s. t. |

Model (11) is a typical best rank- tensor approximation problem that can be solved by HOOI algorithm [?].

Optimize . In the same way as in (11), the target-specific dictionary and the domain-specific representation of target samples are obtained by applying HOOI to the following optimal problem.

(12) | ||||

s. t. |

where .

Optimize . We seek the optimal sub-dictionary class-by-class, so we have the model for class as

(13) | ||||

s. t. |

where , .

We adopt the alternating optimization strategy in [?] to update and by turns. With fixed and , the optimal is provided by theorem 2. With fixed , the optimal is given by and .

###### Theorem 1.

Let be the augmented sample tensor of class which is generated by concatenating and along with the -th mode. Define matrix as

(14) |

where and are identical matrices, and are the column vectors with all ones. Let be the mode- flatting matrix of . Then, the optimal in (19) is provided by , with columns as the eigen-vectors corresponding to the first largest eigenvalues of the following eigenvalue-problem

(15) |

The proof is given in Appendix A.

### 4.3 Label Prediction and Sample Selection

The probability of belonging to class can be computed based on the fidelity error, i.e.

(16) |

where is the parameter of exponent function whose value is set as median value of the denominator. The posterior probability can also be computed based on the deviation of from the centroid of class . Thus we have

(17) |

is adopted to replace for two reasons: (1) is more reliable because it is computed according to the real source labels; (2) it is beneficial to alleviate the domain shift to make the target sample towards to the corresponding class center in source domain.

Through the convex combination of the two kinds of probabilities, We ultimately can predict the class label of by

(18) |

In order to select target samples with reliable pseudo-labels for training, we sort s in descend order. Then we add the target samples associated with highest poster probability into training sample set. The ratio of the selected target samples in the whole target sample set is a parameter of our model.

### 4.4 Initialization

The initialization process includes the following three steps. In step 1, the class-specific dictionary are initialized by structured discriminant dictionary learning (e.g. [?]) based on the labeled source samples, followed by computing class-wise sparse coding and . Then the domain-specific dictionary are initialized through (11). In step 2, the target labels are predicted by (18) without the influence of , i.e. set in (16). Note that at this stage, although the estimated target labels may be deviated from the actual ones,they provide a reasonable start point for iteration because of the underlying correlation between source and target. In step 3, we select partial target samples with their estimated labels to initialize the target-specific dictionary through (12). In summary, the proposed method can be expressed in Algorithm 1.

## 5 Experiments

### 5.1 Experimental Setup

Datasets. We employ two pubic datasets to evaluate the propsoed method. (1) Office+Caltech dataset is released by [?], which consists of images of object classes from domains, i.e., Amazon (A), Webcam (W), Dslr (D) and Caltech (C). We randomly select labeled images per class from Webcam/DSLR/Caltech and from Amazon as source samples respectively according to [?]. We ran different trials and report the average rate and standard deviation of recognition accuracy. For fair comparison, we use the tensor data provided by [?], which is produced by CONV5_3 layer of the VGG-16 model. For other methods, we report the results in the literature. (2) To evaluate the performance of the methods in the settings of small sample size, we adopt the USPS+MNIST dataset released by [?], which consists of digital images from USPS and digit images in MNIST from to . Thus these two domains lead to two DA tasks. The tensor samples are produced by CONV5_3 layer of VGG-16 model pre-trained with all the data in MNIST.

Baseline Models. The proposed SDTDL is compared with seven competitive UDA methods, i.e., No Adaptation (NA), TCA [?], GFK [?], DIP [?], SA [?], LTSL [?], LSSA [?], and three state-of-the-art UDA methods, i.e., CORAL [?], TAISL [?] and JGSA [?]. For digit recognition task, two deep UDA methods DANN [?] and DDC [?] are added into comparison to evaluate SDTDL in small sample size scenario.

Parameter Settings. The optimal parameters of SDTDL are set empirically based on grid searching. Specifically, for object recognition, the parameters are set as: , , , , . For digital recognition task, the parameters are set as: , , , , . The parameters of the other methods in comparison are set according to the corresponding papers.

### 5.2 Experimental Results

Feature Visualization. To qualitatively evaluate the discriminativeness and robustness to domain-shift of the feature extracted by SDTDL, we visualize the feature embeddings in the domain pair Webcam to Caltch (WC). We compare SDTDL with CORAL, TAISL and JGSA in terms of the D scatter plot given by t-distributed stochastic neighbor embedding (t-SNE) [?]. Fig. 2 (a-d) illustrate the visualized distributions of the features corresponding to source and target samples. The features extracted by SDTDL are more prone to form separate clusters associated with the categories compared with other baselines. For both source and target samples, the intra-class scatter is small and the inter-class scatter is large, indicating that SDTDL is able to guarantee the feature to be discriminative. Besides, the distributions of source samples and target samples are aligned for each category, which suggests that our method can suppresses the interference of domain factor to discriminative information transfer from the source domain to the target domain.

Method | C A | C W | C D | A C | A W | A D | W C | W A | W D | D C | D A | D W | MEAN |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

NA | 89.0(2.0) | 79.4(2.7) | 86.2(4.0) | 77.3(1.8) | 74.6(3.1) | 82.8(2.2) | 63.7(2.1) | 74.0(2.5) | 94.9(2.4) | 70.5(1.9) | 81.1(1.9) | 91.1(1.7) | 80.4 |

TCA | 78.1(6.1) | 69.0(6.6) | 74.3(5.2) | 56.7(4.5) | 55.5(6.4) | 59.9(6.7) | 54.7(3.8) | 68.3(4.1) | 90.6(3.2) | 51.9(2.2) | 61.2(4.2) | 89.9(2.2) | 67.5 |

GFK | 87.6(2.3) | 81.9(4.9) | 84.8(.45) | 75.1(3.9) | 74.3(5.2) | 81.4(4.3) | 79.1(2.7) | 84.0(4.4) | 95.2(2.2) | 82.2(2.4) | 90.4(1.4) | 92.8(2.2) | 84.1 |

DIP | 84.8(4.3) | 73.5(4.9) | 82.8(7.7) | 59.8(5.7) | 45.5(9.1) | 52.2(8.1) | 65.2(4.5) | 69.3(6.9) | 94.1(3.1) | 61.9(6.3) | 76.4(3.7) | 90.9(2.3) | 71.4 |

SA | 82.0(2.6) | 65.9(4.0) | 73.7(4.3) | 67.7(4.2) | 61.1(5.1) | 67.8(4.8) | 70.4(4.1) | 80.1(4.3) | 91.1(3.3) | 66.9(3.3) | 77.4(6.0) | 87.3(3.1) | 74.3 |

LTSL | 87.5(2.8) | 75.3(4.2) | 82.3(4.1) | 70.2(2.4) | 66.7(4.6) | 77.7(4.6) | 59.1(4.4) | 66.6(5.7) | 90.0(3.8) | 60.8(3.1) | 69.2(4.5) | 86.0(2.9) | 74.3 |

LSSA | 86.4(1.7) | 45.4(6.6) | 73.5(2.3) | 80.3(2.3) | 84.0(1.7) | 90.9(1.7) | 29.5(7.0) | 86.6(4.5) | 85.8(4.7) | 65.9(6.5) | 92.3(0.6) | 93.4(2.2) | 76.2 |

CORAL | 80.3(1.9) | 63.8(3.1) | 62.1(3.0) | 77.6(1.2) | 61.2(2.4) | 64.3(2.9) | 66.6(2.2) | 69.1(2.6) | 82.8(2.8) | 72.0(1.7) | 74.2(2.2) | 89.6(1.6) | 72.0 |

TAISL | 90.0(1.9) | 85.3(3.1) | 90.6(1.9) | 80.1(1.4) | 77.9(2.6) | 85.1(2.2) | 82.6(2.2) | 85.6(3.5) | 97.7(1.5) | 84.0(1.0) | 87.6(2.1) | 95.9(1.0) | 86.9 |

JGSA | 87.0(0.8) | 69.4(6.7) | 77.29(7.0) | 79.6(1.2) | 67.8(4.8) | 76.27(6.1) | 81.4(1.0) | 87.1(0.7) | 96.9(1.8) | 82.2(0.7) | 88.5(0.8) | 94.9(1.0) | 82.1 |

SDTDL | 94.8(3.2) | 89.5(4.4) | 90.4(4.7) | 86.4(2.5) | 82.8(5.7) | 88.8 (3.6) | 84.4(2.2) | 91.7(3.6) | 97.9(1.7) | 83.9(1.1) | 92.1(1.4) | 98.1(1.2) | 90.1 |

Recognition Accuracy. Table 2 shows that SDTDL achieves the highest accuracy in pairs out of and gains performance improvements in average accuracy of compared to the best method for comparison. We observe that in CD and DC, our method reaches a close second to the best results ( vs. and vs. , respectively). The leading performance of SDTDL compared with other vector-based UDA methods indicates that the internal information of high-dimensional visual data are indeed crucial to cross-domain recognition. It meanwhile demonstrates that SDTDL indeed effectively preserves the useful internal information in the visual data. We also observe that SDTDL outperforms TAISL in all the pairs, which demonstrates the proposed method is able to restrain the interference of domain factors and facilitate the discriminativeness of feature. In Table 3, we can see that SDTDL outperforms both the competitive shallow and deep UDA methods on digit datesets. One one hand, this demonstrates the strong power for discriminative domain-invariant feature extraction of SDTDL. One the other hand, the results validate the advantageous over other methods of SDTDL when large training samples are unavailable for cross-domain recognition.

Small Sample Size Scenarios. We evaluate the performance of SDTDL in addressing the small sample size problem through cross-domain recognition tasks WC and MNISTUSPS. For WC, random samples per class from domain W and all the target samples of domain C are selected to compose the dataset. As shown in Fig.2 (e-f), SDTDL outperforms other three methods when the label source samples are limited, suggesting that SDTDL can achieve knowledge between domains when few label samples are available. We also note that SDTDL underperforms when only one source sample from each class is available. The reason is that the class mean of source sample becomes to zero in this case, thwarting the discriminative term in mode (10). In addition, we select all the source samples and target samples per class to simulates the scenario of small sample size in target domain. Fig. 2 (f) shows that SDTDL offers advantages over other three competitive shallow methods when the number of target samples is limited. For MNISTUSPS, random source samples per class and all the target samples are selected to compose the dataset. Fig. 2 (g-h) show that SDTDL outperforms other three competitors when source samples are scarce in cross-domain digit recognition. Besides, the advantage of SDTDL over TAISL in recognition accuracy demonstates that the structured discrimination dictionary learning strategy of SDTDL can effectively address the small sample size problem in cross-domain recognition.

Method | TCA | GFK | SA | JDA | CORAL | TAISL | JGSA | DANN | DDC | SDTDL |
---|---|---|---|---|---|---|---|---|---|---|

56.3 | 61.2 | 67.8 | 67.2 | 83.6 | 83.0 | 82.3 | 77.1 | 79.1 | 90.7 | |

51.2 | 46.5 | 48.8 | 59.7 | 78.5 | 82.6 | 87.8 | 73.0 | 66.5 | 89.1 | |

MEAN | 53.8 | 53.9 | 58.3 | 63.4 | 81.1 | 82.8 | 85.1 | 75.1 | 72.8 | 89.9 |

Parameter Sensitivity Analysis. We investigate the parameter sensitivity of SDTDL w.r.t target domain weighting parameter , parameter of intra-class variance and target sample selection parameter . Fig. 2 (i-j) validate that SDTDL achieves stable performance for a wide range of parameter settings for and . The observation from Fig. 2 (k) is two-folds: (1) a large proportion of target samples should be selected in SDTDL to insure the samples from each category are provided for training; (2) the proportion should be controlled within a certain range to prevent the negative effects of false labels.

Convergence Analysis. We evaluate the convergence property of SDTDL by checking the prediction accuracy of target samples in each iteration. Fig. 2 (l) shows the increasement of prediction accuracy along with dictionary learning process, indicating that the dictionary becomes more and more transferable and discriminative. This also demonstrate the effectiveness of our pseudo-label selection strategy in model training. Besides, we observe that dictionary evolution can reach the balance between domain-robust and discriminativeness within iterations in most cases.

Dictionary Property Analysis. To demonstrate the efficacy of the learned domain-specific and class-specific sub-dictionaries in extracting domain information and class information, we analysis the reconstructed samples associated with the two sub-dictionaries. Concretely, we apply SDTDL to domain adaptation from MNIST to USPS (MU) and compare the original images and the domain-specific and class-specific reconstructed images. From the results in Fig. 1, we observe that the images in (b) and (e) contain more domain information, e.g., the light typeface style of MNIST and the boldface style of USPS, than the category information of digits. We also note that the images in (d) and (f) contain far more category information than typeface information. The results demonstrate that the sub-dictionaries learned by SDTDL can focus on the domain factor and extract class information from data separately in domain-shift situation.

## 6 Conclusion

Previous unsupervised domain adaptation methods vectorize multi-dimensional data in advance, leading to the loss of internal information which is critical to visual recognition applications. Besides, most existing methods are based on the assumption of plenty samples, which is rarely hold in practice. In this paper, we propose to learn a structured discriminative dictionary using tensor model. The dictionary is composed of multi-linear factor matrices, providing the capability to represent tensors. Moreover, domain-specific information and class-specific information of the cross-domain samples are depicted by the corresponding sub-dictionaries respectively. Our method shows strong power of feature extraction through knowledge transfer between domains, not only in traditional domain adaptation setting, but also in the setting of limited samples, which is rarely explored.

## References

- [Aja-Fernndez et al., 2009] Santiago Aja-Fernndez, Rodrigo de Luis Garca, Dacheng Tao, and Xuelong Li. Tensors in Image Processing and Computer Vision. Springer, 2009.
- [Aljundi et al., 2015] Rahaf Aljundi, Rémi Emonet, Damien Muselet, and Marc Sebban. Landmarks-based kernelized subspace alignment for unsupervised domain adaptation. In CVPR, 2015.
- [Baktashmotlagh et al., 2013] Mahsa Baktashmotlagh, Mehrtash Tafazzoli Harandi, Brian C. Lovell, and Mathieu Salzmann. Unsupervised domain adaptation by domain invariant projection. In ICCV, 2013.
- [Ben-David et al., 2010] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. ML, 79(1-2):151–175, 2010.
- [Bousmalis et al., 2016] Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. Domain separation networks. In NIPS, 2016.
- [Duan et al., 2009] Lixin Duan, Ivor W. Tsang, Dong Xu, and Tat-Seng Chua. Domain adaptation from multiple sources via auxiliary classifiers. In ICML, 2009.
- [Fernando et al., 2013] Basura Fernando, Amaury Habrard, Marc Sebban, and Tinne Tuytelaars. Unsupervised visual domain adaptation using subspace alignment. In ICCV, 2013.
- [Ganin and Lempitsky, 2015] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In ICML, 2015.
- [Gong et al., 2012] Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic flow kernel for unsupervised domain adaptation. In CVPR, 2012.
- [Gopalan et al., 2011] Raghuraman Gopalan, Ruonan Li, and Rama Chellappa. Domain adaptation for object recognition: An unsupervised approach. In ICCV, 2011.
- [Gretton et al., 2012] A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. JMLR, 13:723–773, 2012.
- [Herath et al., 2017] Samitha Herath, Mehrtash Tafazzoli Harandi, and Fatih Porikli. Learning an invariant hilbert space for domain adaptation. In CVPR, 2017.
- [Hoffman et al., 2018] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei A. Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In ICML, 2018.
- [Kolda and Bader, 2009] Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009.
- [Kulis et al., 2011] Brian Kulis, Kate Saenko, and Trevor Darrell. What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In CVPR, 2011.
- [Lathauwer et al., 2000] Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. On the best rank-1 and rank-(r1,r2,. . .,rn) approximation of higher-order tensors. SIAM JMAA, 21(4):1324–1342, 2000.
- [Long et al., 2014] Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S. Yu. Transfer joint matching for unsupervised domain adaptation. In CVPR, 2014.
- [Long et al., 2015] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transferable features with deep adaptation networks. In ICML, 2015.
- [Lu et al., 2017] Hao Lu, Lei Zhang, Zhiguo Cao, Wei Wei, Ke Xian, Chunhua Shen, and Anton van den Hengel. When Unsupervised Domain Adaptation Meets Tensor Representations. In ICCV, 2017.
- [Mansour et al., 2009] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation with multiple sources. In NIPS. 2009.
- [Pan and Yang, 2010] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359, Oct 2010.
- [Pan et al., 2011] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. Domain adaptation via transfer component analysis. TNN, 22(2):199–210, 2011.
- [Patel et al., 2015] V. M. Patel, R. Gopalan, R. Li, and R. Chellappa. Visual domain adaptation: A survey of recent advances. Signal Processing Magazine, 32(3):53–69, 2015.
- [Saito et al., 2018] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR, 2018.
- [Shao et al., 2014] Ming Shao, Dmitry Kit, and Yun Fu. Generalized transfer subspace learning through low-rank constraint. IJCV, 109(1-2):74–93, 2014.
- [Shekhar et al., 2013] Sumit Shekhar, Vishal M. Patel, Hien Van Nguyen, and Rama Chellappa. Generalized domain-adaptive dictionaries. In CVPR, 2013.
- [Sun et al., 2016] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation. In AAAI, 2016.
- [Tang et al., 2018] Hao Tang, Heng Wei, Wei Xiao, Wei Wang, Dan Xu, Yan Yan, and Nicu Sebe. Deep micro-dictionary learning and coding network. In WACV, 2018.
- [Tzeng et al., 2017] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In CVPR, 2017.
- [van der Maaten and Hinton, 2008] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. JMLR, 9:2579–2605, 2008.
- [Volpi et al., 2018] Riccardo Volpi, Pietro Morerio, Silvio Savarese, and Vittorio Murino. Adversarial feature augmentation for unsupervised domain adaptation. In CVPR, 2018.
- [Xu et al., 2018] Ruijia Xu, Ziliang Chen, Wangmeng Zuo, Junjie Yan, and Liang Lin. Deep cocktail network: Multi-source unsupervised domain adaptation with category shift. In CVPR, 2018.
- [Yang et al., 2011] Meng Yang, Lei Zhang, Xiangchu Feng, and David Zhang. Fisher discrimination dictionary learning for sparse representation. In ICCV, 2011.
- [Yang et al., 2018] Baoyao Yang, Andy Jinhua Ma, and Pong C. Yuen. Domain-shared group-sparse dictionary learning for unsupervised domain adaptation. In AAAI, 2018.
- [Yu and Szepesvári, 2012] Yaoliang Yu and Csaba Szepesvári. Analysis of kernel mean matching under covariate shift. In ICML, 2012.
- [Zhang et al., 2017] Jing Zhang, Wanqing Li, and Philip Ogunbona. Joint geometrical and statistical alignment for visual domain adaptation. In CVPR, 2017.
- [Zhang et al., 2018] Zhen Zhang, Mianzhi Wang, Yan Huang, and Arye Nehorai. Aligning infinite-dimensional covariance matrices in reproducing kernel hilbert spaces for domain adaptation. In CVPR, 2018.

## Appendix A Proof of Theorem 1

The proof proof to Theorem 2 in the main paper is presented in this section. Theorem 2 provide the solution to the following optimization problem

(19) | ||||

s. t. |

###### Theorem 2.

Let be the augmented sample tensor of class which is generated by concatenating and along with the -th mode. Define matrix as

(20) |

where and are identical matrices, and are the column vectors with all ones. Let be the mode- flatting matrix of . Then, the optimal in (19) is provided by , with columns as the eigen-vectors corresponding to the first largest eigenvalues of the following eigenvalue-problem

(21) |

###### Proof.

Based on the formula (4.3) (4.4) in [?], we have

(22) |

Similarly, we have we have

(23) |

Define , we can get the following equivalence with formula derivation

(24) |

Taking (22)(23)(24) into account, the optimal problem (10) is equivalent to the following optimal problem

(25) |

For enable better readability, we define intermediate variable

(26) |

So far, we can obtain the optimal factor matrix for each mode by solving the following optimal problem

(27) |

where is the mode- flatting matrix of . According to Lagrange multiplier method, the optimal solution of (27) is with columns as the eigen-vectors corresponding to the first largest eigenvalues of eigenvalue-problem (21). ∎

## Appendix B Dictionary Property Analysis

In this section, we provide additional experimental results to demonstrate the efficacy of the learned domain-specific and class-specific sub-dictionaries in extracting domain information and class information.We apply SDTDL to the task of transferring from USPS to MNIST (UM), in which we compare the original images and the domain-specific and class-specific reconstructed images. The results in Fig. 3 demonstrate that the domain-specific sub-dictionary and the class-specific sub-dictionary learned by SDTDL are able to extract the domain information and class information from cross-domain data respectively.