Generalized Adaptive Dictionary Learning via Domain Shift Minimization
Abstract
Visual data driven dictionaries have been successfully employed for various object recognition and classification tasks. However, the task becomes more challenging if the training and test data are from contrasting domains. In this paper, we propose a novel and generalized approach towards learning an adaptive and common dictionary for multiple domains. Precisely, we project the data from different domains onto a low dimensional space while preserving the intrinsic structure of data from each domain. We also minimize the domainshift among the data from each pair of domains. Simultaneously, we learn a common adaptive dictionary. Our algorithm can also be modified to learn classspecific dictionaries which can be used for classification. We additionally propose a discriminative manifold regularization which imposes the intrinsic structure of class specific features onto the sparse coefficients. Experiments on image classification show that our approach fares better compared to the existing stateoftheart methods.
1 Introduction
The study of sparse representation of signals has received an enormous interest in the recent years. The idea behind sparse representation is to approximate a signal by representing it with a combination of very few elements from an overcomplete set of bases called dictionary, i.e. any natural signal can be reconstructed by a sparse combination of elements of an overcomplete dictionary. Much of the earlier work on sparse representation was devoted to building a dictionary using offtheshelf or parametric bases. The notion of building a dictionary from data instead of a predefined set of bases was studied by Olshausen and Field [14] in their seminal work. Data driven dictionaries have since yielded encouraging results among tasks like restoration [3], superresolution [26, 23] and classification [25].
The effectiveness of these dictionaries in such diverse range of applications can be attributed to their superior ability in adapting to a particular set of data. However we might encounter situations in which the target data has a distribution different from the data used in training the dictionary. Such situations occur frequently in many computer vision problems e.g., changes in resolution, illumination and pose of images. Such changes often lead to degradation in classification performance [2]. Learning dictionaries which are adaptive to these changes is a challenging task, which has been garnering increased interest of late. Earlier works were focussed on learning a dictionary for each domain. Jia [10] considered such a case. But the dimension of the features is often high, hence learning a dictionary for each domain is cumbersome and computationally expensive, making it infeasible for many practical applications.
The idea of adapting classifiers to new domains has attracted a tremendous amount of interest recently, and a number [19, 11, 5, 9] of methods have been proposed. Jhuo [9] proposed learning a transformation of source data onto the target space, such that the joint representation is lowrank. However, they do not effectively utilize the labeled data to learn the projections. Han [8] learned a shared embedding for different domains, with a sparsity constraint on the representation. Albeit, they treat the step of embedding the data onto a common domain separately rather than jointly and assume prelearned projections, which may not result in optimal performance. Among dictionary based methods, Yang [26] and Wang [23] proposed learning dictionary pairs for cross modal synthesis. Qiu [17] proposed learning adaptive dictionaries for smooth domain shifts using regression. However, in practice, domain shifts are wide and often result in abrupt changes among features (eg., increase in resolution from a webcam image to a DSLR image). Shekhar [21] jointly projected the data onto a low dimensional space by preserving the manifold structure of the data from each domain, and learned a common adaptive dictionary for multiple domains, which can also be modified to learn discriminative dictionaries. However, the projected data may still possess a significant domain shift among the data distributions which may not result in an optimal solution.
Considering the above challenges, we present a robust method that learns a common dictionary adapted to both source and target data. As the dimension of features may vary across the domains, we project the data onto a common low dimensional space by learning a projection matrix for each domain. In the process, we preserve the intrinsic geometry of the data from each domain and minimize the shift across the domains. Simultaneously, we learn an efficient and compact dictionary common to both the domains. We extend our framework towards learning class specific discriminative dictionaries, as our final goal is classification. We additionally propose a discriminative manifold regularization, which imposes the intrinsic structure of class specific features onto the sparse coefficients to be obtained in the dictionary learning step.
Our joint learning framework offers several advantages in terms of generalizability. First, learning domain specific projection matrices makes it easy to handle changes in feature dimensions. It also makes our algorithm kernelizable. Second, learning the dictionary in a low dimensional space makes our algorithm faster and tractable. It also helps in discarding any redundant information present in the original features. Further, our method can be generalized to handle data from multiple domains. We present an efficient optimization approach to solve our problem, which has simple update steps.
The paper is organized in five sections. In Section 2, we formulate our dictionary learning framework, and the optimization scheme is described in Section 3. The evaluation approach using test data is described in Section 4. Experimental results are presented in Section 5. Section 6 concludes our work.
2 Learning Framework
The classic dictionary learning problem minimizes the representation error of the given data samples subject to sparsity constraint. Let be the data matrix. Then the dictionary with atoms can be obtained by solving the following problem
where is a sparse representation matrix of over and is the sparsity level. The norm counts the number of nonzero elements in a vector and is the Frobenius norm of a matrix.
We consider a case where we have data from two domains, and . Our goal is to find projection matrices and which map and onto a lowdimensional space and simultaneously learn a common dictionary for both the domains. We enforce orthonormality constraint on columns of projection matrices and , in order to prevent the solution from becoming degenerate. We will see later that, this particular assumption paves way for an efficient optimization approach.
While bringing the data from two domains to a low dimensional space, it is desirable that the projections preserve much of the information which is available in the original domains. To facilitate such preservation, we wish to minimize the following cost function which includes a manifold regularization [1] term for data from each domain:
where is the trace of a matrix and , are the normalized graphLaplacian matrices associated with the nearest neighborhood graphs constructed from data matrices , respectively.
The above cost function enforces the condition that, if two points each domain are close to each other in the original space, they are required to be closer to each other in the projected space as well. This assumption is known as manifold assumption [1], which has been used successfully for nonlinear dimensionality reduction and semisupervised learning techniques [1].
To make the learned dictionary adaptive to both the domains, it should capture the commonality among the domains. But the data among the domains will have largely different distributions. So, there will be a large domain shift among the data even in the reduced space. We seek to minimize this domain shift. To realize this, a natural strategy is to make the data distributions of both the domains as close as possible. In our work, we follow [6, 15, 12] and use the Maximum Mean Discrepancy (MMD) as the distance measure between the data distributions. It computes the distance between the sample means of both the distributions:
After projecting the data onto the common low dimensional space, we seek to minimize the following representation error:
The above costs , , can be rewritten in blockmatrix form as:
(1) 
where , and . Here, denotes the block diagonal matrix formed from the data matrices and . The MMD matrix is computed as:
(2) 
The overall optimization is given as:
(3) 
The above formulation can be conveniently extended to multiple domains. For an domain problem, the block matrices can be constructed as , and .
2.1 Manifold Regularization
To make the atoms of the dictionary respect the intrinsic structures of data, Cai [28] proposed a Graph Regularized Sparse Coding (GraphSC) method, which further explores the manifold assumption [1]. GraphSC assumes that if two points and are close in the intrinsic geometry of data on the projected space, then their sparse representations and are also close. Adding this regularization to the cost :
(4) 
where is the normalized graphLaplacian associated with the nearest neighborhood graph formed from the data in the projected space.
2.2 Discriminative Dictionaries
The dictionary learned using above approach can reconstruct data from multiple domains well, but it cannot discriminate among the data from different classes. Following recent advances [18, 27] in learning discriminative dictionaries, we split the dictionary into class specific dictionaries , where is the total number of classes. We modify the cost function as:
(5) 
where the weights and influence the discriminative power of the dictionary . The matrices and are given as:
and
This way, we learn the dictionary of a particular class one at a time, i.e. (5) encourages the reconstruction of a dictionary of the corresponding class and penalizes the reconstruction of the dictionaries of other classes. We note that the manifold regularization term in (5) is now aware of discrimination, as it handles only the data from the corresponding class and omits the data from other classes.
2.3 Kernelization
Due to the nonlinear structure of the data, projecting the original features may not be efficient. To overcome this drawback, we map the original features onto a high dimensional space before projecting them. Let be a mapping from the space of domain to the reproducing kernel Hilbert space . The projection which maps to the low dimensional space be a compact linear operator. Let be the kernel matrix associated with . The representer theorem [20] states that can be represented as
for some matrix . Using the above expression for projection matrices, we redefine the cost functions and the equality constraints as
(6) 
3 Optimization
The above optimization problem (6) is nonconvex in , and . We solve it in iterative alternating steps. At each iteration, three update steps are performed namely projection update, dictionary update and sparse code update.
3.1 Projection Update
3.2 Dictionary and Sparse code Update
When is fixed, this problem boils down to a discriminative dictionary learning with the data matrix as . We use the discriminative dictionary learning approach presented in [27] to update and .
4 Test Evaluation
As our goal is classification, given a test sample from the domain , we propose the following steps, similar to [21, 13]. We map the sample into kernel space .

Compute the low dimensional embedding of the sample, using the projection matrix ,
where

Compute the sparse code of the embedded test sample over the dictionary using the OMP algorithm [16]

The test sample can now be allocated to class , if the reconstruction error using the class specific dictionary and the corresponding sparse code is minimum. For a better discriminative results, it is desired to compute the error in the original feature space rather than the low dimensional space. So, we map the dictionary onto and allocate the test sample as:
5 Experiments
We conduct experiments on image classification to validate the effectiveness of our proposed method. We show the performance of our method on two adaptation databases and compare it with the existing stateoftheart adaptation algorithms. For each database, the results are averaged over 20 runs of random train/test splits.
5.1 Office and Caltech datasets
Office [19] is a popular benchmark dataset used for visual domain adaptation. The dataset contains three domains of images namely, Amazon which consists of the images downloaded from online merchants, DSLR consists of high resolution images, Webcam consists of low resolution images. It has images and classes. In addition, we choose the Caltech256 dataset [7] as the fourth domain. Fig. (1) shows some BACKPACK images of all the four domains. We choose two different scenarios to test our algorithm. In the first scenario, we use 10 classes common to all four domains: BACKPACK, TOURINGBIKE, CALCULATOR, HEADPHONES, COMPUTERKEYBOARD, LAPTOP, COMPUTERMONITOR, COMPUTERMOUSE, COFFEEMUG and VIDEOPROJECTOR. There are a total of 2533 images in this scenario with 8 to 151 images in each class. In the second scenario, we restrict to the office dataset and test on all the 31 classes in it. In this scenario, we test our method using multiple domains. In both the scenarios, we use 20 samples per class for Amazon/Caltech and 8 samples per class for Webcam/DSLR when used as a source domain. We use 3 samples per class for all the four domains when used as the target for testing. We compare our results with those obtained from [19, 4, 27, 5, 21].


(a)  (b)  (c)  (d) 
Methods  C A  C D  A C  A W  W C  W A  D A  D W 

Metric [19]  33.7 0.8  35.0 1.1  27.3 0.7  36.0 1.0  21.7 0.5  32.3 0.8  32.0 0.8  55.6 0.7 
SGF [5]  40.2 0.7  36.6 0.8  37.7 0.5  37.9 0.7  29.2 0.7  38.2 0.6  39.2 0.7  69.5 0.9 
GFK [4]  46.1 0.6  55.0 0.9  39.6 0.4  56.9 1.0  32.8 0.1  46.2 0.6  46.2 0.6  80.2 0.4 
FDDL [27]  39.3 2.9  55.0 2.8  24.3 2.2  50.4 3.5  22.9 2.6  41.1 2.6  36.7 2.5  65.9 4.9 
SDDL [21]  49.5 2.6  76.7 3.9  27.4 2.4  72.0 4.8  29.7 1.9  49.4 2.1  48.9 3.8  72.6 2.1 
Ours  52.8 3.6  79.7 4.9  29.1 2.6  74.9 5.0  33.1 2.7  53.1 4.0  52.2 4.4  77.5 3.5 
Source  Target  SGF [5]  RDALR [9]  FDDL [27]  SDDL[21]  Ours 

dslr, amazon  webcam  52 2.5  36.9 1.1  41.0 2.4  57.8 2.4  60.2 3.5 
amazon, webcam  dslr  39 1.1  31.2 1.3  38.4 3.4  56.7 2.3  58.4 3.2 
webcam, dslr  amazon  28 0.8  20.9 0.9  19.0 1.2  24.1 1.6  26.2 2.2 
Features for images.
Parameter settings.
We used the nonparametric histogram intersection kernel in all our experiments. We set , , and for our experiments. For the first scenario, we choose to learn 4 dictionary atoms per class, i.e. for ten classes and the final dimension . For the second scenario, we choose 6 dictionary atoms per class, i.e. for thirty one classes and . For SDDL [21] and FDDL [27], we fix the parameters as given in [21] as they are found to give the best results.
5.1.1 Results using single source
The comparison of our results with those obtained from other methods is shown in Table 1. Our algorithm performs best for 6 domain pairs and second best for 1 pair. Further, we can see that our method outperforms SDDL among all the domain pairs. So, we can infer that our domain shift minimizing framework improves the efficiency over [21], specifically when the training data and test data come from different distributions.
5.1.2 Results using multiple sources
5.2 USPS and MNIST datasets
The USPS and MNIST are handwritten digit image datasets used widely for digit recognition, classification etc. The USPS dataset consists of training images and test images of size . MNIST dataset has a training set of images and a test set of images each of size . Some of the images of both the datasets are shown in Fig (2). For our experiments, we adopt the publicly available USPS+MNIST datasets provided by Long [12]. The datasets contain USPS and MNIST images of 10 classes. All the images are scaled to , and each is represented by a vector which encodes the gray level values. For each domain of this database, we use 20 samples per class when used as a source and 3 samples per class when used as a target. We use the same kernel and the set of parameters which are used for the earlier database. We choose to learn 4 dictionary atoms per class, i.e. for ten classes and the final dimension for this database. We compare the performance of our method with those obtained from [27, 21].
5.2.1 Results
Table 3 shows the comparison of our results with those of other methods. We evaluated our method considering USPS as source, MNIST as the target and viceversa. We can see that results using our approach outperform those obtained from the other methods.
6 Conclusion
We presented a generalized framework for adapting dictionaries to multiple domains by minimizing the domain shift. Furthermore, we showed that the method can be kernelized and can be modified to learn discriminative dictionaries for class specific data. The dictionary is learned on a common low dimensional space, on which the original data is projected. We show that our method outperforms the current stateoftheart methods on different adaptation databases. Future works include finding a way to leverage the unlabeled data while training and to implement tractable, online adaptations of dictionaries, for largescale data.
References
 [1] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. The Journal of Machine Learning Research, 7:2399–2434, 2006.
 [2] H. Daumé III. Frustratingly easy domain adaptation. arXiv preprint arXiv:0907.1815, 2009.
 [3] M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. Image Proc., IEEE Trans. on, 15(12):3736–3745, 2006.
 [4] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel for unsupervised domain adaptation. In CVPR, pages 2066–2073. IEEE, 2012.
 [5] R. Gopalan, R. Li, and R. Chellappa. Domain adaptation for object recognition: An unsupervised approach. In ICCV, pages 999–1006. IEEE, 2011.
 [6] A. Gretton, K. M. Borgwardt, M. Rasch, B. Schölkopf, and A. J. Smola. A kernel method for the twosampleproblem. In NIPS, pages 513–520, 2006.
 [7] G. Griffin, A. Holub, and P. Perona. Caltech256 object category dataset. 2007.
 [8] Y. Han, F. Wu, D. Tao, J. Shao, Y. Zhuang, and J. Jiang. Sparse unsupervised dimensionality reduction for multiple view data. Circuits and Sys. for Video Tech., IEEE Trans. on, 22(10):1485–1496, 2012.
 [9] I.H. Jhuo, D. Liu, D. Lee, and S.F. Chang. Robust visual domain adaptation with lowrank reconstruction. In CVPR, pages 2168–2175. IEEE, 2012.
 [10] Y. Jia, M. Salzmann, and T. Darrell. Factorized latent spaces with structured sparsity. In NIPS, pages 982–990, 2010.
 [11] B. Kulis, K. Saenko, and T. Darrell. What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In CVPR, pages 1785–1792. IEEE, 2011.
 [12] M. Long, J. Wang, G. Ding, J. Sun, and P. Yu. Transfer joint matching for unsupervised domain adaptation. In Proc. of IEEE CVPR, pages 1410–1417, 2013.
 [13] H. V. Nguyen, V. M. Patel, N. M. Nasrabadi, and R. Chellappa. Sparse embedding: A framework for sparsity promoting dimensionality reduction. In ECCV, pages 414–427. Springer, 2012.
 [14] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311–3325, 1997.
 [15] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. Domain adaptation via transfer component analysis. Neural Networks, IEEE Trans. on, 22(2):199–210, 2011.
 [16] Y. C. Pati, R. Rezaiifar, and P. Krishnaprasad. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Signals, Systems and Computers, 1993. 1993 Conference Record of The TwentySeventh Asilomar Conference on, pages 40–44. IEEE, 1993.
 [17] Q. Qiu, V. M. Patel, P. Turaga, and R. Chellappa. Domain adaptive dictionary learning. In ECCV, pages 631–645. Springer, 2012.
 [18] I. Ramirez, P. Sprechmann, and G. Sapiro. Classification and clustering via dictionary learning with structured incoherence and shared features. In CVPR, pages 3501–3508. IEEE, 2010.
 [19] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. In ECCV, pages 213–226. Springer, 2010.
 [20] B. Schölkopf and A. J. Smola. Learning with kernels. âTheâ MIT Press, 2002.
 [21] S. Shekhar, V. M. Patel, H. V. Nguyen, and R. Chellappa. Generalized domainadaptive dictionaries. In CVPR, pages 361–368. IEEE, 2013.
 [22] A. Shrivastava, J. K. Pillai, V. M. Patel, and R. Chellappa. Learning discriminative dictionaries with partially labeled data. In ICIP, pages 3113–3116. IEEE, 2012.
 [23] S. Wang, D. Zhang, Y. Liang, and Q. Pan. Semicoupled dictionary learning with applications to image superresolution and photosketch synthesis. In CVPR, pages 2216–2223. IEEE, 2012.
 [24] Z. Wen and W. Yin. A feasible method for optimization with orthogonality constraints. Mathematical Programming, 142(12):397–434, 2013.
 [25] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma. Robust face recognition via sparse representation. Pat. Analy. and Mach. Int., IEEE Trans. on, 31(2):210–227, 2009.
 [26] J. Yang, Z. Wang, Z. Lin, S. Cohen, and T. Huang. Coupled dictionary training for image superresolution. Image Proc., IEEE Trans. on, 21(8):3467–3478, 2012.
 [27] M. Yang, D. Zhang, and X. Feng. Fisher discrimination dictionary learning for sparse representation. In ICCV, pages 543–550. IEEE, 2011.
 [28] M. Zheng, J. Bu, C. Chen, C. Wang, L. Zhang, G. Qiu, and D. Cai. Graph regularized sparse coding for image representation. Image Proc., IEEE Trans. on, 20(5):1327–1336, 2011.