###### Abstract

This work introduces a transformation-based learner model for classification forests. The weak learner at each split node plays a crucial role in a classification tree. We propose to optimize the splitting objective by learning a linear transformation on subspaces using nuclear norm as the optimization criteria. The learned linear transformation restores a low-rank structure for data from the same class, and, at the same time, maximizes the separation between different classes, thereby improving the performance of the split function. Theoretical and experimental results support the proposed framework.

Learning Transformations for Classification Forests

Qiang Qiu qiang.qiu@duke.edu

Department of Electrical and Computer Engineering,

Duke University,

Durham, NC 27708, USA

Guillermo Sapiro guillermo.sapiro@duke.edu

Department of Electrical and Computer Engineering,

Department of Computer Science,

Department of Biomedical Engineering,

Duke University,

Durham, NC 27708, USA

Classification Forests (Breiman, 2001; Criminisi & Shotton, 2013) have recently shown great success for a large variety of classification tasks, such as pose estimation (Shotton et al., 2012), data clustering (Moosmann et al., 2007), and object recognition (Gall & Lempitsky, 2009). A classification forest is an ensemble of randomized classification trees. A classification tree is a set of hierarchically connected tree nodes, i.e., split (internal) nodes and leaf (terminal) nodes. Each split node is associated with a different weak learner with binary outputs (here we focus on binary trees). The splitting objective at each node is optimized using the training set. During testing, a split node evaluates each arriving data point and sends it to the left or right child based on the weak learner output.

The weak learner associated with each split node plays a crucial role in a classification tree. An analysis of the effect of various popular weak learner models can be found in (Criminisi & Shotton, 2013), including decision stumps, general oriented hyperplane learner, and conic section learner. In general, even for high-dimensional data, we usually seek for low-dimensional weak learners that separate different classes as much as possible.

High-dimensional data often have a small intrinsic dimension. For example, in the area of computer vision, face images of a subject (Basri & Jacobs, 2003), (Wright et al., 2009), handwritten images of a digit (Hastie & Simard, 1998), and trajectories of a moving object (Tomasi & Kanade, 1992), can all be well-approximated by a low-dimensional subspace of the high-dimensional ambient space. Thus, multiple class data often lie in a union of low-dimensional subspaces. These theoretical low-dimensional intrinsic structures are often violated for real-world data. For example, under the assumption of Lambertian reflectance, (Basri & Jacobs, 2003) show that face images of a subject obtained under a wide variety of lighting conditions can be accurately approximated with a 9-dimensional linear subspace. However, real-world face images are often captured under additional pose variations; in addition, faces are not perfectly Lambertian, and exhibit cast shadows and specularities (Candès et al., 2011).

When data from the same low-dimensional subspace are arranged as columns of a single matrix, the matrix should be approximately low-rank. Thus, a promising way to handle corrupted underlying structures of realistic data, and as such, deviations from ideal subspaces, is to restore such low-rank structure. Recent efforts have been invested in seeking transformations such that the transformed data can be decomposed as the sum of a low-rank matrix component and a sparse error one (Peng et al., 2010; Shen & Wu, 2012; Zhang et al., 2011). (Peng et al., 2010) and (Zhang et al., 2011) are proposed for image alignment (see (Kuybeda et al., 2013) for the extension to multiple-classes with applications in cryo-tomograhy), and (Shen & Wu, 2012) is discussed in the context of salient object detection. All these methods build on recent theoretical and computational advances in rank minimization.

In this paper, we present a new formulation for random forests, and propose to learn a linear discriminative transformation at each split node in each tree to improve the class separation capability of weak learners. We optimize the data splitting objective using matrix rank, via its nuclear norm convex surrogate, as the learning criteria. We show that the learned discriminative transformation recovers a low-rank structure for data from the same class, and, at the same time, maximize the subspace angles between different classes. Intuitively, the proposed method shares some of the attributes of the Linear Discriminant Analysis (LDA) method, but with a significantly different metric. Similar to LDA, our method reduces intra-class variations and increases inter-class separations to achieve improved data splitting. However, we adopt the matrix nuclear norm as the key criterion to learn a transformation, being this appropriate for data expected to be in (the union of) subspaces. As shown later, our method significantly outperforms the LDA method, as well as state-of-the-art learners in classification forests. The learned transformations help in other classification task as well, e.g., subspace based methods (Qiu & Sapiro, 2013).

A classification forest is an ensemble of binary classification trees, where each tree consists of hierarchically connected split (internal) nodes and leaf (terminal) nodes. Each split node corresponds to a weak learner, and evaluates each arriving data point and sends it to the left or right child based on the weak learner binary outputs. Each leaf node stores the statistics of the data points that arrived during training. During testing, each classification tree returns a class posterior probability for a test sample, and the forest output is often defined as the average of tree posteriors. In this section, we introduce transformation learning at each split node to dramatically improve the class separation capability of a weak learner. Such learned transformation is virtually computationally free at the testing time.

Consider two-class data points , with each data point in one of the low-dimensional subspaces of , and the data arranged as columns of . We assume the class labels are known beforehand for training purposes. and denote the set of points in each of the two classes respectively, points again arranged as columns of the corresponding matrix.

We propose to learn a linear transformation ,^{1}^{1}1We can also consider learning a matrix, , and simultaneously reducing the data dimension.

(1) | |||

where denotes the concatenation of and , and denotes the matrix nuclear norm, i.e., the sum of the singular values of a matrix. The nuclear norm is the convex envelop of the rank function over the unit ball of matrices (Fazel, 2002). As the nuclear norm can be optimized efficiently, it is often adopted as the best convex approximation of the rank function in the literature on rank optimization (see, e.g., (Candès et al., 2011) and (Recht et al., 2010)). The normalization condition prevents the trivial solution ; however, the effects of different normalizations is interesting and is the subject of future research. Throughout this paper we keep this particular form of the normalization which was already proven to lead to excellent results.

As shown later, such linear transformation restores a low-rank structure for data from the same class, and, at the same time, maximizes the subspace angles between classes. In this way, we reduce the intra-class variation and introduce inter-class separations to improve the class separation capability of a weak learner.

One fundamental factor that affects the performance of weak learners in a classification tree is the separation between different class subspaces. An important notion to quantify the separation between two subspaces and is the smallest principal angle (Miao & Ben-Israel, 1992), (Elhamifar & Vidal, 2013), defined as

(2) |

Note that We show next that the learned transformation using the objective function (1) maximizes the angle between subspaces of different classes, leading to improved data splitting in a tree node. We start by presenting some basic norm relationships for matrices and their corresponding concatenations.

###### Theorem 1.

Let and be matrices of the same row dimensions, and be the concatenation of and , we have

with equality obtained if the column spaces of and are orthogonal.

###### Proof.

See Appendix id1. ∎

Based on this result we have that

(3) |

and the proposed objective function (1) reaches the minimum if the column spaces of two classes are orthogonal after applying the learned transformation ; or equivalently, (1) reaches the minimum when the angle between subspaces of two classes is maximized after transformation, i.e., the smallest principal angle between subspaces equals .

We now discuss the advantages of adopting the nuclear norm in (1) over the rank function and other (popular) matrix norms, e.g., the induced 2-norm and the Frobenius norm. When we replace the nuclear norm in (1) with the rank function, the objective function reaches the minimum when subspaces are disjoint, but not necessarily maximally distant. If we replace the nuclear norm in (1) with the induced 2-norm norm or the Frobenius norm, as shown in Appendix id1, the objective function is minimized at the trivial solution , which is prevented by the normalization condition .

Thus, we adopt the nuclear norm in (1) for two major advantages that are not so favorable in the rank function or other (popular) matrix norms: (a) The nuclear norm is the best convex approximation of the rank function (Fazel, 2002), which helps to reduce the variation within classes (first term in (1)); (b) The objective function (1) is in general optimized when the distance between subspaces of different classes is maximized after transformation, which helps to introduce separations between the classes.

We now illustrate the properties of the above mentioned learned transformation using synthetic examples in Fig. 1 (real-world examples are presented in Section id1). We adopt a simple gradient descent optimization method (though other modern nuclear norm optimization techniques could be considered) to search for the transformation matrix T that minimizes (1). As shown in Fig. 1, the learned transformation via (1) increases the inter-class subspace angle towards the maximum , and reduces intra-class subspace angle towards the minimum .

During training, at the -th split node, we denote the arriving training samples as and . When more than two classes are present at a node, we randomly divide classes into two categories. This step is to purposely introduce node randomness to avoid duplicated trees as discussed later. We then learn a transformation matrix using (1), and represent the subspaces of and as and respectively. The weak learner model at the -th split node is now defined as . During both training and testing, at the -th split node, each arriving sample uses as the feature, and is assigned to or that gives the smaller reconstruction error.

Various techniques are available to perform the above evaluation. In our implementation, we obtain and using the K-SVD method (Aharon et al., 2006) and denote a transformation learner as , where The split evaluation of a test sample , , only involves matrix multiplication, which is of low computational complexity at the testing time.

Given a data point , in this paper, we considered a square linear transformation of size . Note that, if we learn a “fat” linear transformation of size , where , we enable dimension reduction along with transformation to handle very high-dimensional data.

During the training phase, we introduce randomness into the forests through a combination of random training set sampling and randomized node optimization. We train each classification tree on a different randomly selected training set. As discussed in (Breiman, 2001; Criminisi & Shotton, 2013), this reduces possible overfitting and improves the generalization of classification forests, also significantly reducing the training time. The randomized node optimization is achieved by randomly dividing classes arriving at each split node into two categories (given more than two classes), to obtain the training sets and . In (1), we learn a transformation optimized for a two-class problem. This randomly class dividing strategy reduces a multi-class problem into a two-class problem at each node for transformation learning; furthermore, it introduces node randomness to avoid generating duplicated trees. Note that (1) is non-convex and the employed gradient descent method converges to a local minimum. Initializing the transformation with different random matrices might lead to different local minimum solutions. The identity matrix initialization of in this paper leads to excellent performance, however, understanding the node randomness introduced by adopting different initializations of is the subject of future research.

This section presents experimental evaluations using public datasets: the MNIST handwritten digit dataset, the Extended YaleB face dataset, and the 15-Scenes natural scene dataset. The MNIST dataset consists of 8-bit grayscale handwritten digit images of “0” through “9” and 7000 examples for each class. The Extended YaleB face dataset contains 38 subjects with near frontal pose under 64 lighting conditions (Fig. 2). All the images are resized to for the MNIST and the Extended YaleB datasets, which gives a 256-dimensional feature. The 15-Scenes dataset contains 4485 images falling into 15 natural scene categories (Fig. 3). The 15 categories include images of living rooms, kitchens, streets, industrials, etc. We also present results for 3D data from the Kinect datatset in (Denil et al., 2013). We first compare many learners in a tree context for accuracy and testing time; then we compare with learners that are common for random forests.

We construct classification trees on the extended YaleB face dataset to compare different learners. We split the dataset into two halves by randomly selecting 32 lighting conditions for training, and the other half for testing. Fig. 4 illustrates the proposed transformation learner model in a classification tree constructed on faces of all 38 subjects. The third column shows that transformation learners at each split node enforce separation between two randomly selected categories, and clearly demonstrates how data in each class is concentrated while the different classes are separated.

Method | Accuracy | Testing |

(%) | time (s) | |

Non-tree based methods | ||

D-KSVD (Zhang & Li, 2010) | 94.10 | - |

LC-KSVD (Jiang et al., 2011) | 96.70 | - |

SRC (Wright et al., 2009) | 97.20 | - |

Classification trees | ||

Decision stump (1 tree) | 28.37 | 0.09 |

Decision stump (100 trees) | 91.77 | 13.62 |

Conic section (1 tree) | 8.55 | 0.05 |

Conic section (100 trees) | 78.20 | 5.04 |

C4.5 (1 tree) (Quinlan, 1993) | 39.14 | 0.21 |

LDA (1 tree) | 38.32 | 0.12 |

LDA (100 trees) | 94.98 | 7.01 |

SVM (1 tree) | 95.23 | 1.62 |

Identity learner (1 tree) | 84.95 | 0.29 |

Transformation learner (1 tree) | 98.77 | 0.15 |

A maximum tree depth is typically specified for random forests to limit the size of a tree (Criminisi & Shotton, 2013), which is different from algorithms like C4.5 (Quinlan, 1993) that grow the tree only relying on termination criterion. The tree depth in this paper is the maximum tree depth. To avoid under/over-fitting, we choose the maximum tree depth through a validation process. We also implement additional termination criteria to prevent further training of a branch, e.g., the number of samples arriving at a node.

In Table 1, we construct classification trees with a maximum depth of 9 using different learners ( no maximum depth is defined for the C4.5 tree.). For reference purpose, we also include the performance of several subspace learning methods, which provide state-of-the-art classification accuracies on this dataset. Using a single classification tree, the proposed transformation learner already significantly outperforms the popular weak learners decision stump and conic section (Criminisi & Shotton, 2013), where 100 trees are used (30 tries are adopted here). We observe that the proposed learner also outperforms more complex split functions SVM and LDA. The identity learner denotes the proposed framework but replacing the learned transformation with the identity matrix. Using a single tree, the proposed approach already outperforms state-of-the-art results reported on this dataset. As shown later, with randomness introduced, the performance in general increases further by employing more trees.

While our learner has higher complexity compared to weak learners like decision stump, the performance for random forests is judged by the accuracy and test time. Increasing the number of trees (sublinearly) increases accuracy, at the cost of (linearly) increased test time (Criminisi & Shotton, 2013). As shown in Table 1, our learner exhibits similar test time as other weaker learners, but with significantly improved accuracy. By increasing the number of trees, other learners may approach our accuracy but at the cost of orders of magnitude more test time. Thus, the fact that 1-2 orders of magnitude less trees with our learned matrix outperforms standard random forests illustrates the importance of the proposed general transform learning framework.

We now evaluate the effect of random training set sampling using the MNIST dataset. The MNIST dataset has a training set of 60000 examples, and a test set of 10000 examples. We train 20 classification trees with a depth of 9, each using only randomly selected training samples (In this paper, we select the random training selection rate to provide each tree about 5000 training samples). As shown in Fig. (a)a, the classification accuracy increases from 93.74% to 97.30% by increasing the number of trees to 20. Fig. 6 illustrates in detail the proposed transformation learner model in one of the trees. As discussed, increasing the number of trees (sublinearly) increases accuracy, at the cost of (linearly) increased test time. Though reporting a better accuracy with hundreds of trees is an option (with limited pedagogical value), a few (20) trees are sufficient to illustrate the trade-off between accuracy and performance.

Using the 15-Scenes dataset in Fig. 3, we further evaluate the effect of randomness introduced by randomly dividing classes arriving at each split node into two categories. We randomly use 100 images per class for training and used the remaining data for testing. We train 20 classification trees with a depth of 5, each using all training samples. As shown in Fig. (b)b, the classification accuracy increases from 66.23% to 79.06% by increasing the number of trees to 20. We notice that, with only 20 trees, the accuracy is already comparable to state-of-the-art results reported on this dataset shown in Table 2. We in general expect the performance increases further by employing more trees.

Method | Accuracy (%) |
---|---|

ScSPM (Yang et al., 2009) | 80.28 |

KSPM (Lazebnik et al., 2006) | 76.73 |

KC (Gemert et al., 2008) | 76.67 |

LSPM (Yang et al., 2009) | 65.32 |

Transformation forests | 79.06 |

We finally evaluate the proposed transformation learner in the task of predicting human body part labels from a depth image. We adopt the Kinect datatset provided in (Denil et al., 2013), where pairs of resolution depth and body part images are rendered from the CMU mocap dataset. The 19 body parts and one background class are represented by 20 unique color identifiers in the body part image. For this experiment, we only use the 500 testing poses from this dataset. We use the first 450 poses for training and remaining 50 poses for testing. During training, we sample 10 pixels for each body part in each pose and produce 190 data points for each depth image. Each pixel is represented using depth difference from its 96 neighbors with radius 8, 32 and 64 respectively, forming a 288-dim descriptor. We train 30 classification trees with a depth of 9, each using randomly selected training samples. As shown in Fig. (c)c, the classification accuracy increases from 55.48% to 73.12% by increasing the number of trees to 30. Fig. 7 shows an example input depth image, the groud truth body parts, and the prediction using the proposed method.

We introduced a transformation-based learner model for classification forests. Using the nuclear norm as optimization criteria, we learn a transformation at each split node that reduces variations/noises within the classes, and increases separations between the classes. The final classification results combines multiple random trees. Thereby we expect the proposed framework to be very robust to noise. We demonstrated the effectiveness of the proposed learner for classification forests, and provided theoretical support to these experimental results reported for very diverse datasets.

###### Proof.

We know that ((Srebro et al., 2005))

We denote and the matrices that achieve the minimum; same for , and ; and same for the concatenation , and . We then have

The matrices and obtained by concatenating the matrices that achieve the minimum for and when computing their nuclear norm, are not necessarily the ones that achieve the corresponding minimum in the nuclear norm computation of the concatenation matrix . It is easy to show that

where denotes the Frobenius norm. Thus, we have

We now show the equality condition. We perform the singular value decomposition of and as

where the diagonal entries of and contain non-zero singular values. We have

The column spaces of and are considered to be orthogonal, i.e., . The above can be written as

Then, we have

The nuclear norm is the sum of the square root of the singular values of . Thus, ∎

###### Proposition 2.

Let and be matrices of the same row dimensions, and be the concatenation of and , we have

with equality if at least one of the two matrices is zero.

###### Proposition 3.

Let and be matrices of the same row dimensions, and be the concatenation of and , we have

with equality if and only if at least one of the two matrices is zero.

## References

- Aharon et al. (2006) Aharon, M., Elad, M., and Bruckstein, A. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. on Signal Processing, 54(11):4311–4322, Nov. 2006.
- Basri & Jacobs (2003) Basri, R. and Jacobs, D. W. Lambertian reflectance and linear subspaces. IEEE Trans. on Patt. Anal. and Mach. Intell., 25(2):218–233, 2003.
- Belkin & Niyogi (2003) Belkin, M. and Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15:1373–1396, 2003.
- Breiman (2001) Breiman, L. Random forests. Machine Learning, 45(1):5–32, 2001.
- Candès et al. (2011) Candès, E. J., Li, X., Ma, Y., and Wright, J. Robust principal component analysis? J. ACM, 58(3):11:1–11:37, June 2011.
- Criminisi & Shotton (2013) Criminisi, A. and Shotton, J. Decision Forests for Computer Vision and Medical Image Analysis. Springer, 2013.
- Denil et al. (2013) Denil, M., Matheson, D., and Nando, D. F. Consistency of online random forests. In International Conference on Machine Learning, 2013.
- Elhamifar & Vidal (2013) Elhamifar, E. and Vidal, R. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Trans. on Patt. Anal. and Mach. Intell., 2013. To appear.
- Fazel (2002) Fazel, M. Matrix Rank Minimization with Applications. PhD thesis, Stanford University, 2002.
- Gall & Lempitsky (2009) Gall, J. and Lempitsky, V. Class-specific hough forests for object detection. In Proc. IEEE Computer Society Conf. on Computer Vision and Patt. Recn., 2009.
- Gemert et al. (2008) Gemert, J. C., Geusebroek, J., Veenman, C. J., and Smeulders, A. W. Kernel codebooks for scene categorization. In Proc. European Conference on Computer Vision, 2008.
- Hastie & Simard (1998) Hastie, T. and Simard, P. Y. Metrics and models for handwritten character recognition. Statistical Science, 13(1):54–65, 1998.
- Jiang et al. (2011) Jiang, Z., Lin, Z., and Davis, L. S. Learning a discriminative dictionary for sparse coding via label consistent K-SVD. In Proc. IEEE Computer Society Conf. on Computer Vision and Patt. Recn., Colorado springs, CO, 2011.
- Kuybeda et al. (2013) Kuybeda, O., Frank, G. A., Bartesaghi, A., Borgnia, M., Subramaniam, S., and Sapiro, G. A collaborative framework for 3D alignment and classification of heterogeneous subvolumes in cryo-electron tomography. Journal of Structural Biology, 181:116–127, 2013.
- Lazebnik et al. (2006) Lazebnik, S., Schmid, C., and Ponce, J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proc. IEEE Computer Society Conf. on Computer Vision and Patt. Recn., 2006.
- Miao & Ben-Israel (1992) Miao, J. and Ben-Israel, A. On principal angles between subspaces in rn. Linear Algebra and its Applications, 171(0):81 – 98, 1992.
- Moosmann et al. (2007) Moosmann, F., Triggs, B., and Jurie, F. Fast discriminative visual codebooks using randomized clustering forests. In Advances in Neural Information Processing Systems, 2007.
- Peng et al. (2010) Peng, Y., Ganesh, A., Wright, J., Xu, W., and Ma, Y. RASL: Robust alignment by sparse and low-rank decomposition for linearly correlated images. In Proc. IEEE Computer Society Conf. on Computer Vision and Patt. Recn., San Francisco, USA, 2010.
- Qiu & Sapiro (2013) Qiu, Q. and Sapiro, G. Learning transformations for clustering and classification. CoRR, abs/1309.2074, 2013.
- Quinlan (1993) Quinlan, J. Ross. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., 1993.
- Recht et al. (2010) Recht, B., Fazel, M., and Parrilo, P. A. Guaranteed minimum rank solutions to linear matrix equations via nuclear norm minimization. SIAM Review, 52(3):471–501, 2010.
- Shen & Wu (2012) Shen, X. and Wu, Y. A unified approach to salient object detection via low rank matrix recovery. In Proc. IEEE Computer Society Conf. on Computer Vision and Patt. Recn., Rhode Island, USA, 2012.
- Shotton et al. (2012) Shotton, J., Girshick, R., Fitzgibbon, A., Sharp, T., Cook, M., Finocchio, M., Moore, R., Kohli, P., Criminisi, A., Kipman, A., and Blake, A. Efficient human pose estimation from single depth images. IEEE Trans. on Patt. Anal. and Mach. Intell., 99(PrePrints), 2012.
- Srebro et al. (2005) Srebro, N., Rennie, J., and Jaakkola, T. Maximum margin matrix factorization. In Advances in Neural Information Processing Systems, Vancouver, Canada, 2005.
- Tomasi & Kanade (1992) Tomasi, C. and Kanade, T. Shape and motion from image streams under orthography: a factorization method. International Journal of Computer Vision, 9:137–154, 1992.
- Wright et al. (2009) Wright, J., Yang, M., Ganesh, A., Sastry, S., and Ma, Y. Robust face recognition via sparse representation. IEEE Trans. on Patt. Anal. and Mach. Intell., 31(2):210–227, 2009.
- Yang et al. (2009) Yang, J., Yu, K., Gong, Y., and Huang, T. Linear spatial pyramid matching using sparse coding for image classification. In Proc. IEEE Computer Society Conf. on Computer Vision and Patt. Recn., 2009.
- Zhang & Li (2010) Zhang, Q. and Li, B. Discriminative k-SVD for dictionary learning in face recognition. In Proc. IEEE Computer Society Conf. on Computer Vision and Patt. Recn., San Francisco, CA, 2010.
- Zhang et al. (2011) Zhang, Z., Liang, X., Ganesh, A., and Ma, Y. TILT: transform invariant low-rank textures. In Proc. Asian conference on Computer vision, Queenstown, New Zealand, 2011.