A Statistical Approach to Set Classification by Feature Selection with Applications to Classification of Histopathology Images

# A Statistical Approach to Set Classification by Feature Selection with Applications to Classification of Histopathology Images

Sungkyu Jung Corresponding author Department of Statistics
University of Pittsburgh
Pittsburgh, Pennsylvania 15260, U.S.A.
E-mail: sungkyu@pitt.edu
Xingye Qiao Department of Mathematical Sciences
Binghamton University, State University of New York
Binghamton, New York 13902-6000, U.S.A.
E-mail: qiao@math.binghamton.edu
###### Abstract

Set classification problems arise when classification tasks are based on sets of observations as opposed to individual observations. In set classification, a classification rule is trained with sets of observations, where each set is labeled with class information, and the prediction of a class label is performed also with a set of observations. Data sets for set classification appear, for example, in diagnostics of disease based on multiple cell nucleus images from a single tissue. Relevant statistical models for set classification are introduced, which motivate a set classification framework based on context-free feature extraction. By understanding a set of observations as an empirical distribution, we employ a data-driven method to choose those features which contain information on location and major variation. In particular, the method of principal component analysis is used to extract the features of major variation. Multidimensional scaling is used to represent features as vector-valued points on which conventional classifiers can be applied. The proposed set classification approaches achieve better classification results than competing methods in a number of simulated data examples. The benefits of our method are demonstrated in an analysis of histopathology images of cell nuclei related to liver cancer.

KEYWORDS: Bioinformatics; Canonical angles; Discriminant analysis; Hotelling’s -square; Principal component analysis; Multidimensional scaling; Set classification.

## 1 Introduction

As advances in technology ease semi-automated segmentation and preprocessing of cell nucleus images, more pathologists are relying on histopathology to discriminate diseased tissues from benign tissues. The classification of tissues based on microscopic examination is often achieved using many cell nucleus images. Figure 1 illustrates two sets of cell nucleus images from human liver tissues. The eight nuclei in the left panel belong to a set labeled as normal tissues, while another set of eight nuclei from hepatoblastoma tissues is shown on the right. See Section 6 for the background and analysis of the data. An eminent statistical task to aid pathologists is to develop a method to classify a new tissue sample consisting of many nucleus images into the normal or the malignant. The classification rule in need is learned from sets of observations and also should be able to predict a single class label for a new set of observations. Such a problem, which we call set classification, has not been studied much in the statistical literature, although it appears to be useful in image-based pathology (Samsudin and Bradley, 2010, Wang et al., 2010).

To precisely define the set classification problem, suppose there are tissue samples, each of which is represented by a set consisting of images of cell nuclei in the tissue, i.e., . Each set or the corresponding tissue sample has its label (say, normal, cancerous). Based on these sets of cell nuclei images, we wish to predict the label for a new tissue sample containing images, .

A few characteristics of such a data set make the task challenging. First, the order of observations in each set may be only given by convenience, meaning that there is no correspondence between observations in different sets. In such a case, and in different sets () should not be directly compared. Moreover, the number of observations in each set may be different from one another. These characteristics rule out a naive approach for set classification, which is based on a long vector consisting of the observations . Instead, a more appropriate statistical model is obtained when a set is regarded as an empirical distribution of observations.

To facilitate the understanding of the set as a distribution, a scatterplot of the liver cell nuclei data is overlaid with contours of the estimated normal densities in Figure 2. A point in Figure 2 corresponds to an image of cell nucleus and each is from a tissue labeled as normal or hepatoblastoma if marked by x or o, respectively. Each observation in a set is assumed to be drawn from a distribution, whose mean and covariance are represented by the contour of the density. A useful insight of the data is that different sets have different distributional parameters. A visual inspection of the plot leads us to believe that the mean and covariance parameters will be useful for classification of sets. Classification based on the parameters of the distribution, or the features of the set, is the initial idea which we develop further in this paper.

Previous approaches to the classification of sets of images are mostly focused on extracting context-based features. For example, Tsantis et al. (2009) used morphological features and Wang et al. (2010) used shape and texture features. To the best knowledge of the authors, Ning and Karypis (2008) and Wang et al. (2010) were among the first to use set information in classification. Although not stated explicitly, they seem to have assumed a simple model of set classification, in that the set information of observations is not used in training but only in prediction.

Having examined more general set classification models, we are proposing a context-free feature selection approach to set classification. Our method is based on extracting statistical features of sets such as the mean and principal components (PCs; Jolliffee, 2002). The proposed feature extraction–selection method transforms empirical PC directions and subspaces into feature vectors so that conventional classifiers can be applied in the transformed space. Multidimensional scaling (Borg and Groenen, 2005) is extensively used in mapping PC subspaces into a real vector space. Our procedure is not a single method but a general framework which can be coupled with any off-the-shelf classifier. We demonstrate the use of variants of linear and quadratic discriminant analysis (Friedman, 1989, Srivastava and Kubokawa, 2007), Support Vector Machine (Cristianini and Shawe-Taylor, 2000) and Distance-Weighted Discrimination (Marron et al., 2007, Qiao et al., 2010) in this article. Prediction of the class label for a new observation is also based on its extracted features using multidimensional scaling.

Since we use only statistical features which are free of context, the proposed approach can be applied to many other types of data beyond images.

In the next section, we formulate relevant statistical models for set classification. We then propose our set classification framework in Section 3 and feature selection procedure in Section 4. A simulation study is presented in Section 5 to examine the effectiveness of our methods in high-dimensional situations. In Section 6, the proposed methods are illustrated with a histopathological cell nucleus image data set. We conclude the article with a discussion in Section 7.

## 2 Statistical models for set classification

In this section we describe statistical models from which the set classification problem arises. In set classification, each observation is labeled by two indices: for the class label and for the set membership. The distribution of is different for different sets and classes. A natural model incorporating this characteristic of the data is a model with hierarchy, i.e., a top level for the class label and a lower level for the set membership.

In the hierarchical model, a random observation follows a probability distribution where is the idiosyncratic parameter of each set. The parameter of a set is considered as a random variable whose distribution is determined by a different class label, . That is, the dependence of on the class label is defined only through its set membership. Denote as a random parameter of the distribution of the th set. The hierarchical model for set classification is

 Θi∣(Yi=k) ∼Hk(k=1,…,K), (1) Xi∣(Θi=θ) ∼f(Xi;θ)=ni∏j=1f(xij;θ)(i=1,…,N),

where is the distribution of the parameter for the th class. In the model above, we have assumed that the observations in a set are independent and follow an absolutely continuous distribution with density function .

To elucidate the use of the hierarchical model, consider a model from which the data in Figure 2 may follow. Assuming normal distribution for each observation , it is visually evident from Figure 2 that the sets (or the corresponding distributions) within either the normal or the hepatoblastoma group have different means and covariance matrices. The difference, however, is small compared to the difference between the normal and hepatoblastoma groups. The (random) parameters of the th set may be modeled as and , where denotes the Wishart distribution of dimension with mean and degrees of freedom. The hyper-parameters differ by the class label normal, hepatoblastoma. The set is then a set of i.i.d. observations following .

Other examples of the hierarchical model include situations where only the mean parameter is random while the covariance parameter is fixed, or vice versa. These models are exemplified in the top two panels of Figure 3. Their population structures, displayed as ellipses, clearly visualize the underlying hierarchical model. In both examples, classification of sets can be done by utilizing the parameters (features) of the distribution, represented as the locations and orientations of the ellipses. The separating hyperplane in Figure 3 by the linear discriminant analysis, pooling all observations in different sets together, is clearly less useful than using the distributional features.

A special case of the hierarchical model can be obtained by fixing the random conditioned on as a deterministic . This model is then no longer hierarchical and is referred to as a simple model. As illustrated in the bottom panels of Figure 3, all observations with the same class label have a common distribution. Since set membership plays no role here, a conventional classification method with the weighted voting strategy is Bayes optimal (see Web Appendix A). In our experience with simulated data analysis, our method based on the hierarchical model is superior to the weighted voting method, which is shown in Section 5.

## 3 Methods

Our framework for set classification is two-step. First, features of a set, or the parameters of the corresponding probability distribution, are estimated, selected and transformed. Then, a classifier is built based on the features extracted from the training data set.

### 3.1 Feature extraction

In choosing appropriate features for sets, it is worth revisiting the characteristics of the liver cell nuclei data illustrated in Figure 1 and 2. First, each observation in the data lies in a fairly high-dimensional space while the number of sets and the number of observations are relatively small. Modeling with a large number of parameters is problematic due to collinearity and can lead to overfitting. Second, as can be seen in Figure 2, the location and orientational information of the distribution (the principal axis of the ellipses) is useful to visually separate sets between different groups. These findings are what motivated us to consider using principal component analysis (PCA; Jolliffee, 2002) to learn orientational information as well as the sample mean for the location to extract succinct but useful features of a set.

In the following, we discuss a method of extracting the orientation of the distribution through PCA. An application of multidimensional scaling (MDS) follows in order to represent the principal component features in a low-dimensional vector space. The vector-valued features which contain orientational information are then attached to the sample means for use in the sequel.

#### 3.1.1 Principal component spaces

Denote the sample mean of the observations in the th set () by . The empirical covariance matrix for the th set is eigen-decomposed into , where the eigenvectors (PC directions) are orthonormal to each other and the eigenvalues (PC variances) are in descending order. In the high-dimensional, low-sample size situation, i.e., , we have for all .

The direction of the major variation is represented by the first few empirical PC directions for some . However, using the direction vectors as features can lead to an erroneous conclusion. Since the PC directions are axial, e.g., is essentially the same as , a special care is needed. Moreover, denote for the eigen-pair from the population covariance matrix. The smaller is, the larger the variation in the empirical PC directions and . In the worst case, where , the population PC directions and are indistinguishable and thus unidentifiable. In such a case, the estimates and are only meaningful if they are understood as a whole. To be more precise,

###### Proposition 1.

If then the subspace spanned by the first two population PC directions is identifiable, and is a consistent estimator of .

A proof of Proposition 1 can be found in Web Appendix A, in which the consistency of subspaces is formally defined.

On this end, denote the subspace spanned by the first empirical PC directions of the th set by for . Which to use is an important matter for model selection, and in Section 4 we propose a data-driven method to choose . The arguments in the current section are applied to any given .

Note that the collection does not lie in a Euclidean vector space, but in the Grassmannian manifold (Chikuse, 2003). Therefore instead of analyzing elements in the curved manifold, s are mapped onto a linear vector space. Such a mapping cannot be obtained by conventional dimension reduction methods such as PCA but can be achieved by multidimensional scaling. We first point out a distance function between s, on which the mapping is defined.

#### 3.1.2 Distances between principal component spaces

The distance between two subspaces of can be measured in terms of the canonical angles formed by the subspaces (cf. Stewart and Sun, 1990). As an intuitive example for the canonical angles, consider a simple case where both subspaces are of dimension 1. The two subspaces are each spanned by unit vectors and . Then the distance between the subspaces is measured as the smallest angle formed by the basis vectors, i.e., , which is also called the canonical angle. For more general situations, canonical angles are computed as follows. Without loss of generality, let and be two subspaces with dimensions and respectively, and and be matrices consisting of basis vectors of and , respectively. In particular, the orthonormal matrix is a basis matrix for the PC subspace . Let be the th largest singular value of . Then the canonical angles between and are for

A distance between and is now defined using the canonical angles. We use a modified Euclidean sine metric (cf. Stewart and Sun, 1990), defined by where is a constant. Since , is at most . The role of the constant is to make the distance commensurate with other features such as the mean. Specifically, when measuring pairwise distances among the PC spaces s, we choose to be the average of the empirical total variance in s, . This choice of leads to good classification results, which are reported in Web Appendix B.

#### 3.1.3 Mapping L(r)i via multidimensional scaling

Multidimensional scaling (MDS) maps objects with distances into corresponding points in an -dimensional space . The vector-valued configuration is found by minimizing a loss function, which measures how well the original distances are approximated by . MDS can be used to map data in a non-Euclidean space into a vector space so that conventional multivariate methods (based on the Euclidean geometry) can be used. In extracting features corresponding to the PC spaces , we use MDS to construct a mapping from the non-Euclidean objects into vector-valued points. We use the classical multidimensional scaling (CMDS) since it gives an analytic solution requiring no numerical optimization. Detailed discussion on methods of MDS can be found in Borg and Groenen (2005) and Cox and Cox (2008). Note that PCA is not applicable for , which lies on a curved manifold .

Mapping training sample: In CMDS, the matrix of configuration is obtained from the matrix of pairwise squared-distances . Writing the matrix of pairwise squared Euclidean distances between columns of as , we minimize the loss function

 L(Z)=∥∥∥−12C(D(Z)−Δ)C∥∥∥2F, (2)

where is the centering matrix, and is the Frobenius norm of matrix . An analytic solution that minimizes (2) is available, due to Gower (1966).

###### Theorem 2.

Let . From the eigen-decomposition of , let be the diagonal matrix consisting of the positive eigenvalues and the corresponding columns of . Then minimizes (2).

The configuration lies in -dimensional real vector space, where is the number of positive eigenvalues of . Note that is the Gram matrix (consisting of the inner products of columns of ), and that if all eigenvalues of are nonnegative, coincides with . The coordinate matrix is the mapped image in of the PC spaces , based on which a classifier can be trained.

Mapping for prediction: For a new observation , we wish to extend the loss function (2) in order to map the new PC space for classification. We include the new observation in the loss function while the training points are fixed. Define the symmetric matrix of pairwise squared-distances

 Δ†=(Δδ12δ210),

where . Let be the matrix of coordinates, where the first columns of represent the training points and the last column represents the new point. That is,

 Z†(z)=(Z0TNz),

where is the minimizer of (2). While the training configuration spans dimensions, the new point may not be found in span(). Therefore the dimension of is increased by 1. We generalize (2) to minimize

 LN(z)=∥∥∥−12PN(D(Z†(z))−Δ†)PTN∥∥∥2F, (3)

where is a projection matrix that works similarly to the centering matrix except that the mean of the first columns (i.e., the training points ) is used for the centering.

For the purpose of classification, the last coordinate of the new is not needed, because a classification rule learned from only uses the first coordinates. An analytic solution for the first coordinates of , hereafter denoted by , is also available. Define

 B†=−12PNΔ†PTN=(Bb12b21b2).

It can be seen that is the vector of inner products , i.e., We then have the coordinates of the new point

 z†=Λ−12+QT+b12. (4)

The following theorem justifies the use of (4) as the feature vector of the new observation .

###### Theorem 3.

If , then

 (5)

are the local minima of (3). If the equality holds, i.e., , then (5) is the global minimum of . Moreover, holds if is nonnegative definite.

To the best of our knowledge, Theorem 3 is the first attempt to obtain a closed form solution for mapping new observations onto coordinates of training configuration by CMDS. The loss function is a fourth order polynomial of multiple arguments, thus finding a global minimum of is a challenging task, involving iterative numerical methods (cf. Nie, 2006). Theorem 3 provides sufficient conditions for the existence of closed form local (or global) minima of . A sufficient condition for to be nonnegative definite is that the distance involved be Euclidean (Borg and Groenen, 2005, Ch. 19). Although our metric is not Euclidean, appears to be nonnegative definite in data analyses we have encountered, and the use of has shown a good performance.

### 3.2 Proposed set classification procedure

The principal component features obtained from the previous section are now combined with the sample mean , on which a classification rule is trained. Specifically, our set classification rule is learned in two steps:

Step 1:

(Feature extraction)
(a) For a given , obtain the sample mean and PC spaces for each set (Section 3.1.1);
(b) Use CMDS to obtain the empirical PC space features (Sections 3.1.2 and 3.1.3);
(c) Collect combined features .

Step 2:

Build a classification rule with any off-the-shelf methods, with inputs .

Prediction of the class label of a new observation is also done in two steps. First obtain the features of the new set, where is the mean vector of the observation in the set and is from (4). The classification rule trained in Step 2 above is applied to the input to predict the label of the new set .

## 4 Principal component space selection

The choice of an appropriate dimension for the PC subspaces is a crucial step in feature extraction. We propose selecting an optimal dimension using training samples without making reference to the specific classification rules used in Step 2 above. A test to check whether one should choose , i.e., discarding the PC features, is also considered.

Our strategy for selection of the dimension of the PC spaces is to use a modified Hotelling’s -squared statistic defined for each , where is the greatest meaningful dimension we can choose. Here, denote the coordinate matrix of for each by , as defined in Theorem 2. Let and be a partition of , where the set (or ) contains the indices corresponding to group (or 2). Let , . Denote the mean difference by , where , the pooled covariance matrix by and the diagonal matrix consisting of the diagonal elements of by .

The modified Hotelling’s statistic for comparing two multivariate populations is the sum of squared marginal -statistics (Srivastava and Du, 2008). We choose an that gives the greatest value of , i.e.,

 ^r=^r(T)=\operatornamewithlimitsargmaxr=1,…,RT(r), (6)

since the greater is, the more separable the groups are. A rationale for the diagonal matrix in comes from the use of the CMDS in Section 3.1.3. The coordinates of the features are chosen so that they are in fact the principal axes; there is no correlation between coordinates of . Moreover, with a small , the inversion of the matrix may cause numerical artifacts. Thus it is natural to exploit only the variances while discarding the covariances of . Accordingly, is used instead of .

A permutation test for usefulness of the chosen PC features is now discussed. The features may not contain any discriminating information when, for example, the sub-populations differ only by location. This is translated into the null hypothesis:

 H0:z(^r)i, i=1,…,N,  are identically % disributed.

The test statistic tends to be large when the alternative hypothesis (not ) is true.

Since the group labels are exchangeable under the null hypothesis, the sampling distribution of under is obtained by permuting the class labels. Let be the re-calculated statistics using permuted data (the th random permutation). The test then rejects if the test statistic is larger than the th percentile of the permuted statistics for a level of test . In other words, the -value of is given by using random permutations.

The dimension chosen by (6) is used for the PC features if ; otherwise we update so that the insignificant PC spaces are discarded.

When the true underlying model has no covariance difference between groups, the proposed permutation test greatly enhances the performance of our classification rules, as demonstrated in Web Appendix D.

## 5 Numerical studies

### 5.1 Competing methods

Our feature selection–set classification methods are denoted by PCF-‘classifier.’ For example, PCF-LDA denotes a set classification rule trained by linear discriminant analysis (LDA) on the extracted features, as that obtained in Sections 3 and 4. For the level of the permutation test, was used. Classifiers considered in numerical studies include LDA, quadratic discriminant analysis (QDA), and Support Vector Machines (SVM, Cristianini and Shawe-Taylor, 2000), Distance-Weighted Discrimination (DWD, Marron et al., 2007), and Minimum Distance Empirical Bayes rule (MDEB, Srivastava and Kubokawa, 2007).

Competing methods include voting classifiers, including two previous methods in Ning and Karypis (2008), Wang et al. (2010). Consider binary classification with and equal class sizes. Using LDA trained from all training samples, ignoring the set memberships, each in a new set can be classified to . Wang et al. (2010) proposed to use a set classifier determined by a majority vote of the individual predictions of , thus called an LDA-MV classifier. Weighted voting of the individual predictors, on the other hand, takes place when an LDA-WV classifier is used. The LDA-WV classifier is an optimal classifier under the simple model (cf. Section 2). Classifiers QDA-(MV,WV), MDEB-(MV,WV), SVM-MV and DWD-MV are defined similar to LDA-(MV,WV) using the chosen classifiers. Precise definitions of these classifiers are provided in Web Appendix E. Ning and Karypis (2008) proposed using SVM-MV in classifications of sets.

In summary, the set classification methods we have considered are categorized into the following.

1. Proposed set classifiers: PCF-LDA, PCF-QDA, PCF-SVM, PCF-DWD and PCF-MDEB;

2. Existing methods: LDA-MV and SVM-MV;

3. Other competing methods: LDA-WV, QDA-(MV,WV), MDEB-(MV,WV) and DWD-MV.

### 5.2 Simulation models and results

We use the following four hierarchical models for binary set classification. Denote for the vector whose components are all zeros except for a 1 in its th. We have , where and

1. No covariance difference: ;

2. Wishart: , where and ;

3. Inverse Wishart: , where ;

4. von Mises–Fisher: , where and

We use a class of covariance matrices to consider highly correlated variables (Srivastava and Kubokawa, 2007, p.129). The is a modified auto-regressive covariance matrix and is , for , where are independently drawn from the uniform distribution on . The von Mises–Fisher distribution used in model (4) is a Gausssian-like distribution for direction vectors, with mode at and a concentration parameter (cf. Mardia and Jupp, 2000). In all of the models, and are independent.

For each model, the performances of the methods in Section 5.1 are evaluated by empirical misclassification rates. The parameters of the models are set as , , and . We tested the number of sets (equal class sizes) with , and the set size was independently selected by .

Table 1 summarizes the result of experiments based on 100 repetitions for . Throughout different models and various settings of dimension–sample size pairs, the proposed methods exhibit much smaller misclassification rates than the competing methods. This is not surprising because the models (2–4) from which the data are generated are the hierarchical models with a difference in PC structures. Since our method choose the PC spaces as features, good performance is to be expected. Moreover, even when there is no covariance difference in model (1), the proposed methods are comparable to the weighted voting methods. In model (1) with , the permutation test in Section 4 makes the PC space features used only 2–8% of the time, thus effectively discarding unimportant PC information. On the other hand, in models (2–4) where the PC spaces possess discriminating information related to class, the tests resulted in rejecting the null hypothesis with empirical power of 95–100%, thus successfully utilizing the PC space features. When the variables are highly correlated (), our methods show rates smaller or comparable to those obtained using MDEB-WV, which is designed to work well for correlated variables. The simulation result also confirms that LDA-WV is not optimal in the general hierarchical model.

## 6 Classification of liver cell nucleus images

One of the common procedures used to diagnose hepatoblastoma (a rare malignant liver cancer) is biopsy–a sample tissue of a tumor is removed and examined under a microscope. A tissue sample contains a number of nuclei, a subset of which is then processed to obtain segmented images of nuclei. The data we analyzed contain 5 sets of nuclei from normal liver tissues and 5 sets of nuclei from cancerous tissues; see Figure 1. Each set contains images. The data set is available at http://www.andrew.cmu.edu/user/gustavor/software.html and is introduced in Wang et al. (2010, 2011), in which one can find details on registration and segmentation of the nucleus images.

We tested the proposed method on the liver cell nuclei image data set in two different ways. First, using all 50 observations in each set, we evaluated the leave-one-out classification error rates by fixing one set as test data and training with the remaining 9 sets. This experiment was repeated for dimension-reduced data (using PCA for combined data). Table 2 summarizes the results. Our methods only misclassified the 9th set, showing the best performance (together with LDA-MV) among all methods considered. The 9th set was misclassified by many methods in different dimensions. This can be explained by a visual inspection of the scatterplot in Figure 2. The orientation of the distribution of the 9th set, illustrated as the thicker orange ellipse in the right panel, is pointing north-east, which is closer to the orientation of the normal group than the north-west orientation of the hepatoblastoma group.

Second, the performances of the set classifiers were evaluated when the number of observations in each set is smaller than 50. Although technology enables processing of an image through semi-automatic registration and segmentation, the cost of obtaining images is still expensive. Thus a method exhibiting solid performance in small setting has a clear advantage over other methods. In particular, we randomly chose images in each of sets as training data and also chose non-overlapping images in each set as testing data. After training the set classifiers, the empirical misclassification rate was computed using the remaining testing data. This procedure was repeated for 100 random subsets of size , for each choice of . Figure 4 summarizes the results of the small- experiment. Our methods exhibited significantly better performance than the others, with the exception of MDEB-WV, for both the original and dimension-reduced data sets.

The features of the data we chose only include ‘statistical features’ of the raw images, but do not include any other features that needs expert input, such as shape or texture features (cf. Chen et al., 2006).

## 7 Discussion

We have introduced a novel set classification framework and proposed a feature extraction–selection procedure. The method is based on the hierarchical modeling of set classification data. The features of sets, chosen by a data-driven approach, are used as inputs for any off-the-shelf classification method. When the orientation of major variations in sets possesses discriminating information, the proposed method adaptively uses such information for classification. Classical multidimensional scaling with the modified Euclidean sine metric between PC subspaces are shown to work well in mapping nonlinear features into vector-valued points. An analytic solution for the mapping of new observations is derived.

Note that Cox and Ferry (1993) and Witten and Tibshirani (2011) also proposed using multidimensional scaling (MDS) for classification. However, the method proposed in Cox and Ferry only works for vector inputs and uses a linear regression for mapping of new observations. Witten and Tibshirani proposed modifying the nonmetric multidimensional scaling, which requires an iterative algorithm to solve, but they did not consider using MDS for set classification.

There is the potential to extend our framework to more general cases. In particular, the independence assumption in the hierarchical model may be relaxed to model dependencies among observations in a set. This is especially relevant in classification of tissues based on sets of images because the cells in a tissue have location information. Images observed with close locations can be assumed to have high correlation. We conjecture that including the location in the hierarchical model greatly increases the accuracy of classification. We have focused on the independent hierarchical model in this paper, as it serves well as an introduction to set classification.

The performance of our set classifiers can be greatly limited when the set sizes are severely imbalanced. The principal component estimates for each set use only samples. If vary greatly, so does the credibility of the estimated features. Moreover, the choice of the dimension for the PC features in Section 4 is also limited up to . These problems are mitigated by excluding those sets with a small sample size or by applying a sensible dimension reduction, but the classifiers could be further honed to properly accommodate imbalanced set sizes.

Another issue is the use of classifiers. We have demonstrated the use of conventional classifiers (LDA, QDA and linear SVM), but have not examined using more sophisticated methods such as kernel machines (cf. Cristianini and Shawe-Taylor, 2000). While use of kernel machines may benefit the classification task as shown in Wang et al. (2010), we chose to focus on our feature selection procedure using simpler methods. On the other hand, since our method is a general framework that works with any classifier, the presented methods have the potential to be improved by advanced kernel machines.

## 8 Supplementary Materials

Web Appendices referenced in Sections 2, 3.1, 4 and 5 as well as a zipped code and example data are available.

## Acknowledgements

We thank Gustavo Rohde for discussions on related topics and providing the histopathology data sets, and the Editor, Associate Editor and anonymous referees for their valuable comments. Jung was partially supported by NSF grant DMS-1307178. Qiao was partially supported by the Simons Foundation #246649.

## References

• Ahn and Marron (2010) Ahn, J. and Marron, J. S. (2010), “The maximal data piling direction for discrimination,” Biometrika, 97, 254–259.
• Ahn et al. (2007) Ahn, J., Marron, J. S., Muller, K. M., and Chi, Y.-Y. (2007), “The high-dimension, low-sample-size geometric representation holds under mild conditions,” Biometrika, 94, 760–766.
• Bickel and Levina (2004) Bickel, P. and Levina, E. (2004), “Some theory for Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations,” Bernoulli, 10, 989–1010.
• Borg and Groenen (2005) Borg, I. and Groenen, P. J. (2005), Modern Multidimensional Scaling: Theory and Applications, Springer Series in Statistics, 2nd ed.
• Bradley (2005) Bradley, R. C. (2005), ‘‘Basic properties of strong mixing conditions. A survey and some open questions,” Probab. Surv., 2, 107–144 (electronic), update of, and a supplement to, the 1986 original.
• Casella and Berger (2002) Casella, G. and Berger, R. L. (2002), Statistical inference, Pacific Grove, CA: Duxbury, 2nd ed.
• Chen et al. (2006) Chen, Y. C., Hu, K. H., Li, F. Z., Li, S. Y., Su, W. F., Huang, Z. Y., and Hu, Y. X. (2006), “Studies on quantitative analysis and automatic recognition of cell types of lung cancer,” Biomed. Mater. Eng., 16, 119–128.
• Chikuse (2003) Chikuse, Y. (2003), Statistics on special manifolds, New York: Springer-Verlag.
• Cox and Cox (2008) Cox, T. F. and Cox, M. a. a. (2008), “Multidimensional Scaling,” in Springer Handbooks of Computational Statistics 2008, Handbook of Data Visualization, III, eds. Chen, C.-h., Härdle, W., and Unwin, A., Springer, vol. 38, pp. 315–347.
• Cox and Ferry (1993) Cox, T. F. and Ferry, G. (1993), ‘‘Discriminant analysis using non-metric multidimensional scaling,” Pattern Recog., 26, 145–153.
• Cristianini and Shawe-Taylor (2000) Cristianini, N. and Shawe-Taylor, J. (2000), An Introduction to Support Vector Machines, Cambridge University Press.
• Dempster (1960) Dempster, A. P. (1960), “A significance test for the separation of two highly multivariate small samples,” Biometrics, 16, 41–50.
• Friedman (1989) Friedman, J. H. (1989), “Regularized discriminant analysis,” J. Amer. Statist. Assoc., 84, 165–175.
• Gower (1966) Gower, J. C. (1966), “Some Distance Properties of Latent Root and Vector Methods Used in Multivariate Analysis,” Biometrika, 53, 325–338.
• Hall et al. (2005) Hall, P., Marron, J. S., and Neeman, A. (2005), “Geometric representation of high dimension, low sample size data,” Journal of the Royal Statistical Society - Series B: Statistical Methodology, 67, 427–444.
• Jolliffee (2002) Jolliffee, I. T. (2002), Principal Component Analysis, New York: Springer-Verlag, 2nd ed.
• Jung and Marron (2009) Jung, S. and Marron, J. S. (2009), “PCA consistency in high dimension, low sample size context,” The Annals of Statistics, 37, 4104–4130.
• Jung et al. (2012) Jung, S., Sen, A., and Marron, J. S. (2012), “Boundary behavior in high dimension, low sample size asymptotics of PCA,” The Journal of Multivariate Analysis, 109, 190–203.
• Lee (2007) Lee, M. H. (2007), “Continuum Direction Vectors in High Dimensional Low Sample Size Data,” Ph.D. thesis, University of North Carolina at Chapel Hill.
• Mardia and Jupp (2000) Mardia, K. V. and Jupp, P. E. (2000), Directional statistics, Chichester: John Wiley & Sons Ltd.
• Marron et al. (2007) Marron, J. S., Todd, M. J., and Ahn, J. (2007), “Distance Weighted Discrimination,” J. Amer. Statist. Assoc., 102, 1267–1271.
• Nie (2006) Nie, J. (2006), “Global Optimization of Polynomial Functions and Applications,” Ph.D. thesis, University of California, Berkeley.
• Ning and Karypis (2008) Ning, X. and Karypis, G. (2008), “The Set Classification Problem and Solution Methods,” in IEEE International Conference on Data Mining Workshops, pp. 720–729.
• Qiao et al. (2010) Qiao, X., Zhang, H. H., Liu, Y., Todd, M. J., and Marron, J. S. (2010), “Weighted distance weighted discrimination and its asymptotic properties,” J. Amer. Statist. Assoc., 105, 401–414.
• Samsudin and Bradley (2010) Samsudin, N. A. and Bradley, A. P. (2010), “Nearest neighbour group-based classification,” Pattern Recogn., 43, 3458–3467.
• Srivastava and Du (2008) Srivastava, M. S. and Du, M. (2008), “A test for the mean vector with fewer observations than the dimension,” J. Multivariate Anal., 99, 386–402.
• Srivastava and Fujikoshi (2006) Srivastava, M. S. and Fujikoshi, Y. (2006), “Multivariate analysis of variance with fewer observations than the dimension,” J. Multivariate Anal., 97, 1927–1940.
• Srivastava and Kubokawa (2007) Srivastava, M. S. and Kubokawa, T. (2007), “Comparison of discrimination methods for high dimensional data,” J. Japan Statist. Soc., 37, 123–134.
• Stewart and Sun (1990) Stewart, G. W. and Sun, J. G. (1990), Matrix perturbation theory, Boston: Academic Press.
• Tsantis et al. (2009) Tsantis, S., Dimitropoulos, N., Cavouras, D., and Nikiforidis, G. (2009), “Morphological and wavelet features towards sonographic thyroid nodules evaluation,” Comput. Med. Imag. Grap., 33, 91–99.
• Wang et al. (2010) Wang, W., Ozolek, J. A., and Rohde, G. K. (2010), “Detection and classification of thyroid follicular lesions based on nuclear structure from histopathology images.” Cytom. Part A, 77, 485–494.
• Wang et al. (2011) Wang, W., Ozolek, J. A., Slepčev, D., Lee, A. B., Chen, C., and Rohde, G. K. (2011), “An optimal transportation approach for nuclear structure-based pathology,” IEEE Trans. Med. Imaging, 30, 621631.
• Witten and Tibshirani (2011) Witten, D. M. and Tibshirani, R. (2011), “Supervised multidimensional scaling for visualization, classification, and bipartite ranking,” Comput. Stat. Data Anal., 55, 789 – 801.
• Yata and Aoshima (2012) Yata, K. and Aoshima, M. (2012), “Effective PCA for high-dimension, low-sample-size data with noise reduction via geometric representations,” J. Multivariate Anal., 105, 193–215.

Web-based Supplementary Materials: “A Statistical Approach to Set Classification by Feature Selection with Applications to Classification of Histopathology images”

by Sungkyu Jung and Xingye Qiao

University of Pittsburgh and State University of New York, Binghamton

## Summary

This supplementary document for ‘A Statistical Approach to Set Classification by Feature Selection with Applications to Classification of Histopathology images’ includes

1. Optimal decision rules in set classification as referenced in Section 2 (Web Appendix A.1–A.3);

2. A proof of Proposition 1 as referenced in Section 3.1.1 (Web Appendix A.4);

3. A proof of Theorem 2 as referenced in Section 3.1.3 (Web Appendix A.5);

4. A discussion on the scale factor as referenced in Section 3.1.2 (Web Appendix B);

5. Additional tables and figures as referenced in Section 4 (Web Appendix C and D);

6. Additional definitions and tables as referenced in Section 5 (Web Appendix E and F).

7. A collection of Matlab codes for the proposed set classification and example data are available as a zipped file.

## Appendix A Technical details

In this section, we first provide theoretical optimal classifiers for the set classification models introduced in Section 2, and discuss their implications to the proposed method. In sections A.4 and A.5 proofs of Proposition 1 and Theorem 3 are provided.

### a.1 Simple model

The simple model for set classification, referenced in Section 2, is a special case of the hierarchical model and can be written as

 Xi∣(Y=k)∼ni∏j=1fk(xij)

with .

Let be a new random object from the simple model. Let be a collection of set classification rules, and . The theoretical Bayes optimal decision rule minimizes the risk function among all . Since

 1−R(ϕ) =K∑k=1πkP(ϕ(X)=k∣Y=k) =∑kπk∫\mathbbm1{ϕ(X)=k}fk(X)dX,

the optimal must satisfy for any .

With an assumption that each is a -variate normal distribution with mean and covariance matrix , a classifier using the linear discriminant analysis can be shown to be optimal in the following sense. For simplicity, consider binary classification with and assume . Then the Bayes decision rule is

 ϕ∗ (X)=\operatornamewithlimitsargmaxk∈{+1,−1}n∏j=1fk(xj) =sign{(n∏j=1f+(xj)/n∏j=1f−(xj))−1} =sign{n∑j=1(xj−μ++μ−2)TΣ−1(μ+−μ1)}. (7)

An estimator of the optimal rule can be obtained by plugging-in the estimates and . If the order of the sign operator and summation in (7) is switched, then a related, but suboptimal, classifier is obtained. The corresponding empirical classifier

 n∑j=1sign⎧⎨⎩(xj−ˆμ++ˆμ−2)TˆΣ−1(ˆμ+−ˆμ1)⎫⎬⎭ (8)

works as if the set label is determined by a majority vote of the individual predictions of . We call (8) as an LDA-MV classifier. In contrast, the optimal classifier (7) works as if the individual predictions vote with weights, thus called a LDA-WV (weighted voting) classifier.

The LDA-WV is optimal only if the simple model can be assumed. In the hierarchical model, both LDA-MV and LDA-WV exhibit poor performances in the simulated and real data analyses, shown in Section 5 and 6. The majority voting strategy has been suggested in Ning and Karypis (2008) and Wang et al. (2010), but they have not discussed its relationship to the theoretical Bayes rule.

### a.2 Hierarchical model

The hierarchical model (equation 1, Section 2) enables us to model appropriate characteristics of set classification data. In the hierarchical model, the optimal classification rule for prediction of a new set minimizes the risk . Writing , we have

 P (ϕ(X)=Y)=K∑k=1πkP(ϕ(X)=k∣Y=k) = K∑k=1πkEΘ[P(ϕ(X)=k∣Θ)∣Y=k] = (9)

where is the conditional density function of given , provided that the density exists. From (9), the optimal rule predicts if maximizes

 πk∫f(X;θ)hk(θ)dθ. (10)

Since the estimation of (10) is challenging, we look at the problem at a different angle. First consider the special case that the parameter is fixed for each class , i.e., . Then (10) is simplified to , which is the expression used in the optimal classifier of the simple model. On the other hand, if the distribution of given is concentrated at only one point, i.e., for an invertible function , then the optimal classifier is

 ϕ(X)=\operatornamewithlimitsargmaxk=1,…,Kπkhk(g(X))=\operatornamewithlimitsargmaxk=1,…,Kπkhk(θ(X)). (11)

In the latter case, a classifier based on is as good as a classifier based on the parameter of the set. This theoretical observation, as well as the empirical observation from the real data in Fig. 2, has contributed in considering a classification based on the parameter (or the features of a set) rather than individual observations in .

The proposed set classification procedure can be thought of as building a classification rule on the predicted parameter (feature) . The parameter is modeled to include the mean and PC spaces. A remaining question is the quality of the prediction of . If is close to , then the empirical classifier is as good as the optimal classifier (11). In the next section we show that the empirical PC space features are as good as the true PC spaces in the high dimensional asymptotic context.

### a.3 High dimensional asymptotic theory

The quality of the empirical features is now assessed in the high-dimensional, low-sample size (HDLSS) asymptotic context to support the theoretical motivation in Section A.2. Our image dataset suits well in HDLSS context since it is in very high dimension with being tens of thousands while the number of sets and the number of observations in the set are both small. An asymptotic investigation where and both fixed is thus relevant to the analysis of such HDLSS data (Hall et al., 2005, Ahn et al., 2007).

Since we are interested in the empirical PC spaces, assume for , the th random observation in the th set, and . Here, the subscript is used to emphasize the dependence on the dimension . Let be the true PC space, which is the target of the empirical PC space .

The following theorem states that when the first eigenvalues are much larger than the others, then the set classification based on the empirical PC spaces is as good as using the true PC spaces in the limit.

For two sequences and let stand for the relation and .

We need the following moment condition:

(C) Each components of have finite fourth moments for all , and there exists a permutation so that the permuted sequence of loadings of the scaled PCs are -mixing (Bradley, 2005).

###### Theorem 4.

Assume that , for and the rest of eigenvalues are fixed, for example for . Suppose is fixed and (C) holds.

1. If , then is consistent with in the sense that in probability as .

2. If , then is strongly inconsistent with in the limit in the sense that

3. If , weakly converges to a distribution with support on as .

Theorem 4 tells us that, in the limit , can be used in place of when signal from the PCs is strong (i.e., large ). Theorem 4 can be shown by an application of Theorem 2 of Jung and Marron (2009) (for ) and Theorem 3 of Jung et al. (2012) (for ) with the fact that for .

Next, we show that the classification based on is optimal in a simplified setting. Suppose that in the binary classification we have;

###### Assumption 1.

is modeled with a single factor as , where and .

The following theorem is the result of directly applying the geometric representation theory in Hall et al. (2005), Ahn et al. (2007) and Qiao et al. (2010). Note that we have assumed normality for to simplify the situation.

###### Proposition 5.

Under Assumption 1, the pairwise distances between the (, resp.) leading eigenvectors for sets from the positive (negative, resp.) class are approximately the same. In particular, scaled by , the squared distances satisfy and

Now if we further assume that the squared mean difference for the ’s is , then we can derive the following asymptotic result which outlines the conditions under which the SVM classifier can always correctly classify the set data based on the leading eigenvectors.

###### Theorem 6.

Without loss of generality, assume that , and hence .

1. If , then with probability converging to 1 as , a new set from either class will be correctly classified by the SVM classifier.

2. If , then with probability converging to 1 as , a new set from either class will be misclassified by the SVM classifier.

3. If , then when , with probability converging to 1 as , a new set from either class will be correctly classified by the SVM classifier.

### a.4 Proof of Proposition 1

The identifiability of the parameter is assured by the assumption and the fact that the spectral decomposition is unique up to sign change. See, for example, Casella and Berger (2002, p. 523) for the notion of identifiability.

The consistency of