Classification and sparsesignature extraction from geneexpression data
Abstract
In this work we suggest a statistical mechanics approach to the classification of highdimensional data according to a binary label. We propose an algorithm whose aim is twofold: First it learns a classifier from a relatively small number of data, second it extracts a sparse signature, i.e. a lowerdimensional subspace carrying the information needed for the classification. In particular the second part of the task is NPhard, therefore we propose a statisticalmechanics based messagepassing approach. The resulting algorithm is firstly tested on artificial data to prove its validity, but also to elucidate possible limitations.
As an important application, we consider the classification of geneexpression data measured in various types of cancer tissues. We find that, despite the currently low quantity and quality of available data (the number of available samples is much smaller than the number of measured genes, limiting thus strongly the predictive capacities), the algorithm performs slightly better than many stateoftheart approaches in bioinformatics.
pacs:
07.05.Kf Data analysis: algorithms and implementation, data management; 02.50.Tt Inference methods; 98.52.Cf Classification and classification systems; 05.00.00 Statistical physics, thermodynamics, and nonlinear dynamical systemsI Introduction
Extracting information from high dimensional data has become a major challenge in biological research. The development of experimental highthroughput techniques allows to monitor simultaneously the behavior of genes, proteins and other cellular constituents on a genomewide scale. Linking the geneexpression profiles of specific tissues to global phenotypic properties, as, e.g., the emergence of pathologies, is one of the most important goals of this kind of studies. Particular attention is paid on cancer tissues, where the ability of classifying such tissues according to their cancer type has immediate impact on establishing an appropriate medical treatment Bair_Tibshirani02; Beer_etal_NatMed_02; Bhattacharjee_PNAS_2002; Ramaswamy_Nat_Gen_2003. Global geneexpression profiling gives a new perspective in this direction.
In this work, we consider microarray data, which measure the abundance of messenger RNA as a mark of gene expression. More precisely, the logarithm of the relative abundance compared to a reference pattern (e.g. normal tissue or average over many tissues) is recorded, such that positive values correspond to over, negative to underexpression with respect to the reference values. Each microarray contains the information about the expression levels of thousands of genes, while the number of microarrays available for a given problem usually ranges from some tens to few hundreds. These relatively few, but highdimensional and, as a further complication, noisy expression patterns render the classification task computationally hard.
A promising strategy for solving this problem is that of systematically reducing the space of the system by isolating small set of genes that are thought to be particularly relevant for the problem. Such a set is often called a gene signature. There are three major reasons: (i) One may hope that the selected genes have some biological relevance elucidating processes related to the pathology. (ii) The reduction of the number of considered genes improves the a priori unfavorable ratio between data dimension and pattern number, and reduces the risk of overfitting data, so that better predictive capabilities are to be expected. (iii) Customdesigned chips monitoring only the selected genes can be constructed, reducing noise and cost of the experiments. This would lead to an increased quality and quantity of data, and thus to better possibilities of extracting biologically relevant information.
The problem of extracting a sparse signature from highdimensional datasets is ubiquitous in many different fields ranging from computer science, combinatorial chemistry, text processing and finally to bioinformatics , and different approach have been proposed so far (see Guyon_Elisseeff2003; Guyon_etal_2006 for a general introduction to the problem and EinDorZukDomany2006 for a detailed discussion about cancer classification based on expression patterns). The literature about feature selection has so far concentrated around three different strategies: (i) wrapper which utilizes learning to score signatures according to their predictive value, (ii) filters that fix the signature as a preprocessing step independent from the classification strategy used in the second step. In our work we will present a new wrapper strategy which fall into the subclass of embedded methods where variable selection is performed in the training process of the classifier.
Two complementary approaches can be considered to face a classification problem: An unsupervised approach tries to suitably regroup/cluster samples according to some inherent similarity measure jain99data; Duda2000PCL, whereas supervised approaches exploit a (partial) labeling of the data (e.g. available diagnoses for part of the patients) and aim at at learning a rule which allows to label also previously unlabeled data (e.g. patients without given diagnosis). The main problem of a supervised approach for microarray data is the already mentioned small number of available labeled samples (tissues already classified) compared with the high dimensionality (the number of genes that a priori have to be taken into account) of each of them. On the other hand, a supervised approach allows to capture the information that is relevant for the sought classification, avoiding to be misled by other structures present in the data keymartin.
In this paper we propose a generalization of a messagepassing based algorithm to supervised classification. The power of the algorithm lies in its statistical physics approach, that allows (i) to deal with the combinatorial nature of the effect of relevant genes and (ii) to characterize the statistical properties of the set of all possible classifiers, weighted by their performance on the training data set, and by the number of genes on which the classifier actually depends. The method consists in translating the problem into a constraint satisfaction problem (CSP), where each constraint corresponds to one classified training pattern. This CSP is solved using belief propagation Yedidia, which in statistical physics is also known as the BethePeierls approximation. The final output of the algorithm is twofold: First a gene signature is provided, and, second, the classifier itself is given.
We test this algorithm on artificial data, we then move to data sets of leukemia, prostate and colon cancer, used as benchmarks for several other algorithms Dettling and we finally consider a set of breast cancer data obtained as the union of two different experiments vandevijver; vantveer.
The paper is organized as follows: In Sec. II we introduce the problem of classification using first a statisticalmechanics formulation, then we show the equivalence of this formulation with the Bayesian formalism in analogy with the work of Kabashima et al. keykabashima. In Sec. III we give the details of the algorithm. In Sec. IV we show results on artificial data and in Sec. V we show and comment the results on RNA microarray data. Finally, in Sec. VI we present conclusions and perspectives of our approach.
Ii The classification problem
Given are patterns (microarrays) measuring the simultaneous activity of genes, with and . The value of gives the logarithm of the ratio between the activity of gene in pattern and some reference activity of gene , so positive correspond to overexpression, negative to underexpression of gene in pattern . In addition we have an dimensional output vector that assigns a binary label to each pattern (e.g. cancer vs. normal tissue, or cancer type1 vs. cancer type2). The full data set is thus defined as a set of inputoutput pairs .
Here we concentrate fully on binary classifications, but this restriction can be easily overcome. A straightforward generalization to deal with many classes is the socalled one against all classification, where one single label is classified separately against a set unifying all other labels. This reformulation allows to treat a label problem via binary problems Clark_Boswell_2001.
As explained before, we aim at extracting a sparse signature out of the set of all genes, and at establishing a functional relation between the selected genes and the labels . This task seems infeasible in its full generality, therefore we restrict ourselves to the simpler task: we will provide a ternary variable for each gene, which describes the influence of gene on the output label:
(1) 
We further on aim at finding variables such that as few entries as possible are nonzero, forming thereby the desired sparse signature. Note that in this scheme does not imply that the output is independent of the input , but that gene does not carry additional information about the output as compared to the genes in the signature (i.e., with non zero coupling).
This classification scheme is clearly an oversimplification with respect to biomedical reality, where a whole range of positive and negative interaction strengths is to be expected. On the other hand, given the peculiar restriction posed by the limited number of available experimental geneexpression patterns, having a simple (and hopefully meaningful) model reduces the risk of overfitting and produces results which are easier to interpret.
An algorithm for binary classification based on Belief Propagation was already proposed by Kabashima keykabashima, where he considers continuous values for the couplings , coupled with binary variables establishing the presence of the gene in the classification task () or its absence (). The algorithm is tested only the Colon dataset and our results give a sensibly better generalization error.
We further explore the possibility to improve our results allowing each to take discrete values. In all the performed simulations the generalization ability of the algorithm remains stationary or decreases when increasing . We thus keep the simpler scheme considering the only possibilities .
A model of the functional relationship between input and output variables (the data set ) has to be formulated to proceed. Again we aim at keeping this model as simple as possible. We consider functions depending only on the sum of the input variables weighted by the coupling vector. This choice, given the Boolean nature of the output variables, results in a linear classifier Hertz_Krogh_Palmer_1991, i.e. a perceptron,
(2) 
where the function gives the sign of the input , and is set to +1.
The full classification problem reduces thus to the inference of a coupling vector .
ii.1 Inference as a constraint satisfaction problem
The coupling vector and the threshold are the free parameters of the problem. Following the usual strategy in statistical mechanics Braunstein_Zec_PRL2005, we can define a cost function (or Hamiltonian) that, given the data set , counts the number of patterns contradicting our threshold model as a function of the coupling vector (and of the threshold ; this dependence will be taken as implicit from now on):
(3) 
with being the Heaviside step function. This cost function does not include any dependence on the sparsity of the coupling vector , so, to obtain a vector with as many zero entries as possible, we have to add an external pressure to our system. From the point of view of a model Hamiltonian, this can be obtained by an external field (or chemical potential) being coupled to the number of nonzero entries in , i.e. to
(4) 
The complete Hamiltonian for our system thus reads
(5) 
Searching for the minimum of this cost function is the analogous of solving a zero temperature problem in statistical mechanics.
This would be the correct procedure if the data were completely clean from noise and perfectly linearly separable. When this is not the case, imposing the correct classification for all the patterns could lead to an improper selection of the coupling vector and consequently to a poor prediction ability.
Keeping this in mind, we consider a “finite temperature problem” instead, where also solutions that leave unsatisfied some clauses are allowed. We see that in many cases these are the solutions that give the better predictions on new data. At this aim, we introduce a formal inverse temperature and the related Gibbs measure
(6) 
with . The values of and set the relative importance of satisfying the constraints given by the patterns versus the sparsity of the coupling vector. Large enforces satisfaction of the constraints, large favors many zeroelements in . At the moment these two parameters are free model parameters and we describe later on a strategy to fix them for specific data sets .
However, our approach is not seeking for one single configuration that minimizes the cost function; the objective is to characterize the statistical properties of all lowcost coupling vectors as weighted by . Once known these statistical properties, a subsequent problem is to characterize a proper classifier. We will discuss this point in the next subsection, where, to clarify the relation between our statistical mechanics and the Bayesian approach, we reformulate the problem using the point of view of the latter.
ii.2 A Bayesian point of view
The statisticalmechanics approach outlined so far can be reinterpreted in terms of the following Bayesian inference scheme McKayBook. As a first step, we define a model that describes the likelihood of a labeling given a parameter vector and the expression data . As usually we assume different data points to be generated independently under the model, i.e.
(7) 
with the likelihood of a single label being defined coherently with the previous subsection,
(8) 
Further on we define a prior on the space of parameters which favors again sparse vectors: and where we used the Kronecker delta function to enforce that .
A straightforward application on Bayes theorem allows us to derive the posterior probability of given the knowledge of the dataset as:
(9) 
which is the analogous of Eq. (6).
Let us now see how the knowledge of this posterior probability can be used in a given case of classification. Imagine that we have experimental access to a dataset of genearray measurements . Now a new expression measure becomes available, but we do not know whether the experimental sample comes from a cancerous tissue or not, i.e. we do not know its annotation label . We can, however, compute the conditional probability of given the knowledge of all experimental measurements in and the new expression profile . Applying the sum rule:
(10) 
we are ready to establish two probabilistic rule to assign the label to a sample on the basis of the posterior probability . To simplify the notation, let us define the quantity:
(11) 
We can then classify the new pattern according to the Bayesian rule (10). If we assume that (11) is a sum of random independent variables, we get that the field is approximately a Gaussian distributed random variable. In this way we can relate the probability distribution of with the conditional probability of the output :
(12)  
where is a Gaussian distribution with mean and variance . In our case these two parameters are given by:
(13)  
(14) 
The averages in the previous equation are taken over the posterior probability distribution . The choice of the label can be simply given by the maximum posterior probability criterion:
(15) 
It is worth pointing out that for computing both and we do not need the knowledge of the whole posterior probability. The knowledge of the singlecoupling marginal posterior probabilities is sufficient. In order to lighten notations, we indicate marginal posterior probabilities simply by , dropping the explicit dependence on the data set .
In the following section we introduce an efficient algorithm for estimating these quantities.
Iii The algorithm
The messagepassing strategy introduced in the following allows to efficiently estimate marginal probability distributions for single entries of the coupling vector. Given the Gibbs measure (6) one could in principle compute the marginal probability distribution of variable using the standard definition:
(16) 
Unfortunately this strategy involves a sum over terms and becomes computationally infeasible already for larger then 30, as compared to thousands of geneexpression values measured in each microarray experiment. In the next section we will explain how to overcome this difficulty by using the BethePeierls approximation.
Note that the marginal distributions contain valuable information for the extraction of a sparse signature. Genes with a high probability of having a nonzero coupling even for large have to be included in such a signature, they are likely to carry crucial information about the output label. Genes with high probability on the other hand can be excluded from most signatures, so their information content is either low or already provided by other genes.
iii.1 The message passing approach
The belief propagation (BP) method, also known as BethePeierls approximation, or cavity approach, in statistical physics, is exact on treeshaped graphical models. As an approximate tool, it was therefore mostly used on sparse graphs Yedidia; SumProd; BMZ, where the influence of loops is expected to be not very strong. More recently, several applications to dense graphs have been successfully proposed Kabashima2007; keykabashima; keyBickson; Braunstein_Zec_PRL2005; Frey; LeoneSumedhaWeigt. Our problem can in fact be considered as a graphical problem over a fully connected bipartite factor graph, cf. Fig. 1, with vertices (variable nodes) representing the variables , and vertices (factor nodes) representing the constraints (2). All pairs of these two vertex types are connected, since each constraint depends on all variables. In addition there are local fields on variables corresponding to the diluting term in Hamiltonian (5).
The BP algorithm provides a strategy for estimating marginal probability distributions. It works via iterative updates of messages, which are exchanged between variable and factor nodes. Let be one of the factor nodes and one of the variable nodes. We can introduce the following messages, which travel according to the direction indicated by the indices:

describes a weight imposed by constraint on the value of the coupling of variable .

is the probability that variable takes value in the absence of the constraint set by factor node .
The abovedefined quantities satisfy the following set of selfconsistent BP equations:
(17)  
(18) 
where is a normalization constant. A more detailed derivation of the equations in the case of the perceptron without dilution can be found in Braunstein_Zec_PRL2005, and in keymartin in the case of dilution.
In the algorithm both types of messages are initialized randomly, and the iteration proceeds via a random sequential update scheme. The algorithm stops when the convergence is reached, i.e. when the difference between each message at time and the corresponding one at time is less than a predefined threshold ( in all our simulations). The number of needed iterations depends on the particular problem and on the values settled for the parameters and . However, we do not exceed a hundred of iterations in the simulations reported above.
Once all messages are evaluated, the desired marginal probability is given by the messages sent from all factor nodes and by the diluting field,
(19) 
where the are normalization constants which can be easily determined by tracing the unnormalized expression over the three values .
Note that in the definition of messages (17) we still have a sum over configurations, but they enter in the expression only through a linear combination, which enables us to use again a Gaussian approximation Braunstein_Zec_PRL2005. The sum over couplings is replaced by a single Gaussian integral, which can be performed explicitly:
(20)  
(21) 
where is a Gaussian variable with mean and variance,
(22) 
The averages are performed over messages . In this way the complexity of Eq. (17) is reduced from to and that of the overall iteration to (The apparent complexity of updating messages in time can be reduced to by a simple trick: The sums in Eqs. (22) can be calculated over all once for each , so only the contribution of has to be removed in the update of for each . This allows to make the single update step in constant time). A precise estimate of the overall complexity of the algorithm would require to control the scaling of the number of iterations needed for convergence. A theoretical analysis of BP convergence times in a general setting (including the perceptron case) remains elusive. Some recent progress for the simpler matching problem can be found in Zec_matching.
The BP equations become feasible even for very large and , and can therefore be applied to biological highthroughput data sets. Note that, even if the central limit theorem is meant to be valid in the limit of , in practice it works very well already for (where the exact computation is clearly feasible).
The Bethe entropy Yedidia can be computed from marginals and messages:
(23)  
In the first line, we have used the distribution
(24) 
which describes the influence of a single factor node by conserved marginal distributions. In the second line, we use the corresponding single sample energy defined as
(25)  
In writing the last expression we have used again the Gaussian approximation as in Eqs. (2022).
Up to this point, the BP equations still depend on two undetermined parameters, namely the inverse temperature coupled to the datagiven constraints, and the diluting field . To implement the algorithm we have to define a strategy for fixing these free parameters:

The diluting field is the conjugate variable of the number of effective link , so we can equivalently fix one of the two quantities. We decided to fix the number of effective links, and thus the size of the searched gene signature, and to choose accordingly. To find the correct value of we apply a cooling procedure, where after each interaction of the BP equations step we increase (resp. decrease) depending on whether the effective number of link is higher (resp. lower) than the desired value. Since the true number of relevant genes is an unknown quantity, the chosen value for is, however, a free parameter. Practically, since the algorithm finds marginal probabilities and not a single configuration, we define the number of relevant links via its thermodynamic average:
(26) Comparing results for different values of we see that the algorithm is fairly robust, as will be seen in the following sections.

The inverse temperature is again fixed by a cooling procedure starting from a low value and increasing it until one of the following two conditions is met: (i) the energy reaches a small enough value ( formally corresponds to zero energy), (ii) the entropy goes to zero (a signal for a freezing transition into a nonextensive number of coupling configurations at finite temperature). In this last case we use the marginals computed at the zeroentropy temperature.
The diluting field drives the system toward solutions of the desired average number of effective coupling . This is not yet enough for determining explicitly one signature of the desired size, since results are still of probabilistic nature. If we want to select, and use in our algorithm, only a desired number of genes, we have to couple the BP algorithm with the decimation procedure presented in ScienceZec; Zec_Mezard_2002:

Random initialization of messages.

Computation of marginal probability distributions using Eq. (19).

Setting to zero the coupling variable having the largest weight in zero (i.e., set for this variable).

GOTO 2. Repeat until classification variables are set to zero.
Practically, this procedure turned out to be computationally much too expensive, so we opted for a faster variant of Step 4 of the decimation procedure: after each convergence step of the BP equations, we rank genes according to , and we set to zero an extensive part of couplings at the top of this ranking. The same procedure is iterated until we reach the desired number of non zero weights .
iii.2 A note on the centroids algorithm
It is interesting to compare the results of the BP algorithm with simpler, but widely applied techniques. The centroids algorithm is based on the notion of distance (Euclidean distance in dimensional space in this case) of a given pattern from the centers of mass of the two sets of patterns with annotations or . The algorithm works in the following way:

Let , and the number of patterns with label . Here we consider only, but the algorithm works equivalently for multiple classes. We compute the centers of mass (or centroids) of the expression data with labels :
(27) 
We assign the label to a new sample if its distance from is smaller than its distance from all other centroids,
(28) where can be the Euclidean distance in or, depending on the problem, any other meaningful notion of distance.
In order to take into account that only a subset of the genes is relevant (sparse signature), we rank genes according to the absolute value of their Pearson correlation with the output:
(29) 
The highest scoring genes are selected as the signature, and the distance of the new pattern from the centroids is computed taking into account only these genes, i.e. the full problem is projected onto the dimensional subspace of all genes. Since the signature size yielding the best classification is not known a priori, one has to consider the performance of the algorithm as a function of it.
Iv Test on artificially generated data
Before running the algorithm on microarray data, it is useful to test it on artificial data generated by a controlled inputoutput relation. In the simulations reported here the are drawn independently from a normal distribution.
In order to compare the performance of the BP and the centroids algorithms, we label the random generated patterns according to two different rules, reflecting the perceptron and the centroids idea respectively.
iv.1 Description of the datasets.
In the first series of simulations, we draw the couplings from the following distribution:
(30) 
We consider the case in order to have a sparse gene signature, the expectation value of (defined as the number of nonzero entries of ) equals . Labels are determined according to a rule similar to the one used for BP inference,
(31) 
If only exist, values are excluded. In this case the data are feasible under model if Eq. 2, and data can be learned at zero energy (no wrongly assigned labels by the inferred vector ). The theoretical analysis presented in keymartin shows a phasetransition line in the plane , with . One phase, at low and , is paramagnetic. It perfectly memorizes the inputoutput relation given by the data, but there is no correspondence between the exponential number of possible coupling vectors and the data generating rule, the predictive properties for a new pattern are poor. For higher and/or , the solution discontinuously jumps to a perfect retrieval of the inputoutput association vector . A particularly important point is that at sufficiently high diluting field perfect inference is possible at values much lower than the critical threshold for perfect inference at .
In the case the data are infeasible, since the structure of the data generator is richer than the one used for inference. Couplings of the data generator are allowed to take 5 values ({), and inference tries to fit data just using ternary couplings ().
In the second series of simulations the patterns are labeled according to their similarity to two vectors, and , chosen as representative of the two classes. In order to take into account the sparsity of the relevant genes, we consider the variables if and if , with . We thus classify the patterns according to the restricted Euclidean distance:
(32) 
iv.2 Results.
We investigate the goodness of the two classifier, in the two different situations described above, according to two different measures. First, we consider the generalization error. In this case patterns are divided into disjoint training and test sets. Learning is done on the training set, and the inferred inputoutput rule is tested against the test set. The generalization error is defined as the fraction of misclassified patterns in the test set. Results are shown in Fig. 2. For the first type of dataset (perceptronlike) one can appreciate how BP outperforms the simple centroids algorithm. In the second type of dataset (centroidslike), the centroids algorithm outperforms BP in the high signature cases but the two algorithms are comparable for low signatures.
As a second test on the accuracy of BP, we use the fact that the datagenerating signature is known. It can be compared directly to the inferred one, allowing to group genes into four classes: True positives (TP) are those genes which are contained in both signatures, i.e. genes which are correctly identified as being relevant by the inference algorithms. False positives (FP) are in the inferred signature, but they are not in the true one. In analogy we define true negatives (TN) as those which are in neither signature, and false negatives (FN) are those genes which are in the data generator, but they are not recognized by the algorithm. The recall RC (or sensitivity) and the precision PR (or specificity) are thus defined as:
(33) 
The recall thus measure the fraction of correctly inferred genes in the signature, while precision is the fraction of inferred couplings which are actually true ones. These two quantities are obviously competing: An algorithm that includes all genes into the signature has very low precision () but maximum recall (), while including only the genes with very strong signal into the signature may result in a good precision, but at the cost of a potentially poor recall. A perfect algorithm would have recall and precision both equal to one, so the interplay between both quantities is the relevant ”observable”. A curve precisionversusrecall can be constructed by ranking all genes according to their probability of having a non zero weight, and introducing different cutoffs in the ranking: In Fig. 3 we show the numerical results for the two datasets, while a more detailed theoretical analysis for the perceptronlike dataset can be found in keymartin. We see that for the data generated by the perceptron like rule, BP actually performs considerably better than ranking by correlations, whereas results are comparably good in the case of the centroidlike generator.
V Test on tumor data
As stated in the introduction, a common problem of microarray experiments is the small number of samples in the data set (from tens to hundreds) compared with the number (thousands) of monitored genes. Values of the parameter in the considered data sets range from to . In principle these values of are below the threshold at which BP outperforms simple pairwise correlation based methods, at least on data artificially generated as described in the previous section and more extensively discussed in keymartin. On the other hand, the simple ratio between the number of patterns and the number of genes might not be the relevant parameter on real data sets, due to the non trivial correlations between different patterns, and between genes. We first consider three data sets (leukemia, colon and prostate cancer), already analyzed in Dettling as benchmarks for other algorithmic strategies (e.g. BagBoost, Random Forest, Support Vector Machine, knearest neighbors, Diagonal Linear Discriminant Analysis). We further study a newer and larger data set on breast cancer from two different laboratories vandevijver; vantveer.
Class A/B  NDP  

Leukemia  3571  72  25/47  48  24  200 
Prostate  6033  102  52/50  68  34  100 
Colon  2000  62  40/22  42  20  200 
Breast  6401  174  86/88  20160  15414  100 
v.1 Description of the data sets
We have analyzed four different data sets of cancer tissues:

Leukemia: It consists of 72 samples of two subtypes of leukemia – 25 samples of acute myeloid leukemia (Class A) and 47 samples of acute lymphoblastic leukemia (Class B) – measured over 3571 genetic probes Golub.

Prostate It consists of 102 samples – 52 from prostate tumor tissues (Class A) and 50 from normal prostate tissues (Class B) – measured over 6033 genetic probes Singh.

Colon: It consists of 62 samples – 40 from colon adenocarcinoma tissues (Class A) and 22 from normal colon tissues (Class B) – measured over 2000 genetic probes Alon.

Breast: The data set we have analyzed is the union of two different experiments presented in vandevijver; vantveer: it contains 311 samples measured over 24496 probes. We labeled the samples according to the following criterion: a metastasis event occurred in the first 5 years after the appearance of the tumor (Class P [poor prognosis]), the remaining samples did not develop a metastasis in this time window and were labeled as Class G (good prognosis). In order to reduce the noise we removed all probes with nearly constant expression values across the data set. More specifically a probe is included in the data set if at least one of the of the following two conditions is met: (i) its variance is larger than 0.1, (ii) in at least ten samples its expression value is outside the window (0.3,0.3). Eventually we ended up with 6401 probes. We further notice that the number of elements in Class P (86) is much smaller than those in Class G (225). Since a major aim in this context is to correctly classify all members of Class P (those developing metastasis) and not to misclassify them as Class G cases, a larger influence of class P data is needed. To obtain a more balanced dataset, we therefore randomly removed 137 elements from Class G and we end up with a set of 174 patients.
In Tab. 1 we resume the details of the data sets.
v.2 Results
For each data set we construct different realizations of the training and test sets by randomly permuting the samples. The first patterns will be the training set and the remaining patterns will be used as test set. In this way we are able to obtain results, which are not dependent on a single specific arbitrary partitioning of the patterns into training and test sets, and to attribute statistical errors to measured observables. In the last three columns of Tab. 1 we give the actual values of , , and the number of different partitions for each of the data sets.
In all data sets discussed here, we have run BP starting at finite temperature and using the cooling scheme described in Sec. III.1, which reached zero temperature. Running the algorithm at various fixed finite temperatures, we observed the generalization error to decrease monotonously with temperature, and to saturate at low temperature. Here we present directly the zerotemperature results obtained using cooling.
For the first three data sets (Leukemia, Prostate, and Colon) we first run BP on the entire probeset with the diluting field defined in Eq. (6). We show in Fig. 4 the generalization error for the three data set as a function of , which, as already explained in Sec. III.1, can be fixed by tuning the field .
In Fig. 4 two different scenarios emerge: the colon and prostate curves display a minimum generalization error at a relatively low value of , while the leukemia curve is monotonously decreasing. This seems to indicate the presence of a small set of probes relevant for the classification in the prostate and colon case, while in the leukemia data set it seems that all probes are relevant for the classification.
Let us recall that a given does not necessarily indicate the actual size of the set of relevant probes. This would be the case only when all probabilities were completely polarized (i.e. is either 0 or 1). Upon direct inspection of our BP results it turns out that this is not the case. No clear threshold is shown on the marginals . The dilution thus seems to be an effective strategy to attribute differential weights to the probes, but it is not clear how to use it to select a relevant signature in the data set. A possible way is to implement a decimation procedure as explained in Sec. III.1, in order to test the performance of the algorithm on a restricted and selected set of probes (signatures) of different size.
Domany et al. in EinDorZukDomany2006 studied the stability of different signatures as a function of the number of samples used for learning. They interestingly noted that the small overlap between different signatures is not only due to the different classification strategies used by different groups, but in particular resulting from the small number of available cases. Such a lack of robustness emerges even when using the same algorithm on data generated with the same probabilistic framework EinDorZukDomany2006. We have investigated how stable are the lists obtained with our decimation procedure. It turns out that an analogous phenomenon of instability occurs in our lists. We observed very few genes appearing in all lists: the actual numbers differs among the different data sets but it ranges from 10% in the case of Leukemia, to 0.02% in the case of prostate tissues, where however the number of relevant genes seems to be very small as we will see in the following.
To understand how BP compares with simpler algorithmic strategies, we used the centroids method selecting signatures of the same size. The results for the first three data set are displayed in Fig. 5. The curve for Leukemia displays a monotonously decreasing profile in agreement with what we have obtained fixing shown in Fig. 4. This seems to indicate that no signature can be defined in this case. The generalization error obtained with the centroids algorithm is slightly worse then that of BP. The Prostate case displays dramatic difference in the two algorithms: from the BP point of view the curve displays a plateau of minimal generalization error for signature larger than 100, at odd with what obtained fixing where a minimum is present around . Also in this case it seems difficult to determine a clear signature in the data set. Centroids behaves in the opposite way displaying a monotonously increasing function with minimal generalization error at values of the signature lower than 10. The optimal generalization error achieved by the two algorithm in this case is compatible. The Colon case is analogous to that of Prostate from the BP point of view (the minimum generalization error is obtained for but for ), while centroids seems to be rather insensitive to signature size in the whole interval analyzed. In this case BP shows an overall better generalization ability. The discrepancy between the values of the signature and of the for which the generalization error is minimum, comes from the not unique set of possible relevant genes. While the algorithm is able to select genes which are relevant and weight them adequately ( small), it is not able to define a single clear signature which alone is sufficient for the classification. This implies that, in the decimation procedure, a much higher number of genes, even if with small weight, is necessary to achieve a good predictive ability (so that the minimum generalization error is reached for signatures much bigger than the probabilistic expectation given by ). It is worth pointing out that the fact that BP behaves worse for very small sizes, is already present in the case artificial data Fig. 2.
The best generalization errors of both BP and centroids are displayed in Tab. 2 compared with the results presented by Dettling in Dettling where no statistical error was associated with measures. We see that BP outperforms 4 out of 6 other algorithms on all three data sets, the other two algorithms perform better on a single data set each.
BP  centroids  BagBoost  RanFor  SVM  kNN  DLDA  

Leukemia  0.025(2)  0.029(2)  0.0408  0.025  0.035  0.0383  0.0292 
Prostate  0.070(4)  0.071(4)  0.0753  0.0788  0.0682  0.1059  0.1418 
Colon  0.140(6)  0.156(4)  0.1610  0.1543  0.1667  0.1638  0.1286 
In the breastcancer case, we consider first the generalization error for various sizes of the training set (). The relatively large balanced data set containing 174 patients allows for the direct analysis of sample sizes ranging over almost a full order of magnitude. The results are presented in the left panel of Fig. 6, each point is averaged over 50 random selections of the training set. We observe a strong monotonous decrease of the generalization error. This is a very encouraging sign and should motivate for collecting larger data sets. The generalization error of 30% obtained for is compatible with the one () reported in keyMichiels.
However, in the breast cancer set we can go into more detail: The generalization error treats good and poor cases (Classes G/P) in the same way. Indeed, one of the open challenges in breast cancer treatment is to recognize the correct cancer subtype at the earliest possible stage of disease, indicating, e.g., the possible sensitivity of the cancer to chemotherapy. In the present case, it is strongly preferable to erroneously include a Class G patient into Class P than vice versa; it has to be avoided to predict the absence of metastasis for a patient who actually develops one, and possibly not to provide the necessary medical treatment to such a patient.
We therefore introduce four different subclasses:

patients in Class G (no metastasis) correctly classified as Class G, their number is called ;

patients in Class P (metastasis) misclassified as Class G, their number is called ;

patients in Class G (no metastasis) misclassified as Class P, their number is called ;

patients in Class P (metastasis) correctly classified as Class P, their number is called .
As said before, our primary aim is to classify correctly Class P patients. The fraction of misclassified Class P patients has to be kept as small as possible; we refer to it as the poor error fraction (PEF). Once this is kept small, we also would like to recognize correctly as many Class G patients as possible. The corresponding fraction in the total Class G sample is , we refer to it as the good prediction fraction (GPF).
The GPF vs. PEF curve for a perfect classifier would be constantly equal to zero for the full GPF range, while a random classifier would produce a curve PEF=GPF. Here we want to characterize the relation of these two competing quantities (a constant prediction as P would have PEF=GPF=0, whereas a constant prediction G would have PEF=GPF=1), keeping however in mind that we want to keep the PEF low. The full curve is obtained done by changing the threshold defined in Eq. (2) of the inferred classifier.
The result is displayed in the right panel of Fig. 6 for training sets of size . For each possible PEF value we have averaged the curve over 50 random balanced partitionings of our data into training / test sets. When training with 100 samples, we have 37 points corresponding to the possible numbers of misclassified Class P patients (0 to 36). When training with 160 samples, this number reduced to 8 points. Note that the finding that the curve for larger training sets is located right of the other curve is consistent with the observation done in the left panel of Fig. 6: Larger training sets lead to better results. It is particularly striking that the PEF starts to grow much later, the GPF at zero PEF grows from about 10% to about 40%. Again we see that larger data sets should be collected for developing more precise prognostic tools.
In order to compare our result with that of vandevijver, we measured the value of the GPF corresponding to an average PEF value around . This can be obtained by averaging over cases with being at most one. In this case we obtained a PEF of 0.076(1) and a GPF of 0.50(2). These values are comparable to vandevijver where a PEF=0.071 and GPF=0.53 are reported for a set of 180 patients and a leaveoneout procedure ().
In conclusion, for the breast cancer dataset, we obtain results comparable with state of the art studies, and we find a not saturated increase of performance with growing training set size.
Vi Conclusion and Perspectives
In this work we introduced and analyzed a message passing algorithm for the problem of classification and applied it to genomewide expression patterns of cancer tissues. The aim of the algorithm is twofold: on one hand one would like to get a reliable classifier, on the other hand one wants to base this result on a minimal set of discriminating genes. This additional requirement of parsimony is based on the biological intuition that, as long as the expression level is concerned, only a relatively small subset of genes is actually significant for the development of the disease. Furthermore the possibility of identifying a meaningful (possibly small) signature of the disease will help the classification in two ways: (i) computationally, the dimensional reduction of the patterns makes the classification less prone to overfitting, (ii) experimentally, specific targeted genechip scanning could improve early stage cancer diagnostic (e.g. in the case of breastcancer).
The performance of our algorithm is found to be slightly better, although comparable with the one of other stateoftheart classification techniques. Using three different data sets, the BP performed better than 4 out of 6 algorithms on all data sets, and better than the other 2 algorithms on two out of the three data sets. The obtained results are compatible with the following theoretical intuition: The sparse quality (experimental and biological noise) and quantity (few expression patterns) of available data does not allow to extract substantially more information than the one seen also by simple algorithms. From this point of view the situation will improve in the near future since genechip technologies are becoming more and more diffused, and the emergence of new array technologies based on longer oligonucleotides should reduce the experimental noise of the expression measurements. This ongoing technological revolution calls for the development of sophisticated global classification techniques which are able to unravel, e.g., combinatorial control.
Another line of development in tumor detection is the integration of different sources of information. From this point of view SNPs (Single Nucleotide Polymorphism) appear to be the most promising highthroughput genomewide technology.
Vii Acknowledgments
We thank Alfredo Braunstein, Yoshiyuki Kabashima, Michele Leone, and Riccardo Zecchina for very interesting discussion on the problem of inference and methodological aspects of message passing algorithms. We also thank Enzo Medico for pointing out the Breast Cancer data set, and the centroids algorithm.