Epitope profiling via mixture modeling of ranked data

Epitope profiling via mixture modeling
of ranked data

Cristina Mollica Dipartimento di Scienze statistiche
Sapienza Università di Roma
Piazzale A. Moro 5
00185 Roma
Italy
cristina.mollica@uniroma1.it
 and  Luca Tardella Dipartimento di Scienze statistiche
Sapienza Università di Roma
Piazzale A. Moro 5
00185 Roma
Italy
luca.tardella@uniroma1.it
Abstract.

We propose the use of probability models for ranked data as a useful alternative to a quantitative data analysis to investigate the outcome of bioassay experiments, when the preliminary choice of an appropriate normalization method for the raw numerical responses is difficult or subject to criticism. We review standard distance-based and multistage ranking models and in this last context we propose an original generalization of the Plackett-Luce model to account for the order of the ranking elicitation process. The usefulness of the novel model is illustrated with its maximum likelihood estimation for a real data set. Specifically, we address the heterogeneous nature of experimental units via model-based clustering and detail the necessary steps for a successful likelihood maximization through a hybrid version of the Expectation-Maximization algorithm. The performance of the mixture model using the new distribution as mixture components is compared with those relative to alternative mixture models for random rankings. A discussion on the interpretation of the identified clusters and a comparison with more standard quantitative approaches are finally provided.

Key words and phrases:
Ranking data, Plackett-Luce model, Multistage ranking models, Mixture models, EM algorithm, Epitope mapping
Version July 5, 2019
\@setaddresses

1. Introduction

Ranked data arise in several contexts, especially when objective and precise measurements of the phenomena of interest can be impossible or deemed unreliable and the observer gathers ordinal information in terms of orderings, preferences, judgments, relative or absolute ranking among competitors. Research fields where the analysis of ranked data are frequently required are the social and behavioral sciences, where studies often ask a sample of people to rank a finite set of items according to certain criteria, typically their personal preferences or attitudes. In marketing, political surveys or psychological experiments, items to rank can be consumer goods, political candidates or goals, words or topics considered to be more or less associated to a reference one according to the individual perception. Another typical context is sport, where teams, horses or cars compete, and the final outcome is a ranking among competitors. A detailed and well-structured reference monograph concerning ranking data analysis and modeling is (Marden, 1995).

Statistical analysis of observed rankings is less usual in experimental research, where the availability of (sometimes sophisticated) measuring devices allows to express phenomena of interest in terms of precise quantitative information. In this work we verified the usefulness of probability models for ranked data in an experimental study, where quantitative outcomes are indeed available but, for reasons due to numerical instability of the measurements and to the absence of universally accepted ways of rescaling the original data, one could instead investigate the ranking evidence. For this purpose we used a real data set from the Large Fragment Phage Display (LFPD) bioassay experiment described in (Gabrielli et al., 2013). Researchers set up a new promising technology in order to get further insights into the understanding of molecular recognition of the immune system via epitope mapping of the HER2 oncoprotein. They employed a sample of patients and recorded for each subject the binding level, expressed on quantitative scale, between antibodies and specific partially overlapping fragments of the HER2 oncoprotein. The sample was actually composed of three different groups according to the known breast cancer status. A preliminary exploratory analysis of the LFPD data showed differential outcome profiles for the three cancer-specific groups (Gabrielli et al., 2013). Hence, we assumed the sample as drawn from a heterogeneous, multimodal population and opted to describe it through a mixture modeling approach for the individual ranked binding sequences. Two well-studied probability distributions for ranked data, the distance-based and the Plackett-Luce model, and a new extension of the latter have been employed as mixture components and the resulting performances have been compared. Maximum likelihood estimates (MLE) have been obtained with the implementation of the Expectation-Maximization (EM) algorithm or with hybrid versions thereof.

This article is organized as follows: in Section 2 we define the notation and review the distance-based and the Plackett-Luce model. The presentation and motivation of our extended Plackett-Luce model follow in Section 2.3 and the MLE for a finite mixture is discussed in Section 3. The application to the LFPD data and the comparison of the novel model with alternative ranking probability distributions are detailed in Section 4, with an interpretation of the inferential findings. The article ends with conclusions and proposals for future developments in Section 5.

2. Statistical models for ranked data

2.1. Notation and basic definitions

Before reviewing some of the approaches for the probabilistic modeling of ranked data, it is convenient to fix some notation. Formally, a full (or complete) ranking is a bijective mapping of a finite set of labeled items into a set of ranks , that is

With some abuse of notation, each item label will be identified with its subscript: instead of writing a ranking as , we will simplify it as . In this way positions refer to items and entries give the corresponding assigned ranks, which means that must be read as the rank attributed to the -th item. The underlying convention is that if , then item is ranked higher than item , and hence preferred to it.

In the literature, one distinguishes a full from a partial (or incomplete) ranking, in which the rank assignment process is not completely carried out. This happens, for instance, when a judge expresses only her first preferences out of items (), producing the so-called top- partial ranking. In the present context, the above restrictive definition for ranking is adopted, so that ties are not allowed due to injectivity and partial rankings are not contemplated because of surjectivity of the mapping .

The inverse of a ranking is called ordering. Positions of the components of refer to ranks and elements correspond to the items. Hence, is the item ranked in the -th position. In order to avoid confusion with , we will henceforth make explicit use of the inverse function notation to denote the corresponding ordering .

We denote with the set of all possible permutations. This special finite subset of is endowed with a composition operation such that two elements and in may yield either a permutation of or of . In particular, indicates ranks under of the items ranked by , whereas gives items to which assigns ranks that has attributed to items .

When a judge proceeds from the elicitation of her best choice (rank ) up to the worst one (rank ), we have the so-called forward ranking process; the inverse ranking procedure is named backward ranking process. This formal definition has been originally introduced in (Fligner and Verducci, 1988) but, to our knowledge, the rank assignment scheme has not received an explicit consideration in a model setup in the attempt to improve the description of random ranked data. Obviously, any other order for the rank assignment process is admissible and potentially leads to different models. This aspect has inspired us to expand an existing and well-known parametric ranking model and employ such a new class in the analysis of the LFPD data, in order to verify whether and how the reference order can influence the inferential results and the final model-based clustering.

2.2. Probability models for random rankings

In this section we give a brief account of rank data modeling. For a more systematic review see (Marden, 1995).

The collection of all discrete distributions for random rankings can be identified with the whole -dimensional simplex . This is equivalent to saying that a random ranking and its distribution can be denoted with , where the set can be regarded as the most general statistical model on rankings parameterized by free parameters, i.e., the probabilities of each ordered sequence. This general form can be considered and named saturated model (SM). Within this very general class, a special role is played by the uniform or null model (UM), represented by the single flat distribution which assigns equal probability to each ranking, and by its opposites, the degenerate models (DM), which concentrate all the probability mass on a single ranking. Although the SM allows for the maximum degree of flexibility, it becomes intractable and cumbersome to interpret even with a relatively small number of items, because of the fast-growing dimension of the ranking space. These practical limitations have motivated the introduction of simplifying assumptions on the ranking process, in order to deal with subsets of , and justify the wide assortment of restricted parametric models developed in the rank data theory.

2.2.1. Distance-based models

A fundamental class of parametric distributions is the so-called distance-based model (DB). Roughly speaking, the DB can be interpreted as the analogue of the normal distribution on the finite discrete space endowed with the group structure; in fact, it is an exponential location-scale model indexed by a discrete parameter , called modal or central ranking, and a non-negative real concentration parameter . Each distribution in a DB model has the following form

(2.1)

where is the normalization constant and is a metric on . The probability mass function in (2.1) attains its maximum at and decreases as the distance from increases. Under (2.1) rankings at the same distance from the modal sequence are equally probable. The central ranking expresses the so-called global consensus in the population, whereas the concentration/precision parameter calibrates the effect of on the probability of the ranking: the higher the value of , the more concentrated the distribution around its mode. Hence, when , equation (2.1) becomes the DM at ; conversely, when it turns out to be the UM.

Changing the distance measure in (2.1), one can define different families of parametric distributions for ranked data. Formally, a function is a distance between rankings if it satisfies the usual three properties of a metric (identity, symmetry and triangle inequality) and the additional fourth condition of right-invariance , that is for all

(2.2)

Condition (2.2) guarantees the desirable property of invariance of w.r.t. arbitrary relabeling of items. Examples of metrics for rankings are:

the Kendall distance

which counts the number of pairwise disagreements, i.e., the pairs of items with relative discordant order under and . It is also equal to the minimum number of adjacent transpositions needed to transform into ;

the Spearman distance ;

the Spearman Footrule ;

the Cayley distance , where is the number of cycles in , corresponding to the minimum number of arbitrary transpositions required to convert into .

The reader is referred to (Critchlow, 1985) for a more complete and detailed description of the metrics on rankings.

The computation of can be computationally demanding as it requires the summation over all possible rankings. As advised by (Fligner and Verducci, 1986), one way to derive a simpler expression for is to consider its relation with the moment generating function of the random variable under the UM on . In the wide variety of distances, only some specific ones lead to a closed form expression for . Hence, in performing a statistical analysis of ranked data one should balance between interpretation purposes, choosing the which best accommodates the problem at hand, and computational feasibility. For our application to the LFPD data we employed the Kendall distance .

2.2.2. The Plackett-Luce models and related extensions

The Plackett-Luce model (PL) is a very popular parametric family for random ranking. Its name arises from both contributions supplied by (Luce, 1959), whose monograph provides an in-depth theoretical description of the individual choice behavior with a general axiom, and (Plackett, 1975), who derived this model in the context of horse races. Its probabilistic expression moves from the decomposition of the ranking process in independent stages, one for each rank that has to be assigned, combined with the underlying assumption of standard forward procedure on the ranking elicitation. In fact, a ranking can be elicited through a series of sequential comparisons in which a single item is preferred to all the remaining ones and after being selected is removed from the next comparisons. For this reason, the PL is said to belong to the family of multistage ranking models. Specifically, the PL probability distribution is completely specified by the so-called support parameter vector , where for all and . Note that in the present formulation the parameters are constrained to add up to one to avoid non-identifiability due to possible multiplication with an arbitrary positive constant. The generic parameter expresses the probability that item is selected at the first stage of the ranking process and hence preferred among all other items. The probability to choose item at lower preference levels is proportional to its support value . Taking into account that the set of available items in the sequence of random selections is reduced by one element after each step, the computation of the choice probabilities at each stage for the assignment of the actual rank requires suitable normalization of the support probabilities w.r.t. the set of remaining items at that stage. It follows that under the PL the probability of the random ordering is

(2.3)

The vase model metaphor originally introduced by (Silverberg, 1980) is an alternative way to interpret the random stage-wise item selections and a useful representation of the PL to understand its extensions developed in the literature (see (Marden, 1995) for a review). Let us consider a vase containing item-labelled balls such that the vector expresses the starting composition of the vase. The vase differs from an urn simply because the former contains an infinite number of balls in order to allow continuous values of the proportions. At the first stage one draws a ball and ranks the corresponding item first. At the second stage one draws another ball from the vase: if its label is different from one assigns rank 2 to the corresponding item, otherwise the ball is put back and one makes drawings until a distinct item is chosen and then ranked second. The multistage experiment ends when there is only one item not yet selected and this is automatically ranked last. The probability of a generic sequence of drawings turns out to be (2.3). In such a scheme the vase configuration is constant over all stages and interactions among items are not contemplated. A first attempt to generalize this basic scheme consists in retaining the absence of item interactions but letting the vase composition vary among stages, as formalized in (Silverberg, 1980). In this model setting the support parameters become stage-dependent, that is for and . Setting the special form one obtains the Benter model (BM) introduced by (Benter, 1994), where the parameter vector with for all is named dampening parameter and accommodates for the possible different degree of accuracy the choice at each selection stage is made with. The PL corresponds to the BM with for all . Relaxing also the non-interaction hypothesis, meaning that the vase composition at each stage relies on the previous selected items, (Plackett, 1975) defined a hierarchy of further extensions of the PL. They are referred to in (Marden, 1995) as Lag models, where indicates that the vase at stage depends on the previous choices only through the last selected items {. The Lag 0 model coincides with the ordinary PL. The Lag 1 model is such that at each choice step the vase depends only on the item . In general, the total number of parameters in the Lag model is given by , thus the model corresponds to the SM.

2.3. Novel extension of the Plackett-Luce model

In this section we introduce an original proposal to generalize the standard PL. Multistage ranking models previously reviewed implicitly suppose that preferences are expressed with the canonical forward procedure, proceeding with the assignment of the first rank up to the last one. This is just a preliminary assumption and other reference orders can be contemplated but, to our knowledge, this aspect has not been addressed in the literature. Indeed, even the individual experience in choice problems suggests the plausibility of alternative paths for the ranking elicitation. For example, one can think of situations where the judge has a clearer perception about her most- and least-liked items first but only a vaguer idea relative to middle ranks; alternatively again the ranker can build up her best alternatives following an exclusion process starting with the final position, which would be described by a backward model. Besides the motivation to characterize typical behaviors in real choice/selection problems, we can also aim at obtaining a more flexible tool in order to improve the description of observed phenomena collected in the form of ordered data. All these intuitive and practical instances make the forward hypothesis too restrictive when approaching a flexible inferential analysis of a ranking data set. Hence, we propose to extend the PL in this way: rather than fixing a priori the stepwise order leading the judge to her final ranked sequence, we would like to represent it with a specific free parameter in the model and let data guide inference about the reference order followed in the rank assignment scheme. Such an approach would also alleviate the asymmetry toward ranks assigned at the extreme (the first and the last) stages of the ranking procedure, which by nature affects the PL with hypothesized known reference order. It turns out that the reference order is the result of a bijection between the stage set and the rank set , i.e.,

where the entry indicates the rank attributed at the -th stage of the ranking process. Then, identifies a discrete parameter taking values in . The composition of an ordering with a reference order yields

the sequence listing the items selected at each stage. This means that is the item chosen at step and receiving rank . The probability of a random ordering under the extended Plackett-Luce model can be written as

(2.4)

where the additional discrete parameter acts directly on the right of the generated outcome of a standard PL. Hereafter we will shortly refer to (2.4) as EPL(,). The vector continues to denote the support parameters with the probabilities for each item to be selected at the first stage and receiving rank given by the first entry in . Obviously, the standard PL is a special case of the EPL, obtained setting equal to the identity permutation . Similarly, when one has the backward PL.

From a theoretical point of view, (2.4) is a proper generalization of the (2.3) if and only if such a new class covers a wider portion of the SM, i.e., if the novel EPL allows to describe additional probability functions that can not be derived with any parameter specification from the PL. In other words, one should give a formal proof concerning the existence of a ranking distribution, generated by the new EPL, which does not belong to the standard PL family. Such a proof is given in the Appendix. In section 3.2 we describe in detail the MLE of such a new model.

2.4. Finite mixture modeling for ranked data

One of the nice formal properties satisfied by the DB (2.1) is strong unimodality, meaning that the probability decreases as the distance from the modal ranking increases, see (Marden, 1995). On the other hand strong unimodality is expected to be violated in real data, especially when the sample composition is heterogenous w.r.t. factors related to the ranking elicitation. A well-established statistical tool to address inference in the presence of unobserved heterogeneity is given by the finite mixture approach. A finite mixture model assumes that the population consists of a finite number of subpopulations. In this setting the probability of observing the ranking for the -th unit is

where denotes the -th component of the mixture, i.e., the statistical distribution of data within the -th group and is the probability for the -th observation to belong to the -th group. The membership probabilities are usually termed weights of the mixture components. Mixture components are often modeled with members of the same parametric family, that is, for all and thus they are identified by the group-specific parameter . For a more extensive introduction to finite mixture models the reader can refer to (McLachlan and Peel, 2000). In the ranking literature one can find several recent mixture model applications to make the ranked data modeling more flexible and account for unobserved heterogeneity. For example, (Murphy and Martin, 2003) analyzed the popular 1980 APA (American Psychological Association) presidential election data set, in particular the sub-data set of complete rankings, with a mixture of distance-based models. They aimed at inquiring voters’ orientation towards candidates within the electorate, assessing the possible adequacy to incorporate a noise component (UM) in the mixture. Such a component, in fact, could collect outliers and/or observations characterized by untypical preference profiles with a possible final improvement of model fitting. A similar method was adopted in other preference studies. (Gormley and Murphy, 2006) fitted a mixture of PL to the 2000 CAO (Central Applicant Office) data set to investigate motivations driving Irish college applicants in the degree course choice.  (Gormley and Murphy, 2008a) applied a mixtures of both PL and BM to infer the structure of the Irish political electorate and characterize voting blocks. In subsequent works the same authors attempted to extend such an approach in different directions (for further details see (Gormley and Murphy, 2008b) and (Gormley and Murphy, 2009)).

In section 4.2 we present our application of mixture models for ranking to data originated from the LFPD bioassay experiment. For the analysis of the LFPD data set we implemented different mixture models, adopting as mixture components elements from the following parametric families:

  • DB with ;

  • PL with known forward and backward reference order;

  • our novel EPL.

DB and PL represent two of the most frequently used distributions for inferring ranking data and both parameterizations allow a clear interpretation: in the former, the central ranking summarizes the profile of the population in assessing the orderings of the items and the concentration parameter expresses how representative the modal ranking is; in the latter, the higher the item support parameter value, the greater the probability for that item to be preferred at each selection level. For the argument on the choice of the EPL in the analysis of the LFPD data, the reader is referred to section 4.2.

3. Inferring ranking models

3.1. MLE of the mixture of distance-based models

We briefly summarize here the fundamental steps to derive the MLE for a mixture of DB with when a sample is available. We basically reproduce the algorithm described in (Murphy and Martin, 2003). Let be the latent variable indicating the individual component membership such that

for . From (2.1) it follows that the complete log-likelihood can be written as

where and are vectors representing, respectively, the prior group membership probabilities and the component-specific concentration parameters, whereas is a matrix, whose rows indicate the central rankings of the mixture components. To derive parameter estimates the EM algorithm can be implemented; it represents the major scheme to address the inferential analysis in the presence of missing data (Dempster et al., 1977). For the present model the EM algorithm consists of the following steps:

  • set initial values for the parameters to be estimated (we used random starting values).

  • at iteration compute

    for and , which is the current estimate of the posterior probability that subject belongs to the -th component;

  • at iteration compute

    and determine as the solution of

for . We run the algorithm with a suitably large number of different starting values to address the issue of local maxima.

3.2. MLE of the mixture of Extended Plackett-Luce models

As mentioned before, the conventional forward PL is a reduction of the wider family of EPL distributions obtained setting the reference order parameter equal to the identity permutation . It follows that the estimation procedure for the mixture of PL can be easily deduced from the one derived for the mixture of EPL with all known reference orders . However, explicit estimation formulas for this special case can be found in (Gormley and Murphy, 2006). In this section we restrict ourselves to give inferential details only for the extended model, starting with the simpler case of homogenous population (G=1).

Postulating the EPL(,) as the underlying mechanism generating the observed orderings , the log-likelihood has the following expression

(3.1)

Note that in order to find MLE solutions, the direct maximization of the log-likelihood w.r.t. the ’s is made arduous by the presence of the annoying term . So, we derived the estimation formula for the support parameters borrowing the approach detailed in (Hunter, 2004) and based on the Minorization/Maximization (MM) algorithm. Such an iterative optimization method is reviewed in general in (Lange et al., 2000) and (Hunter and Lange, 2004), whereas (Hunter, 2004) discusses the specific application of the MM algorithm for the estimation of the PL. The basic idea consists of performing the optimization step for the ’s on a surrogate objective function rather than on (3.1). The surrogate is obtained by exploiting the strict convexity of and in particular the supporting hyperplane property for convex functions. From Taylor’s linear expansion, in fact, one has

and disregarding terms not depending on the minorizing auxiliary objective function can be written as

(3.2)

As emphasized by (Hunter, 2004), the convenience of optimizing the more tractable (3.2) in place of (3.1) is the possibility to estimate each support parameter separately. Furthermore, the iterative maximization of returns a sequence that is guaranteed to converge at least to a local maximum of the original objective function. Thus, we can differentiate w.r.t. each and get

(3.3)

where

corresponds to the binary indicator for the event that item is still available at stage or, equivalently, that is not selected by unit before stage . Notice that the binary array has a superscript because of the dependence on the available at the current iteration. Equating (3.3) to zero, the updating rule at the current iteration for is

The update of the reference order parameter is obtained using the original log-likelihood as follows

(3.4)

Solving (3.4) with a global search in is prohibitive when is large, as in our application to the LFPD data. So, we implemented a local search similarly to the method suggested by (Busse et al., 2007) and constrained the optimization within a fixed Kendall distance from the current estimate for the reference order .

Now we relax the hypothesis of homogeneous population and consider a more flexible mixture model with EPL components, discussing the related inferential issues. If we assume our random sample drawn from a mixture of EPL, the probability of the generic ordering is written as the average of its probability under each sub-population weighted with the corresponding mixture component weight, i.e.,

Hence, the observed log-likelihood turns out to be

Augmenting data with the missing individual group membership indicator , one obtains the following expression for the complete log-likelihood

In the EM algorithm the maximization problem is transferred on the the expectation of the w.r.t. the posterior distribution of the latent variables represented by , that is

where

for and . Similarly to (Gormley and Murphy, 2006) we combined the EM with the MM algorithm into a hybrid version of the former called EMM algorithm using the following minorizing surrogate function

Thus, differentiating

(3.5)

and equating (3.5) to zero, the updating rule for at the current iteration is

for and with

indicating if, under the group-specific reference order , the unit has not extracted the -th item before stage , and hence if at that step it still belongs to the set of available alternatives or not. The update for the reference orders for each subgroup is

As in the case in (3.4), the above minimization is performed locally. The M-step ends with the update of the mixture weights, computed as the posterior proportions of sample units belonging to each group

3.3. Algorithm convergence and model selection

We conducted MLE inference for DB and EPL mixture models relying on the EM algorithm and on a hybrid version thereof. We developed a suite of functions written in R language (R Core Team, 2012) which are available upon request from the first author. In these estimation procedures the log-likelihood is iteratively maximized until convergence is achieved. As suggested by (McLachlan and Peel, 2000), the Aitken acceleration criterion has been employed to assess convergence, in place of the standard lack of progress criterion based on the absolute/relative increment of the log-likelihood. For a discussion on the relative merits of the Aitken acceleration criterion and other related proposals, see (McNicholas et al., 2010).

Another crucial issue in a mixture modeling setting is the choice of the number of components. In the statistical literature this problem is addressed with several criteria; we opted for the popular Bayesian Information Criterion

where is the maximized log-likelihood and is the number of free parameters. The BIC, introduced by (Schwarz, 1978), is a measure which balances between two conflicting goals typically aimed at when fitting a statistical model: good fit and parameter parsimony, where the latter is modulated through the penalty term. In the presence of competing mixture models, the one associated with the lowest value of the BIC is preferred. In the next section we detail MLE results derived from alternative mixture models fitted to the LFPD data set.

4. Statistical analysis of LFPD data

4.1. The LFPD data set

Our investigation is motivated by a real data set coming from a new technology for epitope mapping of the binding between the antibodies present in a biological tissue and a target protein. The biological foundation of the experiment is detailed in (Gabrielli et al., 2013) and consists of repeated binding measurements of human blood exposed to partially overlapping fragments of the HER2 oncoprotein, denoted sequentially by Hum 1,, Hum 11 (see Figure 1). Researchers were originally interested in testing the validity of their innovative biotechnology which consists in a new way of isolating protein fragments without losing the conformational structure of the protein portions. To achieve this goal they employed a phage as a vector for hosting each protein fragment. Then they compared the binding outcome detected on each of the 11 fragments via a standard Enzyme-Linked Immunosorbent Assay (ELISA) with the whole protein (Hum 12) and the empty vector (Hum 13), used respectively as positive and negative controls (see Figure 1).

Figure 1. 1-D scheme of the HER2 oncoprotein and its segmentation into 11 partially overlapping fragments (Hum) employed in the LFPD bioassay experiment. Hum 12 and Hum 13 indicate respectively the whole HER2 oncoprotein (positive control) and the empty phage vector (negative control).

They first checked with monoclonal antibodies that the expected binding at some specific fragment was actually detected. Then they gathered samples of human blood taken from three different disease groups: i) HD healthy patients, ii) EBC patients diagnosed with breast cancer at an early stage, iii) MBC patients diagnosed with metastatic breast cancer. Binding outcomes from the ELISA experiment have been detected by a laser scanner so that the binding intensities have been measured and recorded in terms of absorbance levels in nanometers (nm). In the next section we motivate our statistical analysis of the LFPD data based on the ordinal information, rather than on the original quantitative scale measurements.

4.2. Mixture models for the LFPD data

The original raw absorbance data derived from the LFPD experiment were somehow wildly fluctuating and looked indeed very heterogeneous as apparent in Figure 2.

Figure 2. Raw absorbance profiles for the three group of patients in the LFPD study: HD healthy (green), EBC diagnosed with breast cancer in an early stage (red), MBC diagnosed with metastatic breast cancer (blue). Each broken line represents the absorbance levels in the HER2 oncoprotein fragments (Hum) of a single experimental unit. Hum 12 and Hum 13 indicate respectively the whole HER2 oncoprotein (positive control) and the empty phage vector (negative control).

However there were certainly some manifest peaks corresponding to recurrent fragments, especially high for some patients, most frequently those diagnosed with cancer. It is also apparent that the individual absorbance profiles are measured at different mean levels for different patients and with different variability. A simple logarithmic transformation and recentering w.r.t. the individual mean were performed providing some more stable evidence of differential profiles among groups. However there are some specific profiles which seem pretty much overlapped, among different subgroups although with some different overall pattern (Figure 3).

Figure 3. Mean-centered log-absorbance profiles for the three group of patients in the LFPD study: HD healthy (green), EBC diagnosed with breast cancer in an early stage (red), MBC diagnosed with metastatic breast cancer (blue). Each broken line represents the mean-centered log-absorbance levels in the HER2 oncoprotein fragments (Hum) of a single experimental unit. Hum 12 and Hum 13 indicate respectively the whole HER2 oncoprotein (positive control) and the empty phage vector (negative control).

Since data emerged from the development of an innovative technology, miscalibrations or inaccuracies of the measuring device may occur and/or subject-specific characteristics may alter somehow the observed numerical outcome, making it more difficult to adjust the statistical analysis based on raw or ad-hoc pre-processed data. Unfortunately, for this kind of data a consolidated and fully-shared normalization technique is lacking. For all these reasons, rather than basing our analysis on the quantitative output of the LFPD study, we verified the possible usefulness of the ranking profiles as a more robust and unambiguously-defined evidence, capable to capture and characterize the sample heterogeneity w.r.t. the disease status and specific characteristic profiles of each subgroup. Hence, we first derived ordered sequences ranking the absorbance levels of the individual protein fragments taken in decreasing order (Rank 1=highest value, Rank =lowest value). We performed a simple exploratory analysis by cancer status computing both the first-order marginal matrices , where the generic entry indicates the observed relative frequency that item is ranked -th, and the so-called Borda orderings , listing items taken in order from the highest to the lowest mean rank. These matrices are displayed as image plots in Figure 4,

Figure 4. Image plots of the first-order marginal matrices for the three groups of patients in the LFPD study: HD healthy (left), EBC diagnosed with early stage breast cancer (center), MBC diagnosed with metastatic breast cancer (right). Upper panel refers to the data with protein fragments whereas the lower one concerns the case with the addition of Hum 12 and Hum 13, indicating respectively the whole HER2 oncoprotein (positive control) and the empty phage vector (negative control). The Borda ordering lists items taken in order from the highest to the lowest mean rank.

together with the Borda sequences on the bottom of each panel. The color intensity is an increasing function of the corresponding observed frequency. The analysis of the first-order marginals matrices suggests that very often some protein fragments are associated with lower ranks, as pointed out by the presence of darker rectangles in correspondence of bottom positions. This constantly occurs for all disease subgroups with Hum 10 but some interesting differential evidence emerges from EBC subjects with Hum 5 and 6, from MBC with Hum 2 and 13 and also from HD patients with Hum 9 (Figure 4). Such a precious discriminant information could be better captured by our EPL. To validate this claim we fitted both the PL and the new EPL to the three disease subgroups separately. For the former we used known orders, alternatively forward (PL-) and backward (PL-), whereas for the latter the reference order is a parameter to be estimated. Estimation performances are shown in terms of BIC values in Table 1 and reveal that the fit of the EPL is better or at most comparable with those relative to the PL with fixed reference orders. The interest in relaxing the canonical forward assumption is supported also by the BIC values for the PL-, showing such a model constantly outperforms the PL- when fitted to HD and MBC subjects. These BIC results represent a strong evidence motivating the need of an extension of the PL.


Model HD EBC MBC HD EBC MBC
PL-
PL-
EPL
Table 1. BIC values resulting from the MLE of the PL ( forward and backward reference order) and of the EPL on the three disease groups for a different number of binding probes included in the ranking. Groups of patients are defined as follows: HD healthy, EBC diagnosed with early stage breast cancer, and MBC diagnosed with metastatic breast cancer.

Subsequently we considered a more comprehensive analysis in a mixture model setting. With this approach we aimed at:

  • addressing the heterogenous nature of the LFPD data using the evidence provided by the orderings of absorbance levels;

  • assessing if and how the path in the sequential ranking process can have an impact on the final model-based classification of experimental units and select the most appropriate one.

  • looking for possible characteristic subgroups related to the disease status;

  • characterizing each subgroup with the estimates of the cluster-specific parameters.

4.3. Empirical findings from mixture models fitted to LFPD data

Considering all 67 available orderings we fitted mixtures of DB with (DBmix), mixtures of PL with both forward and backward reference order (PLmix- and PLmix-) and mixtures of EPL (EPLmix), the novel model we presented in section 2.3 where is a parameter to be inferred. All mixtures have been implemented with a number of components varying from to . Of course, the case coincides with the assumption that observations come from a homogeneous population without an underlying group structure. We separately applied the mixture models to the ranking of absorbance levels relative to the partially overlapping protein fragments as well as to the binding probes (spots), when we additionally included the whole HER2 oncoprotein (positive control) and the empty phage vector (negative control).

Focusing on the BIC for compared to , the MLE of the DBmix provided an overall evidence in favor of a heterogeneity when both or binding probes are considered. We highlighted a remarkable decreasing behavior for the associated BIC, which persists when fitting is carried out up to components as shown in Table 2.

1 2 3 4 5 6 7 8 9 10
2078.77 2003.65 1940.86 1899.33 1882.32 1863.17 1846.98 1829.81 1817.06 1798.12
2700.02 2617.66 2551.38 2512.19 2483.25 2451.38 2421.71 2392.10 2366.78 2342.60
Table 2. BIC values resulting from the MLE of the DBmix on the LFPD data with a varying number of components, when either or binding probes are included in the ranking.
(a)
(b)
Figure 5. BIC trends resulting from the MLE of the PLmix-, the PLmix- and the EPLmix on the LFPD data with a varying number of mixture components, when either (left) or (right) binding probes are included in the ranking. The symbol ✷ indicates the minimum BIC values for the final selection of the number of groups.

Indeed, fitting DBmix with an increasing number of groups pointed out a particular feature of the DB, probably due to the sparse nature of LFPD data. We remind, in fact, that in the present application the sample size is small w.r.t. to the cardinality ( or ) of the discrete ranking space. As the value of in the DBmix increases, some components start to represent just a single observation. This can be explained, perhaps, by the fact that, once the modal ranking has been fixed, DB has only one remaining parameter fitting the shape of the uncertainty. It follows that for these components the concentration parameter is typically estimated as a very high value. This behavior, of course, could make the DBmix model not sufficiently parsimonious and promising in some sparse-data situations, because its fit would lead to a more sparse clustering of the observations and to a less enlightening inferential findings. When stage-wise models have been fitted to the LFPD data, we found again evidence in favor of the heterogeneous structure, as indicated in Table 3; in the case all types of mixtures consistently identify 4 groups in the sample, whereas one additional component is selected by both PLmix- and EPLmix when also the control probes are included in the ordered sequences. Bold BIC values in Table 3 indicate the EPLmix as the best model and are, in both cases, significantly smaller than those of the competing mixtures. Indeed, this is constantly true for every considered dimension of the mixture, as elucidated by Figure 5. This proves the successful introduction of the discrete parameter which drastically improves the fitting of the data. Moreover, the EPLmix exhibits a good accuracy in the discrimination of sample units w.r.t. the real disease status.

Mixture
Model BIC BIC
PLmix-
PLmix-
EPLmix
Table 3. BIC values and number of components of the best PLmix-, PLmix- and EPLmix fitted to LFPD data for different number of binding probes included in the ranking.

The two resulting clusterings agree with the most relevant distinction of the real disease status (healthy/non healthy), as pointed out in Tables 4LABEL:sub@t:agree1new and 4LABEL:sub@t:agree2new. Specifically, collapsing also the model-based group membership into this basic bipartition we recognize that healthy subjects are well isolated with only 1 or 2 false positive cases, whereas for diseased patients we have misclassifications when but only 2 with the addition of the control spots, see Tables 4LABEL:sub@t:agree1new and 4LABEL:sub@t:agree2new. As expected, the inclusion of the positive and negative controls produced a fruitful discriminant evidence, suggested by the global reduction of misclassifications for clusterings based on 13 ranks. Healthy patients are always modeled with two components in all mixtures fitted to the LFPD data. This hints at possibly different subtypes of healthy profiles. In fact, we easily verified that such subdivision reflects two different absorbance patterns in cancer-free units, made evident in Figure 2 by the green broken lines: one subgroup whose immune defenses essentially did not react at all to the exposition with the HER2 oncoprotein (lower panel) and those with some manifest and characterized binding profile (upper panel).

Group
Disease Status 1 2 3 4
HD 0 2 10 8
EBC 13 12 2 1
MBC 0 15 3 1
(a)
Group
Disease Status 1 2 3 4 5
HD 1 10 0 1 8
EBC 12 0 9 7 0
MBC 14 2 0 3 0
(b)
Table 4. Correspondence between the model-based clustering derived by the MLE of the EPLmix and the true disease status of the LFPD experimental units: HD healthy, EBC diagnosed with early stage breast cancer and MBC diagnosed with metastatic breast cancer.

On the other hand, among the selected components which can be categorized as corresponding to diseased patients, the sub-classification between EBC and MBC is only partially recovered, especially for the latter group of patients. This is proved by the presence of at least one model-based group entirely composed of EBC subjects in all fitted mixtures, whereas MBC always belong to mixed-type components.

The varying correspondence between the real cancer status and the inferred clustering structure confirms the presumed dependence of the classification results on the adopted reference ranking process . Furthermore, the good agreement obtained with the EPLmix, and pointed out by Tables 4LABEL:sub@t:agree1new and 4LABEL:sub@t:agree2new, suggests that researchers should not focus exclusively on differential epitope identification but should extend their analysis considering also more general global understanding of differential bindings. Hence, in order to characterize disease groups w.r.t. ranking profiles it is interesting to interpret the component-specific modal orderings (Table 5), derived by ordering the corresponding support parameter estimates (Figure 6). Weights and reference order estimates for the identified clusters are shown in Table 6.

Mixture model D.C. D.C. PLmix- * * PLmix- * * EPLmix Note: The symbol indicates mixture components which are very close to the UM.
Table 5. Modal orderings and composition w.r.t. the real cancer status of the components identified with the best PLmix-, PLmix- and EPLmix fitted to LFPD data, for a different number of binding probes included in the ranking. “D.C.” stands for “disease composition” listing sequentially the number of HD healthy, EBC diagnosed with early stage breast cancer and MBC diagnosed with metastatic breast cancer patients in each group.
Mixture model PLmix-