Efficient Monte Carlo Methods for Multi-Dimensional Learning with Classifier Chains

Efficient Monte Carlo Methods for Multi-Dimensional Learning with Classifier Chains

Jesse Read111Correspoding author, jesse@tsc.uc3m.es. Luca Martino David Luengo Dept. of Signal Theory and Communications. Universidad Carlos III de Madrid. Madrid 28911, Spain ( jesse,luca@tsc.uc3m.es). Dept. of Circuits and Systems Engineering. Universidad Politécnica de Madrid. Madrid 28031, Spain ( david.luengo@upm.es).
Abstract

Multi-dimensional classification (MDC) is the supervised learning problem where an instance is associated with multiple classes, rather than with a single class, as in traditional classification problems. Since these classes are often strongly correlated, modeling the dependencies between them allows MDC methods to improve their performance – at the expense of an increased computational cost. In this paper we focus on the classifier chains (CC) approach for modeling dependencies, one of the most popular and highest-performing methods for multi-label classification (MLC), a particular case of MDC which involves only binary classes (i.e., labels). The original CC algorithm makes a greedy approximation, and is fast but tends to propagate errors along the chain. Here we present novel Monte Carlo schemes, both for finding a good chain sequence and performing efficient inference. Our algorithms remain tractable for high-dimensional data sets and obtain the best predictive performance across several real data sets.

keywords:
classifier chains, multi-dimensional classification, multi-label classification, Monte Carlo methods, Bayesian inference
journal: Pattern Recognition

1 Introduction

Multi-dimensional classification (MDC) is the supervised learning problem where an instance may be associated with multiple classes, rather than with a single class as in traditional binary or multi-class single-dimensional classification (SDC) problems. So-called MDC (e.g., in Bielza et al. (2011)) is also known in the literature as multi-target, multi-output Kocev et al. (2013), or multi-objective Kocev et al. (2007) classification222Multi-output, multi-target, multi-variate etc. can also refer to the regression case, where the outputs are continuous, and is related to multi-task clustering and multi-task learning. The recently popularised task of multi-label classification (see Tsoumakas and Katakis (2007); Carvalho and Freitas (2009); Read (2010); Tsoumakas et al. (2010) for overviews) can be viewed as a particular case of the multi-dimensional problem that only involves binary classes, i.e., labels that can be turned on or off for any data instance. The MDC learning context is receiving increased attention in the literature, since it arises naturally in a wide variety of domains, such as image classification Boutell et al. (2004); Qi et al. (2009), information retrieval and text categorization Zhang and Zhou (2006), automated detection of emotions in music Trohidis et al. (2008) or bioinformatics Zhang and Zhou (2006); Barutcuoglu et al. (2006).

The main challenge in this area is modeling label dependencies while being able to deal with the scale of real-world problems. A basic approach to MDC is the independent classifiers (IC) method, (commonly known as binary relevance in multi-label circles), which decomposes the MDC problem into a set of SDC problems (one per label) and uses a separate classifier for each label variable.333Throughout this work we use the term label to refer generally to a class variable that takes a number of discrete values (i.e., classes); not necessarily binary as in the multi-label case In this way, MDC is turned into a series of standard SDC problems that can be solved with any off-the-shelf binary classifier (e.g., a logistic regressor or a support vector machine444Support vector machines (SVMs) are naturally binary, but can be easily adapted to a multi-class scenario by using a pairwise voting scheme, as in Hastie and Tibshirani (1998)). Unfortunately, although IC has a low computational cost, it obtains unsatisfactory performance on many data sets and performance measures, because it does not take into account the dependencies between labels Read (2010); Tsoumakas and Vlahavas (2007); Read et al. (2011); Cheng et al. (2010); Guo and Gu (2011); Zaragoza et al. (2011).

In order to model dependencies explicitly, several alternative schemes have been proposed, such as the so-called label powerset (LP) method Tsoumakas and Katakis (2007). LP considers each potential combination of labels in the MDC problem as a single label. In this way, the multi-dimensional problem is turned into a traditional multi-class SDC problem that can be solved using standard methods. Unfortunately, given the huge number of class values produced by this transformation (especially for non-binary labels), this method is usually unfeasible for practical application, and suffers from issues like overfitting. This was recognised by Tsoumakas and Vlahavas (2007); Read et al. (2008), which provide approximations to the LP scheme that reduce these problems, although such methods have been superseded in recent years (as shown in Madjarov et al. (2012)).

A more recent idea is using classifier chains (CC), which improves the performance of IC and LP on some measures (e.g., the subset 0/1 loss) by constructing a sequence of classifiers that make use of previous outputs of the chain (see Dembczyński et al. (2012) for a detailed discussion on MLC methods and loss functions). The original CC method Read et al. (2011) performs a greedy approximation, and is fast (similar to IC in terms of complexity) but is susceptible to error propagation along the chain of classifiers. Nevertheless, a very recent extensive experimental comparison reaffirmed that CC is among the highest-performing methods for MLC, and recommended it as a benchmark algorithm Madjarov et al. (2012).

A CC-based Bayes-optimal method, probabilistic classifier chains (PCC), was recently proposed Cheng et al. (2010). However, although it improves the performance of CC, its computational cost is too large for most real-world applications. Some approaches have been proposed to reduce the computational cost of PCC at test time Zaragoza et al. (2011); Kumar et al. (2012); Dembczyński et al. (2012), but the problem is still open. Furthermore, the performance of all CC-based algorithms depends on the label order established at training time, an issue that so far has only been considered by Kumar et al. (2012) using a heuristic search algorithm called beam search.

In this paper we introduce novel methods that attain the performance of PCC, but remain tractable for high-dimensional data sets both at training and test times. Our approaches are based on a double Monte Carlo optimization technique that, aside from tractable inference, also explicitly searches the space of possible chain-sequences during the training stage. Another advantage of the proposed algorithms is that predictive performance can be traded off for scalability depending on the application. Furthermore, we demonstrate our methods with support vector machine (SVM) as base classifiers (PCC methods have only been used under a logistic regression scheme so far). Finally, unlike the bulk of related literature, we involve the general multi-dimensional scenario (as in Zaragoza et al. (2011); Kocev et al. (2013)) and provide a theoretical and empirical analysis of payoff functions for searching the chain space.

A preliminary version of this work has been published in Read et al. (2013). With respect to that paper, here we introduce three major improvements: we consider the more challenging scenario of multi-dimensional classification (i.e., multi-class labels); at the training stage, we address the problem of finding the optimum label order instead of accepting the original one or using a random label order; for the test stage, we develop a more sophisticated and efficient population Monte Carlo approach for inference.

The paper is organized as follows. In the following Section 2 we review MDC and the important developments leading up to this paper. In Section 3 and Section 4 we detail our novel methods for training (including learning the optimum chain sequence) and inference, respectively. In Section 5 we elaborate an empirical evaluations of the proposed algorithms and , finally, in Section 6 we draw some conclusions and mention possible future work.

2 Multi-Dimensional Classification (MDC)

Let us assume that we have a set of training data composed of labelled examples, , where

is the -th feature vector (input), and

is the -th label vector (output), with

and being the finite number of classes associated to the -th label. The goal of MDC is learning a classification function,555We consider as a vector because this fits naturally into the independent classifier and classifier chain context, but this is not universal, and is possible in other contexts (such as LP)

Let us assume that the unknown true posterior probability density function (PDF) of the data is . From a Bayesian point of view, the optimal label assignment for a given test instance, , is provided by the maximum a posteriori (MAP) label estimate,

(1)

where the search must be performed over all possible test labels, . The MAP label estimate is the one most commonly used in the literature, although other approaches are possible, as shown in Cheng et al. (2010). Indeed, Cheng et al. (2010) shows that Eq. (1) minimizes the exact match or subset 0/1 loss, whereas the Hamming loss is minimized by finding individual classifiers that maximize the conditional probability for each label. Unfortunately, the problem is further complicated by the fact that the true density, , is usually unknown, and the classifier has to work with an approximation, , constructed from the training data. Hence, the (possibly sub-optimal) label prediction is finally given by

(2)

Table 1 summarizes the main notation used throughout this work.

Notation Description
-dimensional feature/input vector, with , .
-dimensional label/output vector, with (), .
input matrix with all the features.
output matrix with all the labels.
Training data set, .
Unknown true PDF of the data.
Empirical PDF built by the classifier.
Test feature vector.
Classification function built from .
Generic classifier’s output.
Classification matrix applied to .
Label order, with .
-dimensional permuted label vector.
Permuted classification function.
Table 1: Summary of the main notation used in this work.

2.1 Multi-Dimensional Classification vs. Multi-Label Classification

Although binary-only multi-label problems can be considered as a subset of multi-dimensional problems, the reverse is not true, and there are some important quantitative and qualitative differences. Quantitatively, there is a higher dimensionality (for the same value of ); MLC deals with possible values, whereas MDC deals with . Note that this affects the inference space, but not the sequence space (i.e., the possible orderings of variables). Qualitatively, in MDC the distribution of “labellings” is different, even with binary class variables. In typical MLC problems, the binary classes indicate relevance (e.g., the label beach is relevant (or not) to a particular image). Hence, in practice only slightly more than labels are typically relevant to each example on average Read (2010) (see also Table 5), i.e., where is the probability of being relevant. This means that a relatively small part of the -space is used. In MDC, classes (including binary classes) are used differently – e.g., a class gender ( M/F) – with a less-skewed distribution of classes; prior-knowledge of the problem aside, we expect . In summary, in MDC the practical -space is much greater than in MLC, making probabilistic inference more challenging.

2.2 Independent Classifiers (Ic)

The method of using independent classifiers (IC) on each label is commonly mentioned in the MLC and MDC literature Tsoumakas and Katakis (2007); Tsoumakas et al. (2010); Read et al. (2011); Zaragoza et al. (2011). For each a [standard, off-the-shelf binary] classifier is employed to map new data instances to the relevance of the -th label, i.e.,

where, probabilistically speaking, we can define each as

(3)

As we remarked in Section 1, this method is easy to build using off-the-shelf classifiers, but it does not explicitly model label dependencies, and its performance suffers as a result.666An exception to this rule is the minimization of the Hamming loss, which can be attained by considering each of the individual labels separately. Thus, modeling label dependencies does not provide an advantage in this case, as already discussed in Cheng et al. (2010); Dembczyński et al. (2012) In fact, it assumes complete independence, i.e., it approximates the density of the data as

(4)

We always expect label dependencies in a multi-label problem (otherwise we are simply dealing with a collection of unrelated problems); some labels occur more likely together, or mutually exclusively. Thus, it is important to model these dependencies, because doing so can greatly influence the outcome of the predictions.

2.3 Classifier Chains (Cc)

The classifier chains (CC) approach Read et al. (2011) is based on modeling the correlation among labels using the chain rule of probability (see Figure 1). Given a test instance, , the true label probability may be expressed exactly as

(5)

Theoretically, label order is irrelevant in Eq. (5), as all the label orderings result in the same PDF. However, since in practice we are modelling an approximation of (i.e., ), label order can be very important for attaining a good classification performance, as recognized in Cheng et al. (2010); Dembczyński et al. (2012). Given some label order, (a permutation of ), CC approximates the true data density as

(6)

where is the permuted label vector (see Figure 2).

Figure 1: General scheme of the Classifier Chains (CC) approach.
Figure 2: Example of the permuted label vector in a classifier chain with . In this example we have , so that .

First of all, CC considers an arbitrary label order, , and learns all the conditional probabilities in (6) from the labelled data during the training stage, thus effectively constructing a chain of classifiers like the one shown in Figure 1. Then, during the test stage, given a new (test) instance, , CC predicts using only the feature vector, whereas for the -th permuted label () it also makes use of all the previous predictions (), predicting each as

(7)

Note that, given a data instance and a label order , each possible realization of the vector can be seen as a path along a tree of depth , and is the payoff or utility corresponding to this path. CC follows a single path of labels greedily down the chain of binary classifiers, as shown in Figure 3 through a simple example. In carrying out classification down a chain in this way, CC models label dependencies and, as a result, usually performs much better than IC, while being similar in memory and time requirements in practice. However, due to its greedy approach (i.e., only one path is explored) and depending on the choice of , its performance can be very sensitive to errors, especially in the initial links of the chain Cheng et al. (2010).

Figure 3: Example of the possible paths along the tree of class labels (). The best path, , with probability , is shown with dashed lines.

2.4 Probabilistic Classifier Chains (Pcc) and extensions

Probabilistic classifier chains (PCC) was introduced in Cheng et al. (2010). In the training phase, PCC is identical to CC; considering a particular order of labels (either chosen randomly, or as per default in the dataset). However, during the test stage PCC provides Bayes-optimal inference by exploring all the possible paths (note that Cheng et al. (2010) only considers the MLC case, where for ). Hence, for a given test instance, , PCC provides the optimum that minimizes the subset 0/1 loss by maximizing the probability of the complete label vector, rather than the individual labels (as in Eq. (7)), i.e.,

(8)

where is given by (6).777Interestingly, it has been shown in Cheng et al. (2010); Dembczyński et al. (2012) that the optimum set of labels that minimize the Hamming loss is given by (3), i.e., the IC approach is optimal for the Hamming loss and no gain is to be expected from any other method that models correlation among labels. In Cheng et al. (2010) an overall improvement of PCC over CC is reported, but at the expense of a high computational complexity: it is intractable for more than about labels ( paths), which represents the majority of practical problems in the multi-label domain. Moreover, since all the conditional densities in (6) are estimated from the training data, the results can also depend on the chosen label order , as in CC.

An approximate PCC-based inference method with a reduced computational cost has been proposed in Dembczyński et al. (2012). This approach, named -approximate inference, is based on performing a depth-first search in the probabilistic tree with a cutting-off list. It is characterized by quite strong theoretical guarantees regarding the worst-case regret for the subset 0/1 loss and shows a good performance in the experiments, but does not tackle the chain ordering problem. An alternative approach, ‘beam search’ for PCC, has been proposed in Kumar et al. (2012). Beam search is a heuristic search algorithm that speeds up inference considerably and also allows experimentation with chain orderings. Furthermore, the authors of Kumar et al. (2012) mention the (promising) possibility of using Monte Carlo methods in future works. A simple Monte Carlo-based PCC approach has been considered in Dembczyński et al. (2012); Dembczynski et al. (2011) for maximization of the Hamming and the F-measure loss functions respectively during the test (i.e., inference) stage. We have independently developed a Monte Carlo-based approach in Read et al. (2013), which considers not only the test stage but also the training (i.e., chain order optimization) stage. In this paper we elaborate on this work, providing more sophisticated Monte Carlo algorithms that speed up both the training and test stages.

2.5 Bayesian Network Classifiers

Conditional dependency networks (CDN) Guo and Gu (2011) are used as a way of avoiding choosing a specific label order . Whereas both CC and PCC are dependent on the order of labels appearing in the chain, CDN is a fully connected network comprised of label-nodes for . Gibbs sampling is used for inference over steps, and the marginal probabilities collected over the final steps. However, due to having links, inference may not scale to large .

Bayesian Classifier Chains Zaragoza et al. (2011) finds a more tractable (non fully-connected) network based on a maximum spanning tree of label dependencies; although they again use the faster classifier chain-type inference, i.e., by treating the resulting graph as a directed one (by electing one of the nodes to be a root, and thus turning the graph into a tree). This method is similar to CC in the sense that classification depends on the order of nodes, but, unlike CC, it does not model all dependencies (e.g., the dependence between leaf variables is not necessarily modelled).

2.6 Inference in MDC: our approach

As explained in the previous sections, the optimal solution to the classifier chain problem is twofold:

  1. Find the best label order , exploring all the possible label orders.

  2. Find the best label vector within a space composed of possible label vectors.

Unfortunately, this task is unfeasible except for very small values of and (). Indeed, the total space has a cardinality (i.e., exponential times factorial). For this reason, in the following we design efficient Monte Carlo techniques to provide good solutions to both problems: finding a good label order at the training stage (see Section 3), and then a good label vector at the test (i.e., inference) stage (see Section 4).

3 Training stage: Finding the best classifier chain

In the training step, we want to learn each of the individual classifiers, for , and, at the same time, we also wish to find the best chain order, , out of the possibilities. We use a Monte Carlo approach to search this space efficiently.

3.1 Learning the label order

A first, simple, exploration of the label-sequence space is summarized in Algorithm 1. This algorithm can start either with a randomly chosen label order or with the default label order in the dataset, . In each iteration a new candidate sequence is generated randomly according to a chosen proposal density (see Section 3.2 for further details). Then, a suitable payoff function is evaluated (see Section 3.3 for a discussion on possible payoff functions). The new candidate label order, , is accepted if the value of the payoff function is increased w.r.t. the current one, (i.e., if , then ). Otherwise, it is rejected and we set . After a fixed number of iterations , the stored label order is returned as the output of the algorithm, i.e., the estimation of the best chain order provided is .888In order to avoid overfitting, this is typically performed using internal train/test split or cross validation, i.e., using part of the training set for building the model and the rest for calculating its payoff. See Section 5 for further details.

Input:

  • : training data.

  • : proposal density.

  • : initial label order and number of iterations.

Algorithm:

  1. For :

    1. Draw .

    2. if

      • accept.

    3. else

      • reject.

Output:

  • : estimated label order.

Algorithm 1 Finding a good label order

As we show in Section 3.3, the payoff function is based on , an approximation of the true data density, . Hence, in order to decrease the dependence on the training step we can consider a population of estimated label orders, , instead of a single one. This method is detailed in Algorithm 2. The underlying idea is similar to the previous Algorithm 1, but returning the best label orders (the ones with the highest payoff) after iterations instead of a single label order.

Input:

  • : training data.

  • : proposal density.

  • : initial order, number of iterations and population size.

Algorithm:

  1. For :

    1. Draw .

    2. if

      • accept.

      • set.

    3. else

      • accept.

      • set.

  2. Sort decreasingly w.r.t. , taking the top .

Output:

  • : population of best estimated label orders.

  • : corresponding weights.

Algorithm 2 Finding a good population of label orders

Once we have described the two proposed Monte Carlo approaches for the training step, the following two sections are devoted to the critical issues for both of the algorithms: the choice of the proposal (Section 3.2) and of the payoff function (Section 3.3).

3.2 Choice of the proposal function

In order to explore the sequence space, , a proposal mechanism is required. We remark that performing a search in requires (a) learning a probabilistic model and (b) building a new classifier chain for each sequence we want to try. Hence, this stage is inherently much more expensive than searching the label space and the number of label orders that can be explored is thus very limited. Therefore, the proposal density must be both simple and effective. Below, we describe two possibilities.

First proposal scheme: As a first approach we consider a very simple proposal. Specifically, given a sequence

the proposal function consists of choosing uniformly two positions of the label order () and swapping the labels corresponding to those positions, so that and .

Second proposal scheme: The previous proposal does not make a full use of all the available information. For instance, due to the chain structure, changing the initial ‘links’ in the chain (e.g., or ) implies a larger jump in the sequence space than changing the final links (e.g., or ). Indeed, if the first labels in remain unchanged w.r.t. , only classifiers need to be re-trained, thus saving valuable computation time. In light of this observation, we propose an improvement of the previous proposal based on freezing the links in the chain progressively from the beginning to the end.999This idea follows the line of the different tempering strategies found in the literature, such as simulated annealing or simulated tempering Kirkpatrick et al. (1983). However, from a Monte Carlo point of view there is an important difference: our tempering is applied to the proposal, whereas the classical tempering is used to change the target. This allows the algorithm to explore the whole sequence space uniformly in the initial iterations (i.e., potentially requiring re-training of the whole classifier chain), but focuses gradually on the last labels of the sequence, which require almost no re-training and are very cheap to explore. In this case, the first label at the -th iteration is drawn from

(9)

with the second label drawn from

(10)

where is a user-defined and constant parameter. First of all, note that the expressions (9)-(10) indicate only the proportionality of the probabilities w.r.t.  and or , i.e., in order to obtain the probability mass function we have to normalize the weights above. Moreover, observe that for the probability of choosing an index (resp. ) depends on the position (resp. ) and the time . More specifically, this probability increases with the value of (resp. ), and this effect grows as increases, with the probability mass function becoming a delta located at the last possible position when . The speed of convergence is controlled by the parameter : the higher the value of , the faster Eqs. (9) and (10) become delta functions.

3.3 Cost functions: Bayesian risk minimization

Let us define two matrices, and , containing all the features and observations in the training set respectively. Furthermore, let us assume that the data associated with different training instances are independent, i.e.,

(11)

From a Bayesian point of view, the best model (i.e., the best chain or label order) is the one that minimizes the Bayesian risk Cheng et al. (2010); Dembczyński et al. (2012); Trees (2001). Let us define a generic cost function,

(12)

where we have used and to simplify the notation, is a generic functional and is some appropriate loss function, . The Bayesian risk is the expected cost over the joint density of the data given the model,

(13)

with denoting the mathematical expectation w.r.t. the joint conditional density , and the optimum chain corresponding to the label order which minimizes this risk. For a given set of training data, the best label order can be determined in a pointwise way by taking the expectation w.r.t. the conditional probability Cheng et al. (2010); Dembczyński et al. (2012):101010Note that in Cheng et al. (2010); Dembczyński et al. (2012) this approach is followed to find the best classifier for a given label order, whereas here we use it to find the best label order (i.e., the best model).

(14)

where we have made use of (11) to obtain the last expression.111111In practice, we use internal validation to avoid overfitting: the training set is divided into two: a first part for training the classifiers and a second part for validation. Thus, all the expressions in this section should really consider only the validation set, which will be a subset of the training set. However, in the following we always consider for the sake of simplicity.

In the following, we explore several cost and loss functions commonly used in MLC and MDC, showing their probabilistic interpretation from the point of view of finding the best label order.

3.3.1 Additive cost functions

In this section we consider the functional , i.e., an additive cost function. Thus, we have

(15)

Inserting (15) into (14), and after some algebra, we obtain the following estimator for additive cost functions:

(16)

Unfortunately, minimizing (16) for a generic loss function can be unfeasible in practice. However, by focusing on two of the most common losses used in MLC and MDC (the exact match and the Hamming losses), simple expressions with a straightforward probabilistic interpretation may be found. First of all, let us consider the exact match loss,121212Also called by some authors the subset 0/1 loss (cf. Cheng et al. (2010)). which is defined as

(17)

where returns 1 if its predicate holds and 0 otherwise. Using (17), (16) can be expressed as

(18)

From (18) it can be seen that minimizing the exact match loss is equivalent to maximizing the sum of the likelihoods of the predictions for each of the instances in the validation set.131313Note that this is equivalent to the result obtained in Cheng et al. (2010); Dembczyński et al. (2012) for the test stage, i.e., for inferring the best for a given label order . Therefore, in order to minimize the exact match loss we should use the following payoff function:

(19)

As a second example, we consider the Hamming loss:141414The name is due to the fact that it corresponds to the Hamming distance for the binary labels used in MLC. Although this is no longer true for the non-binary labels that can appear in MDC, this definition is still valid and we keep the name used in MLC.

(20)

Unlike the exact match loss, which returns the same value when regardless of how dissimilar they are, the Hamming loss looks at each label component separately. Using (20), it can be shown (see the Appendix) that, for the Hamming loss, (16) becomes

(21)

Hence, from (21) we notice that the Hamming loss is minimized by maximizing the sum of the likelihoods of the individual label predictions, given only the data, for each of the instances in the validation set.151515Once more this is equivalent to the result obtained in Cheng et al. (2010); Dembczyński et al. (2012) for the test stage.161616Note that the CC approach returns instead of . However, an estimate of the probabilities required by (21) and (22) can be easily obtained by summing over the unnecessary variables, i.e., Thus, the corresponding payoff required for minimizing the Hamming loss is

(22)

3.3.2 Multiplicative cost functions

As a second family of cost functions we consider multiplicative cost functions, i.e., we consider a functional , which leads us to

(23)

Inserting (23) into (14), the estimator is now given by

(24)

which has an similar functional form to (16), with the of the inner sum inside the outer sum. Hence, following an identical procedure to the one in Eq. (18) for the exact match loss, we obtain

(25)

which corresponds to the maximum of the likelihood function. Hence, the corresponding payoff function is precisely the likelihood function:

(26)

Similarly, following the steps shown in the Appendix for the additive cost function, we may obtain the estimator for the Hamming loss in the multiplicative case:

(27)

which is similar to (25), but now the product is on the individual label likelihoods instead of the global likelihoods of the different instances. The payoff function in this case is

(28)

4 Test (inference) stage: Finding the best label vector

In the test stage, for a given test instance and a label order , our aim is finding the optimal label vector that maximizes Eq. (8). The PCC method Cheng et al. (2010) solves this part analytically (by performing an exhaustive search). However, since this method becomes computationally intractable for anything but small (the full space involves possible paths).

The goal is providing a Monte Carlo (MC) approximation of the estimated label vector,

(29)

for the minimization of the exact-match loss or

(30)

for the Hamming loss, such that when , with being the number of iterations of the MC algorithm.

A first possible MC approach for the minimization of the exact match loss is provided by Algorithm 3.171717Algorithm 3 can also be used to minimize the Hamming loss, simply changing the condition in step 1(a) by the following condition: 181818An MC-based approach like the one shown in Algorithm 3 has been independently proposed in Dembczyński et al. (2012) for the minimization of the exact match loss during the test stage. Given a test instance and a label order , this algorithm starts from an initial label vector arbitrarily chosen (e.g., randomly or from the greedy inference offered by standard CC), and draws samples () directly from the model learnt in the training stage, .191919Note that this is equivalent to generating random paths in the tree of class labels according to the corresponding weights associated to each branch (see Figure 3). Then, the label vector with the highest payoff is returned as the output, i.e., , with

(31)

for the minimization of the exact-match loss and

(32)

when the goal is minimizing the Hamming loss. From a Monte Carlo point of view, it is important to remark that all the candidate vectors are always drawn directly from the target density, , i.e., is always a valid path on a tree selected according to the weights of the different branches. This is an important consideration, since it guarantees that the estimated label vector, , will always be a feasible path.

Input:

  • : test instance and given label order.

  • : probabilistic model.

  • : initial label vector and number of iterations.

Algorithm:

  1. For :

    1. Draw .

    2. if

      • accept.

    3. else

      • reject.

Output:

  • : predicted label assignment.

Algorithm 3 Obtaining that minimizes the exact-match loss for a given test instance .

As previously discussed, the inference of Algorithm 3 depends strictly on the chosen label order . For this reason, we also propose another scheme that uses a population of label orders (chosen randomly or obtained using Algorithm 2). A naive procedure to incorporate this information in the inference technique would be running parallel algorithms to find sequences of labels (like Algorithm 3) using different label orders () and then selecting the best one. However, this approach is computationally inefficient. In Algorithm 4 we propose a more sophisticated approach that makes use of the information within the entire population but requires running only one random search. The main steps of the method can be summarized as follows:

  1. A label order is selected according to some weights (e.g., those provided by Algorithm 2) proportional to a certain payoff function.

  2. A good label vector is found by following Algorithm 3.

  3. The procedure is repeated times, with the best label vector for the label orders explored being returned as the output.

Input:

  • : test instance.

  • : population of label orders.

  • : corresponding weights.

  • : probabilistic model.

  • : number of iterations for searching and resp.

  • : initial label vector.

Algorithm:

  1. For :

    1. Choose for .

    2. Set .

    3. For :

      1. .

      2. if

        • accept.

      3. else

        • reject.

    4. Set .

Output:

  • : predicted label assignment.

Algorithm 4 Obtaining that minimizes the exact-match loss given , and a population .

5 Experiments

In order to compare fairly both the performance and the computational effort, we progressively apply the ideas introduced in the previous sections to form four novel methods:

  • MCC (Algorithm 3): given a classifier chain trained on some previously-determined label order (e.g., randomly as in CC or PCC), we infer the label vector for all test instances using a simple MC approach.

  • MCC (Algorithm 1 plus Algorithm 3): like MCC, but we additionally search for a suitable label order during the training stage. Specifically, we use Algorithm 1 with the simplest proposal density described in the first part of Section 3.2.

  • PMCC (Algorithm 2 plus Algorithm 4): population version of MCC, still using the simplest proposal density described in Section 3.2.

  • PMCC (Algorithm 2 plus Algorithm 4): PMCC with the improved proposal described in the last part of Section 3.2.

Note that MCC and MCC differ on how they obtain the label order (randomly chosen for MCC or estimated using Algorithm 1 for MCC), whereas PMCC and PMCC differ on the proposal used to search for the best label order.

5.1 Comparison of different cost functions

In this section we analyze the performance of different payoff functions: the two additive payoffs given by (19) and (22), and the multiplicative payoff of Eq. (26). Initially we focus on the Music dataset, because it is faster to run and easier to visualise than other datasets. Indeed, since (see Table 5) we can find the optimum label order for the exact match payoff () by performing an exhaustive search over the possibilities. Table 2 shows that the proposed Monte Carlo approach (Algorithm 1) arrives to the optimum label order under two separate initializations after 1935 and 1626 iterations respectively; although we note that after a much smaller number of iterations (310 and 225 respectively), the difference is minimal in payoff. The search also converges maximizing (Table not displayed), although we noted that it is a different maxima, specifically, .

[5, 2, 4, 1, 0, 3] 164.92 -1079.26 2889.89
[5, 4, 0, 1, 2, 3] 166.14 -1084.91 2887.3