A snapshot on nonstandard supervised learning problems
Machine learning is a field which studies how machines can alter and adapt their behavior, improving their actions according to the information they are given. This field is subdivided into multiple areas, among which the best known are supervised learning (e.g. classification and regression) and unsupervised learning (e.g. clustering and association rules).
Within supervised learning, most studies and research are focused on well known standard tasks, such as binary classification, multiclass classification and regression with one dependent variable. However, there are many other less known problems. These are what we generically call nonstandard supervised learning problems. The literature about them is much more sparse, and each study is directed to a specific task. Therefore, the definitions, relations and applications of this kind of learners are hard to find.
The goal of this paper is to provide the reader with a broad view on the distinct variations of nonstandard supervised problems. A comprehensive taxonomy summarizing their traits is proposed. A review of the common approaches followed to accomplish them and their main applications is provided as well.
Keywords:Machine learning Supervised learning Nonstandard learning
Msc:MSC 68T05 MSC 68T10
According to Mitchell learning-mitchell (), a machine is said to learn from experience related to a class of tasks and performance metric , when its performance at tasks in improves according to after experience .
Supervised learning is one of the fundamental areas of machine learning learning-marsland (). From object detection to ecological modeling to emotion recognition, it covers all kinds of applications. It essentially consists in learning a function by training with a set of input-output pairs. The training stage can be seen as in the previous definition, and the specific task may vary, but usually involves predicting an appropriate output given a new input.
Traditionally, supervised learning problems have been spread into two categories: classification and regression classification (); pattern-rec (). In the first, information is divided into discrete categories, while the latter involves patterns associated to a value in a continuous spectrum.
These problems can be processed by learning from a training dataset, which is composed of instances. Typically, these instances or samples take the form where is a vector of values in the space of input variables and is a value in the target variable. Each problem can be described by the type of its instances: inputs will usually belong to a subset of , and outputs will take values in a specific one-dimensional set, finite or continuous. Once trained, the obtained model can be used to predict the target variable on unseen instances.
Standard classification problems are those where labels are either binary or multiclass classification-duda (); multiclass (). In the binary case, an instance can only be associated with one of two values: positive or negative, which is equivalent to 0 or 1. For example, email messages may be classified into spam or legit, and tumours can be categorized as either benign or malign. Multiclass problems, on the other hand, involve any finite number of classes. That is, any given instance will belong to one of possibly many categories, which is equivalent to it being assigned a natural number below a convenient threshold. As an example, a photograph of a plant or a sound recording from an animal could correspond to one of a variety of species.
A standard regression problem learning-tibshirani (); regression () consists in finding a function which is able to predict, for a given example, a real value among a continuous range, usually an interval or the set of real numbers . For example, the height of a person may be estimated out of several characteristics such as age or country of origin.
Even though these standard problems are applicable in a multitude of cases, there are situations whose correct modeling requires modifications of their structure. For example, a newspaper article can be categorized according to its contents, but it could be desirable to assign several categories simultaneously. Similarly, a social media post could be described by not one but two input vectors, an image and a piece of text. These special circumstances cannot be covered by the traditional one-vector input and one-dimensional output schema. As a consequence, since performance metrics which measure improvements in standard tasks assume the common structure, they lose applicability or sense in these cases. Thus, not only new techniques are needed to tackle the problems, but also new ways of measuring and comparing their success.
This work studies variations on classic supervised problems where the traditional structure is not obeyed, which we call nonstandard variations. These emerge when the structure of the classical components of the problems does not suffice to describe complex situations, such as multiplicity of inputs or outputs, or order restrictions. As a consequence, this manuscript does not cover other singular supervised problems, such as high dimensionality of the feature space highdim () or unbalanced training sets imbalanced (); imbalanced-krawczyk (), nor time-dependent problems, such as data streams streams (); streams2 () or time series timeseries ().
The rest of the paper is structured as follows. Section 2 formally defines and describes each nonstandard variation. This is followed by Section 3 establishing relations among the introduced problems and proposing a taxonomy of them. Section 4 describes the most common techniques used to solve them. After that, Section 5 enumerates popular applications of each problem. Section 6 covers other variations further from the ones previously detailed. Lastly, Section 7 draws some conclusions.
2 Definitions of nonstandard variations
The problems introduced in this section are generalizations over the traditional versions of classification and regression. The focus is on fully supervised problems, where inputs are always paired with outputs during training. An alternative taxonomy based on different supervision models is introduced in weak-nonstandard ().
In this work we will establish a notation which intends to be as simple to understand as possible, while being able to encompass every nonstandard variation. First, any supervised learning problem consists in finding a function which will classify, rank or perform regression. It will be noted as
where is an input set, or domain, and is an output set, or codomain. It will be assumed that a training dataset is provided, including a finite number of input-output pairs:
This way, a learning algorithm will be able to generate the desired function . An additional notation will be the set of labels where convenient.
For example, in standard binary classification and . Similarly, standard regression problems can be defined with the same kind of set and . Thus, we can define very distinct supervised problems by particularizing sets or in different ways.
Other usual notations are based in probability theory, thus involving random variables and probability distributions gaussianproc (); learning-murphy (). In that case, and would be the sample spaces of the input and output variables and , respectively. Predictors would usually attempt to infer a discriminant model from the training dataset.
The multi-instance (MI) framework mil () assumes a single feature space for all instances, but each training pattern may consist of more than one instance. In this case, a training pattern is composed of a finite multiset or bag of instances and a label. Formally, assuming instances are drawn from a set , the domain can be described as follows:
In this case, the learning algorithm will not know labels associated to each instance but to a bag of them. In addition to this, not all instances may share the same relevance or are equally related to the label.
Some MI problems assume that hidden labels are present for each instance in a bag: for example, a training set of drug tests where, for each test, several drug types are analyzed. Additionally, a typical MI assumption in the binary scenario states that a bag is positive when at least one of its instances is positive, and it is negative otherwise mil-assumptions ().
Other MI problems differ in that a per-instance labeling may not be possible or may not make sense: for example, if each bag represents an image and instances are image segments, class beach can only apply to bags with water and sand segments, but it cannot apply to an individual instance.
A learning problem is considered to be multi-view (MV) mviewl () when inputs are composed of several components of very different nature.
For example, if a learning pattern consists of an image as well as a piece of text representing the same instance, they can be seen as two views on it. In that case, images and texts would belong to distinct feature spaces and respectively, an input pattern being . More generally, we can describe the input space as:
where is the number of views offered by the problem and is the dimension of the feature space of the -th view.
The multi-label (ML) learning field mlc (); mltutorial () studies problems related to simultaneously assigning multiple labels to a single instance. That is, if the codomain consists of all possible selections of these labels, also known as labelsets:
As shown by this formulation, it is equivalent to think of a selection of labels as a subset of and as a binary vector. For example, the labelset composed of the first and third labels can be represented either by or .
The difference that arises when comparing ML problems to binary or multiclass ones is that labels may interact with each other. For example, a news piece classified in economy is more likely to be labeled politics than sports. Similarly, a photograph labeled ocean is less likely to have the mountains label rather than beach. Methods may take advantage of label co-ocurrence scumble () in order to reduce the search space when predicting a labelset.
Multi-dimensional (MD) learning mdc () is a generalized classification problem where categorization is performed simultaneously along several dimensions. Each instance can belong to one of many classes in each dimension, thus the output space can be formally described as:
where is the label space for the -th dimension.
As with ML learning, label dimensions may be related in some way and treating them independently would only be a naive solution to the problem.
2.6 Label distribution learning
In label distribution learning (LDL) problems ldl (), otherwise known as probabilistic class label problems ldl-prob (), any instance can be described in different degrees by each label. This can be modeled as a discrete distribution over the labels, where the probability of a label given a specific instance is called its degree of description. Analitically, the objective is, for each instance, to predict a real-valued vector which sums exactly 1:
In this case, we would say that the -th label in describes an instance with degree .
2.7 Label ranking
In a label ranking (LR) problem lrankpairwise (); lranksurvey () the objective is not to find a function able to choose one or several labels from the label space. Instead, it must evaluate their relevance for each unseen instance. The most general version of the problem involves a training set where is the set of all partial orders of , and the obtained function also maps individual instances to partial orders. This way, for each test instance the function will output a sequence of preferences where some labels will be seen as more relevant than others.
However, the typical situation in label ranking problems is that the orders are total, which means any two labels can always be compared. This is called a ranking and does not exclude the possibility of ties. When ties are not allowed it is said to be a sorting or permutation, and can be formulated as follows:
where is the amount of labels. can also be seen as the set of all permutations of the labels in , usually known as the symmetric group of order , and noted as .
2.8 Multi-target regression
A regression problem where the output space has more than just one dimension is usually called multi-target regression (MTR) and is also known as multi-output, multi-variate or multi-response moutr (). In this case, a formal description is simply that the codomain is a continuous multi-dimensional real set:
and is the number of target variables.
As with other multiple target extensions, the key difference with single-target regression in this case is the possible interactions among output variables.
2.9 Ordinal regression
A problem where the target space is discrete but ordered is called ordinal regression (OR) or, alternatively, ordinal classification ord-survey (). It can be located midway between classification and regression. More specifically, it consists in labeling instances with a finite number of choices where these are ordered
In OR, the training phase consists in learning from a set of feature vectors which have a specific label associated to them, and testing can be performed over individual instances. This means that, although labels are ordered, the main objective is not to rank or sort instances as in learning to rank ltr (), but to simply classify them. The labels themselves do not provide any metric information either, they only carry qualitative information about the order among themselves.
2.10 Monotonicity constraints
Order relations can exist not only in the label space but in the feature space as well. Partial orders among real-valued feature vectors are always possible, and there may be cases where the order among instances is determined by just one or a few of their attributes.
When inputs as well as outputs are at least partially ordered, it is common to look for predictions which respect their order relations. In that case, the objective is to obtain a classifier or regression function which enforces the following constraint:
2.11 Absence or partiality of information
Some problems do not directly alter the structure of and from the standard supervised problem. Instead, they restrict which data can belong to a training set, or remove labelings from training examples. In this case, training information is presented partially or with some exclusions.
According to which kind of information is missing from the training set, a learning task can usually be categorized as semi-supervised semi-sup (), one-class learning oneclass (), PU-learning pu-learn (), zero-shot learning zeroshot () or one-shot learning oneshot (). These are described further in Section 6.1.
2.12 Variation combinations
Some of the components described above can be combined to compose a more complex problem overall. Usually, one of these combinations will take components from different variation types, for example, simultaneous multiplicity of inputs and outputs.
More specifically, there exist several studies involving MI ML scenarios miml (); miml2 (). In this case, examples from the input space are composed of several feature vectors and are associated to various labels. As a consequence, this model can represent many complicated problems where inputs and outputs have more structure than usual.
Other more uncommon situations are MV MI ML problems mvmiml (), where patterns have several instances which may or may not belong to the same space, a multi-output version of OR named graded ML classification graded-ml () and more complex input structures such as multi-layer MI MV mlmimv (), where a hierarchy of instances is present in each example.
A first categorization of the variations analyzed in this work can be made according to how they differ from the standard problem. There can be multiplicity in the input space or the output space, order constraints may exist, or only partial information may be given in some cases. Fig. 1 shows ways in which the traditional problems can be generalized.
Problems introducing multiple inputs are MI and MV, whereas multiple outputs can be found on ML, MD, LR, LDL and MTR. Problems where orders are present are OR, MC and IR. Likewise, tasks with only partial information are, among others, semi-supervised learning, one-shot classification and zero-shot classification.
Finally, a generalized problem can be built out of combining several of these components: for example, a multiple-input multiple-output problem where the inputs and outputs can belong to structures like the ones defined above.
The rest of this section studies variations on the structure of the input space and output space, establishes relations among problems, and describes how they can be particularized or generalized to one another.
3.1 Input structure
In a standard supervised problem, the input space consists of single feature vectors and does not impose a specific order.
Problems where learning patterns are composed of multiple instances can usually be categorized into either MI, if the inputs share the same structure, or MV, otherwise. Their combination can also be considered as well, e.g. a problem where an example is composed of one or more photographs and one or more pieces of text. This would be a case of a MV MI problem.
There are also problems where there exists a partial or total order among instances, which is coupled with an order constraint in relation to the outputs. These are MC and IR.
Fig. 2 summarizes these structural traits in a hierarchy and indicates problems where these traits are present.
3.2 Output structure
The diversity in output variations is higher than that of the input ones. A first sorting criterion is whether the codomain is discrete or continuous. This way, problems are either classification or regression ones.
Further subdivision of problems allows to separate these traits according to whether outputs remain scalars or become vectors. In the first case we consider order in the discrete scenario a nonstandard variation, which is present in OR and MC. In the second case, classification problems are spread into ML, LR and MD, and regression ones into LDL and MTR.
Fig. 3 organizes these traits in a hierarchy based on the previous criteria. Each leaf of the tree also includes problems where each one is present.
The variations in the structure of target spaces in supervised problems can be seen as generalizations of the standard problems. Furthermore, some of them are also more general than others. For example, ML problems can be seen as LR ones where, for a given instance, labels over a threshold are active and those below are not. Thus, LR is a generalization of the ML scenario. More relations of this kind are displayed in Fig. 4.
As shown in the graph, an inclusion of more target variables of the same type transforms a binary problem into ML, a multiclass problem into MD and a single-target regression one into MTR. Similarly, inclusion of more values into each variable allows to generalize binary problems to multiclass, and ordinal to single-target regression, as well as ML ones to MD and these to MTR. LDL can be seen as a generalization of ML where real numbers between 0 and 1 are also allowed as values for a label. LR is a generalization of ML by the argument discussed before.
In this section input and output variations of standard supervised problems have been categorized and related. Table 1 allows to identify specific problems according to which input and output traits are present.
|Unordered outputs||Ordered outputs|
|Unordered inputs||standard classification classification ()||ML/MD classification mlc (); mdc ()||OR ord-survey ()||standard regression regression ()||Graded ML graded-ml ()||MTR moutr ()|
|Ordered inputs||-||-||MC mc-salva ()||IR ir-book ()||-||-|
|Multiple instances||MI classification mil ()||MIML/MIMD classification miml ()||-||MI regression mil ()||-||-|
|Multiple views||MV classification mviewl ()||MVML/MVMD classification mvmiml ()||-||MV regression mviewl ()||-||-|
4 Common approaches to tackle nonstandard problems
When tackling a nonstandard problem, most techniques follow one of two main approaches: problem transformation or algorithm adaptation. The first one relies on appropriate transformations of the data which result in one or more simpler, standard problems. The latter implies an extension or development of previously existing algorithms, in order to adapt them to the complexities induced by the structure of the data.
In the following subsections several methods based on both approaches are enumerated for each analysed problem.
4.1 Problem transformation
Problem transformation methods assume that a solution can be achieved by extracting one or more simpler problems out of the original one. For example, a problem with multi-dimensional targets could be transformed into many problems with scalar outputs. Then, these problems could be solved independently by a classical algorithm. A solution for the original problem would be the concatenation of those extracted from the simpler ones.
Next, the most common transformation techniques are described for each nonstandard supervised learning task previously introduced.
The taxonomy proposed in mic-taxonomy () describes an Embedded Space paradigm, where each bag is transformed into a single feature vector representing the relevant information about the whole bag. This transformation brings the MI problem into a single-instance one. Most of these methods are vocabulary-based, which means that the embedding uses a set of concepts to classify each bag according to its instances, resulting in a single vector with one component per concept.
Some naive transformations consist in ignoring every view except one, or concatenating feature vectors from all views, thus training a single-view model in both cases mv-spectral (). A preprocessing based on Canonical Correlation Analysis mv-cca () is able to project data from multiple views onto a lower-dimensional, single-view space.
Transformation methods for ML classification mlmethods () are diverse: Binary Relevance trains separate binary classifiers for each label. Label Powerset reduces the problem to a multiclass one by treating each individual labelset as an independent class label, and Random k-Labelsets ml-rakel () extracts an ensemble of multiclass problems similarly. Classifier chains ml-chains () trains subsequent binary classifiers accumulating previous predictions as inputs. ML problems can also be transformed to LR ml-clr ().
In some cases, independent classifiers can be trained for several dimensions mdc (); mdc-indep () but this method ignores possible correlations among dimensions. An alternative transformation, building a different label from each combination of classes, would produce a much larger label space and thus is not typically applied.
A LDL problem can be reduced to multiclass classification by extracting as many single-label examples as labels for each one of the training instances ldl (). These new examples are assigned a class corresponding to each label and weighted according to its degree of description. During the prediction process, the classifier must be able to output the score/confidence for each label, which can be used as its description degree.
A reduction of this problem to several binary problems can be achieved by learning pairwise preferences lrankpairwise (). This transforms a -label problem into binary problems describing a comparison among two labels. An alternative reduction by means of constraint classification lr-constraint () builds a single binary classification dataset by expanding each label preference into a new positive instance and a new negative instance. The feature space of the new binary problem has dimension , where is the original dimension and the number of labels, due to the constraints embedded in it by Kesler’s construction nilsson ().
There are several ways to transform a MTR problem into several single-target regression ones. Some of them are inspired by the ML field, such as a one-vs-all single-target reduction, multi-target stacking and regressor chains mtrviaml (). All of them train single-target regressors for several extracted problems, and then combine the obtained predictions. A different approach based on support vectors mtr-lssvr () extends the feature space which expresses the multi-output problem as a single-target one that can be solved using least squares support vector regression machines.
An ordinal problem with classes can be transformed into binary classification problems by using each class from the second to the last one as a threshold for the positive class ord-simple (). This decomposition can be called ordered partitions and is not the only possible one: others are one-vs-next, one-vs-followers and one-vs-previous ord-survey (). Several 3-class problems can also be obtained by using, for the -th problem, classes “”, “” and “”.
The authors in monotonicity () describe a procedure to tackle binary MC problems by means of IR. Multiclass MC cases can be reduced to several binary MC ones, which in turn are solved as IR problems.
4.2 Algorithm adaptation
Existing methods for classical problems can be extended in order to introduce the necessary complexities of nonstandard variations. As an example, nearest neighbor methods could be coupled with new distance metrics in order to be able to measure similarity among multiple inputs.
The rest of this section presents some algorithm adaptations which can be used to tackle nonstandard supervised tasks.
Methods that work on instance level are adaptations of algorithms from single-instance classification whose responses are then aggregated to build the bag-level classification mic-taxonomy (). They typically assume that one positive instance implies a positive bag. Adaptations of common algorithms have been proposed with support vector machines (SVM) mi-svm () and neural networks mi-nn (), whereas some original methods in this area are Axis-Parallel Rectangles mi-apr () and Diverse Density mi-framework (). In the bag-space paradigm, methods treat bags as a whole and use specific distance metrics with distance as well as kernel-based classifiers, such as k-nearest neighbor (k-NN) mi-knn () or SVM mi-kernel ().
Supervised methods for MV are comparatively less developed than semi-supervised ones. Nonetheless, there is an extension of SVM mv-svm () which simultaneously looks for two SVMs, one in each of the feature spaces of a two-view problem. There is an extension of Fisher discriminant analysis as well mv-fda ().
The most relevant algorithm adaptations mlmethods () are based on standard classification algorithms with added support for choosing more than one class at a time: adaptations exist for k-NN ml-knn (), decision trees ml-dt (), SVMs ml-svm (), association rules ml-rules () and ensembles mlensembles ().
Proposals in ldl () are adaptations of k-NN, with a special derivation of the label distribution of an unseen instance given its neighbors, and backpropagated neural networks, where the output layer indicates the label distribution of an instance. Other proposed methods are based on the optimization algorithms BFGS and Improved Iterative Scaling.
First methods able to treat MTR problems were actually generalizations of statistical methods for single-target regression mtr-rank (); mtr-canon (). Other common methods which have been extended to predict multiple regression variables are support vector regression mtr-svr1 (); mtr-svr2 (), kernel-based methods mtr-kern1 (); mtr-kern2 (), and regression trees mtr-trees () as well as random forests mtr-rf ().
Neural networks can be used to tackle OR with slight changes in the loss function or the output layer or-nn (); or-nn2 (). Similarly, extreme learning machines have also been applied to this problem or-elm (); or-elm2 (). Common techniques such as k-NN or decision trees have been coupled with global constraints for OR or-knn-dt (), and extensions of other well known algorithms such as Gaussian processes or-gp () and AdaBoost or-ada () have been proposed as well.
Algorithm adaptations generally take a well known technique and add monotonicity constraints. For example, there exist in the literature adaptations of k-NN mc-knn (), decision trees mc-trees (), decision rules mc-rules (); mc-rules2 () and artificial neural networks mc-monnets ().
Table 2 gathers all the methods described previously to tackle nonstandard supervised tasks.
|Task||Problem transformation||Algorithm adaptation|
|MI||Embedded-space mic-taxonomy ()||SVM mi-svm (); mi-kernel ()|
|Neural networks mi-nn ()|
|k-NN mi-knn ()|
|MV||Canonical correlation analysis mv-cca ()||SVM mv-svm ()|
|Fisher discriminant analysis mv-fda ()|
|ML||Binary Relevance mlmethods ()|
|Label Powerset mlmethods ()|
|Classifier chains ml-chains ()||k-NN ml-knn ()|
|Decision trees ml-dt ()|
|SVM ml-svm ()|
|Association rules ml-rules ()|
|Ensembles mlensembles ()|
|MD||Independent classifiers mdc (); mdc-indep ()||Bayesian networks md-bayes (); md-bayes2 ()|
|Maximum Entropy mdc (); mdc-indep ()|
|LDL||Multiclass reduction ldl ()||k-NN ldl ()|
|Neural networks ldl ()|
|LR||Pairwise preferences lrankpairwise ()|
|Constraint classification lr-constraint ()||Boosting lr-boost ()|
|SVM lranksurvey ()|
|Perceptron lr-online ()|
|MTR||ML mtrviaml ()|
|Support vectors mtr-lssvr ()||Generalizations mtr-rank (); mtr-canon ()|
|Support vector regression mtr-svr1 (); mtr-svr2 ()|
|Kernel-based mtr-kern1 (); mtr-kern2 ()|
|Regression trees mtr-trees ()|
|Random forests mtr-rf ()|
|OR||Ordered partitions ord-simple ()|
|One-vs-next, One-vs-followers, One-vs-previous ord-survey ()|
|3-class problems ord-survey ()||Neural networks or-nn (); or-nn2 ()|
|Extreme learning machines or-elm (); or-elm2 ()|
|Decision trees or-knn-dt ()|
|Gaussian processes or-gp ()|
|AdaBoost or-ada ()|
|MC||Reduction to IR monotonicity ()||k-NN mc-knn ()|
|Decision trees mc-trees ()|
|Decision rules mc-rules (); mc-rules2 ()|
|Neural networks mc-monnets ()|
5 Applications. Original real word scenarios
The problems studied in this work have their origins in real-world scenarios which are related below:
Problems modeled under MI learning are drug activity prediction mi-apr (), where each pattern describes a molecule and its different forms are represented by instances; image classification mic-taxonomy (), and bankruptcy ami-bank (). Most of the datasets used in experimentations, however, are usually synthetic.
Some situations where data is described in multiple views are multilingual text categorization amv-multilingual (), face detection with several poses amv-face (), user localization in a WiFi network amv-wifi (), advertisements described by their image and surrounding text amv-ads-webkb () and image classification with several color-based views and texture-based views amv-corel ().
Applications of MD classification include classification of biomedical text mdc (), where predicted dimensions for a given document are its focus, evidence type, certainty level, polarity and trend; gene function identification md-bayes (); tumor classification, and illness diagnosis in animals md-bayes2 ().
Applications modeled as MTR problems are diverse, including modeling of vegetation condition in ecosystems assigning several scores which depend on the vegetation type amtr-eco (), prediction of audio spectrums of wind tunnel tests amtr-wind (), and estimation of several biophysical parameters from remote sensing images amtr-remote ().
Monotonicity constraints are found in problems related to customer satisfaction analysis amc-customer (), in which overall appreciation of a product must increase along with the evaluation of its features; house pricing mc-trees (); bankruptcy risk evaluation amc-bank (), and cancer prediction amc-cancer (), among others.
6 Other nonstandard variations
This section covers variations of the standard supervised problem which are further from the central focus of this paper less related to those above.
6.1 Learning with partial information
In a standard supervised classification setting, it is assumed that every training example is labeled accordingly and that there exist examples for every class that may appear in the testing phase. When only a fraction of the training instances are labeled, the problem is considered semi-supervised semi-sup (), but generally there still exist labeled samples for each class.
In positive-unlabeled learning pu-learn (); pu-text (), however, labeled examples provided within the training set are only positive. This means the learning algorithm only knows about the class of positive instances, and unlabeled ones can have either class.
A different scenario arises when the training set only consists of negative (or only positive) instances, and no unlabeled examples are provided. This is known as one-class classification oneclass (), and data of this nature can be obtained from outlier detection applications, where positive examples are hardly recorded.
A problem which may be seen as a generalization of one-class classification is zero-shot learning zeroshot (), a situation where unseen classes are to be predicted in the testing stage. That is, the label space includes some values which are not present in any training pattern, but the classifier must be able to predict them. For example, if in a speech recognition problem is the set of all words in English, the training set is unlikely to have at least one instance for each word, thus the classifier will only succeed if it is capable of assigning unlearned words to test examples.
A relaxation on the obstacles of zero-shot learning is present in one-shot learning oneshot (), where algorithms attempt to generalize from very few (1 to 5) examples of each class. This is a common circumstance in the field of image classification, where the cost of collecting and labeling data samples is high.
A classification of these problems according to the type of missing information can be found in Table 3.
|Presence of unlabeled instances||Semi-supervised semi-sup ()|
|Positive-unlabeled pu-learn ()|
|No representation of some classes||One-class oneclass ()|
|Positive-unlabeled pu-learn ()|
|Zero-shot zeroshot ()|
|Scarce representation of some classes||One-shot oneshot ()|
6.2 Prediction of structured data
The nonstandard variations described in this work generalize traditional supervised problems where the predicted output is at most a vector whose components take values in either a finite set or . Further generalizations are possible if other kinds of structures are allowed. For example, the target may take the form of an ordered sequence or a tree. In this case, the problem usually enters the scope of structured prediction str-pred (), a generalization of supervised learning where methods must build structured data associated to input instances.
A particular case of supervised problem which can be seen under the umbrella of structured prediction is learning to rank ltr (), which does not involve a label space as such. Instead, training consists in learning from a set of feature vectors with a series of preferences among them, that is, a partial or total order in the training set. During testing a set of feature vectors is provided and the desired output is a ranking (with a predefined number of relevance levels, allowing ties) or a sorting (simply an ordering of the instances). This problem differs from OR in that individual classifications are usually meaningless: only relative distances among ranked instances matter.
Traditional supervised learning comprises two well known problems in machine learning: classification and regression. However, the multitude of applications which do not strictly fit the structure of the standard versions of those problems have favored the development of alternative versions which are more flexible and allow the analysis of more complex situations.
In this work an overview of nonstandard variations of supervised learning problems has been presented. A novel taxonomy under several criteria has described relationships among these variations, where the main differentiating properties are multiplicity of inputs, multiplicity of outputs, presence of order relations and constraints, and partial information. Afterwards, common methods for tackling these problems have been outlined and their main applications have been mentioned as well. Finally, some additional variants which were left out of the scope of the previous analysis have been introduced as well.
Design of novel algorithms for nonstandard supervised tasks is scarcer than adaptations and transformations, but there exist some approximations and even more open possibilities for tackling these from classical algorithmic perspectives, such as probabilistic and heuristic methods, information theory and linear algebra, among others.
Acknowledgements.D. Charte is supported by the Spanish Ministry of Science, Innovation and Universities under the FPU National Program (Ref. FPU17/04069). This work has been partially supported by projects TIN2017-89517-P (FEDER Founds) of the Spanish Ministry of Economy and Competitiveness and TIN2015-68454-R of the Spanish Ministry of Science, Innovation and Universities.
- (1) Alvarez, M.A., Rosasco, L., Lawrence, N.D.: Kernels for vector-valued functions: A review. In: Foundations and Trends in Machine Learning. Now Publishers (2012). DOI 10.1561/2200000036
- (2) Amini, M., Usunier, N., Goutte, C.: Learning from multiple partially observed views-an application to multilingual text categorization. In: Advances in neural information processing systems, pp. 28–36 (2009)
- (3) Amores, J.: Multiple instance classification: Review, taxonomy and comparative study. Artificial Intelligence 201, 81 – 105 (2013). DOI https://doi.org/10.1016/j.artint.2013.06.003
- (4) Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for multiple-instance learning. In: Advances in neural information processing systems, pp. 577–584 (2003)
- (5) Baccianella, S., Esuli, A., Sebastiani, F.: Feature selection for ordinal text classification. Neural computation 26(3), 557–591 (2014). DOI 10.1162/NECO_a_00558
- (6) Barlow, R.E.: Statistical inference under order restrictions; the theory and application of isotonic regression. Wiley (1972)
- (7) Bender, R., Grouven, U.: Ordinal logistic regression in medical research. Journal of the Royal College of physicians of London 31(5), 546–551 (1997)
- (8) Bielza, C., Li, G., Larranaga, P.: Multi-dimensional classification with bayesian networks. International Journal of Approximate Reasoning 52(6), 705–727 (2011)
- (9) Błaszczyński, J., Słowiński, R., Szelag, M.: Sequential covering rule induction algorithm for variable consistency rough set approaches. Information Sciences 181(5), 987–1002 (2011). DOI 10.1016/j.ins.2010.10.030
- (10) Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Feature Selection for High-Dimensional Data. Springer International Publishing, Cham (2015). DOI 10.1007/978-3-319-21858-8. URL https://doi.org/10.1007/978-3-319-21858-8
- (11) Borchani, H., Varando, G., Bielza, C., Larrañaga, P.: A survey on multi-output regression. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 5(5), 216–233 (2015). DOI 10.1002/widm.1157
- (12) Boutell, M., Luo, J., Shen, X., Brown, C.: Learning multi-label scene classification. Pattern Recognition 37(9), 1757–1771 (2004). DOI 10.1016/j.patcog.2004.03.009
- (13) Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender, G.: Learning to rank using gradient descent. In: Proceedings of the 22nd international conference on Machine learning, pp. 89–96. ACM (2005). DOI 10.1145/1102351.1102363
- (14) Cardoso, J.S., Sousa, R.: Classification models with global constraints for ordinal data. In: 2010 Ninth International Conference on Machine Learning and Applications, pp. 71–77. IEEE (2010). DOI 10.1109/ICMLA.2010.18
- (15) Chang, K.Y., Chen, C.S., Hung, Y.P.: Ordinal hyperplanes ranker with cost sensitivities for age estimation. In: Computer vision and pattern recognition (cvpr), 2011 ieee conference on, pp. 585–592. IEEE (2011). DOI 10.1109/CVPR.2011.5995437
- (16) Chapelle, O., Schlkopf, B., Zien, A.: Semi-Supervised Learning, 1st edn. The MIT Press (2010)
- (17) Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: Quinta: A question tagging assistant to improve the answering ratio in electronic forums. In: EUROCON 2015 - International Conference on Computer as a Tool (EUROCON), IEEE, pp. 1–6 (2015). DOI 10.1109/EUROCON.2015.7313677
- (18) Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: Dealing with difficult minority labels in imbalanced mutilabel data sets. Neurocomputing (2017). DOI 10.1016/j.neucom.2016.08.158
- (19) Chaudhuri, K., Kakade, S.M., Livescu, K., Sridharan, K.: Multi-view clustering via canonical correlation analysis. In: Proceedings of the 26th annual international conference on machine learning, pp. 129–136. ACM (2009). DOI 10.1145/1553374.1553391
- (20) Chen, Q., Sun, S.: Hierarchical multi-view fisher discriminant analysis. In: International Conference on Neural Information Processing, pp. 289–298. Springer (2009). DOI 10.1007/978-3-642-10684-2_32
- (21) Cheng, J., Wang, Z., Pollastri, G.: A neural network approach to ordinal regression. In: Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on, pp. 1279–1284. IEEE (2008). DOI 10.1109/IJCNN.2008.4633963
- (22) Cheng, W., Hüllermeier, E., Dembczynski, K.J.: Graded multilabel classification: The ordinal case. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp. 223–230 (2010)
- (23) Chu, W., Ghahramani, Z.: Gaussian processes for ordinal regression. Journal of machine learning research 6(Jul), 1019–1041 (2005)
- (24) Clare, A., King, R.D.: Knowledge discovery in multi-label phenotype data. In: European Conference on Principles of Data Mining and Knowledge Discovery, pp. 42–53. Springer (2001). DOI 10.1007/3-540-44794-6_4
- (25) Costa, M.: Probabilistic interpretation of feedforward network outputs, with relationships to statistical prediction of ordinal quantities. International journal of neural systems 7(05), 627–637 (1996). DOI 10.1142/S0129065796000610
- (26) De Waal, P.R., Van Der Gaag, L.C.: Inference and learning in multi-dimensional bayesian network classifiers. In: European Conference on Symbolic and Quantitative Approaches to Reasoning and Uncertainty, pp. 501–511. Springer (2007). DOI 10.1007/978-3-540-75256-1_45
- (27) De’Ath, G.: Multivariate regression trees: a new technique for modeling species–environment relationships. Ecology 83(4), 1105–1117 (2002). DOI 10.1890/0012-9658(2002)083[1105:MRTANT]2.0.CO;2
- (28) Dekel, O., Singer, Y., Manning, C.D.: Log-linear models for label ranking. In: Advances in neural information processing systems, pp. 497–504 (2004)
- (29) Dembczyński, K., Kotłowski, W., Słowiński, R.: Ensemble of decision rules for ordinal classification with monotonicity constraints. In: International Conference on Rough Sets and Knowledge Technology, pp. 260–267. Springer (2008). DOI 10.1007/978-3-540-79721-0_38
- (30) Deng, W.Y., Zheng, Q.H., Lian, S., Chen, L., Wang, X.: Ordinal extreme learning machine. Neurocomputing 74(1-3), 447–456 (2010). DOI 10.1016/j.neucom.2010.08.022
- (31) Dietterich, T.G., Lathrop, R.H., Lozano-Pérez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artificial intelligence 89(1-2), 31–71 (1997). DOI 10.1016/S0004-3702(96)00034-3
- (32) Diplaris, S., Tsoumakas, G., Mitkas, P., Vlahavas, I.: Protein classification with multiple algorithms. In: Proc. 10th Panhellenic Conference on Informatics, Volos, Greece, PCI05, pp. 448–456 (2005). DOI 10.1007/11573036_42
- (33) Duda, R.O., Hart, P.E., Stork, D.G.: Pattern classification. John Wiley & Sons (2012)
- (34) Duivesteijn, W., Feelders, A.: Nearest neighbour classification with monotonicity constraints. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 301–316. Springer (2008). DOI 10.1007/978-3-540-87479-9_38
- (35) Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences 95(25), 14863–14868 (1998)
- (36) Elisseeff, A., Weston, J.: A kernel method for multi-labelled classification. In: Advances in neural information processing systems, pp. 681–687 (2002)
- (37) Elkan, C., Noto, K.: Learning classifiers from only positive and unlabeled data. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 213–220. ACM (2008). DOI 10.1145/1401890.1401920
- (38) Farquhar, J., Hardoon, D., Meng, H., Shawe-taylor, J.S., Szedmak, S.: Two view learning: Svm-2k, theory and practice. In: Advances in neural information processing systems, pp. 355–362 (2006)
- (39) Fe-Fei, L., et al.: A bayesian approach to unsupervised one-shot learning of object categories. In: Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, pp. 1134–1141. IEEE (2003). DOI 10.1109/ICCV.2003.1238476
- (40) Fernández, A., García, S., Galar, M., Prati, R.C., Krawczyk, B., Herrera, F.: Learning from Imbalanced Data Sets. Springer International Publishing (2018). DOI 10.1007/978-3-319-98074-4
- (41) Foulds, J., Frank, E.: A review of multi-instance learning assumptions. The Knowledge Engineering Review 25(1), 1–25 (2010). DOI 10.1017/S026988890999035X
- (42) Frank, E., Hall, M.: A simple approach to ordinal classification. In: European Conference on Machine Learning, pp. 145–156. Springer (2001). DOI 10.1007/3-540-44795-4_13
- (43) Fukunaga, K.: Introduction to statistical pattern recognition. Elsevier (2013)
- (44) Fürnkranz, J., Hüllermeier, E., Mencía, E.L., Brinker, K.: Multilabel classification via calibrated label ranking. Machine learning 73(2), 133–153 (2008). DOI 10.1007/s10994-008-5064-8
- (45) Fürnkranz, J., Hüllermeier, E., Mencía, E.L., Brinker, K.: Multilabel classification via calibrated label ranking. Machine learning 73(2), 133–153 (2008). DOI 10.1007/s10994-008-5064-8
- (46) Gama, J.: Knowledge discovery from data streams. Chapman and Hall/CRC (2010)
- (47) Geng, X.: Label distribution learning. IEEE Transactions on Knowledge and Data Engineering 28(7), 1734–1748 (2016). DOI 10.1109/TKDE.2016.2545658
- (48) Gibaja, E., Ventura, S.: A tutorial on multilabel learning. ACM Computing Surveys (CSUR) 47(3), 52 (2015). DOI 10.1145/2716262
- (49) Greco, S., Matarazzo, B., Slowinski, R.: A new rough set approach to evaluation of bankruptcy risk. In: Operational tools in the management of financial risks, pp. 121–136. Springer (1998). DOI 10.1007/978-1-4615-5495-0_8
- (50) Greco, S., Matarazzo, B., Słowiński, R.: Rough set approach to customer satisfaction analysis. In: International Conference on Rough Sets and Current Trends in Computing, pp. 284–295. Springer (2006). DOI 10.1007/11908029_31
- (51) Gutiérrez, P.A., García, S.: Current prospects on ordinal and monotonic classification. Progress in Artificial Intelligence 5(3), 171–179 (2016). DOI 10.1007/s13748-016-0088-y. URL https://doi.org/10.1007/s13748-016-0088-y
- (52) Gutiérrez, P.A., Pérez-Ortiz, M., Sánchez-Monedero, J., Fernández-Navarro, F., Hervás-Martínez, C.: Ordinal regression methods: Survey and experimental study. IEEE Transactions on Knowledge and Data Engineering 28(1), 127–146 (2016). DOI 10.1109/TKDE.2015.2457911
- (53) Har-Peled, S., Roth, D., Zimak, D.: Constraint classification for multiclass classification and ranking. In: Advances in neural information processing systems, pp. 809–816 (2003)
- (54) Hernández-González, J., Inza, I., Lozano, J.A.: Weak supervision and other non-standard classification problems: A taxonomy. Pattern Recognition Letters 69, 49 – 55 (2016). DOI 10.1016/j.patrec.2015.10.008
- (55) Herrera, F., Charte, F., Rivera, A.J., Del Jesus, M.J.: Multilabel classification. Springer (2016)
- (56) Herrera, F., Ventura, S., Bello, R., Cornelis, C., Zafra, A., Sánchez-Tarragó, D., Vluymans, S.: Multiple instance learning: foundations and algorithms. Springer (2016). DOI 10.1007/978-3-319-47759-6
- (57) Hüllermeier, E., Fürnkranz, J., Cheng, W., Brinker, K.: Label ranking by learning pairwise preferences. Artificial Intelligence 172(16-17), 1897–1916 (2008). DOI 10.1016/j.artint.2008.08.002
- (58) Hyndman, R.J., Athanasopoulos, G.: Forecasting: principles and practice. OTexts (2018)
- (59) Izenman, A.J.: Reduced-rank regression for the multivariate linear model. Journal of multivariate analysis 5(2), 248–264 (1975). DOI 10.1016/0047-259X(75)90042-1
- (60) Jain, A.K., Duin, R.P., Mao, J.: Statistical pattern recognition: A review. IEEE Transactions on pattern analysis and machine intelligence 22(1), 4–37 (2000)
- (61) James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning: with Applications in R. Springer New York, New York, NY (2013). DOI 10.1007/978-1-4614-7138-7
- (62) Katakis, I., Tsoumakas, G., Vlahavas, I.: Multilabel text classification for automated tag suggestion. In: Proc. ECML PKDD08 Discovery Challenge, Antwerp, Belgium, pp. 75–83 (2008)
- (63) Kocev, D., Džeroski, S., White, M.D., Newell, G.R., Griffioen, P.: Using single-and multi-target regression trees and ensembles to model a compound index of vegetation condition. Ecological Modelling 220(8), 1159–1168 (2009). DOI 10.1016/j.ecolmodel.2009.01.037
- (64) Kocev, D., Vens, C., Struyf, J., Džeroski, S.: Tree ensembles for predicting structured outputs. Pattern Recognition 46(3), 817–833 (2013). DOI 10.1016/j.patcog.2012.09.023
- (65) Kotlowski, W., Slowinski, R.: On nonparametric ordinal classification with monotonicity constraints. IEEE Transactions on Knowledge and Data Engineering 25(11), 2576–2589 (2013). DOI 10.1109/TKDE.2012.204
- (66) Kotsiantis, S., Kanellopoulos, D., Tampakas, V.: Financial application of multi-instance learning: two greek case studies. Journal of Convergence Information Technology 5(8), 42–53 (2010)
- (67) Krawczyk, B.: Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence 5(4), 221–232 (2016). DOI 10.1007/s13748-016-0094-0. URL https://doi.org/10.1007/s13748-016-0094-0
- (68) Kumar, A., Rai, P., Daume, H.: Co-regularized multi-view spectral clustering. In: Advances in neural information processing systems, pp. 1413–1421 (2011)
- (69) Kuznar, D., Mozina, M., Bratko, I.: Curve prediction with kernel regression. In: Proceedings of the 1st Workshop on Learning from Multi-Label Data, pp. 61–68 (2009)
- (70) Kwon, Y.S., Han, I., Lee, K.C.: Ordinal pairwise partitioning (opp) approach to neural networks training in bond rating. Intelligent Systems in Accounting, Finance & Management 6(1), 23–40 (1997). DOI 10.1002/(SICI)1099-1174(199703)6:1¡23::AID-ISAF113¿3.0.CO;2-4
- (71) Laghmari, K., Marsala, C., Ramdani, M.: An adapted incremental graded multi-label classification model for recommendation systems. Progress in Artificial Intelligence 7(1), 15–29 (2018). DOI 10.1007/s13748-017-0133-5
- (72) Li, S.Z., Zhu, L., Zhang, Z., Blake, A., Zhang, H., Shum, H.: Statistical learning of multi-view face detection. In: European Conference on Computer Vision, pp. 67–81. Springer (2002). DOI 10.1007/3-540-47979-1_5
- (73) Lin, H.T., Li, L.: Combining ordinal preferences by boosting. In: Proceedings ECML/PKDD 2009 Workshop on Preference Learning, pp. 69–83 (2009)
- (74) Liu, B., Dai, Y., Li, X., Lee, W.S., Yu, P.S.: Building text classifiers using positive and unlabeled examples. In: Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, pp. 179–186. IEEE (2003). DOI 10.1109/ICDM.2003.1250918
- (75) López-Cruz, P.L., Bielza, C., Larrañaga, P.: Learning conditional linear gaussian classifiers with probabilistic class labels. In: Conference of the Spanish Association for Artificial Intelligence, pp. 139–148. Springer (2013). DOI 10.1007/978-3-642-40643-0_15
- (76) Lyons, M., Akamatsu, S., Kamachi, M., Gyoba, J.: Coding facial expressions with gabor wavelets. In: Automatic Face and Gesture Recognition, 1998. Proceedings. Third IEEE International Conference on, pp. 200–205. IEEE (1998). DOI 10.1109/AFGR.1998.670949
- (77) Maron, O., Lozano-Pérez, T.: A framework for multiple-instance learning. In: Advances in neural information processing systems, pp. 570–576 (1998)
- (78) Marsland, S.: Machine Learning: An Algorithmic Perspective. Chapman & Hall (2014)
- (79) Micchelli, C.A., Pontil, M.: On learning vector-valued functions. Neural computation 17(1), 177–204 (2005). DOI 10.1162/0899766052530802
- (80) Mitchell, T.M.: Machine learning. McGraw Hill series in computer science. McGraw-Hill (1997)
- (81) Moya, M.M., Koch, M.W., Hostetler, L.D.: One-class classifier networks for target recognition applications. NASA STI/Recon Technical Report N 93 (1993)
- (82) Moyano, J.M., Gibaja, E.L., Cios, K.J., Ventura, S.: Review of ensembles of multi-label classifiers: Models, experimental study and prospects. Information Fusion 44, 33–45 (2018). DOI 10.1016/j.inffus.2017.12.001
- (83) Murphy, K.P.: Machine Learning: A Probabilistic Perspective. The MIT Press (2012)
- (84) Nguyen, C.T., Wang, X., Liu, J., Zhou, Z.H.: Labeling complicated objects: Multi-view multi-instance multi-label learning. In: AAAI, pp. 2013–2019 (2014)
- (85) Nilsson, N.J.: Learning machines: foundations of trainable pattern-classifying systems. McGraw-Hill (1965)
- (86) Palatucci, M., Pomerleau, D., Hinton, G.E., Mitchell, T.M.: Zero-shot learning with semantic output codes. In: Advances in neural information processing systems, pp. 1410–1418 (2009)
- (87) Pan, F.: Multi-dimensional fragment classification in biomedical text. Queen’s University (2006)
- (88) Pan, S.J., Kwok, J.T., Yang, Q., Pan, J.J.: Adaptive localization in a dynamic wifi environment through multi-view learning. In: AAAI, pp. 1108–1113 (2007)
- (89) Potharst, R., Feelders, A.J.: Classification trees for problems with monotonicity constraints. ACM SIGKDD Explorations Newsletter 4(1), 1–10 (2002). DOI 10.1145/568574.568577
- (90) Ramon, J., De Raedt, L.: Multi instance neural networks. In: Proceedings of the ICML-2000 workshop on attribute-value and relational learning, pp. 53–60 (2000)
- (91) Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. Machine learning 85(3), 333 (2011). DOI 10.1007/s10994-011-5256-5
- (92) Ryu, Y.U., Chandrasekaran, R., Jacob, V.S.: Breast cancer prediction using the isotonic separation technique. European Journal of Operational Research 181(2), 842–854 (2007). DOI 10.1016/j.ejor.2006.06.031
- (93) Sánchez-Fernández, M., de Prado-Cumplido, M., Arenas-García, J., Pérez-Cruz, F.: Svm multiregression for nonlinear channel estimation in multiple-input multiple-output systems. IEEE transactions on signal processing 52(8), 2298–2307 (2004). DOI 10.1109/TSP.2004.831028
- (94) Sánchez-Monedero, J., Gutiérrez, P.A., Hervás-Martínez, C.: Evolutionary ordinal extreme learning machine. In: International Conference on Hybrid Artificial Intelligence Systems, pp. 500–509. Springer (2013). DOI 10.1007/978-3-642-40846-5_50
- (95) Shalev-Shwartz, S., Singer, Y.: A unified algorithmic approach for efficient online label ranking. In: Artificial Intelligence and Statistics, pp. 452–459 (2007)
- (96) Shatkay, H., Pan, F., Rzhetsky, A., Wilbur, W.J.: Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users. Bioinformatics 24(18), 2086–2093 (2008). DOI 10.1093/bioinformatics/btn381
- (97) Sill, J.: Monotonic networks. In: Advances in neural information processing systems, pp. 661–667 (1998)
- (98) Silva, J.A., Faria, E.R., Barros, R.C., Hruschka, E.R., De Carvalho, A.C., Gama, J.: Data stream clustering: A survey. ACM Computing Surveys (CSUR) 46(1), 13 (2013)
- (99) Smola, A.J., Schölkopf, B.: On a kernel-based method for pattern recognition, regression, approximation, and operator inversion. Algorithmica 22(1-2), 211–231 (1998)
- (100) Sousa, R., Gama, J.: Multi-label classification from high-speed data streams with adaptive model rules and random rules. Progress in Artificial Intelligence 7(3), 177–187 (2018). DOI 10.1007/s13748-018-0142-z
- (101) Spyromitros-Xioufis, E., Tsoumakas, G., Groves, W., Vlahavas, I.: Multi-label classification methods for multi-target regression. arXiv preprint arXiv 1211 (2012)
- (102) Sun, S., Chao, G.: Multi-view maximum entropy discrimination. In: IJCAI, pp. 1706–1712 (2013)
- (103) Surdeanu, M., Tibshirani, J., Nallapati, R., Manning, C.D.: Multi-instance multi-label learning for relation extraction. In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pp. 455–465. Association for Computational Linguistics (2012)
- (104) Taskar, B., Chatalbashev, V., Koller, D., Guestrin, C.: Learning structured prediction models: A large margin approach. In: Proceedings of the 22nd international conference on Machine learning, pp. 896–903. ACM (2005). DOI 10.1145/1102351.1102464
- (105) Tax, D.M., Duin, R.P.: Using two-class classifiers for multiclass classification. In: Pattern Recognition, 2002. Proceedings. 16th International Conference on, vol. 2, pp. 124–127. IEEE (2002)
- (106) Thabtah, F.A., Cowling, P., Peng, Y.: Mmac: A new multi-class, multi-label associative classification approach. In: Data Mining, 2004. ICDM’04. Fourth IEEE International Conference on, pp. 217–224. IEEE (2004). DOI 10.1109/ICDM.2004.10117
- (107) Tian, Q., Chen, S., Tan, X.: Comparative study among three strategies of incorporating spatial structures to ordinal image regression. Neurocomputing 136, 152–161 (2014). DOI 10.1016/j.neucom.2014.01.017
- (108) Tsoumakas, G., Vlahavas, I.: Random k-labelsets: An ensemble method for multilabel classification. In: European conference on machine learning, pp. 406–417. Springer (2007). DOI 10.1007/978-3-540-74958-5_38
- (109) Tuia, D., Verrelst, J., Alonso, L., Pérez-Cruz, F., Camps-Valls, G.: Multioutput support vector regression for remote sensing biophysical parameter estimation. IEEE Geoscience and Remote Sensing Letters 8(4), 804–808 (2011). DOI 10.1109/LGRS.2011.2109934
- (110) Tzortzis, G., Likas, A.: Kernel-based weighted multi-view clustering. In: Data Mining (ICDM), 2012 IEEE 12th International Conference on, pp. 675–684. IEEE (2012). DOI 10.1109/ICDM.2012.43
- (111) Van Der Merwe, A., Zidek, J.: Multivariate regression analysis and canonical variates. Canadian Journal of Statistics 8(1), 27–39 (1980). DOI 10.2307/3314667
- (112) Vazquez, E., Walter, E.: Multi-output support vector regression. In: 13th IFAC Symposium on System Identification, pp. 1820–1825. Citeseer (2003)
- (113) Vembu, S., Gärtner, T.: Label ranking algorithms: A survey. In: Preference learning, pp. 45–64. Springer (2010). DOI 10.1007/978-3-642-14125-6_3
- (114) Wang, J., Zucker, J.D.: Solving multiple-instance problem: a lazy learning approach. In: International Conference on Machine Learning, pp. 1119–1126. Morgan Kaufmann Publishers (2000)
- (115) Williams, C.K., Barber, D.: Bayesian classification with gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(12), 1342–1351 (1998)
- (116) Wu, B., Zhong, E., Horner, A., Yang, Q.: Music emotion recognition by multi-label multi-layer multi-instance multi-view learning. In: Proceedings of the 22nd ACM international conference on Multimedia, pp. 117–126. ACM (2014). DOI 10.1145/2647868.2654904
- (117) Zhang, M.L., Zhou, Z.H.: Ml-knn: A lazy learning approach to multi-label learning. Pattern recognition 40(7), 2038–2048 (2007). DOI 10.1016/j.patcog.2006.12.019
- (118) Zhang, M.L., Zhou, Z.H.: A review on multi-label learning algorithms. IEEE transactions on knowledge and data engineering 26(8), 1819–1837 (2014). DOI 10.1109/TKDE.2013.39
- (119) Zhang, W., Liu, X., Ding, Y., Shi, D.: Multi-output ls-svr machine in extended feature space. In: Computational Intelligence for Measurement Systems and Applications (CIMSA), 2012 IEEE International Conference on, pp. 130–134. IEEE (2012). DOI 10.1109/CIMSA.2012.6269600
- (120) Zhao, J., Xie, X., Xu, X., Sun, S.: Multi-view learning overview: Recent progress and new challenges. Information Fusion 38, 43–54 (2017). DOI 10.1016/j.inffus.2017.02.007
- (121) Zhou, Z.H., Sun, Y.Y., Li, Y.F.: Multi-instance learning by treating instances as non-iid samples. In: Proceedings of the 26th annual international conference on machine learning, pp. 1249–1256. ACM (2009). DOI 10.1145/1553374.1553534
- (122) Zhou, Z.H., Zhang, M.L., Huang, S.J., Li, Y.F.: Multi-instance multi-label learning. Artificial Intelligence 176(1), 2291–2320 (2012). DOI 10.1016/j.artint.2011.10.002