Learning from Imprecise and Fuzzy Observations: Data Disambiguation through GeneralizedLoss Minimization

# Learning from Imprecise and Fuzzy Observations: Data Disambiguation through Generalized Loss Minimization

Eyke Hüllermeier
Department of Mathematics and Computer Science
University of Marburg, Germany
eyke@mathematik.uni-marburg.de
###### Abstract

Methods for analyzing or learning from “fuzzy data” have attracted increasing attention in recent years. In many cases, however, existing methods (for precise, non-fuzzy data) are extended to the fuzzy case in an ad-hoc manner, and without carefully considering the interpretation of a fuzzy set when being used for modeling data. Distinguishing between an ontic and an epistemic interpretation of fuzzy set-valued data, and focusing on the latter, we argue that a “fuzzification” of learning algorithms based on an application of the generic extension principle is not appropriate. In fact, the extension principle fails to properly exploit the inductive bias underlying statistical and machine learning methods, although this bias, at least in principle, offers a means for “disambiguating” the fuzzy data. Alternatively, we therefore propose a method which is based on the generalization of loss functions in empirical risk minimization, and which performs model identification and data disambiguation simultaneously. Elaborating on the fuzzification of specific types of losses, we establish connections to well-known loss functions in regression and classification. We compare our approach with related methods and illustrate its use in logistic regression for binary classification.
Keywords: Imprecise data, fuzzy sets, machine learning, extension principle, inductive bias, data disambiguation, loss function, risk minimization, logistic regression.

## 1 Introduction

The learning of models from imprecise data, such as interval data or, more generally, data modeled in terms of fuzzy subsets of an underlying reference space, has gained increasing interest in recent years [Sanchez and Couso, 2007, Denoeux, 2011, Denoeux, 2013, Cour et al., 2011, Viertl, 2011]. Indeed, while problems such as fuzzy regression analysis [Diamond, 1988, Diamond and Tanaka, 1998, Tanaka and Guo, 1999, Changa and Ayyubb, 2001, Gonzalez-Rodriguez et al., 2009, Ferraro et al., 2010] have already been studied for a long time, the scope is currently broadening, both in terms of the problems tackled (e.g., classification, clustering, ranking) and the uncertainty formalisms used (e.g., probability distributions, histograms, intervals, fuzzy sets, belief functions).

Needless to say, learning from imprecise and uncertain data also requires the extension of corresponding learning algorithms. Unfortunately, this is often done without clarifying the actual meaning of an uncertain observation, although representations such as intervals or fuzzy sets can obviously be interpreted in different ways. In particular, an ontic interpretation of (fuzzy) set-valued data should be carefully distinguished from an epistemic one [Dubois, 2011]. This difference is reflected, for example, in different approaches to fuzzy statistics, where fuzzy random variables can be formalized in an epistemic [Kwakernaak, 1978, Kwakernaak, 1979, Kruse and Meyer, 1987] as well as an ontic way [Puri and Ralescu, 1986]; see [Couso and Dubois, 2009] for a comparison of these views in this context. Surprisingly, however, the fact that these two interpretations also call for very different types of extensions of existing learning algorithms and methods for data analysis seems to be largely ignored in the literature.

Under the ontic view, a variable can assume a fuzzy set as its “true value”; for example, one may argue that assigning a precise value to the variable “daily sunshine duration” is not very meaningful, and that a specification of sunshine durations in terms of intervals or fuzzy sets is more appropriate. This interpretation suggests the learning of models that produce fuzzy sets as predictions, that is to say, models that reproduce the observed data. As opposed to this, a reproduction of the data appears less reasonable under the epistemic view, where fuzzy sets are used to describe, not the data itself, but the uncertain or imprecise knowledge about the data: A fuzzy set defines a possibility distribution that specifies a degree of plausibility for each potential precise value. As we shall explain in more detail later on, one should then rather try to “disambiguate” the data instead of reproducing it.

The possibilistic interpretation of fuzzy sets in the epistemic case, that we focus on in this paper, naturally suggests a “fuzzification” of learning algorithms based on an application of the generic extension principle [Cerny and Rada, 2011, Xianga and Kreinovich, 2013]. As we shall argue, however, this approach is not appropriate and prone to fail in the context of data analysis. The main reason, to be detailed in Section 3, is a lack of differentiation between the possible data instantiations (i.e., the instantiation of each imprecise observation by a precise value). Such a differentiation, however, is typically suggested by the model assumptions through which the learning algorithm justifies its generalization beyond the data observed.

This idea of differentiating between instantiations of the data leads us to the notion of “data disambiguation” that we already mentioned above: When learning from imprecise data under the epistemic view, model identification and data disambiguation should go hand in hand. To this end, we propose an approach based on the generalization of loss functions in empirical risk minimization.

The rest of the paper is organized as follows. In the next section, we introduce the basic setting that we consider and the main notation that we shall use throughout the paper (see Table 1 for a summary). In Section 3, we explain the aforementioned problems caused by the use of the extension principle and elaborate on our idea of data disambiguation. Our new approach to learning from fuzzy data based on generalized loss functions is then introduced in Section 4. Section 5 is devoted to a comparison with an alternative and closely related method that was recently introduced by Denoeux [Denoeux, 2011, Denoeux, 2013]. In Section 6, we illustrate our approach on a concrete learning problem. Finally, we conclude with a summary and some additional remarks in Section 7.

## 2 Notation and Basic Setting

We consider the problem of model induction, which, roughly speaking, consists of passing from a specific data sample to a general (though hypothetical) model describing the data generating process or at least certain properties of this process. In this setting, a learning (data analysis) algorithm is given as input a set

 D={zi}Ni=1∈ZN (1)

of data points . As output, the algorithm produces a model , where is a predefined model class. Formally, the algorithm can hence be seen as a mapping

 ALG:D→M, (2)

where is the space of potentially observable data samples. For instance, the data points might be vectors in , and the model could be a partitioning of the data into a finite set of disjoint groups (clusters). Or, the model could be a probability density function characterizing the underlying data generating process. In fact, the data points are typically assumed to be independent and identically distributed (i.i.d.) according to an underlying (though unknown) probability distribution. Moreover, the model class is often parameterized, which means that each model is uniquely identified by a parameter (in other words, there is a bijection between the model space and the parameter space ).

In supervised learning, the data space is split into an input (instance) space and an output space , that is, . The interest, then, is to learn a mapping from to that models, in one way or the other, the dependence of outputs (responses) on inputs (predictors); correspondingly, the model space typically consists of a class of such mappings. To this end, the learning algorithm is given a set

 D={(xi,yi)}Ni=1∈(X×Y)N

of training examples as input. Important special cases of this setting include classification, where is a finite (usually small) set comprised of classes , and regression, where outputs are real numbers ().

In this paper, we are interested in the case where observations are imprecise and, therefore, characterized in terms of set-valued or fuzzy set-valued data. Subsequently, we therefore assume that, instead of precise data, the observations are given in the form of a sample of fuzzy data

 D={Zi}Ni=1∈F(Z)N, (3)

where is the set of all fuzzy subsets of the underlying data space .

We like to emphasize that, in this setting, a fuzzy set is supposed to represent information about an observation, not about any kind of underlying “true” value or distribution; correspondingly, the specification of will typically not involve any kind of statistical inference. In particular, our setting is completely coherent with the common statistical view of a data point as the realization of a random variable characterized by a probability distribution, for example a normal distribution with mean and standard deviation . Then, would represent knowledge about the realization and not about its expectation .

## 3 Data Disambiguation

Given a learning algorithm for precise data, the most straightforward approach to handling a fuzzy sample (3) is to apply the well-known extension principle [Zadeh, 1975] to the mapping (2). More formally, we define an instantiation of the fuzzy sample (3) as a sample

 D={zi}Ni=1

of precise data points, where for all . The degree of membership of in the fuzzy set of instantiations is given by

with the degree of membership of in . Then, according to the extension principle, the result of applying to the fuzzy data (3) is a fuzzy set of models in , with the degree of membership of given by

 (4)

We argue, however, that the application of the extension principle is not very meaningful in the context of learning from data. To ease the explanation for our reservations, let us consider the special case where the imprecise data is set-valued, i.e., the are sets instead of fuzzy sets; as will be seen, our arguments obviously apply (“level-wise”) to the more general fuzzy case in exactly the same way. If data is set-valued, then the extension principle simply yields a subset of models from , namely

 M=⋃D∈INS(D)ALG(D)⊆M, (5)

where is the (crisp) set of instantiations of .

Now, according to (5), all instantiations are treated as equal, in the sense that each instantiation contributes a possible model and all the models thus produced are seen as equally plausible candidates. While this equal treatment of all instantiations is reasonable in common applications of the extension principle, where the variables of the function to be extended do not interact with each other, it can be questioned in the context of learning from data: A method inducing a model from a set of data always comes with certain model assumptions, and under these assumptions, specific selections may appear more plausible than others! Or, stated differently, the underlying model assumptions introduce an implicit dependency between the data points . This dependency, however, is ignored by the extension principle, which simply selects the independently of each other.

This point is best explained by means of a simple example. Consider the problem of learning a regression function from observations of the form . More specifically, suppose that the observed outputs are imprecise and therefore modeled as intervals (whereas the inputs are precise). Our learning algorithm assumes a linear dependency (i.e., the model space is given by ) and fits the intercept and the slope of the regression line using the method of least squares.

Figure 1 shows a concrete example with two different instantiations of the same set-valued data and the corresponding regression lines. In this case, the first data/model combination (left picture) is arguably more plausible than the second one (right picture), simply because the first instantiation allows for a much better fit than the second one. In fact, the first instantiation is much more in agreement with the assumption of a linear relationship between inputs and outputs than the second one. Consequently, we argue that the first regression line should be considered as more plausible than the second one, at least in light of our assumption of a linear dependency. According to (5) and the extension principle (4), however, there is no difference between the two models.

Another example is shown in Figure 2, where the problem is to cluster data points . For three of the observations, the -value is not known precisely and only characterized in terms of an interval; these observations are shown as grey rectangles in the picture. Now, assuming that the data is indeed well separated into subgroups, the instantiation in the left figure (red triangles) is arguably more plausible than the one in the right figure (blue circles). In fact, while the first instantiation allows for inducing a simple structure with two well-formed clusters, the second would imply a much less convenient structure.

What these examples show is that, in the context of learning from data, not only the data is providing information about the (unknown) model, but also the other way around: Against the background of the model assumptions underlying the model class and learning algorithm , some instantiations of the imprecise or ambiguous data appear to be more plausible than others. Exploiting this insight in order to differentiate between more and less plausible instantiations is something that we refer to as data disambiguation [Hüllermeier and Beringer, 2006]. In other words, we consider an extension of standard model induction, in which we are not only interested in inferring properties of the data generating process, but also of the imprecisely observed data. Or, stated differently, we are not only interested in learning about the model given the data, but in learning about the model and the data simultaneously.

As an aside, we note that, just like in standard statistics and machine learning, our approach takes the underlying model assumptions for granted and does not question them. Thus, model induction should be seen as a kind of conditional inference, making hypothetical claims about the data generating process (and in our case even about the observed data itself) given the validity of the underlying model assumptions. Needless to say, these assumptions are not always correct and, therefore, are often adapted or corrected by a data analyst if they seem to be incoherent with the data. The corresponding search for a proper model class, however, is outside the model induction process itself.

## 4 A Loss Minimization Approach

How can model induction be combined with data disambiguation? Here, we propose an approach based on the notion of (direct) loss minimization. Roughly speaking, instead of generalizing the learning algorithm, as done by the extension principle, we “fuzzify” an underlying loss function to be minimized by this algorithm. Thus, instead of fixing an instantiation first and fitting a model to this data afterward, we look for an optimal instantiation given a model; the model itself is then evaluated on the basis of this instantiation.

In supervised learning, the main goal is typically to find a model with minimal risk, that is, expected loss

 R(M)=∫L(y,M(x))dP(x,y), (6)

where is a loss function: For an input , this function compares the prediction with the true output and quantifies a corresponding penalty in terms of . Roughly speaking, the risk is a weighted average of these losses, with each input/output tuple weighted according to its probability of occurrence. Thus, a risk minimizer

 M∗∈argminM∈MR(M)

is a model that, on average, performs well in terms of the loss .

Obviously, the risk of a model cannot be computed directly, since the probability measure in (6), which specifies the data generating process, is unknown. What is often minimized as a substitute, therefore, is the empirical risk

 Remp(M)=1NN∑i=1L(yi,M(xi)), (7)

i.e., the average loss on the training data . Or, in order to avoid the problem of possibly overfitting the data, a regularized version of (7) is minimized:

 Rreg(M)=1NN∑i=1L(yi,M(xi))+λC(M), (8)

where is a measure of the complexity of the model and is a regularization parameter. In the following, we shall mostly stick to (7), keeping in mind that an extension to the regularized version (8) can be realized in a rather straightforward way.

### 4.1 The Case of Set-Valued Data

Again, for the ease of exposition, we consider the set-valued case first, before turning to the more general fuzzy case; moreover, we consider imprecision only for the output part while the inputs are supposed to be precise.

Consider a candidate model and an imprecise observation . With , the set of possible losses of on this observation is then given by

 {L(y,^y)|y∈Y}.

In agreement with the idea of data disambiguation, we should look at the smallest of these losses, namely

 L(Y,^y)=min{L(y,^y)|y∈Y}, (9)

and the value for which it is obtained:111We assume that is closed and the minimum exists.

 y∗=argmin{L(y,^y)|y∈Y}.

Given the model , this value appears to be the most plausible in .

The function as defined in (9) can be seen as a generalized loss function, which, instead of comparing a (precise) prediction with a precise observation, compares a (precise) prediction with an imprecise (set-valued) observation. On the basis of this loss function, we can also generalize the empirical risk (7):

 Remp(M)=1NN∑i=1L(Yi,M(xi)). (10)

A minimizer

 M∗∈argminM∈MRemp(M) (11)

of this risk (or, alternatively, a regularized version thereof) is an optimal model and, at the same time, suggests a disambiguation of the data: For each imprecise observation , the most plausible precise value is

 y∗i=argmin{L(yi,M∗(xi))|y∈Yi}. (12)

Thus, the minimization of (10) serves our original purpose and solves two problems simultaneously, namely the induction of a plausible model (11) and a plausible disambiguation of the data (12).222This approach is connected to the “minimin” strategy for model selection under imprecision as proposed in [Utkin and Coolen, 2011].

So far, we have assumed that only the output value is imprecise, while the input values are precisely observed. Obviously, the whole approach can be generalized quite easily to the case of imprecise observations of the form . To this end, the loss function (9) is further generalized as follows:

 L(M,X,Y)=min{L(y,M(x))|(x,y)∈X×Y}. (13)

### 4.2 The Case of Fuzzy Data

In the set-valued case, each candidate model is evaluated in terms of a generalized empirical risk, that is, a risk function based on a generalized loss. This evaluation can be expressed equivalently in terms of a standard empirical risk on a properly selected (instantiated) data sample:

 Remp(M)=1NN∑i=1L(yMi,M(xMi)), (14)

where

 (xMi,yMi) =SEL(Xi,Yi,M) (15) =argmin{L(yi,M(xi))|(xi,yi)∈Xi×Yi}

is the disambiguation of under . A best model

 M∗=argminM∈MRemp(M), (16)

supposed to be unique here, is then chosen, which in turn leads to a unique disambiguation

 {(xM∗i,yM∗i)}Ni=1

of the original (imprecise) data. In the more general case of fuzzy data, the same approach can be realized level-wise, i.e., for each level-cut

 {([Xi]α,[Yi]α)}Ni=1

of the fuzzy data

 {(Xi,Yi)}Ni=1⊆F(X)×F(Y).

Then, for a fixed model , data disambiguation does not yield a unique selection (15), but instead a potentially different selection for each level cut. In other words, the selection is now a mapping

 α↦(xMi(α),yMi(α))=argmin{L(yi,M(xi))|(xi,yi)∈[Xi]α×[Yi]α}.

In [Dubois and Prade, 2008], a mapping of that type is called a gradual element (in a fuzzy set). Likewise, a mapping from levels to (empirical) risk values can be associated with each model :

 rM:(0,1]→R,α↦1NN∑i=1L(yMi(α),M(xMi(α))) (17)

Note that the risk function thus defined is non-decreasing.

The problem of comparing models now comes down to comparing risk functions. This problem is non-trivial, since there is no natural total order on such functions. Obviously, a model is (weakly) preferred to another model , written , if , i.e., for all . The relation thus defined is only a partial order on the model class , as models and may also be incomparable (i.e., neither nor ). Figure 3 shows a simple (one-dimensional) example for the case of regression, namely three regression lines approximating four observations with fuzzy output; all three models are incomparable amongst each other, that is, none of them dominates any other one in terms of the associated risk function.

This situation can be handled in different ways. First, one may accept the non-uniqueness of the result, i.e., the existence of several (Pareto) optimal models; here, a model is optimal (non-dominated) if there is no model such that , that is, and .

Second, one may refine the partial oder as defined above into a total order. For example, a model could be evaluated in terms of the aggregated risk

 ¯¯¯¯¯Remp(M)=∫10rM(α)dα, (18)

and models could then be compared in terms of these values:

 (M⪰M′)⇔(¯¯¯¯¯Remp(M)≤¯¯¯¯¯Remp(M′))

The model induction problem then comes down to finding a minimizer of (18):

 M∗∈argminM∈M¯¯¯¯¯Remp(M) (19)

Interestingly, by exchanging summation and integration, (18) can also be written as a standard (empirical) risk with a modified loss function:

 ¯¯¯¯¯Remp(M)=1NN∑i=1L(Yi,M(xi)), (20)

where

 L(Y,^y)=∫10L([Y]α,^y)dα (21)

is a “fuzzy” loss function that compares a (precise) prediction with a fuzzy set-valued observation.

Expression (20) holds in the case of precise input and fuzzy output data but needs to be generalized further if input data is fuzzy, too.

### 4.3 Fuzzy Losses for Regression

The fuzzy loss function (21) compares a fuzzy value with a (predicted) precise value . An example of such a loss is shown in Figure 4 for the case of regression. More specifically, this function is a fuzzy version of the absolute () loss

 L(y,^y)=|y−^y|,

which is shown as a dashed line (as a function of for fixed ). The fuzzy loss (solid line) is given by the map , where is the trapezoidal fuzzy set shown in grey.

Interestingly, a fuzzification of the loss based on a triangular fuzzy set with mid-point and support leads to a kind of Huber-loss [Huber, 1981]:

 L(Y,^y)={12(y−^y)2/δ if ^y≤δ|y−^y|−12δ if ^y>δ

This loss behaves like the quadratic () loss for small errors and like the loss for larger deviations. This kind of loss function is very popular in robust statistics, as it combines two interesting properties: Like the absolute error , it is much less sensitive toward outliers than, for example, , but at the same time, it avoids the non-differentiability of .

As can be seen, our approach to learning from fuzzy data based on generalized loss functions includes methods such as M-estimation with Huber-loss as specific cases; methods for Huber M-estimation have been studied quite intensively in the literature [Mangasarian and Musicant, 2000]. It needs to be mentioned, however, that our approach is in a sense more general, especially as it allows for modeling each fuzzy value and, therefore, the corresponding loss function individually instead of applying the same loss function to each observation (recall, for instance, the example of an asymmetric fuzzy loss in Figure 4 (right)). In other words, our approach is sample-specific in the sense that a specific loss function can be defined for each sample point. To make this more clear, we may also write the fuzzy loss (21) using a slightly different notation:

 L(Y,^y)=LY(y,^y),

where could be an observed value, and the fuzzy set is used to specify a region of imprecision around this observation. Thus, the fuzzy loss is a standard loss function (defined on pairs of precise values) “modulated” by the fuzzy set around .

Another important loss function we can mimic is the -insensitive loss that plays an important role in support vector regression [Schölkopf and Smola, 2001]:

 L(y,^y)={0 if |y−^y|≤ϵ|y−^y|−ϵ if |y−^y|>ϵ

This loss is obtained as a special case of (21) with given by the interval .

The use of a trapezoidal fuzzy set with core and support nicely combines the two types of loss discussed above: is insensitive in the core, behaves quadratically in the boundary region and like outside the support.

### 4.4 Fuzzy Losses for Classification

In classification problems, the output space is a finite set comprised of classes . The most typical loss function is the 0/1 loss . Now, suppose the output is characterized by a fuzzy subset of , that is, by a membership degree for each class label . The fuzzy loss function (21) is then given as follows:

 L(Y,^y)=1−μY(^y).

Thus, the higher the membership degree of the predicted class , the smaller the loss. An interesting special case is obtained for a fuzzy set of the type

 μY(λ)={1 if λ=λk1−w if λ≠λk, (22)

for some and . Using this fuzzy set for modeling the observation of class label corresponds to a discounting of this observation: Although is regarded as completely plausible, the other class labels are not fully excluded either; or, stated differently, can be seen as a degree of certainty that the observed class is indeed . For a fuzzy observation of that kind,

 L(Y,^y)={0 if ^y=λkw if ^y≠λk,

which means that the penalty for a misclassification is effectively reduced from 1 to . In other words, the training example is weighted by the factor . Again, learning from weighted examples (aka instance weighting) has been studied intensively in the literature [Shimodaira, 2000].

An important class of loss functions in binary classification is the so-called margin losses [Rosset et al., 2003]. Instead of merely checking whether a prediction is on the right or the wrong side of the decision boundary, as the 0/1 loss does, such losses depend on how much on the right or wrong side the prediction is. By preferring “very correct” predictions to simply correct ones, they enforce a “large margin” between the classes, i.e., they tend to separate the classes as much as possible.

More formally, let encode the two classes (negative and positive), and suppose that is a class of scoring classifiers ; a positive score suggests that belongs to the positive class, whereas a negative score suggests that is negative. A margin loss is a function of the form

 L(y,s)=f(ys), (23)

where is non-increasing. Thus, a margin loss penalizes scores instead of binary predictions, and the larger (smaller) the score in the case of a positive (negative) class, the smaller the loss. Important examples of (23) include the hinge loss

 L(y,s)=f(ys)=max(1−ys,0) (24)

used in support vector machines [Vapnik, 1998, Schölkopf and Smola, 2001], the exponential loss

 L(y,s)=f(ys)=exp(−ys) (25)

used in boosting algorithms [Schapire, 1990], and the logistic loss

 L(y,s)=f(ys)=log(1+exp(−ys)) (26)

closely connected with logistic regression.

Now, suppose again that the output is characterized by a fuzzy subset of , that is, by a membership degrees and for the negative and positive class, respectively. More specifically, consider again the special case (22):

 μY(λ)={1 if λ=y1−w if λ=¯y, (27)

where and can be interpreted as a degree of confidence in . Then, it is not difficult to show that the fuzzy loss function (21) is given by

 L(Y,s)=fw(ys)=w⋅f(ys)+(1−w)⋅f(|ys|). (28)

Please note that coincides with the original margin loss if , i.e., if the prediction is in favor of the more likely class ; thus, the difference only concerns the negative part. Figure 5 shows the graph of (28) for different margin losses and different values of .

As can be seen, the loss (28) looses the properties of monotonicity and convexity for sufficiently small values of . Apart from the fact that this is certainly undesirable from a computational perspective, as it makes optimization more difficult, the non-monotone behavior of the loss may also be surprising at first sight. At second sight, however, it makes perfect sense. In fact, one has to keep in mind that, in contrast to the simple 0/1 loss, a margin loss pursues two goals at the same time, namely correct classification and separation of the data. To comply with the first goal, the penalty should decrease with decreasing , just like in the case of the 0/1 loss; this is why for . At the same time, however, an increase of the margin is rewarded. Taking both effects together, that is, a discounted penalty for misclassification and a reward for an increased margin, it is possible that an incorrect classification with a large margin in penalized less than a correct classification with a small margin.

Moreover, one should note that the fuzzy margin losses are fully in agreement with our idea of data disambiguation. This can be seen most clearly for , which corresponds to the case where both labels, positive and negative, are considered completely plausible (in other words, no label information is given). Here, the loss is a symmetric function around 0: Putting an instance directly on the decision boundary, and thereby expressing maximal ambiguity, is the worst solution and penalized with the highest loss. The larger the distance from the decision boundary, regardless to what side, the smaller the loss becomes. Or, stated differently, the more pronounced the prediction in favor of one of the classes, the better it is.

So far, we only considered imprecision of the dependent variables and assumed the predictor variables to be precise. Without going into detail, we note that the predictors can of course be affected by imprecision, too, and that the effect on the loss function is different in this case. For example, suppose that a predictor is represented by a (closed) contiguous region , such as a rectangle or a ball. The scores that can be produced for this instance by a model are then given in the form of an interval

 [min{M(x)|x∈X},max{M(x)|x∈X}],

which can also be written as with some and the middle point. Applying the generic loss function (13) to this case (with precise output ), and assuming to be a margin loss, we obtain

 L(M,X,y)=L(max{y(s−d),y(s+d)}). (29)

Thus, the loss function is “shifted to the left” by units; see Figure 6 for an illustration.

## 5 Comparison with Denoeux’s Approach

Denoeux addressed a quite similar problem in his recent articles [Denoeux, 2011, Denoeux, 2013]. More specifically, he addressed the problem of learning from imprecise data, represented in terms of fuzzy sets or belief functions, within a probabilistic framework and, for this purpose, proposed an extension of maximum likelihood inference. Without going into technical details, we shall try to highlight the main conceptual differences between Denoeux’s approach (subsequently referred to as GMLI for Generalized Maximum Likelihood Inference) and ours, presenting ideas of the former in terms of our notation.333A special case of this approach was already introduced in [Côme et al., 2009].

Roughly speaking, given a sample of imprecise data , Denoeux defines the plausibility of a model identified by a parameter in terms of a normalized likelihood; the likelihood of is in turn defined by the probability that the data-generating process specified by produces an instantiation :

 π(M)=π(θ)∝P(D∈D|θ)=N∏i=1P(Zi|θ)

This probability can also be written as

 ∫D∈INS(D)P(D|θ)dμ(D)=∫D∈INS(D)N∏i=1P(zi|θ)dμ(D), (30)

where measures the plausibility of instantiations. This already reveals the most important difference between Denoeux’s approach and ours: In the former, a model is evaluated, not by looking at how it fits the most favorable instantiation of the imprecise data, but how it fits all possible instantiations simultaneously. In fact, as can be seen from (30), the score of a model is obtained by summing (averaging) its likelihood degrees (on precise samples) over all instantiations.

The difference may perhaps become even more clear when looking at GMLI from our loss minimization perspective. As already mentioned earlier, likelihood maximization and loss minimization are closely connected, and maximizing the log-likelihood can typically be considered as minimizing an additive loss on the training data, namely the log-loss. As an illustration, consider the simple case of (one-dimensional) regression, where the observed response is supposed to follow a normal distribution. Thus, given (precise) training data , the likelihood is of the form

 cN∏i=11σexp⎛⎝−12(M(xi)−yiσ)2⎞⎠, (31)

where is a normalizing constant, and the minimizer of the logarithm of that likelihood is obviously equivalent to the least squares estimator

 M∗=argminN∑i=1(M(xi)−yi)2.

In the case where a response is imprecise, the contribution to the likelihood is a factor of the form , and the logarithm of this factor can be seen as the loss caused by on the observation ; thus, in GMLI, the counterpart to our generalized loss function (21) is given by

 L(Yi,^yi)=−log(P(ˆYi∈Yi)), (32)

where is a random variable defined by and the underlying probabilistic model. More concretely, suppose that is an interval, say, , and recall our assumption of an underlying normal distribution (31). Then, is a Gaussian centered at . As shown in Figure 7, the loss caused by the prediction corresponds to the logarithm of the area of this distribution outside the interval .

The overall loss function produced in this way (with in (31) given by 1) is shown in Figure 8, together with the loss functions for other intervals (of different width) for comparison. It is noteworthy that the loss in GMLI is never 0, not even when predicting the center of the interval: Even in that case, the Gaussian centered at that value is not completely inside the interval , i.e., .

The loss can only become 0 if either is very large or the Gaussian is very narrow, i.e., if the standard deviation in (31) is very small. This standard deviation, however, is normally estimated globally and not specifically adapted to a single observation; and even if this could be done (the case of heteroscedasticity), the standard deviation would need to be fitted to the data-generating process, not to our knowledge about the data.

Anyway, for a fixed standard deviation, there is a constant and unavoidable penalty that only depends on the width of the interval (and similarly for fuzzy sets): The smaller the interval, the higher the penalty. Please note, however, that this shift of the function does not have any influence on loss minimization: It is simply a constant term in the empirical risk that does not change its minimizer. For better comparison with our approach, we can therefore “normalize” the GMLI loss functions by subtracting the constant penalty.

The result is shown in Figure 9. As can be seen, the discounting of the loss due to an increased imprecision of the observation is quite different in GMLI and our approach: The former is favoring the mid-point of the interval, while the width of the interval (imprecision) leads to a global scaling of the whole loss function: The smaller the interval, the steeper the loss function.444Note that this may lead to technical problems in the limit case of a precise observation, where the width of the interval tends to 0. Then, even a small prediction error may yield an extreme loss. Obviously, the idea of data inclusion does not naturally apply to the special case of precise data: If in (32) reduces from a proper set to a singleton, the probability goes to 0 and hence the logarithm to infinity. As opposed to this, our approach treats all points inside the interval as equal; likewise, the increase of the loss outside the interval is always the same. Roughly speaking, our approach leads to stretching the loss function “horizontally”, while GMLI scales it “vertically”.

Qualitative differences of a similar kind can also be seen for other types of loss functions, for example the logistic loss (26) used in binary classification. For the special type of a discounted (weighted) observation (27), GMLI yields the loss

 L(Y,s)=−log(1−w⋅exp(−ys)1+exp(−ys)). (33)

Figure 10 shows these loss functions for different values of and, for comparison, reproduces the corresponding functions for our approach (already shown in Figure 5). In Section 6 below, these two loss functions will be compared with each other in a numerical experiment.

Although the cases we considered here are specific ones, they already suggest that Denoeux’s approach is not in agreement with our notion of data disambiguation—which is perhaps not surprising, given that it was never intended to implement this idea. In GMLI, the compatibility of a model with an imprecise sample is based on the idea of data inclusion: When comparing a predicted data point with an imprecise observation , the loss (log-likelihood) depends on how well (or, more specifically, the probability distribution associated with that point) is included in . Naturally, this leads to a preference for points “in the middle” of . As already mentioned above, the approach therefore tends to fit these middle points, while the imprecision of the information leads to a global decrease of the loss. Our method, on the other hand, starts without any bias in the form of preferences on instantiations and instead tries to figure out the most likely ones.

## 6 Illustration

This section presents an illustration of our approach in a simple classification setting. Before explaining the setup, we emphasize that our experiments are not meant as an empirical validation of our approach, let alone a comparison with alternative methods in terms of specific performance measures. Since we consider the contribution of this paper as being more of a conceptual than methodological nature, and indeed proposed a conceptual framework rather than a concrete method, such a comparison is arguably not appropriate at this point.

Nevertheless, we would like to show the potential usefulness of our fuzzy loss functions by means of a practical example. To this end, we consider a simple binary classification problem with normally distributed classes in , the positive one with mean and the negative one with mean . As training data, we assume a sample consisting of 100 randomly generated instances from both classes; a typical example is shown in Figure 11. On a sample of that kind, we train a linear classifier using logistic regression. Since the true conditional class distributions are known, it is not difficult to determine the generalization performance of such a model in terms of the error rate, i.e., the probability of an incorrect classification (which corresponds to the risk (6) with the loss).

In a first experiment, the class information was partly removed from the training instances. More specifically, each of the 200 instances was declared “unlabeled” with a fixed probability (while the original label was kept with probability ); thus, we are in a setting of semi-supervised learning, in which approximately of the instances are labeled (see Figure 11 for a typical data set of that kind). In our approach, the unlabeled instances can be modeled in terms of a fuzzy set that assigns a membership degree of 1 to both the positive and the negative class. Then, a model is trained using the fuzzy loss function (28) with the log-loss (26).555Minimization of the empirical loss was done by means of a simple gradient method, which, due to the non-convexity of the loss, may of course end up in local optima. Standard logistic regression, on the other hand, cannot directly exploit the unlabeled instances, and therefore only used the remaining labeled ones.

The results are shown in Figure 12 in terms of the expected classification error (derived as an average over a large number of repetitions of this experiment) as a function of . As expected, the larger becomes, i.e., the less labeled and the more unlabeled examples the training data contains, the worse the generalization performance of both methods. Obviously, however, the drop in performance is much more significant for standard logistic regression. From these results, we may conclude that our fuzzy loss (28) allows for exploiting the unlabeled instances, in addition to the labeled ones, in a meaningful way.

As a side remark, we note that Denoeux’s GMLI will produce exactly the same result as standard logistic regression: Although the unlabeled data could be modeled in the same way as in our approach, it will effectively be ignored by GMLI: Since the probability to observe either the positive or the negative label (the sure event) is 1, the unlabeled instances will not influence the likelihood function.

In a second experiment, we assume that the label of each example is switched (from positive to negative and vice versa) with a fixed probability , which can be seen as a kind of noise level. This noise level is supposed to be known, whereas for each individual training example, it is not known whether the observed label corresponds to the original one or has been switched. In our approach as well as in GMLI, we can use the idea of attaching a degree of certainty to an observation: The label information is modeled in terms of a fuzzy set (27), assigning a membership degree of 1 to the observed and of to the other label. For our approach, we again use the fuzzy loss function (28) with the log-loss (26), whereas GMLI is based on the minimization of the loss (33). Standard logistic regression simply uses the observed label information, which is the best it can do.

Figure 13 shows the average classification error of the three methods as a function of the noise level . Overall, the picture is quite similar to the first experiment: Compared to our approach, the drop in performance is much more significant for standard logistic regression. This time, GMLI is not exactly equivalent to standard logistic regression, but the difference in performance is negligible. Apparently, our fuzzy loss function (28) is more apt to exploit the uncertain training information than the modified loss (33) underlying GMLI.

## 7 Conclusion

We have introduced a conceptual framework for (supervised) learning from imprecise and fuzzy data, which is based on the generalization of loss functions in empirical risk minimization. In contrast to the generic extension principle, our approach implicitly exploits the inductive bias underlying the learning method and performs model identification and data disambiguation simultaneously.

Our extended loss functions allow for directly “comparing” a (precise) prediction with an imprecise observation, and thereby provide the basis for fitting a precise model to imprecise data. The principle that we used for extending a standard loss function is coherent with our idea of data disambiguation and can be seen as a sample-specific “modulation” of the original loss.

Interestingly enough, our fuzzy set-based generalization of loss functions covers several existing methods as special cases, including instance weighting, robust regression (Huber loss) and support vector regression (-insensitive loss). Thus, it may have the potential to serve as a unifying framework of such methods. Apart from that, however, it also allows for deriving new methods in a systematic and conceptually sound manner. For example, while the well-known Huber loss and the -insensitive loss are obtained by modulating the loss with a symmetric triangular fuzzy set and an interval, respectively, a trapezoidal fuzzy set leads to a new loss function that elegantly combines both effects (insensitivity and robustness) at the same time.

Needless to say, while being conceptually simple, our framework can become quite challenging from a computational perspective. In particular, solving the generalized risk minimization problems (16) and (19) is far from trivial. Therefore, developing efficient algorithms for specific problem classes is an important topic of future work. Such algorithms will also provide the basis for a proper empirical evaluation of our framework.

## References

• [Cerny and Rada, 2011] Cerny, M. and Rada, M. (2011). On the possibilistic approach to linear regression with rounded or interval-censored data. Measurement Science Review, 11(2):34–40.
• [Changa and Ayyubb, 2001] Changa, Y. and Ayyubb, B. (2001). Fuzzy regression methods—a comparative assessment. Fuzzy Sets and Systems, 119(2):187–203.
• [Côme et al., 2009] Côme, E., Oukhellou, L., Denoeux, T., and Aknin, P. (2009). Learning from partially supervised data using mixture models and belief functions. Pattern Recognition, 42(3):334–348.
• [Cour et al., 2011] Cour, T., Sapp, B., and Taskar, B. (2011). Learning from partial labels. The Journal of Machine Learning Research, 12:1501–1536.
• [Couso and Dubois, 2009] Couso, I. and Dubois, D. (2009). On the variability of the concept of variance for fuzzy random variables. IEEE Transactions on Fuzzy Systems, 17:1070–1080.
• [Denoeux, 2011] Denoeux, T. (2011). Maximum likelihood estimation from fuzzy data using the EM algorithm. Fuzzy Sets and Systems, 183(1):72–91.
• [Denoeux, 2013] Denoeux, T. (2013). Maximum likelihood estimation from uncertain data in the belief function framework. IEEE Transactions on Knowledge and Data Engineering, 25(1):119–130.
• [Diamond, 1988] Diamond, P. (1988). Fuzzy least squares. Information Sciences, 46:141–157.
• [Diamond and Tanaka, 1998] Diamond, P. and Tanaka, H. (1998). Fuzzy regression analysis. In Slowinski, R., editor, Fuzzy Sets in Decision Analysis, Operations Research and Statistics, pages 349–387. Kluwer.
• [Dubois, 2011] Dubois, D. (2011). Ontic vs. epistemic fuzzy sets in modeling and data processing tasks. In Madani, K., Kacprzyk, J., and Filipe, J., editors, Proc. IJCCI (NCTA), International Conference on Neural Computation Theory and Applications, Paris.
• [Dubois and Prade, 2008] Dubois, D. and Prade, H. (2008). Gradual elements in a fuzzy set. Soft Computing, 12(2):165–175.
• [Ferraro et al., 2010] Ferraro, M., Coppi, R., Gonzalez-Rodriguez, G., and Colubi, A. (2010). A linear regression model for imprecise response. Int. Journal of Approximate Reasoning, 51:759–770.
• [Gonzalez-Rodriguez et al., 2009] Gonzalez-Rodriguez, G., Blanco, A., Colubi, A., and Lubiano, M. (2009). Estimation of a simple linear regression model for fuzzy random variables. Fuzzy Sets and Systems, 160(3):357–370.
• [Huber, 1981] Huber, P. (1981). Robust Statistics. Wiley.
• [Hüllermeier and Beringer, 2006] Hüllermeier, E. and Beringer, J. (2006). Learning from ambiguously labeled examples. Intelligent Data Analysis, 10(5):419–440.
• [Kruse and Meyer, 1987] Kruse, R. and Meyer, D. (1987). Statistics with Vague Data. D. Reidel, Dordrecht.
• [Kwakernaak, 1978] Kwakernaak, H. (1978). Fuzzy random variables I: Definitions and theorems. Information Sciences, 15:1–29.
• [Kwakernaak, 1979] Kwakernaak, H. (1979). Fuzzy random variables II: Algorithms and examples for the discrete case. Information Sciences, 17:253–278.
• [Mangasarian and Musicant, 2000] Mangasarian, O. and Musicant, D. (2000). Robust linear and support vector regression. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(9).
• [Puri and Ralescu, 1986] Puri, M. and Ralescu, D. (1986). Fuzzy random variables. Journal of Mathematical Analysis and Applications, 114:409–422.
• [Rosset et al., 2003] Rosset, S., Zhu, J., and Hastie, T. (2003). Margin maximizing loss functions. In Proceedings NIPS-2003, Advances in Neural Information Processing.
• [Sanchez and Couso, 2007] Sanchez, L. and Couso, I. (2007). Advocating the use of imprecisely observed data in genetic fuzzy systems. IEEE Transactions on Fuzzy Systems, 15(4):551–562.
• [Schapire, 1990] Schapire, R. (1990). The strength of weak learnability. Machine Learning, 5(2):197–227.
• [Schölkopf and Smola, 2001] Schölkopf, B. and Smola, A. (2001). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press.
• [Shimodaira, 2000] Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227–244.
• [Tanaka and Guo, 1999] Tanaka, H. and Guo, P. (1999). Possibilistic Data Analysis for Operations Research. Physika-Verlag, Heidelberg.
• [Utkin and Coolen, 2011] Utkin, L. and Coolen, F. (2011). Interval-valued regression and classification models in the framework of machine learning. In Proc. ISIPTA 2011, 7th International Symposium on Imprecise Probability: Theories and Applications, Innsbruck, Austria.
• [Vapnik, 1998] Vapnik, V. (1998). Statistical Learning Theory. John Wiley & Sons.
• [Viertl, 2011] Viertl, R. (2011). Statistical Methods for Fuzzy Data. Wiley.
• [Xianga and Kreinovich, 2013] Xianga, G. and Kreinovich, V. (2013). Towards fast and accurate algorithms for processing fuzzy data: Interval computations revisited. INternational Journal of General Systems.
• [Zadeh, 1975] Zadeh, L. (1975). The concept of a linguistic variable and its application to approximate reasoning, parts 1-3. Information Science, 8/9.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters