We focus on several learning approaches that employ max-operator to evaluate the margin. For example, such approaches are commonly used in multi-class learning task and top-rank learning task. In general, in order to estimate the theoretical generalization risk, we need to individually evaluate the complexity of each hypothesis class used in the learning approaches. In this paper, we provide a technique to estimate a theoretical generalization risk for such learning approaches in a same fashion. The key idea is to “redundantly” reformulate the learning problem as one-class multiple-instance learning by redefining the specific input space based on the original input space. Surprisingly, we succeed to improve the generalization risk bounds for some multi-class learning and top-rank learning algorithms.
An Application of Multiple-Instance Learning
to Estimate Generalization Risk
Kyushu University / RIKEN
A lot of margin-based learning approaches such as SVMs have a strong theoretical generalization risk bound and well works in practice.
In this paper, we focus on learning approaches that define a margin based on max-operator. More precisely, we focus on two learning problems, multi-class learning (MCL) problem and top-rank learning (TRL) problem. MCL is a standard learning problem that the goal of the learner is to find a function which predict the label of unseen data. One of the typical approach  is to find number of linear hyperplanes for -class classification, and predict the label by for data . In this approach, the margin for a labeled data can be defined as . TRL [13, 1, 10] is one of the bipartite ranking problem . When given positive and negative instances, the goal of the learner is to obtain a scoring function which gives high values to only absolute top positive list. In other words, the learner wish to maximize the number of positive instances ranked higher than the highest negative instance. When we consider linear hypothesis class of the scoring function, the margin can be defined for a positive and negative set as for all .
Since different margin and different hypothesis class is used in different problem, we need to evaluate the theoretical generalization risk individually . One of the typical approach to guarantee a generalization risk is to estimate Rademacher complexity  of the hypothesis class. The notion of Rademacher complexity allow us to easily estimate the generalization bounds in several learning problems, but we still need to evaluate the Rademacher complexity for an individual learning approach. In this paper, based on the notion of Rademacher complexity, we propose a technique to estimate a theoretical generalization risk in a same fashion for different learning approaches.
The key idea is to “redundantly” reformulate the learning problem as one-class Multiple-Instance Learning by redefining the specific input space based on the original input space. Multiple-Instance Learning (MIL) problem is a fundamental problem in machine learning field. A standard MIL setting is described as follows: The learner receives sets called bags. Each bag contains multiple instances. In the training phase, the learner can observe each label of bags, but cannot observe the label of instances individually. The goal of the learner is to find a decision function that predicts the labels of unseen bags accurately. As a decision function, the following form is commonly used in practice:
Our motivation of this idea is that the max-operator of hypothesis can switch to the target margin if we define something special bag. Roughly speaking, we consider max-based margin function and non max-based hypothesis for the target learning problem, there may exist non max-based margin function and max-based hypothesis and some bag that satisfy . If we can find , and the definition of , we can apply the existing Rademacher complexity of MIL approaches (e.g., [14, 16]) instead of calculating of the target learning approach.
Our derivation approach of generalization risk is very simple, and finishes in two steps. First, we reformulate the target learning problem as one-class MIL problem. Second, we apply the existing Rademacher complexity bound of the MIL approach. In this paper, by using this derivation approach, we derive the generalization risk of Multi-class SVM in MCL and InfinitePush  in TRL. We think it is an interesting result that three different learning problems (see Figure 1, MCL, TRL, and MIL problem are connected by our approaches.
The main contribution of this paper is summarized as follows:
We provide a derivation technique which allows us to easily estimate the generalization risk of some margin-based approaches.
We improve the generalization bound of several learning algorithms in MCL and TRL.
2 Related works
In this paper, MIL setting is a key of our theoretical derivation technique. Since Dietterich et al. first proposed MIL in , many researchers introduced various theories and applications of MIL [7, 2, 14, 17, 6, 4]. For example, the aforementioned decision function employing max-operator is widely used for local feature analysis [4, 16]. Sankaranarayanan and Davis  proposed a real application of one-class MIL, which we use in this paper. Theoretical generalization performance of several MIL algorithm have been investigated [14, 16]. We can say that our approach is quite different MIL application from the existing works.
To my knowledges, the best known generalization of multi-class SVM is shown in  based on the Rademacher complexity111 shows tighter bound by considering local Rademacher complexity which requires some additional assumptions (e.g., the strong convexity of the loss function). In this paper, we do not consider such assumptions and only consider global Rademacher complexity.. In the previous work, it was said that the generalization performance linearly denpends on the class size (see, e.g., ). They showed that, when considering -norm regularized multi-class SVM, the generalization bound logarithmically depends on the class size in the case . However, when (e.g., considering the standard 2-norm regularized multi-class SVM), the generalization bound has a radical dependence on the class size. Therefore, they suggested a learning algorithm of -norm regularized multi-class SVM to control the complexity. However, surprisingly, our generalization bound suggests that -norm multi-class SVM also has the logarithmic dependence on the class size.
TRL is one of the bipartite ranking problem. A typical bipartite ranking problem is; the learner observes a binary-labeled sample, and to find a ranking function that gives higher value positive instances than negative instances. In TRL, the learner wishes to find a ranking function which maximized the number of positive instances ranked above the ranked highest negative instance. Several approaches have been proposed in TRL problem setting [13, 1, 10]. The latest theoretical generalization risk  is bounded by , where is the dimension of input space, are the size of positive and negative sample, respectively. However, this bound has high dependence on the dimension222Note that the dependency on the number of dimension is omitted in the main paper. Please see full version  of this paper.. Thus, it may be uninformative bound for high dimensional data. By contrast, we show the generalization risk bounded by , which does not have a direct dependence on the dimension .
|(MCL) (TRL) (MIL)|
In this section, we introduce three learning problems, multi-class learning problem, top-rank learning problem, and (binary classification) multiple-instance problem. For each problem, many learning approaches have been provided. More precisely, in this paper, we focus on margin-maximization approaches. At first, we introduce -margin loss function is defined for any as
3.1 Multi-class learning problem
Let be an instance space, and be an output space. The learner receives a set of labeled instances , where each instance is drawn i.i.d according to some unknown distribution . Given a hypothesis set of functions that map to , the learner predict the label of unseen using the following mapping
The goal of the learner is to find a set of functions from with small expected margin loss:
The empirical margin loss for multi-class learning problem can be defined as:
In this paper, we consider linear functions as hypothesis set .
Learning approach for MCL
There are many learning approaches for MCL. In this paper, we focus on Multi-class SVM , which is one of the most popular learning algorithm for MCL. The original optimization problem of Multi-class SVM is as follows:
OP 1: Multi-class SVM
where is a constant hyper-parameter. Thus, Multi-class SVM algorithm finds a regularized linear function, that is, the hyposhesis set is represented as
for any .
3.2 Top-Rank learning problem
Let be an instance space. The learner receives a set of pairwise instances drawn i.i.d. according to some unknown distribution , and drawn i.i.d. according to some unknown distribution . Given a hypothesis set of functions that map to , the goal of the learner is to find a function with small expected misranking risk with a margin :
where denotes the support of . For any , the empirical risk of can be defined as:
In this paper, we consider the set of linear functions as for TRL problem:
Learning approach for TRL
We introduce InfinitePush , which is a support vector style algorithm for TRL. The optimization problem of InfinitePush is as follows:
OP 2: InfinitePush
where is a constant hyper-parameter. InfinitePush finds a regularized linear function, that is, the hypothesis set is represented as
for any .
3.3 Multiple-instance learning problem
Let be an instance space. A bag is a finite set of instances chosen from . The learner receives a sequence of (binary) labeled bags called a sample, where each labeled bag is independently drawn according to some unknown distribution over . The goal of the learner is to find a hypothesis with small expected margin risk:
We define the margin loss function for any as
In the multi-instance learning problem, the following hypothesis class is commonly used in practice:
Learning approach for MIL
In this paper, we focus on Multiple-Instance SVM (MI-SVM)  as an algorithm for MIL problem. The optimization problem of MI-SVM (without bias) is as follows:
OP 3: Multi-Instance SVM
where is a constant hyper-parameter. The hypothesis set of MI-SVM can be represented as
for any .
4 One-class multiple-instance learning problem
In this section, we introduce our key learning problem, One-Class Multiple-Instance Learning (OCMIL) problem. This is an easy expansion of MIL problem for binary classification. Moreover, we show the upper bound of the generalization risk by simply using the existing theorem.
4.1 Problem setting
Let be an instance space. A bag is a finite set of instances chosen from . The learner receives a sequence of bags where each bag is independently drawn according to some unknown distribution over . For any hypothesis , we define the expected risk as following:
We define the margin-based empirical risk for any as
For convenience, we assume that all bags are negative (i.e., each label of every bags is ). We replace all of OP 3.3 by -1, and we have the optimization problem of One-Class MI-SVM as bellow.
OP 4: One-Class MI-SVM
4.2 Generalization bound
To derive the generalization bound for OCMIL, we introduce the definition of Rademacher complexity.
[The Rademacher complexity ]
Given a sample , the empirical Rademacher complexity of a class w.r.t. is defined as , where and each is an independent uniform random variable in .
Based on the Rademacher complexity, it is known that the following generalization bound holds ; For fixed , the following bound holds with high probability: for all with margin loss ,
For MI-SVM, two generalization bounds,  and  have been provided. In fact, the generalization bounds for binary classification MI-SVM also hold for One-Class MI-SVM. This is because the following holds: Let , and let ,
We show the two incomparable generalization risk bounds of One-Class MI-SVM as bellow.
Corollary 1 (Rademacher complexity bounds for One-Class MI-SVM).
The above two bounds are incomparable, but they share same advantage that the Rademacher complexity only logarithmically depends on the total number of instances. Note that both bounds do not directly depend on the number of dimension. For simplicity, we unify to use both bounds as
5 Connection between OCMIL and MCL
In this section and the next section, we show our main result, OCMIL-based approaches for deriving generalization risk bound. Our approach is very simple. First, we consider specialized input bags in OCMIL according to the target learning problem. In Table 1, we summarize the sample bags of OCMIL when given sample of the target problem.
|Problem||original sample||bag||sample||bag size|
Next, we show the relationship of Rademacher complexity between OCMIL and the target problem. Finally, we apply the generalization bound of One-Class MI-SVM for the target learning problem.
5.1 Specialized OCMIL setting for MCL
We consider a special OCMIL setting according to a MCL problem.
We prepare some definitions. For a pair of different integers and , denotes a vector such that the block of elements from to is , the block of elements from to is , and other elements are zero. For example, the column vector of is as follows:
where is -dimensional all-zero column vector. We define that maps to .
As we mention bellow, we derive a generalization bound for MCL by considering OCMIL problem such that the bag sample is given as mapped from the original sample in MCL setting.
5.2 Generalization bound of Multi-class SVM
Here we derive the generalization bound for Multi-class SVM by using the generalization bound of One-class Multiple-Instance SVM.
Theorem 2 (Generalization bound for Multi-class SVM).
Let and let Then, for any , the expected risk is bounded as:
Under MCL problem aforementioned in Section 3.1, we consider an OCMIL problem such that the learner receives a sample according to distributed from . Let . Then, we derive that for any and , the empirical Rademacher complexity is as following:
Then, we have
5.3 Multiple-instance SVM and multi-class SVM
We would like to show an interesting result that MI-SVM algorithm over the specialized setting meets Multi-class SVM algorithm.
The max-based constraints of OP 4.1 can be decomposed to the constraints for all instances in each bag. Then, we have
Replacing by the representation using , we have:
OP 5: One-class MI-SVM for MCL
Therefore, the optimization problem of Multi-class SVM (OP 3.1) is equivalent to the problem of One-class Multi-Instance SVM for MCL (OP 5.3) when given the specialized bag sample . This fact does not yield a valuable result immediately, but makes the relationship between MCL and MIL more convincing.
6 Connection between MIL and TRL
6.1 Specialized OCMIL setting for TRL
We define that maps a pair of sets to . As similar as the MCL case, we consider OCMIL such that the sample is given as mapped by from the original sample and in TRL.
6.2 Generalization bound for InfinitePush
Under TRL problem described in Section 3.2, we consider an OCMIL problem such that the learner receives a sample according to size of and size of distributed from and , respectively. Let and let . Then, for any , the expected risk is bounded as:
Under TRL problem aforementioned in Section 3.2, we consider an OCMIL problem such that the learner receives sample according to unknown distribution . Let . For any and the empirical Rademacher complexity is as following:
Then, we have
7 Comparison with the existing bounds
Comparison with the existing MCL approaches
Lei et al. provided -norm regularized SVM, and the generalization bound depends on:
This bound indicates that a standard (i.e., 2-norm regularized) Multi-class SVM still has a radical dependence on . However, our bound suggests that 2-norm regularized Multi-class SVM has also logarithmically dependence on .
7.1 Comparison with the existing TRL approaches
InfinitePush  provided the generalization bound based on the covering number (see Theorem 5.1 of ). However, this bound is hard to evaluate in terms of the size of negative sample , because it contains a sum of an increasing term and a decreasing term with increasing . Li et al. provided TopPush  algorithm333Note that TopPush employs a slightly different loss function from InfinitePush and ours. They employ quadratic hinge loss for efficient optimization., and also provided the genralization bound which depends on , where and is the size of positive sample. In contrast to InfinitePush, this bound has very little dependence on . However, this bound highly depends on the dimension , and thus it may be uninformative bound for very high dimensional data. Our bound has a little dependence on negative instance , and does not have a direct dependence on the dimension .
8.1 Non-linear case
The derivations aforementioned are basically can be applied for non-linear case by considering some kernel Hilbert space. Let denote a feature map associated with the kernel for a Hilbert space , i.e., . I we replace by and replace by , where . For example, in MCL case, the column vector of is as follows:
where is zero column vector, the dimension of which corresponds to the dimension of the feature vector .
8.2 Generality of our derivation technique
In this paper, we only focus on MCL and TRL. However, we believe that a margin loss based on max-operator is employed also in other learning problem. Even if there is, we do not know a generalization bound derived by our approach is tighter than the existing generalization bound. However, we can say that our approach is very simple and makes it us to derive the generalization bound easily for several learning problems.
In this paper, we show a technique to derive generalization bounds for several margin-based learning approaches, in which the margin is based on max-operator. For different hypothesis classes in different learning problems, our derivation technique allows us to estimate a theoretical generalization risk in a same fashion.
We focus on multi-class learning and top-rank learning problem, and we derive improved generalization risk bounds, respectively.
As a future work, we would like to apply our derivation to other learning model such as maximum entropy learning , and we have already initiated.
-  Shivani Agarwal. The infinite push: A new support vector ranking algorithm that directly optimizes accuracy at the absolute top of the list. In Proceedings of the SIAM International Conference on Data Mining, 2011.
-  Stuart Andrews, Ioannis Tsochantaridis, and Thomas Hofmann. Support vector machines for multiple-instance learning. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 577–584. MIT Press, 2003.
-  Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2003.
-  Marc-André Carbonneau, Veronika Cheplygina, Eric Granger, and Ghyslain Gagnon. Multiple instance learning: A survey of problem characteristics and applications. Pattern Recognition, 77:329 – 353, 2018.
-  Thomas G. Dietterich, Richard H. Lathrop, and Tomás Lozano-Pérez. Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence, 89(1-2):31–71, January 1997.
-  Gary Doran and Soumya Ray. A theoretical and empirical analysis of support vector machine methods for multiple-instance classification. Machine Learning, 97(1-2):79–102, October 2014.
-  Thomas Gärtner, Peter A. Flach, Adam Kowalczyk, and Alex J. Smola. Multi-instance kernels. In Proceedings 19th International Conference. on Machine Learning, pages 179–186. Morgan Kaufmann, 2002.
-  Yunwen Lei, Urun Dogan, Alexander Binder, and Marius Kloft. Multi-class svms: From tighter data-dependent generalization bounds to novel algorithms. In Advances in Neural Information Processing Systems, pages 2035–2043, 2015.
-  Jian Li, Yong Liu, Rong Yin, Hua Zhang, Lizhong Ding, and Weiping Wang. Multi-class learning: from theory to algorithm. In Advances in Neural Information Processing Systems, pages 1586–1595, 2018.
-  Nan Li, Rong Jin, and Zhi-Hua Zhou. Top rank optimization in linear time. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1502–1510. Curran Associates, Inc., 2014.
-  Nan Li, Rong Jin, and Zhi-Hua Zhou. Top rank optimization in linear time. arXiv preprint arXiv:1410.1462, 2014.
-  Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018.
-  C Rudin. Ranking with a P-Norm Push. In Proceedings of 19th Annual Conference on Learning Theory, pages 589–604, 2006.
-  Sivan Sabato and Naftali Tishby. Multi-instance learning with any hypothesis class. Journal of Machine Learning Research, 13(1):2999–3039, October 2012.
-  Karthik Sankaranarayanan and James W Davis. One-class multiple instance learning and applications to target tracking. In Asian Conference on Computer Vision, pages 126–139. Springer, 2012.
-  Daiki Suehiro, Kohei Hatano, Eiji Takimoto, Shuji Yamamoto, Kenichi Bannai, and Akiko Takeda. Multiple-instance learning by boosting infinitely many shapelet-based classifiers. arXiv preprint arXiv:1811.08084, 2018.
-  Dan Zhang, Jingrui He, Luo Si, and Richard Lawrence. Mileage: Multiple instance learning with global embedding. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 82–90, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.