Abstract
We focus on several learning approaches that employ maxoperator to evaluate the margin. For example, such approaches are commonly used in multiclass learning task and toprank learning task. In general, in order to estimate the theoretical generalization risk, we need to individually evaluate the complexity of each hypothesis class used in the learning approaches. In this paper, we provide a technique to estimate a theoretical generalization risk for such learning approaches in a same fashion. The key idea is to “redundantly” reformulate the learning problem as oneclass multipleinstance learning by redefining the specific input space based on the original input space. Surprisingly, we succeed to improve the generalization risk bounds for some multiclass learning and toprank learning algorithms.
An Application of MultipleInstance Learning
to Estimate
Generalization Risk
Daiki Suehiro
Kyushu University / RIKEN
1 Introduction
A lot of marginbased learning approaches such as SVMs have a strong theoretical generalization risk bound and well works in practice.
In this paper, we focus on learning approaches that define a margin based on maxoperator. More precisely, we focus on two learning problems, multiclass learning (MCL) problem and toprank learning (TRL) problem. MCL is a standard learning problem that the goal of the learner is to find a function which predict the label of unseen data. One of the typical approach [12] is to find number of linear hyperplanes for class classification, and predict the label by for data . In this approach, the margin for a labeled data can be defined as . TRL [13, 1, 10] is one of the bipartite ranking problem [12]. When given positive and negative instances, the goal of the learner is to obtain a scoring function which gives high values to only absolute top positive list. In other words, the learner wish to maximize the number of positive instances ranked higher than the highest negative instance. When we consider linear hypothesis class of the scoring function, the margin can be defined for a positive and negative set as for all .
Since different margin and different hypothesis class is used in different problem, we need to evaluate the theoretical generalization risk individually [12]. One of the typical approach to guarantee a generalization risk is to estimate Rademacher complexity [3] of the hypothesis class. The notion of Rademacher complexity allow us to easily estimate the generalization bounds in several learning problems, but we still need to evaluate the Rademacher complexity for an individual learning approach. In this paper, based on the notion of Rademacher complexity, we propose a technique to estimate a theoretical generalization risk in a same fashion for different learning approaches.
The key idea is to “redundantly” reformulate the learning problem as oneclass MultipleInstance Learning by redefining the specific input space based on the original input space. MultipleInstance Learning (MIL) problem is a fundamental problem in machine learning field. A standard MIL setting is described as follows: The learner receives sets called bags. Each bag contains multiple instances. In the training phase, the learner can observe each label of bags, but cannot observe the label of instances individually. The goal of the learner is to find a decision function that predicts the labels of unseen bags accurately. As a decision function, the following form is commonly used in practice:
Our motivation of this idea is that the maxoperator of hypothesis can switch to the target margin if we define something special bag. Roughly speaking, we consider maxbased margin function and non maxbased hypothesis for the target learning problem, there may exist non maxbased margin function and maxbased hypothesis and some bag that satisfy . If we can find , and the definition of , we can apply the existing Rademacher complexity of MIL approaches (e.g., [14, 16]) instead of calculating of the target learning approach.
Our derivation approach of generalization risk is very simple, and finishes in two steps. First, we reformulate the target learning problem as oneclass MIL problem. Second, we apply the existing Rademacher complexity bound of the MIL approach. In this paper, by using this derivation approach, we derive the generalization risk of Multiclass SVM in MCL and InfinitePush [1] in TRL. We think it is an interesting result that three different learning problems (see Figure 1, MCL, TRL, and MIL problem are connected by our approaches.
Surprisingly, as a result, we can improve the existing generalization risk bounds thanks to the strong bounds in MIL [14, 16].
The main contribution of this paper is summarized as follows:

We provide a derivation technique which allows us to easily estimate the generalization risk of some marginbased approaches.

We improve the generalization bound of several learning algorithms in MCL and TRL.
2 Related works
2.1 Mil
In this paper, MIL setting is a key of our theoretical derivation technique. Since Dietterich et al. first proposed MIL in [5], many researchers introduced various theories and applications of MIL [7, 2, 14, 17, 6, 4]. For example, the aforementioned decision function employing maxoperator is widely used for local feature analysis [4, 16]. Sankaranarayanan and Davis [15] proposed a real application of oneclass MIL, which we use in this paper. Theoretical generalization performance of several MIL algorithm have been investigated [14, 16]. We can say that our approach is quite different MIL application from the existing works.
2.2 Mcl
To my knowledges, the best known generalization of multiclass SVM is shown in [8] based on the Rademacher complexity^{1}^{1}1[9] shows tighter bound by considering local Rademacher complexity which requires some additional assumptions (e.g., the strong convexity of the loss function). In this paper, we do not consider such assumptions and only consider global Rademacher complexity.. In the previous work, it was said that the generalization performance linearly denpends on the class size (see, e.g., [12]). They showed that, when considering norm regularized multiclass SVM, the generalization bound logarithmically depends on the class size in the case . However, when (e.g., considering the standard 2norm regularized multiclass SVM), the generalization bound has a radical dependence on the class size. Therefore, they suggested a learning algorithm of norm regularized multiclass SVM to control the complexity. However, surprisingly, our generalization bound suggests that norm multiclass SVM also has the logarithmic dependence on the class size.
2.3 Trl
TRL is one of the bipartite ranking problem. A typical bipartite ranking problem is; the learner observes a binarylabeled sample, and to find a ranking function that gives higher value positive instances than negative instances. In TRL, the learner wishes to find a ranking function which maximized the number of positive instances ranked above the ranked highest negative instance. Several approaches have been proposed in TRL problem setting [13, 1, 10]. The latest theoretical generalization risk [10] is bounded by , where is the dimension of input space, are the size of positive and negative sample, respectively. However, this bound has high dependence on the dimension^{2}^{2}2Note that the dependency on the number of dimension is omitted in the main paper. Please see full version [11] of this paper.. Thus, it may be uninformative bound for high dimensional data. By contrast, we show the generalization risk bounded by , which does not have a direct dependence on the dimension .
(MCL) (TRL) (MIL) 
3 Settings
In this section, we introduce three learning problems, multiclass learning problem, toprank learning problem, and (binary classification) multipleinstance problem. For each problem, many learning approaches have been provided. More precisely, in this paper, we focus on marginmaximization approaches. At first, we introduce margin loss function is defined for any as
3.1 Multiclass learning problem
Let be an instance space, and be an output space. The learner receives a set of labeled instances , where each instance is drawn i.i.d according to some unknown distribution . Given a hypothesis set of functions that map to , the learner predict the label of unseen using the following mapping
The goal of the learner is to find a set of functions from with small expected margin loss:
(1) 
The empirical margin loss for multiclass learning problem can be defined as:
(2) 
In this paper, we consider linear functions as hypothesis set .
Learning approach for MCL
There are many learning approaches for MCL. In this paper, we focus on Multiclass SVM [12], which is one of the most popular learning algorithm for MCL. The original optimization problem of Multiclass SVM is as follows:
OP 1: Multiclass SVM
sub.to:  
where is a constant hyperparameter. Thus, Multiclass SVM algorithm finds a regularized linear function, that is, the hyposhesis set is represented as
for any .
3.2 TopRank learning problem
Let be an instance space. The learner receives a set of pairwise instances drawn i.i.d. according to some unknown distribution , and drawn i.i.d. according to some unknown distribution . Given a hypothesis set of functions that map to , the goal of the learner is to find a function with small expected misranking risk with a margin :
(3) 
where denotes the support of . For any , the empirical risk of can be defined as:
(4) 
In this paper, we consider the set of linear functions as for TRL problem:
Learning approach for TRL
We introduce InfinitePush [1], which is a support vector style algorithm for TRL. The optimization problem of InfinitePush is as follows:
OP 2: InfinitePush
sub.to:  
where is a constant hyperparameter. InfinitePush finds a regularized linear function, that is, the hypothesis set is represented as
for any .
3.3 Multipleinstance learning problem
Let be an instance space. A bag is a finite set of instances chosen from . The learner receives a sequence of (binary) labeled bags called a sample, where each labeled bag is independently drawn according to some unknown distribution over . The goal of the learner is to find a hypothesis with small expected margin risk:
We define the margin loss function for any as
(5) 
In the multiinstance learning problem, the following hypothesis class is commonly used in practice:
Learning approach for MIL
In this paper, we focus on MultipleInstance SVM (MISVM) [2] as an algorithm for MIL problem. The optimization problem of MISVM (without bias) is as follows:
OP 3: MultiInstance SVM
sub.to:  
where is a constant hyperparameter. The hypothesis set of MISVM can be represented as
for any .
4 Oneclass multipleinstance learning problem
In this section, we introduce our key learning problem, OneClass MultipleInstance Learning (OCMIL) problem. This is an easy expansion of MIL problem for binary classification. Moreover, we show the upper bound of the generalization risk by simply using the existing theorem.
4.1 Problem setting
Let be an instance space. A bag is a finite set of instances chosen from . The learner receives a sequence of bags where each bag is independently drawn according to some unknown distribution over . For any hypothesis , we define the expected risk as following:
We define the marginbased empirical risk for any as
(6) 
For convenience, we assume that all bags are negative (i.e., each label of every bags is ). We replace all of OP 3.3 by 1, and we have the optimization problem of OneClass MISVM as bellow.
OP 4: OneClass MISVM
sub.to:  
4.2 Generalization bound
To derive the generalization bound for OCMIL, we introduce the definition of Rademacher complexity.
Definition 1.
[The Rademacher complexity [3]]
Given a sample ,
the empirical Rademacher complexity of a class w.r.t.
is defined as
,
where and each is an independent
uniform random variable in .
Based on the Rademacher complexity, it is known that the following generalization bound holds [12]; For fixed , the following bound holds with high probability: for all with margin loss ,
For MISVM, two generalization bounds, [14] and [16] have been provided. In fact, the generalization bounds for binary classification MISVM also hold for OneClass MISVM. This is because the following holds: Let , and let ,
We show the two incomparable generalization risk bounds of OneClass MISVM as bellow.
Corollary 1 (Rademacher complexity bounds for OneClass MISVM).
The above two bounds are incomparable, but they share same advantage that the Rademacher complexity only logarithmically depends on the total number of instances. Note that both bounds do not directly depend on the number of dimension. For simplicity, we unify to use both bounds as
(8) 
5 Connection between OCMIL and MCL
In this section and the next section, we show our main result, OCMILbased approaches for deriving generalization risk bound. Our approach is very simple. First, we consider specialized input bags in OCMIL according to the target learning problem. In Table 1, we summarize the sample bags of OCMIL when given sample of the target problem.
Problem  original sample  bag  sample  bag size  

MCL  
TRL 

Next, we show the relationship of Rademacher complexity between OCMIL and the target problem. Finally, we apply the generalization bound of OneClass MISVM for the target learning problem.
5.1 Specialized OCMIL setting for MCL
We consider a special OCMIL setting according to a MCL problem.
We prepare some definitions. For a pair of different integers and , denotes a vector such that the block of elements from to is , the block of elements from to is , and other elements are zero. For example, the column vector of is as follows:
where is dimensional allzero column vector. We define that maps to .
As we mention bellow, we derive a generalization bound for MCL by considering OCMIL problem such that the bag sample is given as mapped from the original sample in MCL setting.
5.2 Generalization bound of Multiclass SVM
Here we derive the generalization bound for Multiclass SVM by using the generalization bound of Oneclass MultipleInstance SVM.
Theorem 2 (Generalization bound for Multiclass SVM).
Let and let Then, for any , the expected risk is bounded as:
(9) 
Proof.
Under MCL problem aforementioned in Section 3.1, we consider an OCMIL problem such that the learner receives a sample according to distributed from . Let . Then, we derive that for any and , the empirical Rademacher complexity is as following:
Then, we have
Finally, we apply the bound (8). The coefficient of the theorem is lead by the fact that and the bound (8) has linear dependence on the radius . ∎
5.3 Multipleinstance SVM and multiclass SVM
We would like to show an interesting result that MISVM algorithm over the specialized setting meets Multiclass SVM algorithm.
The maxbased constraints of OP 4.1 can be decomposed to the constraints for all instances in each bag. Then, we have
sub.to:  
Replacing by the representation using , we have:
OP 5: Oneclass MISVM for MCL
sub.to:  
Therefore, the optimization problem of Multiclass SVM (OP 3.1) is equivalent to the problem of Oneclass MultiInstance SVM for MCL (OP 5.3) when given the specialized bag sample . This fact does not yield a valuable result immediately, but makes the relationship between MCL and MIL more convincing.
6 Connection between MIL and TRL
6.1 Specialized OCMIL setting for TRL
We define that maps a pair of sets to . As similar as the MCL case, we consider OCMIL such that the sample is given as mapped by from the original sample and in TRL.
6.2 Generalization bound for InfinitePush
Under TRL problem described in Section 3.2, we consider an OCMIL problem such that the learner receives a sample according to size of and size of distributed from and , respectively. Let and let . Then, for any , the expected risk is bounded as:
Theorem 3.
(10) 
Proof.
Under TRL problem aforementioned in Section 3.2, we consider an OCMIL problem such that the learner receives sample according to unknown distribution . Let . For any and the empirical Rademacher complexity is as following:
Then, we have
Finally, we apply the bound (8). The coefficient of the theorem is lead by the fact that and the bound (8) has linear dependence on the radius . ∎
7 Comparison with the existing bounds
Comparison with the existing MCL approaches
Lei et al. provided norm regularized SVM, and the generalization bound depends on:
(11) 
This bound indicates that a standard (i.e., 2norm regularized) Multiclass SVM still has a radical dependence on . However, our bound suggests that 2norm regularized Multiclass SVM has also logarithmically dependence on .
7.1 Comparison with the existing TRL approaches
InfinitePush [1] provided the generalization bound based on the covering number (see Theorem 5.1 of [1]). However, this bound is hard to evaluate in terms of the size of negative sample , because it contains a sum of an increasing term and a decreasing term with increasing . Li et al. provided TopPush [10] algorithm^{3}^{3}3Note that TopPush employs a slightly different loss function from InfinitePush and ours. They employ quadratic hinge loss for efficient optimization., and also provided the genralization bound which depends on , where and is the size of positive sample. In contrast to InfinitePush, this bound has very little dependence on . However, this bound highly depends on the dimension , and thus it may be uninformative bound for very high dimensional data. Our bound has a little dependence on negative instance , and does not have a direct dependence on the dimension .
8 Discussion
8.1 Nonlinear case
The derivations aforementioned are basically can be applied for nonlinear case by considering some kernel Hilbert space. Let denote a feature map associated with the kernel for a Hilbert space , i.e., . I we replace by and replace by , where . For example, in MCL case, the column vector of is as follows:
where is zero column vector, the dimension of which corresponds to the dimension of the feature vector .
8.2 Generality of our derivation technique
In this paper, we only focus on MCL and TRL. However, we believe that a margin loss based on maxoperator is employed also in other learning problem. Even if there is, we do not know a generalization bound derived by our approach is tighter than the existing generalization bound. However, we can say that our approach is very simple and makes it us to derive the generalization bound easily for several learning problems.
9 Conclusion
In this paper, we show a technique to derive generalization bounds for several marginbased learning approaches, in which the margin is based on maxoperator. For different hypothesis classes in different learning problems, our derivation technique allows us to estimate a theoretical generalization risk in a same fashion.
We focus on multiclass learning and toprank learning problem, and we derive improved generalization risk bounds, respectively.
As a future work, we would like to apply our derivation to other learning model such as maximum entropy learning [12], and we have already initiated.
References
 [1] Shivani Agarwal. The infinite push: A new support vector ranking algorithm that directly optimizes accuracy at the absolute top of the list. In Proceedings of the SIAM International Conference on Data Mining, 2011.
 [2] Stuart Andrews, Ioannis Tsochantaridis, and Thomas Hofmann. Support vector machines for multipleinstance learning. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 577–584. MIT Press, 2003.
 [3] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2003.
 [4] MarcAndré Carbonneau, Veronika Cheplygina, Eric Granger, and Ghyslain Gagnon. Multiple instance learning: A survey of problem characteristics and applications. Pattern Recognition, 77:329 – 353, 2018.
 [5] Thomas G. Dietterich, Richard H. Lathrop, and Tomás LozanoPérez. Solving the multiple instance problem with axisparallel rectangles. Artificial Intelligence, 89(12):31–71, January 1997.
 [6] Gary Doran and Soumya Ray. A theoretical and empirical analysis of support vector machine methods for multipleinstance classification. Machine Learning, 97(12):79–102, October 2014.
 [7] Thomas Gärtner, Peter A. Flach, Adam Kowalczyk, and Alex J. Smola. Multiinstance kernels. In Proceedings 19th International Conference. on Machine Learning, pages 179–186. Morgan Kaufmann, 2002.
 [8] Yunwen Lei, Urun Dogan, Alexander Binder, and Marius Kloft. Multiclass svms: From tighter datadependent generalization bounds to novel algorithms. In Advances in Neural Information Processing Systems, pages 2035–2043, 2015.
 [9] Jian Li, Yong Liu, Rong Yin, Hua Zhang, Lizhong Ding, and Weiping Wang. Multiclass learning: from theory to algorithm. In Advances in Neural Information Processing Systems, pages 1586–1595, 2018.
 [10] Nan Li, Rong Jin, and ZhiHua Zhou. Top rank optimization in linear time. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1502–1510. Curran Associates, Inc., 2014.
 [11] Nan Li, Rong Jin, and ZhiHua Zhou. Top rank optimization in linear time. arXiv preprint arXiv:1410.1462, 2014.
 [12] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018.
 [13] C Rudin. Ranking with a PNorm Push. In Proceedings of 19th Annual Conference on Learning Theory, pages 589–604, 2006.
 [14] Sivan Sabato and Naftali Tishby. Multiinstance learning with any hypothesis class. Journal of Machine Learning Research, 13(1):2999–3039, October 2012.
 [15] Karthik Sankaranarayanan and James W Davis. Oneclass multiple instance learning and applications to target tracking. In Asian Conference on Computer Vision, pages 126–139. Springer, 2012.
 [16] Daiki Suehiro, Kohei Hatano, Eiji Takimoto, Shuji Yamamoto, Kenichi Bannai, and Akiko Takeda. Multipleinstance learning by boosting infinitely many shapeletbased classifiers. arXiv preprint arXiv:1811.08084, 2018.
 [17] Dan Zhang, Jingrui He, Luo Si, and Richard Lawrence. Mileage: Multiple instance learning with global embedding. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 82–90, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.