Abstract
We study the problem of learning with label proportions in which the training data is provided in groups and only the proportion of each class in each group is known. We propose a new method called proportionSVM, or SVM, which explicitly models the latent unknown instance labels together with the known group label proportions in a largemargin framework. Unlike the existing works, our approach avoids making restrictive assumptions about the data. The SVM model leads to a nonconvex integer programming problem. In order to solve it efficiently, we propose two algorithms: one based on simple alternating optimization and the other based on a convex relaxation. Extensive experiments on standard datasets show that SVM outperforms the stateoftheart, especially for larger group sizes.
SVM for Learning with Label Proportions
Felix X. Yu yuxinnan@ee.columbia.edu
Dong Liu dongliu@ee.columbia.edu
Sanjiv Kumar sanjivk@google.com
Tony Jebara jebara@cs.columbia.edu
ShihFu Chang sfchang@cs.columbia.edu
Columbia University, New York, NY 10027
Google Research, New York, NY 10011
The problem of learning with label proportions has recently drawn attention in the learning community (Quadrianto et al., 2009; Rüeping, 2010). In this setting, the training instances are provided as groups or “bags”. For each bag, only the proportions of the labels are available. The task is to learn a model to predict labels of the individual instances.
Learning with label proportions raises multiple issues. On one hand, it enables interesting applications such as modeling voting behaviors from aggregated proportions across different demographic regions. On the other hand, the feasibility of such a learning method also raises concerns about the sensitive personal information that could potentially be leaked simply by observing label proportions.
To address this learning setting, this article explicitly models the unknown instance labels as latent variables. This alleviates the need for making restrictive assumptions on the data, either parametric or generative. We introduce a largemargin framework called proportionSVM, or SVM^{1}^{1}1 is the symbol for “proportionalto”., which jointly optimizes over the unknown instance labels and the known label proportions (Section id1). In order to solve SVM efficiently, we propose two algorithms  one based on simple alternating optimization (Section id1), and the other based on a convex relaxation (Section id1). We show that our approach outperforms the existing methods for various datasets and settings (Section id1). The gains are especially higher for more challenging settings when the bag size is large.
MeanMap: Quadrianto et al. (2009) proposed a theoretically sound method to estimate the mean of each class using the mean of each bag and the label proportions. These estimates are then used in a conditional exponential model to maximize the log likelihood. The key assumption in MeanMap is that the classconditional distribution of data is independent of the bags. Unfortunately, this assumption does not hold for many real world applications. For example, in modeling voting behaviors, in which the bags are different demographic regions, the data distribution can be highly dependent on the bags.
Inverse Calibration (InvCal): Rüeping (2010) proposed treating the mean of each bag as a “superinstance”, which was assumed to have a soft label corresponding to the label proportion. The “superinstances” can be poor in representing the properties of the bags. Our work also utilizes a largemargin framework, but we explicitly model the instance labels. Section id1 gives a detailed comparison with InvCal.
Figure 1 provides a toy example to highlight the problems with MeanMap and InvCal, which are the stateoftheart methods.
Related Learning Settings: In semisupervised learning, Mann and McCallum (2007) and Bellare et al. (2009) used an expectation regularization term to encourage model predictions on the unlabeled data to match the given proportions. Similar ideas were also studied in the generalized regularization method (Gillenwater et al., 2011). Li et al. (2009a) proposed a variant of semisupervised SVM to incorporate the label mean of the unlabeled data. Unlike semisupervised learning, the learning setting we are considering requires no instance labels for training.
As an extension to multipleinstance learning, Kuck and de Freitas (2005) designed a hierarchical probabilistic model to generate consistent label proportions. Besides the inefficiency in optimization, the method was shown to be inferior to MeanMap (Quadrianto et al., 2009). Similar ideas have also been studied by Chen et al. (2006) and Musicant et al. (2007).
Stolpe and Morik (2011) proposed an evolutionary strategy paired with a labeling heuristic for clustering with label proportions. Different from clustering, the proposed SVM framework jointly optimizes the latent instance labels and a largemargin classification model. The SVM formulation is related to largemargin clustering (Xu et al., 2004), with an additional objective to utilize the label proportions. Specifically, the convex relaxation method we used is inspired by the works of Li et al. (2009a) and Xu et al. (2004).
We consider a binary learning setting as follows. The training set is given in the form of bags,
(1) 
In this paper, we assume that the bags are disjoint, i.e., , . The th bag is with label proportion :
(2) 
in which denotes the unknown groundtruth label of , . We use for predicting the binary label of an instance , where is a map of the input data.
We explicitly model the unknown instance labels as , in which denotes the unknown label of , . Thus the label proportion of the th bag can be straightforwardly modeled as
(3) 
We formulate the SVM under the largemargin framework as below.
s.t.  (4) 
in which is the loss function for classic supervised learning. is a function to penalize the difference between the true label proportion and the estimated label proportion based on . The task is to simultaneously optimize the labels and the model parameters and .
The above formulation permits using different loss functions for and . One can also add weights for different bags. Throughout this paper, we consider as the hinge loss, which is widely used for largemargin learning: . The algorithms in Section id1 and Section id1 can be easily generalized to different .
Compared to (Rüeping, 2010; Quadrianto et al., 2009), SVM requires no restrictive assumptions on the data. In fact, in the special case where no label proportions are provided, SVM becomes largemargin clustering (Xu et al., 2004; Li et al., 2009a), whose solution depends only on the data distribution. SVM can naturally incorporate any amount of supervised data without modification. The labels for such instances will be observed variables instead of being hidden. SVM can be easily extended to the multiclass case, similar to (Keerthi et al., 2012).
As stated in Section id1, the Inverse Calibration method (InvCal) (Rüeping 2010) treats the mean of each bag as a “superinstance”, which is assumed to have a soft label corresponding to the label proportion. It is formulated as below.
(5)  
in which the th bag mean is , . Unlike SVM, the proportion of the th bag is modeled on top of this “superinstance” as:
(6) 
The second term of the objective function (5) tries to impose , , albeit in an inverse way.
Though InvCal is shown to outperform other alternatives, including MeanMap (Quadrianto et al., 2009) and several simple largemargin heuristics, it has a crucial limitation. Note that (6) is not a good way of measuring the proportion predicted by the model, especially when the data has high variance, or the data distribution is dependent on the bags. In our formulation (S3.Ex1), by explicitly modeling the unknown instance labels , the label proportion can be directly modeled as given in (3). The advantage of our method is illustrated in a toy experiment shown in Figure 1 (for details see Section id1).
The SVM formulation is fairly intuitive and straightforward. It is, however, a nonconvex integer programming problem, which is NPhard. Therefore, one key issue lies in how to find an efficient algorithm to solve it approximately. In this paper, we provide two solutions: a simple alternating optimization method (Section id1), and a convex relaxation method (Section id1).
In SVM, the unknown instance labels can be seen as a bridge between supervised learning loss and label proportion loss. Therefore, one natural way for solving (S3.Ex1) is via alternating optimization as,

For a fixed , the optimization of (S3.Ex1) and becomes a classic SVM problem.

When and are fixed, the problem becomes:
s.t.  (7) 
We show that the second step above can be solved efficiently. Because the influence of each bag , on the objective is independent, we can optimize (S4.Ex6) on each bag separately. In particular, solving yields the following optimization problem:
s.t.  (8) 
Proposition 1
For a fixed , (S4.Ex7) can be optimally solved by the steps below.

Initialize , . The optimal solution can be obtained by flipping the signs as below.

By flipping the sign of , , suppose the reduction of the first term in (S4.Ex7) is . Sort , . Then flip the signs of the top ’s which have the highest reduction . .
For bag , we only need to sort the corresponding , once. Sorting takes time. After that, for each , the optimal solution can be computed incrementally, each taking time. We then pick the solution with the smallest objective value, yielding the optimal solution of (S4.Ex7).
Proposition 2
Following the above steps, (S4.Ex6) can be solved in time, .
The proofs of the above propositions are given in the supplementary material.
By alternating between solving (, ) and , the objective is guaranteed to converge. This is due to the fact that the objective function is lower bounded, and nonincreasing. In practice, we terminate the procedure when the objective no longer decreases (or if its decrease is smaller than a threshold). Empirically, the alternating optimization typically terminates fast within tens of iterations, but one obvious problem is the possibility of local solutions.
To alleviate this problem, similar to TSVM (Joachims, 1999; Chapelle et al., 2008), the proposed alterSVM algorithm (Algorithm 1) takes an additional annealing loop to gradually increase . Because the nonconvexity of the objective function mainly comes from the second term of (S3.Ex1), the annealing can be seen as a “smoothing” step to protect the algorithm from suboptimal solutions. Following (Chapelle et al., 2008), we set in Algorithm 1 throughout this work. The convergence and annealing are further discussed in the supplementary material.
In the implementation of alterSVM, we consider as the absolute loss: . Empirically, each alterSVM loop given an annealing value terminates within a few iterations. From Proposition 2, optimizing has linear complexity in (when is small). Therefore the overall complexity of the algorithm depends on the SVM solver. Specifically, when linear SVM is used (Joachims, 2006), alterSVM has linear complexity. In practice, to further alleviate the influence of the local solutions, similar to clustering, e.g., kmeans, we repeat alterSVM multiple times by randomly initializing , and then picking the solution with the smallest objective value.
In this section, we show that with proper relaxation of the SVM formulation (S3.Ex1), the objective function can be transformed to a convex function of . We then relax the solution space of to its convex hull, leading to a convex optimization problem of . The convSVM algortihm is proposed to solve the relaxed problem. Unlike alterSVM, convSVM does not require multiple initializations. This method is motivated by the techniques used in largemargin clustering (Li et al., 2009b; Xu et al., 2004).
We change the label proportion term in the objective function (S3.Ex1) as a constraint , and we drop the bias term ^{2}^{2}2If the bias term is not dropped, there will be constraint in the dual, leading to nonconvexity. Such difficulty has also been discussed in (Xu et al., 2004). Fortunately, the effect of removing the bias term can be alleviated by zerocentering the data or augmenting the feature vector with an additional dimension with value .. Then, (S3.Ex1) is rewritten as:
(9)  
in which controls the compatibility of the label proportions. The constraint can be seen as a special loss function:
(10) 
We then write the inner problem of (9) as its dual:
(11) 
in which , denotes pointwisemultiplication, is the kernel matrix with , , and .
The objective in (11) is nonconvex in , but convex in . So, following (Li et al., 2009b; Xu et al., 2004), we instead solve the optimal . However, the feasible space of is
(12) 
which is a nonconvex set. In order to get a convex optimization problem, we relax to its convex hull, the tightest convex relaxation of :
(13) 
in which .
Thus solving the relaxed is identical to finding :
(14) 
(14) can be seen as Multiple Kernel Learning (MKL) (Bach et al., 2004), which is a widely studied problem. However, because is very large, it is not tractable to solve (14) directly.
Fortunately, we can assume that at optimality only a small number of ’s are active in (14). Define as the set containing all the active ’s. We show that can be incrementally found by the cutting plane method.
Because the objective function of (14) is convex in , and concave in , it is equivalent to (Fan, 1953),
(15) 
It is easy to verify that the above is equivalent to:
(16)  
s.t. 
This form enables us to apply the cutting plane method (Kelley Jr, 1960) to incrementally include the most violated into , and then solve the MKL problem, (14) with replaced as . The above can be repeated until no violated exists.
In the cutting plane training, the critical step is to obtain the most violated :
(17) 
which is equivalent to
(18) 
This is a 0/1 concave QP, for which there exists no efficient solution. However, instead of finding the most violated constraint, if we find any violated constraint , the objective function still decreases. We therefore relax the objective in (18), which can be solved efficiently. Note that the objective of (18) is equivalent to a norm . Following (Li et al., 2009b), we approximate it as the norm:
(19) 
in which is the th dimension of the th feature vector. These can be obtained by eigendecomposition of the kernel matrix , when a nonlinear kernel is used. The computational complexity is . In practice, we choose such that 90% of the variance is preserved. We further rewrite (19) as:
(20)  
Therefore the approximation from (18) to (19) enables us to consider each dimension and each bag separately. For the th dimension, and the th bag, we only need to solve two subproblems , and . The former, as an example, can be written as
(21) 
This can be solved in the same way as (S4.Ex7), which takes time. Because we have dimensions, similar to Proposition 2, one can show that:
Proposition 3
(18) with the norm approximated as the norm can be solved in time, .
The overall algorithm, called convSVM, is shown in Algorithm 2. Following (Li et al., 2009b), we use an adapted SimpleMKL algorithm (Rakotomamonjy et al., 2008) to solve the MKL problem.
As an additional step, we need to recover from . This is achieved by rank1 approximation of (as )^{3}^{3}3Note that . This ambiguity can be resolved by validation on the training bags.. Because of the convex relaxation, the computed is not binary. However, we can use the realvalued directly in our prediction model (with dual):
(22) 
Similar to alterSVM, the objective of convSVM is guaranteed to converge. In practice, we terminate the algorithm when the decrease of the objective is smaller than a threshold. Typically the SimpleMKL converges in less than 5 iterations, and convSVM terminates in less than 10 iterations. The SimpleMKL takes (computing the gradient) time, or the complexity of SVM, whichever is higher. Recovering takes time and computing eigendecomposition with the first singular values takes time.
MeanMap (Quadrianto et al., 2009) was shown to outperform alternatives including kernel density estimation, discriminative sorting and MCMC (Kuck and de Freitas, 2005). InvCal (Rüeping, 2010) was shown to outperform MeanMap and several largemargin alternatives. Therefore, in the experiments, we only compare our approach with MeanMap and InvCal.
To visually demonstrate the advantage of our approach, we first show an experiment on a toy dataset with two bags. Figure 1 (a) and (b) show the data of the two bags, and Figure 1 (c) and (d) show the learned separating hyperplanes from different methods. Linear kernel is used in this experiment. For this specific dataset, the restrictive data assumptions of MeanMap and InvCal do not hold: the mean of the first bag (60% positive) is on the “negative side”, whereas, the mean of the second bag (40% positive) is on the “positive side”. Consequently, both MeanMap and InvCal completely fail, with the classification accuracy of 0%. On the other hand, our method, which does not make strong data assumptions, achieves the perfect performance with 100% accuracy.
Datasets. We compare the performance of different techniques on various datasets from the UCI repository^{4}^{4}4http://archive.ics.uci.edu/ml/ and the LibSVM collection^{5}^{5}5http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/. The details of the datasets are listed in Table 1.
In this paper, we focus on the binary classification settings. For the datasets with multiple classes (dna and satimage), we test the onevsrest binary classification performance, by treating data from one class as positive, and randomly selecting same amount of data from the remaining classes as negative. For each dataset, the attributes are scaled to .
Dataset  Size  Attributes  Classes 

heart  270  13  2 
heartc  303  13  2 
colic  366  22  2 
vote  435  16  2 
breastcancer  683  10  2 
australian  690  14  2 
credita  690  15  2 
breastw  699  9  2 
a1a  1,605  119  2 
dna  2,000  180  3 
satimage  4,435  36  6 
codrna.t  271,617  8  2 
Dataset  Method  2  4  8  16  32  64 

heart  MeanMap  81.851.28  80.390.47  79.630.83  79.461.46  79.001.42  76.061.25 
InvCal  81.780.55  80.981.35  79.453.07  76.943.26  73.762.69  73.046.46  
alterSVM  83.410.71  81.801.25  79.912.11  79.690.64  77.802.52  76.582.00  
convSVM  83.330.59  80.612.48  81.000.75  80.720.82  79.321.14  79.400.72  
colic  MeanMap  80.000.80  76.141.69  75.520.72  74.171.61  76.101.92  76.746.10 
InvCal  81.250.24  78.823.24  77.341.62  74.844.14  69.634.12  69.476.06  
alterSVM  81.420.02  80.791.48  79.591.38  79.401.06  78.593.32  78.492.93  
convSVM  81.420.02  80.630.77  78.841.32  77.981.14  77.490.66  76.941.07  
vote  MeanMap  87.760.20  91.901.89  90.842.33  88.721.45  87.630.26  88.420.80 
InvCal  95.570.11  95.570.42  94.430.24  94.000.61  91.472.57  91.131.07  
alterSVM  95.620.33  96.090.41  95.560.47  94.231.35  91.971.56  92.121.20  
convSVM  91.660.19  90.800.34  89.550.25  88.870.37  88.950.39  89.070.24  
australian  MeanMap  86.030.39  85.620.17  84.081.36  83.701.45  83.961.96  82.901.96 
InvCal  85.420.28  85.800.37  84.990.68  83.142.54  80.284.29  80.536.18  
alterSVM  85.420.30  85.600.39  85.490.78  84.960.96  85.290.92  84.472.01  
convSVM  85.510.00  85.540.08  85.900.54  85.670.24  85.670.81  85.470.89  
dna1  MeanMap  86.381.33  82.711.26  79.891.55  78.460.53  80.201.44  78.831.73 
InvCal  93.051.45  90.810.87  86.272.43  81.583.09  78.313.28  72.982.33  
alterSVM  94.931.05  94.310.62  92.860.78  90.721.35  90.840.52  89.410.97  
convSVM  92.780.66  90.081.18  85.382.05  84.912.43  82.773.30  85.660.20  
dna2  MeanMap  88.450.68  83.061.68  78.692.11  79.945.68  79.723.73  74.734.26 
InvCal  93.300.88  90.321.89  87.301.80  83.172.18  79.472.55  76.853.42  
alterSVM  94.740.56  94.490.46  93.060.85  91.821.59  90.811.55  90.081.45  
convSVM  94.351.01  92.081.48  89.721.26  88.271.87  87.581.54  86.551.18  
satimage2  MeanMap  97.210.38  96.270.77  95.851.12  94.650.31  94.490.37  94.520.28 
InvCal  88.413.14  94.650.56  94.700.20  94.490.31  92.901.05  93.820.60  
alterSVM  97.830.51  97.750.43  97.520.48  97.520.51  97.510.20  97.110.26  
convSVM  96.870.23  96.630.09  96.400.22  96.870.38  96.290.40  96.500.38 
Experimental Setup. Following (Rüeping, 2010), we first randomly split the data into bags of a fixed size. Bag sizes of 2, 4, 8, 16, 32, 64 are tested. We then conduct experiments with 5fold cross validation. The performance is evaluated based on the average classification accuracy on the individual test instances. We repeat the above processes 5 times (randomly selecting negative examples for the multiclass datasets, and randomly splitting the data into bags), and report the mean accuracies with standard deviations.
The parameters are tuned by an inner cross validation loop on the training subset of each partition of the 5fold validation. Because no instancelevel labels are available during training, we use the baglevel error on the validation bags to tune the parameters:
(23) 
in which and denote the predicted and the groundtruth proportions for the th validation bag.
For MeanMap, the parameter is tuned from . For InvCal, the parameters are tuned from , and . For alterSVM, the parameters are tuned from , and . For convSVM, the parameters are tuned from , and . Two kinds of kernels are considered: linear and RBF. The parameter of the RBF kernel is tuned from .
We randomly initialize alterSVM 10 times, and pick the result with the smallest objective value. Empirically, the influence of random initialization to other algorithms is minimal.
Method  

InvCal  88.790.21  88.200.62  87.890.79 
alterSVM  90.321.22  90.280.94  90.211.53 
Dataset  Method  2  4  8  16  32  64 

heart  MeanMap  82.690.71  80.800.97  79.650.82  79.441.21  80.032.05  77.260.85 
InvCal  83.150.56  81.060.70  80.261.32  79.613.84  76.363.72  73.903.00  
alterSVM  83.150.85  82.891.30  81.510.54  80.071.21  79.100.96  78.631.85  
convSVM  82.960.26  82.200.52  81.380.53  81.170.55  80.940.86  78.871.37  
colic  MeanMap  82.450.88  81.381.26  81.711.16  79.941.33  76.362.43  77.841.69 
InvCal  82.200.61  81.200.87  81.171.74  78.592.19  74.095.26  72.814.80  
alterSVM  83.280.50  82.970.39  82.030.44  81.620.46  81.530.21  81.390.34  
convSVM  82.741.15  81.830.46  79.580.57  79.770.84  78.221.19  77.311.76  
vote  MeanMap  91.150.33  90.520.62  91.540.20  90.281.63  89.581.09  89.381.33 
InvCal  95.680.19  94.770.44  93.950.43  93.030.37  87.791.64  86.634.74  
alterSVM  95.800.20  95.540.25  94.880.94  92.440.60  90.721.11  90.931.30  
convSVM  92.990.20  92.010.69  90.570.68  88.980.35  88.740.43  88.620.60  
australian  MeanMap  85.970.72  85.880.34  85.341.01  83.362.04  83.121.52  80.585.41 
InvCal  86.060.30  86.110.26  86.320.45  84.131.62  82.731.70  81.873.29  
alterSVM  85.740.22  85.710.21  86.260.61  85.650.43  83.631.83  83.622.21  
convSVM  85.970.53  86.460.23  85.300.70  84.180.53  83.690.78  82.981.32  
dna1  MeanMap  91.530.25  90.580.34  86.001.04  80.773.69  77.353.59  68.474.30 
InvCal  89.323.39  92.730.53  87.991.65  81.053.14  74.772.95  67.753.86  
alterSVM  95.670.40  94.650.52  93.710.47  92.520.63  91.851.42  90.641.32  
convSVM  93.360.53  86.752.56  81.033.58  75.904.56  76.925.91  77.942.48  
dna2  MeanMap  92.081.54  91.030.69  87.501.58  82.213.08  76.774.33  72.565.32 
InvCal  89.654.05  93.121.37  89.191.17  83.522.57  77.942.82  72.643.89  
alterSVM  95.630.45  95.050.75  94.250.50  93.950.93  92.740.93  92.460.90  
convSVM  94.060.86  90.681.18  87.640.76  87.321.55  85.741.03  85.330.79  
satimage2  MeanMap  97.080.48  96.820.38  96.500.43  96.451.16  95.510.73  94.260.22 
InvCal  97.531.33  98.330.13  98.380.23  97.990.54  96.271.15  94.470.27  
alterSVM  98.830.36  98.690.37  98.620.27  98.720.37  98.510.22  98.250.41  
convSVM  96.550.13  96.450.19  96.450.39  96.140.49  96.160.35  95.930.45 
Results. Table 2 and Table 4 show the results with linear kernel, and RBF kernel, respectively. Additional experimental results are provided in the supplementary material. Our methods consistently outperform MeanMap and InvCal, with pvalue 0.05 for most of the comparisons (more than 70%). For larger bag sizes, the problem of learning from label proportions becomes more challenging due to the limited amount of supervision. For these harder cases, the gains from SVM are typically even more significant. For instance, on the dna2 dataset, with RBF kernel and bag size 64, alterSVM outperforms the former works by 19.82% and 12.69%, respectively (Table 4).
A LargeScale Experiment. We also conduct a largescale experiment on the codrna.t dataset containing about 271K points. The performance of InvCal and alterSVM with linear kernel are compared. The experimental setting is the same as for the other datasets. The results in Table 3 show that alterSVM consistently outperforms InvCal. For smaller bag sizes also, alterSVM outperforms InvCal, though the improvement margin reduces due to sufficient amount of supervision.
In Section id1, because the bags were randomly generated, distribution of is approximately Gaussian for moderate to large . It is intuitive that the performance will depend on the distribution of proportions . If is either 0 or 1, the bags are most informative, because this leads to the standard supervised learning setting. On the other hand, if ’s are close to each other, the bags will be least informative. In fact, both MeanMap and InvCal cannot reach a numerically stable solution in such case. For MeanMap, the linear equations for solving class means will be illposed. For InvCal, because all the “superinstances” are assumed to have the same regression value, the result is similar to random guess.
SVM, on the other hand, can achieve good performance even in this challenging situation. For example, when using the vote dataset, with bag sizes 8 and 32, , (same as prior), with linear kernel, alterSVM has accuracies(%) and , and convSVM has accuracies(%) and , respectively. These results are close to those obtained for randomly generated bags in Table 2. This indicates that our method is less sensitive to the distribution of .
Empirically, when nonlinear kernel is used, the run time of alterSVM is longer than that of convSVM, because we are repeating alterSVM multiple times to pick the solution with the smallest objective value. For instance, on a machine with 4core 2.5GHz CPU, on the vote dataset with RBF kernel and 5fold cross validation, the alterSVM algorithm (repeating 10 times with the annealing loop, and one set of parameters) takes 15.0 seconds on average, while the convSVM algorithm takes only 4.3 seconds. But as shown in the experimental results, for many datasets, the performance of convSVM is marginally worse than that of alterSVM. This can be explained by the multiple relaxations used in convSVM, and also the 10 time initializations of alterSVM. As a heuristic solution for speeding up the computation, one can use convSVM (or InvCal) to initialize alterSVM. For largescale problems, in which linear SVM is used, alterSVM is preferred, because its computational complexity is .
The speed of both alterSVM and convSVM can be improved further by solving the SVM in their inner loops incrementally. For example, one can use warm start and partial activeset methods (Shilton et al., 2005). Finally, one can linearize kernels using explicit feature maps (Rahimi and Recht, 2007; Vedaldi and Zisserman, 2012), so that alterSVM has linear complexity even for certain nonlinear kernels.
We have proposed the SVM framework for learning with label proportions, and introduced two algorithms to efficiently solve the optimization problem. Experiments on several standard and one largescale dataset show the advantage of the proposed approach over the existing methods. The simple, yet flexible form of SVM framework naturally spans supervised, unsupervised and semisupervised learning. Due to the usage of latent labels, SVM can also be potentially used in learning with label errors. In the future, we will design algorithms to handle bags with overlapping data. Also, we plan to investigate the theoretical conditions under which the label proportions can be preserved with the convex relaxations.
Acknowledgment. We thank Novi Quadrianto and YuFeng Li for their help. We thank Jun Wang, Yadong Mu and anonymous reviewers for the insightful suggestions.
References
 Multiple kernel learning, conic duality, and the smo algorithm. In Proceedings of the 21th International Conference on Machine learning, pp. 6. Cited by: p65.
 Alternating projections for learning with expectation constraints. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, pp. 43–50. Cited by: p17.
 Optimization techniques for semisupervised support vector machines. The Journal of Machine Learning Research 9, pp. 203–233. Cited by: p48.
 Learning from aggregate views. In Proceedings of the 22nd International Conference on Data Engineering, pp. 3. Cited by: p18.
 Minimax theorems. Proceedings of the National Academy of Sciences of the United States of America 39 (1), pp. 42. Cited by: p67.
 Posterior sparsity in unsupervised dependency parsing. The Journal of Machine Learning Research 12, pp. 455–490. Cited by: p17.
 Transductive inference for text classification using support vector machines. In Proceedings of the 16th International Conference on Machine Learning, pp. 200–209. Cited by: p48.
 Training linear SVMs in linear time. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 217–226. Cited by: p49.
 Extension of TSVM to multiclass and hierarchical text classification problems with general losses.. In Proceeding of the 24th International Conference on Computational Linguistics, pp. 1091–1100. Cited by: p31.
 The cuttingplane method for solving convex programs. Journal of the Society for Industrial & Applied Mathematics 8 (4), pp. 703–712. Cited by: p71.
 Learning about individuals from group statistics. In Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence, pp. 332–339. Cited by: p18, p87.
 Semisupervised learning using label mean. In Proceedings of the 26th International Conference on Machine Learning, pp. 633–640. Cited by: p17, p19, p31.
 Tighter and convex maximum margin clustering. In Proceeding of the 12th International Conference on Artificial Intelligence and Statistics, pp. 344–351. Cited by: p50, p58, p76, p83.
 Simple, robust, scalable semisupervised learning via expectation regularization. In Proceedings of the 24th International Conference on Machine Learning, pp. 593–600. Cited by: p17.
 Supervised learning by training on aggregate outputs. In Proceedings of the 7th International Conference on Data Mining, pp. 252–261. Cited by: p18.
 Estimating labels from label proportions. The Journal of Machine Learning Research 10, pp. 2349–2374. Cited by: p11, p14, p18, p31, p37, p87.
 Random features for largescale kernel machines. Advances in Neural Information Processing Systems 20, pp. 1177–1184. Cited by: p102.
 SimpleMKL. The Journal of Machine Learning Research 9, pp. 2491–2521. Cited by: p83.
 SVM classifier estimation from group probabilities. In Proceedings of the 27th International Conference on Machine Learning, pp. 911–918. Cited by: p11, p15, p31, p32, p87, p91.
 Incremental training of support vector machines. Neural Networks, IEEE Transactions on 16 (1), pp. 114–131. Cited by: p102.
 Learning from label proportions by optimizing cluster model selection. In Proceedings of the 2011 European Conference on Machine Learning and Knowledge Discovery in DatabasesVolume Part III, pp. 349–364. Cited by: p19.
 Efficient additive kernels via explicit feature maps. Pattern Analysis and Machine Intelligence, IEEE Transactions on 34 (3), pp. 480–492. Cited by: p102.
 Maximum margin clustering. Advances in Neural Information Processing Systems 17, pp. 1537–1544. Cited by: footnote 2, p19, p31, p50, p58.