MEMOIR: Multi-class Extreme Classification with Inexact Margin
Multi-class classification with a very large number of classes, or extreme classification, is a challenging problem from both statistical and computational perspectives. Most of the classical approaches to multi-class classification, including one-vs-rest or multi-class support vector machines, require the exact estimation of the classifier’s margin, at both the training and the prediction steps making them intractable in extreme classification scenarios. In this paper, we study the impact of computing an approximate margin using nearest neighbor (ANN) search structures combined with locality-sensitive hashing (LSH). This approximation allows to dramatically reduce both the training and the prediction time without a significant loss in performance. We theoretically prove that this approximation does not lead to a significant loss of the risk of the model and provide empirical evidence over five publicly available large scale datasets, showing that the proposed approach is highly competitive with respect to state-of-the-art approaches on time, memory and performance measures.
Recently, the problem of large-scale multi-class classification became very popular in the machine learning community owing to numerous applications in computer vision [1, 2], text categorization , recommendation and ranking systems [4, 5]. Publicly available text and image repositories such as Wikipedia, Yahoo! Directory111www.dir.yahoo.com, ImageNet222www.image-net.org or Directory Mozilla DMOZ 333www.dmoz.org contain millions of objects from thousands of classes. For instance, an LSHTC dataset from Mozilla DMOZ collection contains 163K objects belonging to 12K classes and described by 409K features. Classical multi-class classification approaches, such as one-vs-one and one-vs-rest, remain popular mainly because of their high accuracy and robustness to noisy input. On the other hand, their direct application to the extreme classification problems is doubtful due to highly non-linear time and memory efforts [6, 7].
Some promising attempts have been made to reduce the computation time of these models, by either doing locality-sensitive hashing  or reinforcement learning policies in an online convex optimization context . Despite empirical evidence of how well these approaches perform, to the best of our knowledge none of the studies propose a well funded theoretical strategy for large-scale multi-class classification that guarantees a gain in computation time without a significant loss on the statistical risk of the model.
In this paper, we propose a novel method for approximating the output estimation of multi-class classification models using approximate nearest neighbor (ANN) search structures with locality-sensitive hashing (LSH). We show that the proposed inexact margin computation significantly reduces time and memory requirements, allowing popular techniques, such as multi-class support vector machines and the one-vs-rest approach, to pass the scale without a significant loss of their true risk.
The contribution of this paper is threefold. Namely, we
propose an inexact margin multi-class classification framework and provide a theoretical analysis of its behavior;
design efficient numerical methods for and inexact margin multi-class support vector machines;
provide empirical evidence of its ability to learn efficient models compared to state-of-the-art approaches over multiple extreme classification datasets and make available the corresponding open-source software for research purpose.
In the next section we introduce a framework of multi-class margin classification and describe the inexact margin idea in more detail. Then we provide a theoretical analysis of the statistical performance of multi-class classification methods with inexact margin supporting it by the excess risk bounds of the corresponding classification algorithms. Further, we present experimental results obtained on publicly available extreme classification benchmarks, and finally, we briefly discuss the proposed algorithms and compare them to existing solutions.
Multi-class Margin Classification
Let be an identically and independently distributed (i.i.d.) sample with respect to a fixed yet unknown distribution , over , where is a feature space and , is a set of classes. Given a hypothesis set of functions mapping to , the exact margin of a labeled example with respect to a function is defined as
An observation is misclassified by a function if and only if . We refer to a class of margin loss functions as
The main problem here is that the computation of the margin for an observation requires the estimation of for each which is intractable when is too large. For instance, in the case of linear classifiers, margin computation is equal to finding the maximal element of a matrix-vector product on each iteration which is challenging in large-scale scenarios. In order to overcome this problem in such case, we estimate an approximate, or an inexact margin for each observation by first choosing randomly a class and then estimating
In this paper, we focus on the influence of inexact margin computation over the true risk of a classifier
where is equal to if the predicate is true and otherwise. More precisely, we are interested in the case where the classifier is found following the Empirical Risk Minimization principle by supposing that the approximate margin is inexact, that is for a given and with probability at least , , we have
The empirical loss considered in this work is based on the Hinge -loss function , (resp. ) for and defined as
Our main result is stated in Theorem 1, and it provides an upper-bound on the true risk of a multi-class classifier based on its empirical loss estimated with an inexact margin. The notion of function class capacity used in the bound is the Rademacher complexity of the set of functions :
where ’s are independent uniform random variables taking values in ; i.e. .
Let be an i.i.d. sample from an unknown distribution over , , and be a class functions from . Then for any with probability at least , the expected risk of any trained with a inexact margin is upper-bounded by
Moreover, for kernel-based hypotheses, with a PDS kernel and its associated feature mapping function, defined as :
where is the matrix formed by the weight vectors defining the kernel-based hypotheses, and is the group norm of , then if one has
The standard Rademacher complexity margin bound according to Theorem 4.4 of  gives with probability for a class of functions the following bounds:
Similarly to , for all , let satisfy we have due to monotonicity of the Rademacher complexity in the number of examples
Finally, with probability at least in the conditions of the theorem, for all training objects we have , thus
Combining it with Ineq. (Multi-class Margin Classification) one gets
To proof the remaining inequalities we note that with probability at least in the conditions of the theorem, for all training objects we have , thus ∎
We consider two principal approaches to inexact margin computation: locality-sensitive hashing and convex optimization. Our main focus here is linear classification and the maximal inner product approximation.
A hash is said to be a -LSH for a similarity function over the pair of spaces if for any and :
if then ;
if then , .
As the optimal value of is unknown in advance (and might be even negative), to guarantee the utility of LSH to approximate the margin we require LSH to be universal, i.e. for every and it is an -LSH . They also propose the simple LSH algorithm based on random Gaussian projections, with hashing quality
Following , one needs to distinguish between , and with probability at least . Assume below for simplicity that for any . Applying LSH recursively until
one gets time to construct margin approximation, where
We also note that the maximal inner product can be equally stated as a stochastic convex optimization problem:
where we use the superscript index to denote a coordinate of the vectors, and denote by the uniform probability measure over . As it is known from the seminal result of  on stochastic mirror descent algorithm and  on the stochastic Frank-Wolfe there exists potentially sub-linear time complexity algorithms to solve the problem approximately over sparse data. Nevertheless the discussion on optimization approach is out of the scope of this paper.
Multi-Class Support Vector Machines
Multi-class support vector machines (M-SVMs) still remain top classification methods owing to their accuracy and robustness . In this section, we analyse simple sub-gradient descent methods to train M-SVMs with and regularization in terms of the influence of inexact margins on their accuracy. Our consideration is mainly inspired by the seminal work of  for support vector machines optimization with inexact oracle.
According to , the multi-class SVM classifier decision rule is:
where is a weight vector of -th classifier and
is a matrix of weight vectors of all classifiers.
The learning objective is
where is -Hinge loss function.
Algorithm 1 is essentially an extension of the Pegasos algorithm to train the regularized support vector machines . We further refer to it as MEMOIR- The convergence rate of the algorithm is established in Theorem 2.
Following to  we treat a margin inexactness as an adversarial noise to the gradient and the derive its total influence on the minimization. The bound on is nothing more than application of the Hoeffding inequality to the total distortion introduced by the inexact margin. The full proof is provided in the supplementary. ∎
Theorem 2 requires the inexactness in margin computation to be bounded by the which in its turn should sum up to for the consistency on the algorithm. This requirement is important from theoretical perspective as it limits the performance of the numerical schemes based on convex optimization and LSH in margin approximation. On the other hand, it is much less crucial from the practical perspective as we see in our numerical study.
In the area of text classification, objects are often described by the TF-IDF or word/collocation frequencies and have sparse representation which is crucial for large-scale machine learning problem. To control the sparsity of , we use a simple truncated stochastic gradient descent given in Algorithm 2. It’s worth to mention a variety of efficient optimization schemes for minimization [17, 18, 19, 20]. We believe that similar technique could be utilized for any of the schemes above as well as for the sub-gradient descent we consider here. We also refer to this algorithm as MEMOIR-.
The problem of multi-class support vector machines  is to minimize
A step of stochastic sub-gradient descent method
The details of the method are provided in Algorithm 2. Note an important truncation step which zeroes out sufficiently small elements of and significantly reduces memory consumption. We apply it in our algorithm only in the case of the resulting norm of the truncated elements is sufficiently small. In particular, for :
|Datasets||# of Classes,||Dimension,||Train Size||Test Size||Heldout Size|
After iterations of Algorithm 2 with the full update step and step sizes and corresponding inexact margins the expected loss at is bounded with probability as
where , and is the optimal value in Eq. (11), the subgradient’s -norm is bounded by , for any , and each truncation operation does not change the norm of on more than after each iteration.
The proof is a direct implication of the standard sub-gradient convergence analysis to the case of sub-gradient errors due to inexact margin and truncation. The condition on the inexact margin also guarantees that with with probability at least each distortion in margin is bounded by ∎
In this section we provide an empirical evaluation of the proposed algorithms and compare them to several state-of-the-art approaches. We also discuss how hyperparameter tuning affects algorithms’ performance from time, memory and quality prospectives.
We use datasets from the Large Scale Hierarchical Text Classification Challenge (LSHTC) 1 and 2 , which were converted from multi-label to multi-class format by replicating the instances belonging to different class labels. These datasets are provided in a pre-processed format using both stemming and stop-words removal. Their characteristics, such as train, test, and heldout sizes, are listed in Table 1. We would like to thank authors of  for providing us with these datasets.
During the experiments two quality measures were evaluated: the accuracy and the Macro-Averaged F1 Measure (MaF1). The former represents the fraction of the test data being classified correctly, the later is a harmonic average of macro-precision and macro-recall; the higher values correspond to better performance. Being insensitive to class imbalance, the MaF1 is commonly used for comparing multi-class classification algorithms.
We compare MEMOIR- and SVM algorithms with the following multi-class classification algorithms:
OVR: One-vs-rest SVM implementation from LIBLINEAR .
M-SVM: Multi-class SVM implementation from LIBLINEAR proposed in .
RecallTree: A logarithmic time one-vs-some tree-based classifier. It utilises trees for selecting a small subset of labels with high recall and scores them with high precision .
FastXML: A computationally efficient algorithm for extreme multi-labeling problems. It uses hierarchical partitioning of feature space together with direct optimization of nDCG ranking measure .
PfastReXML: A tree ensemble based algorithm which is an enhanced version of FastXML: the nDCG loss is replaced with its propensity scored variant which is unbiased and assigns higher rewards for accurate tail label predictions .
PD-Sparse: A recent classifier with a multi-class margin loss and Elastic Net regularisation. To establish an optimal solution the Dual-Space Fully-Corrective Block-Coordinate Frank-Wolfe algorithm is used .
In our experiments we use a system with two 20-core Intel(R) Xeon(R) Silver 4114 2.20GHz CPUs and 128 GB of RAM. A maximum of 16 cores are used in each experiment for training and predicting. The proposed algorithms run in a single main thread, but querying and updating ANN search structure is parallelized. We were able to achieve a perfect 8x training time speed up when using 8 cores, 11x speed up when using 16 cores, and 20x speed up when using 32 cores (compared to a single core run of MEMOIR- algorithm on LSHTC1 dataset). This means that the proposed algorithm is well-scalable and has reasonable yet sublinear training time improvement with the number of used cores.
In all algorithms we employ ”lazy” matrix scaling, accumulating shared matrix multiplier during all matrix-scalar multiplications in a form and later dividing by in addition and subtraction operations when necessary. This allows to keep update time sublinear to problem size. Additionally, we store weight matrix as lists of CSR sparse vectors individual to each class, thus requiring memory. Finally, in each algorithm we do class prediction for test object using exact MIPS computation. This is feasible due to the fact that all algorithms achieve highly sparse weight matrices , thus making weight matrix-test object vector multiplication very fast.
Inexact Margin Computation.
In the case of linear classifiers, the problem of inexact margin computation is known as the Maximum Inner Product Search (MIPS) problem. Many techniques, such as cone trees or locality-sensitive hashing  have been proposed to tackle this problem in a high dimensional setup. We refer to a recent survey  for more details.
To deal effectively in large scale setup, a MIPS solver should maintain the sublinear costs of query and incremental update operations. In practice the majority of solutions lacks theoretical guarantees providing only empirical evaluations .
In our experiments we use two different MIPS solvers: SimpleLSH, an LSH-based MIPS solver , and a Navigable Small-World Graph (SW-Graph), a graph-based algorithm introduced in . Here we provide theoretical guarantees regarding to the properties of SimpleLSH as an inexact oracle, but our framework can be extended to any similar LSH implementation.
In contrast to SimpleLSH, SW-Graph is a purely empirical algorithm, according to , yet we included it into our experiments due to itst high performance in comparison to other solutions . In our implementation we use Non-Metric Space Library (nmslib) 444https://github.com/nmslib/nmslib which provides a parallel implementation of SW-Graph algorithm with incremental insertions and deletions. One useful property of nmslib is its ability to work in non-metric similarity functions, including negative dot product, which makes it a suitable implementation of a MIPS oracle.
In this subsection we discuss reasonable choice of algorithms’ parameters and ways how this choice affects training time, memory consumption and quality. The complete list with exact values of all hyperparameters is provided in the Appendix.
Learning rate is used with learning rate in a form , where t is the iteration number and , are hyperparameters.
Total number of iterations were chosen for each dataset independently by obsesving quality improvement on heldout set. When heldout MaF1 does not improve significantly for 5–10 iterations, we stop the learning process.
Batch size was optimized using grid search on logarithmic scale and turned out to be identical for all algorithms and datasets. Generally, increasing batch size while keeping the number of observed objects fixed first leads to improvement in accuracy at a cost of increased learning time, see Table 2 for details.
The choice of regularizer’s in case of MEMOIR- algorithm affects quality, but has almost no effect on time and memory. This is because value only affects matrix multiplier and does not have sparsifying effect on it, unlike in SVM. We found to be a reasonable choice in all our experiments.
Regularizer’s parameter in MEMOIR- algorithm has little effect on training time (it takes only 33% longer to train MEMOIR- with compared to ), but provides a way to trade quality for memory. Figure 1 illustrates this effect. Decreasing leads to increasing the number of non-zero elements in the weight matrix, which leads to more accurate predictions made using this weights.
The results for the baselines and the proposed methods in terms of training and predicting time, total memory usage and predictive performance evaluated with accuracy and MaF1 are provided in Table 3. For a visual comparison we also display the results of seven best models graphically, see Figure 2.
All proposed algorithms consume substantially less memory than the existing solutions with MEMOIR- algorithm achieving the most impressive memory reduction. MEMOIR-LSH algorithm achieves the fastest training time (due to reduced feature space and fast computation of Hamming distance) with highest memory usage. MEMOIR- is a good compromise between the previous two algorithms in terms of test quality, memory and training time.
|n = 163589||Predict time||328s||314s||21s||339s||164s||67s||18s||6s||12s|
|d = 409774||Total memory||40.3G||40.3G||122M||470M||471M||10.5G||119M||46M||218M|
|C = 12294||Accuracy||44.1%||36.4%||18.1%||39.3%||39.8%||45.7%||34.5%||31.9%||34.6%|
|n = 510943||Predict time||2797s||3981s||47s||424s||505s||482s||76s||22s||77s|
|d = 594158||Total memory||131.9G||131.9G||256M||1339M||1242M||28.1G||271M||52M||417M|
|C = 27875||Accuracy||37.7%||32.2%||16.9%||33.4%||33.7%||40.8%||25.1%||20.6%||25.2%|
|n = 1000772||Predict time||2270s||NA||24s||453s||871s||382s||88s||36s||61s|
|d = 380078||Total memory||109.1G||109.1G||178M||949M||947M||12.4G||121M||39M||213M|
|C = 36504||Accuracy||15.6%||NA||7.9%||11.1%||12.1%||15.6%||19.0%||17.0%||19.4%|
|n = 1384693||Predict time||NA||NA||45s||1110s||2461s||790s||120s||59s||120s|
|d = 951558||Total memory||330G||330G||226M||1327M||1781M||35G||185M||42M||424M|
|C = 50000||Accuracy||NA||NA||17.9%||25.8%||27.3%||33.8%||29.6%||24.2%||30.0%|
|n = 2750663||Predict time||NA||NA||90s||1687s||3210s||3121s||457s||161s||504s|
|d = 1271710||Total memory||1017G||1017G||370M||2622M||2834M||40.3G||346M||35M||660M|
|C = 100000||Accuracy||NA||NA||8.4%||15.0%||16.1%||22.2%||22.3%||14.0%||22.6%|
Our paper is about extreme multi-class classification using inexact margin computation. We provided theoretical analysis of a classification risk in multi-class (one-vs-rest and Crammer-Singer) settings and showed that inexact margin computation does not lead to a significant loss of models’ risk. We then designed three efficient methods, MEMOIR-, MEMOIR- and MEMOIR-LSH, that solve Crammer-Singer multi-class SVM problem with inexact approximation of a margin using two different MIPS solvers, SW-Graph and SimpleLSH. We illustrated an empirical performance of these methods on five extreme classification datatets, on which we achieved good results in terms of quality, memory and training time.
Finally, we discussed how parameter tuning affects algorithms’ performance and provided a practical advice on how to choose hyperparameters for the proposed algorithms. Our implementation is publicly available in a form of an open-source library.
-  Florent Perronnin and Christopher Dance. Fisher kernels on visual vocabularies for image categorization. In IEEE conference on computer vision and pattern recognition, pages 1–8, 2007.
-  Armand Joulin, Francis Bach, and Jean Ponce. Multi-class cosegmentation. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 542–549. IEEE, 2012.
-  Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant features. In European conference on machine learning, pages 137–142. Springer, 1998.
-  Yang Song, Ziming Zhuang, Huajing Li, Qiankun Zhao, Jia Li, Wang-Chien Lee, and C Lee Giles. Real-time automatic tag recommendation. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 515–522. ACM, 2008.
-  Amnon Shashua and Anat Levin. Ranking with large margin principle: Two approaches. In Advances in neural information processing systems, pages 961–968, 2003.
-  Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. Classifier chains for multi-label classification. Machine learning, 85(3):333, 2011.
-  Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical Programming, 127(1):3–30, 2011.
-  Behnam Neyshabur and Nathan Srebro. On symmetric and asymmetric lshs for inner product search. arXiv preprint arXiv:1410.5518, 2014.
-  Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
-  Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2012.
-  Yunwen Lei, Urun Dogan, Alexander Binder, and Marius Kloft. Multi-class svms: From tighter data-dependent generalization bounds to novel algorithms. In Advances in Neural Information Processing Systems, pages 2035–2043, 2015.
-  Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009.
-  Elad Hazan and Haipeng Luo. Variance-reduced and projection-free stochastic optimization. In International Conference on Machine Learning, pages 1263–1271, 2016.
-  Ryan Rifkin and Aldebaro Klautau. In defense of one-vs-all classification. Journal of machine learning research, 5(Jan):101–141, 2004.
-  Zhuang Wang, Koby Crammer, and Slobodan Vucetic. Multi-class Pegasos On A Budget. International Conference on Machine Learning (ICML), pages 1143–1150, 2010.
-  Koby Crammer and Yoram Singer. On The Algorithmic Implementation of Multiclass Kernel-based Vector Machines. Journal of Machine Learning Research (JMLR), 2:265–292, 2001.
-  Martin Jaggi. Revisiting frank-wolfe: projection-free sparse convex optimization. In Proceedings of the 30th International Conference on International Conference on Machine Learning-Volume 28, pages I–427. JMLR. org, 2013.
-  I En-Hsu Yen, X Huang, P Ravikumar, K Zhong, and I Dhillon. PD-Sparse : A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification. Proceedings of The 33rd International Conference on Machine Learning, 48:3069–3077, 2016.
-  Donald Goldfarb, Garud Iyengar, and Chaoxu Zhou. Linear convergence of stochastic frank wolfe variants. In Artificial Intelligence and Statistics, pages 1066–1074, 2017.
-  Ji Zhu, Saharon Rosset, Robert Tibshirani, and Trevor J Hastie. 1-norm support vector machines. In Advances in neural information processing systems, pages 49–56, 2004.
-  Ioannis Partalas, Aris Kosmopoulos, Nicolas Baskiotis, Thierry Artieres, George Paliouras, Eric Gaussier, Ion Androutsopoulos, Massih-Reza Amini, and Patrick Galinari. LSHTC: A Benchmark for Large-Scale Text Classification. pages 1–9, mar 2015.
-  Bikash Joshi, Massih R Amini, Ioannis Partalas, Franck Iutzeler, and Yury Maximov. Aggressive sampling for multi-class to binary reduction with applications to text classification. In Advances in Neural Information Processing Systems, pages 4159–4168, 2017.
-  Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research, 9:1871–1874, 2008.
-  Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res., 2:265–292, 2002.
-  Hal Daume, Nikos Karampatziakis, John Langford, and Paul Mineiro. Logarithmic Time One-Against-Some. pages 1–13, 2016.
-  Yashoteja Prabhu and Manik Varma. Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 263–272. ACM, 2014.
-  Himanshu Jain, Yashoteja Prabhu, and Manik Varma. Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking & Other Missing Label Applications. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16, pages 935–944, New York, New York, USA, 2016. ACM Press.
-  Behnam Neyshabur and Nathan Srebro. On Symmetric and Asymmetric LSHs for Inner Product Search. Proceedings of The 32nd International Conference on Machine Learning, 37:1926–1934, 2015.
-  George H. Chen and Devavrat Shah. Explaining the Success of Nearest Neighbor Methods in Prediction. Foundations and Trends® in Machine Learning, 10(5-6):337–588, 2018.
-  Yury Malkov, Alexander Ponomarenko, Andrey Logvinov, and Vladimir Krylov. Approximate nearest neighbor algorithm based on navigable small world graphs. Information Systems, 45:61–68, sep 2014.
-  Martin Aumüller, Erik Bernhardsson, and Alexander Faithfull. ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms. volume 8199 of Lecture Notes in Computer Science, pages 34–49. Springer Berlin Heidelberg, Berlin, Heidelberg, 2017.
MEMOIR: Multi-class Extreme Classification with Inexact Margin
Proof of Theorem 2
Following to  will treat the difference between the exact margin and the inexact one as the gradient error.
For the convenience we refer as
the instantaneous loss on -th object. Thus
The relative progress towards the optimal solution at -th iteration is
where the last inequality is due to the contraction property of projection on a convex set, and .
By the strong convexity of we have
As and the strong convexity of the objective we have . Dividing both sides of Eq. (Proof of Theorem 2) by and rearranging we get:
Summing over all one has
Rearranging the the terms and using that we have
where each is bounded as with probability at least , and always bounded as due to the bound on . Thus with probability at least by the Hoeffding inequality
which finishes the proof of the theorem. ∎
Parameters in Numerical Study
|hash string length||64||64||64||64||64|