MEMOIR: Multi-class Extreme Classification with Inexact Margin

MEMOIR: Multi-class Extreme Classification with Inexact Margin

Anton Belyy
ITMO University

Multi-class classification with a very large number of classes, or extreme classification, is a challenging problem from both statistical and computational perspectives. Most of the classical approaches to multi-class classification, including one-vs-rest or multi-class support vector machines, require the exact estimation of the classifier’s margin, at both the training and the prediction steps making them intractable in extreme classification scenarios. In this paper, we study the impact of computing an approximate margin using nearest neighbor (ANN) search structures combined with locality-sensitive hashing (LSH). This approximation allows to dramatically reduce both the training and the prediction time without a significant loss in performance. We theoretically prove that this approximation does not lead to a significant loss of the risk of the model and provide empirical evidence over five publicly available large scale datasets, showing that the proposed approach is highly competitive with respect to state-of-the-art approaches on time, memory and performance measures.


Recently, the problem of large-scale multi-class classification became very popular in the machine learning community owing to numerous applications in computer vision [1, 2], text categorization [3], recommendation and ranking systems [4, 5]. Publicly available text and image repositories such as Wikipedia, Yahoo!, or Directory Mozilla DMOZ contain millions of objects from thousands of classes. For instance, an LSHTC dataset from Mozilla DMOZ collection contains 163K objects belonging to 12K classes and described by 409K features. Classical multi-class classification approaches, such as one-vs-one and one-vs-rest, remain popular mainly because of their high accuracy and robustness to noisy input. On the other hand, their direct application to the extreme classification problems is doubtful due to highly non-linear time and memory efforts [6, 7].

Some promising attempts have been made to reduce the computation time of these models, by either doing locality-sensitive hashing [8] or reinforcement learning policies in an online convex optimization context [9]. Despite empirical evidence of how well these approaches perform, to the best of our knowledge none of the studies propose a well funded theoretical strategy for large-scale multi-class classification that guarantees a gain in computation time without a significant loss on the statistical risk of the model.

In this paper, we propose a novel method for approximating the output estimation of multi-class classification models using approximate nearest neighbor (ANN) search structures with locality-sensitive hashing (LSH). We show that the proposed inexact margin computation significantly reduces time and memory requirements, allowing popular techniques, such as multi-class support vector machines and the one-vs-rest approach, to pass the scale without a significant loss of their true risk.

The contribution of this paper is threefold. Namely, we

  1. propose an inexact margin multi-class classification framework and provide a theoretical analysis of its behavior;

  2. design efficient numerical methods for and inexact margin multi-class support vector machines;

  3. provide empirical evidence of its ability to learn efficient models compared to state-of-the-art approaches over multiple extreme classification datasets and make available the corresponding open-source software for research purpose.

In the next section we introduce a framework of multi-class margin classification and describe the inexact margin idea in more detail. Then we provide a theoretical analysis of the statistical performance of multi-class classification methods with inexact margin supporting it by the excess risk bounds of the corresponding classification algorithms. Further, we present experimental results obtained on publicly available extreme classification benchmarks, and finally, we briefly discuss the proposed algorithms and compare them to existing solutions.

Multi-class Margin Classification

Let be an identically and independently distributed (i.i.d.) sample with respect to a fixed yet unknown distribution , over , where is a feature space and , is a set of classes. Given a hypothesis set of functions mapping to , the exact margin of a labeled example with respect to a function is defined as


An observation is misclassified by a function if and only if . We refer to a class of margin loss functions as

The main problem here is that the computation of the margin for an observation requires the estimation of for each which is intractable when is too large. For instance, in the case of linear classifiers, margin computation is equal to finding the maximal element of a matrix-vector product on each iteration which is challenging in large-scale scenarios. In order to overcome this problem in such case, we estimate an approximate, or an inexact margin for each observation by first choosing randomly a class and then estimating


In this paper, we focus on the influence of inexact margin computation over the true risk of a classifier

where is equal to if the predicate is true and otherwise. More precisely, we are interested in the case where the classifier is found following the Empirical Risk Minimization principle by supposing that the approximate margin is inexact, that is for a given and with probability at least , , we have

The empirical loss considered in this work is based on the Hinge -loss function , (resp. ) for and defined as

Our main result is stated in Theorem 1, and it provides an upper-bound on the true risk of a multi-class classifier based on its empirical loss estimated with an inexact margin. The notion of function class capacity used in the bound is the Rademacher complexity of the set of functions [10]:

where ’s are independent uniform random variables taking values in ; i.e. .

Theorem 1.

Let be an i.i.d. sample from an unknown distribution over , , and be a class functions from . Then for any with probability at least , the expected risk of any trained with a inexact margin is upper-bounded by


Moreover, for kernel-based hypotheses, with a PDS kernel and its associated feature mapping function, defined as :

where is the matrix formed by the weight vectors defining the kernel-based hypotheses, and is the group norm of , then if one has


The standard Rademacher complexity margin bound according to Theorem 4.4 of [10] gives with probability for a class of functions the following bounds:




where .

Similarly to [11], for all , let satisfy we have due to monotonicity of the Rademacher complexity in the number of examples


Finally, with probability at least in the conditions of the theorem, for all training objects we have , thus

Combining it with Ineq. (Multi-class Margin Classification) one gets


Application of Ineq. (8) to Ineq.(6) proves Ineq. (3) in the statement of the theorem. Theorem 4.3. of [10] gives for the linear classifiers which proves the Ineq. (4).

To proof the remaining inequalities we note that with probability at least in the conditions of the theorem, for all training objects we have , thus

Margin approximation.

We consider two principal approaches to inexact margin computation: locality-sensitive hashing and convex optimization. Our main focus here is linear classification and the maximal inner product approximation.

Locality-Sensitive Hashing (LSH) is another paradigm to approximate the maximal inner product , which is known Ma. Following [8], we introduce Definition 1.

Definition 1.

A hash is said to be a -LSH for a similarity function over the pair of spaces if for any and :

  • if then ;

  • if then , .

As the optimal value of is unknown in advance (and might be even negative), to guarantee the utility of LSH to approximate the margin we require LSH to be universal, i.e. for every and it is an -LSH [8]. They also propose the simple LSH algorithm based on random Gaussian projections, with hashing quality


Following [8], one needs to distinguish between , and with probability at least . Assume below for simplicity that for any . Applying LSH recursively until

one gets time to construct margin approximation, where

We also note that the maximal inner product can be equally stated as a stochastic convex optimization problem:

where we use the superscript index to denote a coordinate of the vectors, and denote by the uniform probability measure over . As it is known from the seminal result of [12] on stochastic mirror descent algorithm and [13] on the stochastic Frank-Wolfe there exists potentially sub-linear time complexity algorithms to solve the problem approximately over sparse data. Nevertheless the discussion on optimization approach is out of the scope of this paper.

Multi-Class Support Vector Machines

Multi-class support vector machines (M-SVMs) still remain top classification methods owing to their accuracy and robustness [14]. In this section, we analyse simple sub-gradient descent methods to train M-SVMs with and regularization in terms of the influence of inexact margins on their accuracy. Our consideration is mainly inspired by the seminal work of [15] for support vector machines optimization with inexact oracle.


According to [16], the multi-class SVM classifier decision rule is:

where is a weight vector of -th classifier and

is a matrix of weight vectors of all classifiers.

The learning objective is


where is -Hinge loss function.

Algorithm 1 is essentially an extension of the Pegasos algorithm to train the regularized support vector machines [7]. We further refer to it as MEMOIR- The convergence rate of the algorithm is established in Theorem 2.

Theorem 2.

Assume that for all , . Let be an optimal solution for the Problem (10) and also the batch size for all in Algorithm 1. Then for one has with probability at least ,

where on each -th training step we use inexact margin, , and .


Following to [15] we treat a margin inexactness as an adversarial noise to the gradient and the derive its total influence on the minimization. The bound on is nothing more than application of the Hoeffding inequality to the total distortion introduced by the inexact margin. The full proof is provided in the supplementary. ∎

Theorem 2 requires the inexactness in margin computation to be bounded by the which in its turn should sum up to for the consistency on the algorithm. This requirement is important from theoretical perspective as it limits the performance of the numerical schemes based on convex optimization and LSH in margin approximation. On the other hand, it is much less crucial from the practical perspective as we see in our numerical study.

  for  do
     Get batch uniformly at random
     for  do
        Compute approximately
        if  then
        end if
     end for
  end for
Algorithm 1 -regularized Multi-class Support Vector Machines with Approximate Maximal Inner Product Search


  for  do
     Get batch
     for  do
        Compute approximately
        if  then
        end if
     end for
     for  do
         (Eq. 13)
     end for
  end for
Algorithm 2 -regularized Multi-class Support Vector Machines with Approximate Maximal Inner Product Search

In the area of text classification, objects are often described by the TF-IDF or word/collocation frequencies and have sparse representation which is crucial for large-scale machine learning problem. To control the sparsity of , we use a simple truncated stochastic gradient descent given in Algorithm 2. It’s worth to mention a variety of efficient optimization schemes for minimization [17, 18, 19, 20]. We believe that similar technique could be utilized for any of the schemes above as well as for the sub-gradient descent we consider here. We also refer to this algorithm as MEMOIR-.

The problem of multi-class support vector machines [20] is to minimize


A step of stochastic sub-gradient descent method




The details of the method are provided in Algorithm 2. Note an important truncation step which zeroes out sufficiently small elements of and significantly reduces memory consumption. We apply it in our algorithm only in the case of the resulting norm of the truncated elements is sufficiently small. In particular, for :



Datasets # of Classes, Dimension, Train Size Test Size Heldout Size
LSHTC1 12294 409774 126871 31718 5000
DMOZ 27875 594158 381149 95288 34506
WIKI-Small 36504 380078 796617 199155 5000
WIKI-50k 50000 951558 1102754 276939 5000
WIKI-100k 100000 1271710 2195530 550133 5000
Table 1: Description of datasets used for the numerical evaluation
Theorem 3.

After iterations of Algorithm 2 with the full update step and step sizes and corresponding inexact margins the expected loss at is bounded with probability as


where , and is the optimal value in Eq. (11), the subgradient’s -norm is bounded by , for any , and each truncation operation does not change the norm of on more than after each iteration.


The proof is a direct implication of the standard sub-gradient convergence analysis to the case of sub-gradient errors due to inexact margin and truncation. The condition on the inexact margin also guarantees that with with probability at least each distortion in margin is bounded by

Numerical Experiments

In this section we provide an empirical evaluation of the proposed algorithms and compare them to several state-of-the-art approaches. We also discuss how hyperparameter tuning affects algorithms’ performance from time, memory and quality prospectives.


We use datasets from the Large Scale Hierarchical Text Classification Challenge (LSHTC) 1 and 2 [21], which were converted from multi-label to multi-class format by replicating the instances belonging to different class labels. These datasets are provided in a pre-processed format using both stemming and stop-words removal. Their characteristics, such as train, test, and heldout sizes, are listed in Table 1. We would like to thank authors of [22] for providing us with these datasets.

Evaluation Measures.

During the experiments two quality measures were evaluated: the accuracy and the Macro-Averaged F1 Measure (MaF1). The former represents the fraction of the test data being classified correctly, the later is a harmonic average of macro-precision and macro-recall; the higher values correspond to better performance. Being insensitive to class imbalance, the MaF1 is commonly used for comparing multi-class classification algorithms.


We compare MEMOIR- and SVM algorithms with the following multi-class classification algorithms:

  • OVR: One-vs-rest SVM implementation from LIBLINEAR [23].

  • M-SVM: Multi-class SVM implementation from LIBLINEAR proposed in [24].

  • RecallTree: A logarithmic time one-vs-some tree-based classifier. It utilises trees for selecting a small subset of labels with high recall and scores them with high precision [25].

  • FastXML: A computationally efficient algorithm for extreme multi-labeling problems. It uses hierarchical partitioning of feature space together with direct optimization of nDCG ranking measure [26].

  • PfastReXML: A tree ensemble based algorithm which is an enhanced version of FastXML: the nDCG loss is replaced with its propensity scored variant which is unbiased and assigns higher rewards for accurate tail label predictions [27].

  • PD-Sparse: A recent classifier with a multi-class margin loss and Elastic Net regularisation. To establish an optimal solution the Dual-Space Fully-Corrective Block-Coordinate Frank-Wolfe algorithm is used [18].

Hardware Platform.

In our experiments we use a system with two 20-core Intel(R) Xeon(R) Silver 4114 2.20GHz CPUs and 128 GB of RAM. A maximum of 16 cores are used in each experiment for training and predicting. The proposed algorithms run in a single main thread, but querying and updating ANN search structure is parallelized. We were able to achieve a perfect 8x training time speed up when using 8 cores, 11x speed up when using 16 cores, and 20x speed up when using 32 cores (compared to a single core run of MEMOIR- algorithm on LSHTC1 dataset). This means that the proposed algorithm is well-scalable and has reasonable yet sublinear training time improvement with the number of used cores.

Implementation Details.

In all algorithms we employ ”lazy” matrix scaling, accumulating shared matrix multiplier during all matrix-scalar multiplications in a form and later dividing by in addition and subtraction operations when necessary. This allows to keep update time sublinear to problem size. Additionally, we store weight matrix as lists of CSR sparse vectors individual to each class, thus requiring memory. Finally, in each algorithm we do class prediction for test object using exact MIPS computation. This is feasible due to the fact that all algorithms achieve highly sparse weight matrices , thus making weight matrix-test object vector multiplication very fast.

Inexact Margin Computation.

Train time 1569s 885s 1331s 1984s 1830s
Accuracy 8.4% 26.6% 30.8% 34.5% 29.3%
MaF1 6.9% 19.5% 22.7% 24.7% 21.0%
Table 2: Batch size tuning experiment

In the case of linear classifiers, the problem of inexact margin computation is known as the Maximum Inner Product Search (MIPS) problem. Many techniques, such as cone trees or locality-sensitive hashing [28] have been proposed to tackle this problem in a high dimensional setup. We refer to a recent survey [29] for more details.

To deal effectively in large scale setup, a MIPS solver should maintain the sublinear costs of query and incremental update operations. In practice the majority of solutions lacks theoretical guarantees providing only empirical evaluations [29].

In our experiments we use two different MIPS solvers: SimpleLSH, an LSH-based MIPS solver [28], and a Navigable Small-World Graph (SW-Graph), a graph-based algorithm introduced in [30]. Here we provide theoretical guarantees regarding to the properties of SimpleLSH as an inexact oracle, but our framework can be extended to any similar LSH implementation.

In contrast to SimpleLSH, SW-Graph is a purely empirical algorithm, according to [30], yet we included it into our experiments due to itst high performance in comparison to other solutions [31]. In our implementation we use Non-Metric Space Library (nmslib) 444 which provides a parallel implementation of SW-Graph algorithm with incremental insertions and deletions. One useful property of nmslib is its ability to work in non-metric similarity functions, including negative dot product, which makes it a suitable implementation of a MIPS oracle.


Figure 1: regularizer tuning experiment

In this subsection we discuss reasonable choice of algorithms’ parameters and ways how this choice affects training time, memory consumption and quality. The complete list with exact values of all hyperparameters is provided in the Appendix.

Common parameters.

Learning rate is used with learning rate in a form , where t is the iteration number and , are hyperparameters.

Total number of iterations were chosen for each dataset independently by obsesving quality improvement on heldout set. When heldout MaF1 does not improve significantly for 5–10 iterations, we stop the learning process.

Batch size was optimized using grid search on logarithmic scale and turned out to be identical for all algorithms and datasets. Generally, increasing batch size while keeping the number of observed objects fixed first leads to improvement in accuracy at a cost of increased learning time, see Table 2 for details.


The choice of regularizer’s in case of MEMOIR- algorithm affects quality, but has almost no effect on time and memory. This is because value only affects matrix multiplier and does not have sparsifying effect on it, unlike in SVM. We found to be a reasonable choice in all our experiments.


Figure 2: Comparison in time, memory usage, MaF1 and accuracy of the seven best performing methods

Regularizer’s parameter in MEMOIR- algorithm has little effect on training time (it takes only 33% longer to train MEMOIR- with compared to ), but provides a way to trade quality for memory. Figure 1 illustrates this effect. Decreasing leads to increasing the number of non-zero elements in the weight matrix, which leads to more accurate predictions made using this weights.


The results for the baselines and the proposed methods in terms of training and predicting time, total memory usage and predictive performance evaluated with accuracy and MaF1 are provided in Table 3. For a visual comparison we also display the results of seven best models graphically, see Figure 2.

All proposed algorithms consume substantially less memory than the existing solutions with MEMOIR- algorithm achieving the most impressive memory reduction. MEMOIR-LSH algorithm achieves the fastest training time (due to reduced feature space and fast computation of Hamming distance) with highest memory usage. MEMOIR- is a good compromise between the previous two algorithms in terms of test quality, memory and training time.

Data Baselines MEMOIR
OVR M-SVM RecallTree FastXML PfastReXML PD-Sparse LSH
LSHTC1 Train time 23056s 48313s 701s 8564s 3912s 5105s 1673s 2204s 655s
n = 163589 Predict time 328s 314s 21s 339s 164s 67s 18s 6s 12s
d = 409774 Total memory 40.3G 40.3G 122M 470M 471M 10.5G 119M 46M 218M
C = 12294 Accuracy 44.1% 36.4% 18.1% 39.3% 39.8% 45.7% 34.5% 31.9% 34.6%
MaF1 27.4% 18.8% 3.8% 21.3% 22.4% 27.7% 24.4% 23.6% 24.8%
DMOZ Train time 180361s 212356s 2212s 14334s 15492s 63286s 5709s 7226s 3637s
n = 510943 Predict time 2797s 3981s 47s 424s 505s 482s 76s 22s 77s
d = 594158 Total memory 131.9G 131.9G 256M 1339M 1242M 28.1G 271M 52M 417M
C = 27875 Accuracy 37.7% 32.2% 16.9% 33.4% 33.7% 40.8% 25.1% 20.6% 25.2%
MaF1 22.2% 14.3% 1.8% 15.1% 15.9% 22.7% 18.8% 17.1% 19.1%
WIKI-Small Train time 212438s 4d 1610s 10646s 21702s 16309s 4791s 7055s 4165s
n = 1000772 Predict time 2270s NA 24s 453s 871s 382s 88s 36s 61s
d = 380078 Total memory 109.1G 109.1G 178M 949M 947M 12.4G 121M 39M 213M
C = 36504 Accuracy 15.6% NA 7.9% 11.1% 12.1% 15.6% 19.0% 17.0% 19.4%
MaF1 8.8% NA 1% 4.6% 5.6% 9.9% 12.7% 12.1% 13.1%
WIKI-50K Train time NA NA 4188s 30459s 48739s 41091s 6755s 17303s 6215s
n = 1384693 Predict time NA NA 45s 1110s 2461s 790s 120s 59s 120s
d = 951558 Total memory 330G 330G 226M 1327M 1781M 35G 185M 42M 424M
C = 50000 Accuracy NA NA 17.9% 25.8% 27.3% 33.8% 29.6% 24.2% 30.0%
MaF1 NA NA 5.5% 14.6% 16.3% 23.4% 22.9% 20.0% 23.3%
WIKI-100K Train time NA NA 8593s 42359s 73371s 155633s 21061s 46730s 14323s
n = 2750663 Predict time NA NA 90s 1687s 3210s 3121s 457s 161s 504s
d = 1271710 Total memory 1017G 1017G 370M 2622M 2834M 40.3G 346M 35M 660M
C = 100000 Accuracy NA NA 8.4% 15.0% 16.1% 22.2% 22.3% 14.0% 22.6%
MaF1 NA NA 1.4% 8.0% 9.0% 15.1% 16.9% 11.8% 17.2%
Table 3: Comparison of the result of various baselines in terms of time, memory, accuracy, and macro F1-measure


Our paper is about extreme multi-class classification using inexact margin computation. We provided theoretical analysis of a classification risk in multi-class (one-vs-rest and Crammer-Singer) settings and showed that inexact margin computation does not lead to a significant loss of models’ risk. We then designed three efficient methods, MEMOIR-, MEMOIR- and MEMOIR-LSH, that solve Crammer-Singer multi-class SVM problem with inexact approximation of a margin using two different MIPS solvers, SW-Graph and SimpleLSH. We illustrated an empirical performance of these methods on five extreme classification datatets, on which we achieved good results in terms of quality, memory and training time.

Finally, we discussed how parameter tuning affects algorithms’ performance and provided a practical advice on how to choose hyperparameters for the proposed algorithms. Our implementation is publicly available in a form of an open-source library.


  • [1] Florent Perronnin and Christopher Dance. Fisher kernels on visual vocabularies for image categorization. In IEEE conference on computer vision and pattern recognition, pages 1–8, 2007.
  • [2] Armand Joulin, Francis Bach, and Jean Ponce. Multi-class cosegmentation. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 542–549. IEEE, 2012.
  • [3] Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant features. In European conference on machine learning, pages 137–142. Springer, 1998.
  • [4] Yang Song, Ziming Zhuang, Huajing Li, Qiankun Zhao, Jia Li, Wang-Chien Lee, and C Lee Giles. Real-time automatic tag recommendation. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 515–522. ACM, 2008.
  • [5] Amnon Shashua and Anat Levin. Ranking with large margin principle: Two approaches. In Advances in neural information processing systems, pages 961–968, 2003.
  • [6] Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. Classifier chains for multi-label classification. Machine learning, 85(3):333, 2011.
  • [7] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical Programming, 127(1):3–30, 2011.
  • [8] Behnam Neyshabur and Nathan Srebro. On symmetric and asymmetric lshs for inner product search. arXiv preprint arXiv:1410.5518, 2014.
  • [9] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
  • [10] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2012.
  • [11] Yunwen Lei, Urun Dogan, Alexander Binder, and Marius Kloft. Multi-class svms: From tighter data-dependent generalization bounds to novel algorithms. In Advances in Neural Information Processing Systems, pages 2035–2043, 2015.
  • [12] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009.
  • [13] Elad Hazan and Haipeng Luo. Variance-reduced and projection-free stochastic optimization. In International Conference on Machine Learning, pages 1263–1271, 2016.
  • [14] Ryan Rifkin and Aldebaro Klautau. In defense of one-vs-all classification. Journal of machine learning research, 5(Jan):101–141, 2004.
  • [15] Zhuang Wang, Koby Crammer, and Slobodan Vucetic. Multi-class Pegasos On A Budget. International Conference on Machine Learning (ICML), pages 1143–1150, 2010.
  • [16] Koby Crammer and Yoram Singer. On The Algorithmic Implementation of Multiclass Kernel-based Vector Machines. Journal of Machine Learning Research (JMLR), 2:265–292, 2001.
  • [17] Martin Jaggi. Revisiting frank-wolfe: projection-free sparse convex optimization. In Proceedings of the 30th International Conference on International Conference on Machine Learning-Volume 28, pages I–427. JMLR. org, 2013.
  • [18] I En-Hsu Yen, X Huang, P Ravikumar, K Zhong, and I Dhillon. PD-Sparse : A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification. Proceedings of The 33rd International Conference on Machine Learning, 48:3069–3077, 2016.
  • [19] Donald Goldfarb, Garud Iyengar, and Chaoxu Zhou. Linear convergence of stochastic frank wolfe variants. In Artificial Intelligence and Statistics, pages 1066–1074, 2017.
  • [20] Ji Zhu, Saharon Rosset, Robert Tibshirani, and Trevor J Hastie. 1-norm support vector machines. In Advances in neural information processing systems, pages 49–56, 2004.
  • [21] Ioannis Partalas, Aris Kosmopoulos, Nicolas Baskiotis, Thierry Artieres, George Paliouras, Eric Gaussier, Ion Androutsopoulos, Massih-Reza Amini, and Patrick Galinari. LSHTC: A Benchmark for Large-Scale Text Classification. pages 1–9, mar 2015.
  • [22] Bikash Joshi, Massih R Amini, Ioannis Partalas, Franck Iutzeler, and Yury Maximov. Aggressive sampling for multi-class to binary reduction with applications to text classification. In Advances in Neural Information Processing Systems, pages 4159–4168, 2017.
  • [23] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research, 9:1871–1874, 2008.
  • [24] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res., 2:265–292, 2002.
  • [25] Hal Daume, Nikos Karampatziakis, John Langford, and Paul Mineiro. Logarithmic Time One-Against-Some. pages 1–13, 2016.
  • [26] Yashoteja Prabhu and Manik Varma. Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 263–272. ACM, 2014.
  • [27] Himanshu Jain, Yashoteja Prabhu, and Manik Varma. Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking & Other Missing Label Applications. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16, pages 935–944, New York, New York, USA, 2016. ACM Press.
  • [28] Behnam Neyshabur and Nathan Srebro. On Symmetric and Asymmetric LSHs for Inner Product Search. Proceedings of The 32nd International Conference on Machine Learning, 37:1926–1934, 2015.
  • [29] George H. Chen and Devavrat Shah. Explaining the Success of Nearest Neighbor Methods in Prediction. Foundations and Trends® in Machine Learning, 10(5-6):337–588, 2018.
  • [30] Yury Malkov, Alexander Ponomarenko, Andrey Logvinov, and Vladimir Krylov. Approximate nearest neighbor algorithm based on navigable small world graphs. Information Systems, 45:61–68, sep 2014.
  • [31] Martin Aumüller, Erik Bernhardsson, and Alexander Faithfull. ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms. volume 8199 of Lecture Notes in Computer Science, pages 34–49. Springer Berlin Heidelberg, Berlin, Heidelberg, 2017.

Appendix for
MEMOIR: Multi-class Extreme Classification with Inexact Margin

Proof of Theorem 2

Theorem 2.

Assume that for all , . Let be an optimal solution for the Problem (10) and also the batch size for all in Algorithm 1. Then for one has with probability at least ,

where on each -th training step we use inexact margin, , and .


Following to [15] will treat the difference between the exact margin and the inexact one as the gradient error.

For the convenience we refer as

the instantaneous loss on -th object. Thus

where is a normed ball, according to Lemma 1 of [7] and Lemma 1 of [15]. The distortion bounded as with probability at least by the definition of inexact margin.

The relative progress towards the optimal solution at -th iteration is


where the last inequality is due to the contraction property of projection on a convex set, and .

By the strong convexity of we have

As and the strong convexity of the objective we have . Dividing both sides of Eq. (Proof of Theorem 2) by and rearranging we get:

Summing over all one has


Rearranging the the terms and using that we have


Finally combining the inequalities (Proof of Theorem 2), (Proof of Theorem 2), (Proof of Theorem 2) and Jensen’s inequality we get

where each is bounded as with probability at least , and always bounded as due to the bound on . Thus with probability at least by the Hoeffding inequality

which finishes the proof of the theorem. ∎

Parameters in Numerical Study

Algorithm Parameters LSHTC1 DMOZ WIKI-Small WIKI-50K WIKI-100K
OVR C 10 10 1 NA NA
RecallTree –b 30 30 30 30 28
–l 1 0.7 0.7 0.5 0.5
–loss_function Hinge Hinge Logistic Hinge Hinge
–passes 5 5 5 5 5
FastXML -t 100 50 50 100 50
-c 100 100 10 10 10
PfastReXML -t 50 50 100 200 100
-c 100 100 10 10 10
PD-Sparse -l 0.01 0.01 0.001 0.0001 0.01
Hashing multiTrainHash multiTrainHash multiTrainHash multiTrainHash multiTrainHash
MEMOIR-* 0.1 0.1 0.1 0.1 0.1
0.02 0.02 0.02 0.02 0.02
1 1 1 1 1
25 40 48 60 80
25 40 48 60 80
MEMOIR-LSH 1 1 1 1 1
25 40 48 60 80
hash string length 64 64 64 64 64
Table 4: Hyper-parameters used in the final experiments
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description