MEMOIR: Multiclass Extreme Classification with Inexact Margin
Abstract
Multiclass classification with a very large number of classes, or extreme classification, is a challenging problem from both statistical and computational perspectives. Most of the classical approaches to multiclass classification, including onevsrest or multiclass support vector machines, require the exact estimation of the classifier’s margin, at both the training and the prediction steps making them intractable in extreme classification scenarios. In this paper, we study the impact of computing an approximate margin using nearest neighbor (ANN) search structures combined with localitysensitive hashing (LSH). This approximation allows to dramatically reduce both the training and the prediction time without a significant loss in performance. We theoretically prove that this approximation does not lead to a significant loss of the risk of the model and provide empirical evidence over five publicly available large scale datasets, showing that the proposed approach is highly competitive with respect to stateoftheart approaches on time, memory and performance measures.
Introduction
Recently, the problem of largescale multiclass classification became very popular in the machine learning community owing to numerous applications in computer vision [1, 2], text categorization [3], recommendation and ranking systems [4, 5]. Publicly available text and image repositories such as Wikipedia, Yahoo! Directory^{1}^{1}1www.dir.yahoo.com, ImageNet^{2}^{2}2www.imagenet.org or Directory Mozilla DMOZ ^{3}^{3}3www.dmoz.org contain millions of objects from thousands of classes. For instance, an LSHTC dataset from Mozilla DMOZ collection contains 163K objects belonging to 12K classes and described by 409K features. Classical multiclass classification approaches, such as onevsone and onevsrest, remain popular mainly because of their high accuracy and robustness to noisy input. On the other hand, their direct application to the extreme classification problems is doubtful due to highly nonlinear time and memory efforts [6, 7].
Some promising attempts have been made to reduce the computation time of these models, by either doing localitysensitive hashing [8] or reinforcement learning policies in an online convex optimization context [9]. Despite empirical evidence of how well these approaches perform, to the best of our knowledge none of the studies propose a well funded theoretical strategy for largescale multiclass classification that guarantees a gain in computation time without a significant loss on the statistical risk of the model.
In this paper, we propose a novel method for approximating the output estimation of multiclass classification models using approximate nearest neighbor (ANN) search structures with localitysensitive hashing (LSH). We show that the proposed inexact margin computation significantly reduces time and memory requirements, allowing popular techniques, such as multiclass support vector machines and the onevsrest approach, to pass the scale without a significant loss of their true risk.
The contribution of this paper is threefold. Namely, we

propose an inexact margin multiclass classification framework and provide a theoretical analysis of its behavior;

design efficient numerical methods for and inexact margin multiclass support vector machines;

provide empirical evidence of its ability to learn efficient models compared to stateoftheart approaches over multiple extreme classification datasets and make available the corresponding opensource software for research purpose.
In the next section we introduce a framework of multiclass margin classification and describe the inexact margin idea in more detail. Then we provide a theoretical analysis of the statistical performance of multiclass classification methods with inexact margin supporting it by the excess risk bounds of the corresponding classification algorithms. Further, we present experimental results obtained on publicly available extreme classification benchmarks, and finally, we briefly discuss the proposed algorithms and compare them to existing solutions.
Multiclass Margin Classification
Let be an identically and independently distributed (i.i.d.) sample with respect to a fixed yet unknown distribution , over , where is a feature space and , is a set of classes. Given a hypothesis set of functions mapping to , the exact margin of a labeled example with respect to a function is defined as
(1) 
An observation is misclassified by a function if and only if . We refer to a class of margin loss functions as
The main problem here is that the computation of the margin for an observation requires the estimation of for each which is intractable when is too large. For instance, in the case of linear classifiers, margin computation is equal to finding the maximal element of a matrixvector product on each iteration which is challenging in largescale scenarios. In order to overcome this problem in such case, we estimate an approximate, or an inexact margin for each observation by first choosing randomly a class and then estimating
(2) 
In this paper, we focus on the influence of inexact margin computation over the true risk of a classifier
where is equal to if the predicate is true and otherwise. More precisely, we are interested in the case where the classifier is found following the Empirical Risk Minimization principle by supposing that the approximate margin is inexact, that is for a given and with probability at least , , we have
The empirical loss considered in this work is based on the Hinge loss function , (resp. ) for and defined as
Our main result is stated in Theorem 1, and it provides an upperbound on the true risk of a multiclass classifier based on its empirical loss estimated with an inexact margin. The notion of function class capacity used in the bound is the Rademacher complexity of the set of functions [10]:
where ’s are independent uniform random variables taking values in ; i.e. .
Theorem 1.
Let be an i.i.d. sample from an unknown distribution over , , and be a class functions from . Then for any with probability at least , the expected risk of any trained with a inexact margin is upperbounded by
(3) 
Moreover, for kernelbased hypotheses, with a PDS kernel and its associated feature mapping function, defined as :
where is the matrix formed by the weight vectors defining the kernelbased hypotheses, and is the group norm of , then if one has
(4) 
Proof.
The standard Rademacher complexity margin bound according to Theorem 4.4 of [10] gives with probability for a class of functions the following bounds:
(5) 
and
(6) 
where .
Similarly to [11], for all , let satisfy we have due to monotonicity of the Rademacher complexity in the number of examples
(7) 
Finally, with probability at least in the conditions of the theorem, for all training objects we have , thus
Combining it with Ineq. (Multiclass Margin Classification) one gets
(8) 
Application of Ineq. (8) to Ineq.(6) proves Ineq. (3) in the statement of the theorem. Theorem 4.3. of [10] gives for the linear classifiers which proves the Ineq. (4).
To proof the remaining inequalities we note that with probability at least in the conditions of the theorem, for all training objects we have , thus ∎
Margin approximation.
We consider two principal approaches to inexact margin computation: localitysensitive hashing and convex optimization. Our main focus here is linear classification and the maximal inner product approximation.
LocalitySensitive Hashing (LSH) is another paradigm to approximate the maximal inner product , which is known Ma. Following [8], we introduce Definition 1.
Definition 1.
A hash is said to be a LSH for a similarity function over the pair of spaces if for any and :

if then ;

if then , .
As the optimal value of is unknown in advance (and might be even negative), to guarantee the utility of LSH to approximate the margin we require LSH to be universal, i.e. for every and it is an LSH [8]. They also propose the simple LSH algorithm based on random Gaussian projections, with hashing quality
(9) 
Following [8], one needs to distinguish between , and with probability at least . Assume below for simplicity that for any . Applying LSH recursively until
one gets time to construct margin approximation, where
We also note that the maximal inner product can be equally stated as a stochastic convex optimization problem:
where we use the superscript index to denote a coordinate of the vectors, and denote by the uniform probability measure over . As it is known from the seminal result of [12] on stochastic mirror descent algorithm and [13] on the stochastic FrankWolfe there exists potentially sublinear time complexity algorithms to solve the problem approximately over sparse data. Nevertheless the discussion on optimization approach is out of the scope of this paper.
MultiClass Support Vector Machines
Multiclass support vector machines (MSVMs) still remain top classification methods owing to their accuracy and robustness [14]. In this section, we analyse simple subgradient descent methods to train MSVMs with and regularization in terms of the influence of inexact margins on their accuracy. Our consideration is mainly inspired by the seminal work of [15] for support vector machines optimization with inexact oracle.
regularization.
According to [16], the multiclass SVM classifier decision rule is:
where is a weight vector of th classifier and
is a matrix of weight vectors of all classifiers.
The learning objective is
(10) 
where is Hinge loss function.
Algorithm 1 is essentially an extension of the Pegasos algorithm to train the regularized support vector machines [7]. We further refer to it as MEMOIR The convergence rate of the algorithm is established in Theorem 2.
Theorem 2.
Proof.
Following to [15] we treat a margin inexactness as an adversarial noise to the gradient and the derive its total influence on the minimization. The bound on is nothing more than application of the Hoeffding inequality to the total distortion introduced by the inexact margin. The full proof is provided in the supplementary. ∎
Theorem 2 requires the inexactness in margin computation to be bounded by the which in its turn should sum up to for the consistency on the algorithm. This requirement is important from theoretical perspective as it limits the performance of the numerical schemes based on convex optimization and LSH in margin approximation. On the other hand, it is much less crucial from the practical perspective as we see in our numerical study.
regularization.
In the area of text classification, objects are often described by the TFIDF or word/collocation frequencies and have sparse representation which is crucial for largescale machine learning problem. To control the sparsity of , we use a simple truncated stochastic gradient descent given in Algorithm 2. It’s worth to mention a variety of efficient optimization schemes for minimization [17, 18, 19, 20]. We believe that similar technique could be utilized for any of the schemes above as well as for the subgradient descent we consider here. We also refer to this algorithm as MEMOIR.
The problem of multiclass support vector machines [20] is to minimize
(11) 
A step of stochastic subgradient descent method
(12) 
where
and
The details of the method are provided in Algorithm 2. Note an important truncation step which zeroes out sufficiently small elements of and significantly reduces memory consumption. We apply it in our algorithm only in the case of the resulting norm of the truncated elements is sufficiently small. In particular, for :
(13) 
where
Datasets  # of Classes,  Dimension,  Train Size  Test Size  Heldout Size 

LSHTC1  12294  409774  126871  31718  5000 
DMOZ  27875  594158  381149  95288  34506 
WIKISmall  36504  380078  796617  199155  5000 
WIKI50k  50000  951558  1102754  276939  5000 
WIKI100k  100000  1271710  2195530  550133  5000 
Theorem 3.
After iterations of Algorithm 2 with the full update step and step sizes and corresponding inexact margins the expected loss at is bounded with probability as
(14) 
where , and is the optimal value in Eq. (11), the subgradient’s norm is bounded by , for any , and each truncation operation does not change the norm of on more than after each iteration.
Proof.
The proof is a direct implication of the standard subgradient convergence analysis to the case of subgradient errors due to inexact margin and truncation. The condition on the inexact margin also guarantees that with with probability at least each distortion in margin is bounded by ∎
Numerical Experiments
In this section we provide an empirical evaluation of the proposed algorithms and compare them to several stateoftheart approaches. We also discuss how hyperparameter tuning affects algorithms’ performance from time, memory and quality prospectives.
Datasets.
We use datasets from the Large Scale Hierarchical Text Classification Challenge (LSHTC) 1 and 2 [21], which were converted from multilabel to multiclass format by replicating the instances belonging to different class labels. These datasets are provided in a preprocessed format using both stemming and stopwords removal. Their characteristics, such as train, test, and heldout sizes, are listed in Table 1. We would like to thank authors of [22] for providing us with these datasets.
Evaluation Measures.
During the experiments two quality measures were evaluated: the accuracy and the MacroAveraged F1 Measure (MaF1). The former represents the fraction of the test data being classified correctly, the later is a harmonic average of macroprecision and macrorecall; the higher values correspond to better performance. Being insensitive to class imbalance, the MaF1 is commonly used for comparing multiclass classification algorithms.
Baselines.
We compare MEMOIR and SVM algorithms with the following multiclass classification algorithms:

OVR: Onevsrest SVM implementation from LIBLINEAR [23].

MSVM: Multiclass SVM implementation from LIBLINEAR proposed in [24].

RecallTree: A logarithmic time onevssome treebased classifier. It utilises trees for selecting a small subset of labels with high recall and scores them with high precision [25].

FastXML: A computationally efficient algorithm for extreme multilabeling problems. It uses hierarchical partitioning of feature space together with direct optimization of nDCG ranking measure [26].

PfastReXML: A tree ensemble based algorithm which is an enhanced version of FastXML: the nDCG loss is replaced with its propensity scored variant which is unbiased and assigns higher rewards for accurate tail label predictions [27].

PDSparse: A recent classifier with a multiclass margin loss and Elastic Net regularisation. To establish an optimal solution the DualSpace FullyCorrective BlockCoordinate FrankWolfe algorithm is used [18].
Hardware Platform.
In our experiments we use a system with two 20core Intel(R) Xeon(R) Silver 4114 2.20GHz CPUs and 128 GB of RAM. A maximum of 16 cores are used in each experiment for training and predicting. The proposed algorithms run in a single main thread, but querying and updating ANN search structure is parallelized. We were able to achieve a perfect 8x training time speed up when using 8 cores, 11x speed up when using 16 cores, and 20x speed up when using 32 cores (compared to a single core run of MEMOIR algorithm on LSHTC1 dataset). This means that the proposed algorithm is wellscalable and has reasonable yet sublinear training time improvement with the number of used cores.
Implementation Details.
In all algorithms we employ ”lazy” matrix scaling, accumulating shared matrix multiplier during all matrixscalar multiplications in a form and later dividing by in addition and subtraction operations when necessary. This allows to keep update time sublinear to problem size. Additionally, we store weight matrix as lists of CSR sparse vectors individual to each class, thus requiring memory. Finally, in each algorithm we do class prediction for test object using exact MIPS computation. This is feasible due to the fact that all algorithms achieve highly sparse weight matrices , thus making weight matrixtest object vector multiplication very fast.
Inexact Margin Computation.
LSHTC1  

Train time  1569s  885s  1331s  1984s  1830s 
Accuracy  8.4%  26.6%  30.8%  34.5%  29.3% 
MaF1  6.9%  19.5%  22.7%  24.7%  21.0% 
In the case of linear classifiers, the problem of inexact margin computation is known as the Maximum Inner Product Search (MIPS) problem. Many techniques, such as cone trees or localitysensitive hashing [28] have been proposed to tackle this problem in a high dimensional setup. We refer to a recent survey [29] for more details.
To deal effectively in large scale setup, a MIPS solver should maintain the sublinear costs of query and incremental update operations. In practice the majority of solutions lacks theoretical guarantees providing only empirical evaluations [29].
In our experiments we use two different MIPS solvers: SimpleLSH, an LSHbased MIPS solver [28], and a Navigable SmallWorld Graph (SWGraph), a graphbased algorithm introduced in [30]. Here we provide theoretical guarantees regarding to the properties of SimpleLSH as an inexact oracle, but our framework can be extended to any similar LSH implementation.
In contrast to SimpleLSH, SWGraph is a purely empirical algorithm, according to [30], yet we included it into our experiments due to itst high performance in comparison to other solutions [31]. In our implementation we use NonMetric Space Library (nmslib) ^{4}^{4}4https://github.com/nmslib/nmslib which provides a parallel implementation of SWGraph algorithm with incremental insertions and deletions. One useful property of nmslib is its ability to work in nonmetric similarity functions, including negative dot product, which makes it a suitable implementation of a MIPS oracle.
Hyperparameters
In this subsection we discuss reasonable choice of algorithms’ parameters and ways how this choice affects training time, memory consumption and quality. The complete list with exact values of all hyperparameters is provided in the Appendix.
Common parameters.
Learning rate is used with learning rate in a form , where t is the iteration number and , are hyperparameters.
Total number of iterations were chosen for each dataset independently by obsesving quality improvement on heldout set. When heldout MaF1 does not improve significantly for 5–10 iterations, we stop the learning process.
Batch size was optimized using grid search on logarithmic scale and turned out to be identical for all algorithms and datasets. Generally, increasing batch size while keeping the number of observed objects fixed first leads to improvement in accuracy at a cost of increased learning time, see Table 2 for details.
L2.
The choice of regularizer’s in case of MEMOIR algorithm affects quality, but has almost no effect on time and memory. This is because value only affects matrix multiplier and does not have sparsifying effect on it, unlike in SVM. We found to be a reasonable choice in all our experiments.
L1.
Regularizer’s parameter in MEMOIR algorithm has little effect on training time (it takes only 33% longer to train MEMOIR with compared to ), but provides a way to trade quality for memory. Figure 1 illustrates this effect. Decreasing leads to increasing the number of nonzero elements in the weight matrix, which leads to more accurate predictions made using this weights.
Results
The results for the baselines and the proposed methods in terms of training and predicting time, total memory usage and predictive performance evaluated with accuracy and MaF1 are provided in Table 3. For a visual comparison we also display the results of seven best models graphically, see Figure 2.
All proposed algorithms consume substantially less memory than the existing solutions with MEMOIR algorithm achieving the most impressive memory reduction. MEMOIRLSH algorithm achieves the fastest training time (due to reduced feature space and fast computation of Hamming distance) with highest memory usage. MEMOIR is a good compromise between the previous two algorithms in terms of test quality, memory and training time.
Data  Baselines  MEMOIR  

OVR  MSVM  RecallTree  FastXML  PfastReXML  PDSparse  LSH  
LSHTC1  Train time  23056s  48313s  701s  8564s  3912s  5105s  1673s  2204s  655s 
n = 163589  Predict time  328s  314s  21s  339s  164s  67s  18s  6s  12s 
d = 409774  Total memory  40.3G  40.3G  122M  470M  471M  10.5G  119M  46M  218M 
C = 12294  Accuracy  44.1%  36.4%  18.1%  39.3%  39.8%  45.7%  34.5%  31.9%  34.6% 
MaF1  27.4%  18.8%  3.8%  21.3%  22.4%  27.7%  24.4%  23.6%  24.8%  
DMOZ  Train time  180361s  212356s  2212s  14334s  15492s  63286s  5709s  7226s  3637s 
n = 510943  Predict time  2797s  3981s  47s  424s  505s  482s  76s  22s  77s 
d = 594158  Total memory  131.9G  131.9G  256M  1339M  1242M  28.1G  271M  52M  417M 
C = 27875  Accuracy  37.7%  32.2%  16.9%  33.4%  33.7%  40.8%  25.1%  20.6%  25.2% 
MaF1  22.2%  14.3%  1.8%  15.1%  15.9%  22.7%  18.8%  17.1%  19.1%  
WIKISmall  Train time  212438s  4d  1610s  10646s  21702s  16309s  4791s  7055s  4165s 
n = 1000772  Predict time  2270s  NA  24s  453s  871s  382s  88s  36s  61s 
d = 380078  Total memory  109.1G  109.1G  178M  949M  947M  12.4G  121M  39M  213M 
C = 36504  Accuracy  15.6%  NA  7.9%  11.1%  12.1%  15.6%  19.0%  17.0%  19.4% 
MaF1  8.8%  NA  1%  4.6%  5.6%  9.9%  12.7%  12.1%  13.1%  
WIKI50K  Train time  NA  NA  4188s  30459s  48739s  41091s  6755s  17303s  6215s 
n = 1384693  Predict time  NA  NA  45s  1110s  2461s  790s  120s  59s  120s 
d = 951558  Total memory  330G  330G  226M  1327M  1781M  35G  185M  42M  424M 
C = 50000  Accuracy  NA  NA  17.9%  25.8%  27.3%  33.8%  29.6%  24.2%  30.0% 
MaF1  NA  NA  5.5%  14.6%  16.3%  23.4%  22.9%  20.0%  23.3%  
WIKI100K  Train time  NA  NA  8593s  42359s  73371s  155633s  21061s  46730s  14323s 
n = 2750663  Predict time  NA  NA  90s  1687s  3210s  3121s  457s  161s  504s 
d = 1271710  Total memory  1017G  1017G  370M  2622M  2834M  40.3G  346M  35M  660M 
C = 100000  Accuracy  NA  NA  8.4%  15.0%  16.1%  22.2%  22.3%  14.0%  22.6% 
MaF1  NA  NA  1.4%  8.0%  9.0%  15.1%  16.9%  11.8%  17.2% 
Conclusion
Our paper is about extreme multiclass classification using inexact margin computation. We provided theoretical analysis of a classification risk in multiclass (onevsrest and CrammerSinger) settings and showed that inexact margin computation does not lead to a significant loss of models’ risk. We then designed three efficient methods, MEMOIR, MEMOIR and MEMOIRLSH, that solve CrammerSinger multiclass SVM problem with inexact approximation of a margin using two different MIPS solvers, SWGraph and SimpleLSH. We illustrated an empirical performance of these methods on five extreme classification datatets, on which we achieved good results in terms of quality, memory and training time.
Finally, we discussed how parameter tuning affects algorithms’ performance and provided a practical advice on how to choose hyperparameters for the proposed algorithms. Our implementation is publicly available in a form of an opensource library.
References
 [1] Florent Perronnin and Christopher Dance. Fisher kernels on visual vocabularies for image categorization. In IEEE conference on computer vision and pattern recognition, pages 1–8, 2007.
 [2] Armand Joulin, Francis Bach, and Jean Ponce. Multiclass cosegmentation. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 542–549. IEEE, 2012.
 [3] Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant features. In European conference on machine learning, pages 137–142. Springer, 1998.
 [4] Yang Song, Ziming Zhuang, Huajing Li, Qiankun Zhao, Jia Li, WangChien Lee, and C Lee Giles. Realtime automatic tag recommendation. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 515–522. ACM, 2008.
 [5] Amnon Shashua and Anat Levin. Ranking with large margin principle: Two approaches. In Advances in neural information processing systems, pages 961–968, 2003.
 [6] Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. Classifier chains for multilabel classification. Machine learning, 85(3):333, 2011.
 [7] Shai ShalevShwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated subgradient solver for SVM. Mathematical Programming, 127(1):3–30, 2011.
 [8] Behnam Neyshabur and Nathan Srebro. On symmetric and asymmetric lshs for inner product search. arXiv preprint arXiv:1410.5518, 2014.
 [9] Peter Auer, Nicolo CesaBianchi, and Paul Fischer. Finitetime analysis of the multiarmed bandit problem. Machine learning, 47(23):235–256, 2002.
 [10] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2012.
 [11] Yunwen Lei, Urun Dogan, Alexander Binder, and Marius Kloft. Multiclass svms: From tighter datadependent generalization bounds to novel algorithms. In Advances in Neural Information Processing Systems, pages 2035–2043, 2015.
 [12] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009.
 [13] Elad Hazan and Haipeng Luo. Variancereduced and projectionfree stochastic optimization. In International Conference on Machine Learning, pages 1263–1271, 2016.
 [14] Ryan Rifkin and Aldebaro Klautau. In defense of onevsall classification. Journal of machine learning research, 5(Jan):101–141, 2004.
 [15] Zhuang Wang, Koby Crammer, and Slobodan Vucetic. Multiclass Pegasos On A Budget. International Conference on Machine Learning (ICML), pages 1143–1150, 2010.
 [16] Koby Crammer and Yoram Singer. On The Algorithmic Implementation of Multiclass Kernelbased Vector Machines. Journal of Machine Learning Research (JMLR), 2:265–292, 2001.
 [17] Martin Jaggi. Revisiting frankwolfe: projectionfree sparse convex optimization. In Proceedings of the 30th International Conference on International Conference on Machine LearningVolume 28, pages I–427. JMLR. org, 2013.
 [18] I EnHsu Yen, X Huang, P Ravikumar, K Zhong, and I Dhillon. PDSparse : A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification. Proceedings of The 33rd International Conference on Machine Learning, 48:3069–3077, 2016.
 [19] Donald Goldfarb, Garud Iyengar, and Chaoxu Zhou. Linear convergence of stochastic frank wolfe variants. In Artificial Intelligence and Statistics, pages 1066–1074, 2017.
 [20] Ji Zhu, Saharon Rosset, Robert Tibshirani, and Trevor J Hastie. 1norm support vector machines. In Advances in neural information processing systems, pages 49–56, 2004.
 [21] Ioannis Partalas, Aris Kosmopoulos, Nicolas Baskiotis, Thierry Artieres, George Paliouras, Eric Gaussier, Ion Androutsopoulos, MassihReza Amini, and Patrick Galinari. LSHTC: A Benchmark for LargeScale Text Classification. pages 1–9, mar 2015.
 [22] Bikash Joshi, Massih R Amini, Ioannis Partalas, Franck Iutzeler, and Yury Maximov. Aggressive sampling for multiclass to binary reduction with applications to text classification. In Advances in Neural Information Processing Systems, pages 4159–4168, 2017.
 [23] RongEn Fan, KaiWei Chang, ChoJui Hsieh, XiangRui Wang, and ChihJen Lin. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research, 9:1871–1874, 2008.
 [24] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernelbased vector machines. J. Mach. Learn. Res., 2:265–292, 2002.
 [25] Hal Daume, Nikos Karampatziakis, John Langford, and Paul Mineiro. Logarithmic Time OneAgainstSome. pages 1–13, 2016.
 [26] Yashoteja Prabhu and Manik Varma. Fastxml: A fast, accurate and stable treeclassifier for extreme multilabel learning. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 263–272. ACM, 2014.
 [27] Himanshu Jain, Yashoteja Prabhu, and Manik Varma. Extreme Multilabel Loss Functions for Recommendation, Tagging, Ranking & Other Missing Label Applications. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining  KDD ’16, pages 935–944, New York, New York, USA, 2016. ACM Press.
 [28] Behnam Neyshabur and Nathan Srebro. On Symmetric and Asymmetric LSHs for Inner Product Search. Proceedings of The 32nd International Conference on Machine Learning, 37:1926–1934, 2015.
 [29] George H. Chen and Devavrat Shah. Explaining the Success of Nearest Neighbor Methods in Prediction. Foundations and Trends® in Machine Learning, 10(56):337–588, 2018.
 [30] Yury Malkov, Alexander Ponomarenko, Andrey Logvinov, and Vladimir Krylov. Approximate nearest neighbor algorithm based on navigable small world graphs. Information Systems, 45:61–68, sep 2014.
 [31] Martin Aumüller, Erik Bernhardsson, and Alexander Faithfull. ANNBenchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms. volume 8199 of Lecture Notes in Computer Science, pages 34–49. Springer Berlin Heidelberg, Berlin, Heidelberg, 2017.
Appendix for
MEMOIR: Multiclass Extreme Classification with Inexact Margin
Proof of Theorem 2
Theorem 2.
Proof.
Following to [15] will treat the difference between the exact margin and the inexact one as the gradient error.
For the convenience we refer as
the instantaneous loss on th object. Thus
where is a normed ball, according to Lemma 1 of [7] and Lemma 1 of [15]. The distortion bounded as with probability at least by the definition of inexact margin.
The relative progress towards the optimal solution at th iteration is
(1) 
where the last inequality is due to the contraction property of projection on a convex set, and .
By the strong convexity of we have
As and the strong convexity of the objective we have . Dividing both sides of Eq. (Proof of Theorem 2) by and rearranging we get:
Summing over all one has
(2) 
Rearranging the the terms and using that we have
(3) 
Finally combining the inequalities (Proof of Theorem 2), (Proof of Theorem 2), (Proof of Theorem 2) and Jensen’s inequality we get
where each is bounded as with probability at least , and always bounded as due to the bound on . Thus with probability at least by the Hoeffding inequality
which finishes the proof of the theorem. ∎
Parameters in Numerical Study
Algorithm  Parameters  LSHTC1  DMOZ  WIKISmall  WIKI50K  WIKI100K 
OVR  C  10  10  1  NA  NA 
MSVM  C  1  1  NA  NA  NA 
RecallTree  –b  30  30  30  30  28 
–l  1  0.7  0.7  0.5  0.5  
–loss_function  Hinge  Hinge  Logistic  Hinge  Hinge  
–passes  5  5  5  5  5  
FastXML  t  100  50  50  100  50 
c  100  100  10  10  10  
PfastReXML  t  50  50  100  200  100 
c  100  100  10  10  10  
PDSparse  l  0.01  0.01  0.001  0.0001  0.01 
Hashing  multiTrainHash  multiTrainHash  multiTrainHash  multiTrainHash  multiTrainHash  
MEMOIR*  0.1  0.1  0.1  0.1  0.1  
0.02  0.02  0.02  0.02  0.02  
1  1  1  1  1  
25  40  48  60  80  
25  40  48  60  80  
MEMOIRLSH  1  1  1  1  1  
25  40  48  60  80  
hash string length  64  64  64  64  64 