Candidates v.s. Noises Estimation for Large Multi-Class Classification Problem

Candidates v.s. Noises Estimation for Large Multi-Class Classification Problem

Lei Han and Tong Zhang
Tencent AI Lab, Shenzhen, China

This paper proposes a method for multi-class classification problems, where the number of classes is large. The method, referred to as Candidates v.s. Noises Estimation (CANE), selects a small subset of candidate classes and samples the remaining classes. We show that CANE is always consistent and computationally efficient. Moreover, the resulting estimator has low statistical variance approaching that of the maximum likelihood estimator, when the observed label belongs to the selected candidates with high probability. In practice, we use a tree structure with leaves as classes to promote fast beam search for candidate selection. We also apply the CANE method to estimate word probabilities in neural language models. Experiments show that CANE achieves better prediction accuracy over the Noise-Contrastive Estimation (NCE), its variants and a number of the state-of-the-art tree classifiers, while it gains significant speedup compared to the standard methods.

1 Introduction

In practice one often encounters multi-class classification problem with a large number of classes. For example, applications in image classification [33] and language modeling [18] usually have tens to hundreds of thousands of classes. In this case, training the standard softmax logistic or one-against-all (OAA) models become impractical.

One promising way to handle the large class size is to use sampling. In language models, a commonly adopted technique is Noise-Contrastive Estimation (NCE) [13]. This method is originally proposed for estimating probability densities and has been applied to various language modeling situations, such as learning word embedding, context generation and neural machine translation [27, 26, 38, 36]. NCE reduces the problem of multi-class classification to binary classification problem, which discriminates between a target class distribution and a noise distribution, where a few noise classes are sampled as a representation of the entire noise distribution. In general, the noise distribution is given a priori. For example, a power-raised unigram distribution has been shown to be effective in language models [24, 16, 27]. Recently, some variants of NCE have been proposed. The Negative Sampling [24] is a simplified version of NCE that ignores the numerical probabilities in the distributions and discriminates between only the target class and noise samples; the One v.s. Each [37] solves a very similar problem motivated by bounding the softmax logistic log-likelihood. Two other variants, BlackOut [16] and complementary sum sampling [4], employ parametric forms of the noise distribution and use sampled noises to approximate the normalization factor. NCE and its variants use (only) the observed class versus the noises. By sampling the noises, these methods avoid the costly computation of the normalization factor to achieve fast training speed. In this paper, we will generalize the idea by using a subset of classes (which can be automatically learned) called candidate classes against the remaining noise classes. Compared to NCE, this approach can significantly improve the statistical efficiency when the true class belongs to the candidate classes with high probability.

Another type of popular methods for large class space is the tree structured classifier [3, 2, 9, 6, 7, 15]. In these methods, a tree structure is defined over the classes which are treated as leaves. Each internal node of the tree is assigned with a local classifier, routing the examples to one of its descendants. Decisions are made from the root until reaching a leaf. Therefore, the multi-class classification problem is reduced to solving a number of small local models defined by a tree, which typically admits a logarithmic complexity on the total number of classes. Generally speaking, tree classifiers gain training and prediction speed while suffering a loss of accuracy. The performance of tree classifier may rely heavily on the quality of the tree structure [25]. Earlier approaches use fixed tree, such as the Filter Tree [3] and the Hierarchical Softmax (HSM) [28]. Recent tree classifiers are able to adjust the tree and learn the local classifiers simultaneously, such as the LOMTree [6] and Recall Tree [7], etc. Our approach is complementary to these tree classifiers, because we study the orthogonal issue of consistent class sampling, which in principle can be combined with many of these tree methods. In fact, a tree structure will be used in our approach to select a small subset of candidate classes. Since we focus on the class sampling aspect, we do not necessarily employ the best tree construction method in our experiments.

In this paper, we propose a method to efficiently deal with the large class problem by paying attention to a small subset of candidate classes instead of the entire class space. Given a data point (without observing ), we select a small number of competitive candidates as a set . Then, we sample the remaining classes, which are treated as noises, to represent the entire noises in the large normalization factor. The estimation is referred to as Candidates v.s. Noises Estimation (CANE). We show that CANE is consistent and its computation using stochastic gradient method is independent of the class size . Moreover, the statistical variance of the CANE estimator can approach that of the maximum likelihood estimator (MLE) of the softmax logistic regression when can cover the target class with high probability. This statistical efficiency is a key advantage of CANE over NCE, and its effect can be observed in practice.

We then describe two concrete algorithms: the first one is a generic stochastic optimization procedure for CANE; and the second algorithm employs a tree structure with leaves as classes to enable fast beam search for candidate selection. We also apply the CANE method to solve the word probability estimation problem in neural language modeling. Experimental results conducted on both classification problems and neural language modeling problems show that CANE achieves significant speedup compared to the standard softmax logistic regression. Moreover, it achieves superior performance over NCE, its variants, and a number of the state-of-the-art tree classifiers.

2 Candidates v.s. Noises Estimation

Consider a -class classification problem ( is large) with training examples , where is from an input space and . The softmax logistic regression solves


where for is a model parameterized by . Solving Eq. (1) requires computing a score for every class and the summation in the normalization factor, which is very expensive when is large.

Generally speaking, given , only a small number of classes in the entire class space might be competitive to the true class. Therefore, we propose to find a small subset of classes as a candidate set and treat the classes outside as noises, so that we can focus on the small set instead of the entire classes. We will discuss one way to choose in Section 4. Denote the remaining noises as a set , so is the complementary set of . We propose to sample some noise class to represent the entire . That is, we replace the partial summation in the denominator of Eq. (1) by using some sampled class with an arbitrary sampling probability , where and . Thus, the denominator will be approximated as

Given example and its candidate set , if , then for some sampled noise class , the probability will be defined as


otherwise, if , we define the probability as


Now, with Eqs. (2) and (3), in expectation, we will need to solve the following objective:


Empirically, we will need to solve


Eq. (5) consists of two summations over both the data points and the classes in the noise set . Therefore, we can employ a ‘doubly’ stochastic gradient optimization method by sampling both data points and noise classes . It is not difficult to check that each stochastic gradient is bounded under reasonable conditions, which means that the computational cost for solving (5) using stochastic gradient is independent of the class number . Since we only choose a small number of candidates in , the computation for each stochastic gradient in Eq. (5) is efficient. The above method is referred to as Candidates v.s. Noises Estimation (CANE).

3 Properties

In this section, we investigate the statistical properties of CANE. The parameter space of the softmax logistic model in Eq. (1) has redundancy, observing that adding any function to will not change the objective. Similar situation happens for Eqs. (4) and (5). To avoid this redundancy, one can add some constraints on the scores or simply fix one of them as zero, e.g., let . To facilitate the analysis, we will fix and consider within this section. First, we have the following result.

Theorem 1 (Infinity-Sample Consistency).

By viewing the objective as a function of , achieves its maximum if and only if for .

In Theorem 1, the global optima is exactly the log-odds function with class as the reference class. Now, considering the parametric form , there exists a true parameter so that if the model is correctly specified. The following theorem shows that the CANE estimator is consistent with .

Theorem 2 (Finite-Sample Asymptotic Consistency).

Given , denote as and as . Assume , and for are bounded under some norm defined on the parameter space of . Furthermore, assume the matrix is positive definite, where

Then, as , the estimator converges to .

The above theorem shows that similar to the maximum likelihood estimator of Eq. (1), the CANE estimator in Eq. (5) is also consistent. Next, we have the asymptotic normality for as follows.

Theorem 3 (Asymptotic Normality).

Under the same assumption used in Theorem 2, as , follows the asymptotic normal distribution:


Theorem 3 shows that the CANE method has a statistical variance of . As we will see in the next corollary, if one can successfully choose the candidate set so that it covers the observed label with high probability, then the difference between the statistical variance of CANE and that of Eq. (1) is small. Therefore, choosing a good candidate set can be important for practical applications. Moreover, under standard conditions, the computation of CANE using stochastic gradient is independent of the class size because the variance of stochastic gradient is bounded.

Corollary 1 (Low Statistical Variance).

The variance of the maximum likelihood estimator for the softmax logistic regression in Eq. (1) has the form of . If , i.e., the probability that covers the observed class label approches 1, then we have

4 Algorithm

In this section, we propose two algorithms. The first one is a general optimization procedure for CANE. The second implementation provides an efficient way to select a competitive set using a tree structure defined on the classes.

4.1 A General Optimization Algorithm

1:  Input: , , number of candidates , number of sampled noises , sampling strategy and learning rate .
2:  Output: .
3:  Initialize ;
4:  for every sampled example do
5:     Receive example ;
6:     Find the candidate set ;
7:     if  then
8:         Sample noises outside according to and denote the selected noise set as ;
9:          with given by
10:     else
11:          with given by
12:     end if
13:  end for
Algorithm 1 A general optimization procedure for CANE.

Eq. (5) suggests an efficient algorithm using a ‘doubly’ stochastic gradient descend (SGD) method by sampling both the data points and classes. That is, by sampling a data point , we find the candidate set . If , we sample noise according to probability times and denote the selected noises as a set (). We then optimize

with gradient given by Eq. (7). Otherwise, if , we optimize

with gradient given by Eq. (8). This general procedure is provided in Algorithm 1. Algorithm 1 has a complexity of (), which is independent of the class size . In step 6, any method can be used to select .

4.2 Beam Tree Algorithm

In the second algorithm, we provide an efficient way to find a competitive . An attractive strategy is to use a tree defined on the classes, because one can perform fast heuristic search algorithms based on a tree structure to prune the uncompetitive classes. Indeed, any structure, e.g., a graph, can be used alternatively as long as the structure allows to efficiently prune low quality branches. We will use tree structure for candidate selection in this paper.

Given a tree structure defined on the classes, the model is then interpreted as a tree model illustrated in Fig. 1. For simplicity, Fig. 1 uses a binary tree over labels as example while any tree structure can be used for selecting . In the example, circles denote internal nodes and squares indicate classes. The parameters are kept in the edges and denoted as , where indicates an internal node and is the index of the -th child of node . Therefore, a pair represents an edge from node to its -th child. The dashed circles indicate that we do not keep any parameters in the internal nodes.

Figure 1: Illustration of the tree model. Suppose an example is arriving, and two candidate classes 1 and 2 are selected by beam search. The class 6 is sampled as noise.

Now, we define as


where is a function parameterized by and it maps the input to a representation for some . For example, in image classification, a good choice of the representation of the raw pixels is usually a deep neural network. denotes the path from the root to the class . Eq. (9) implies that the score of an example belonging to a class is calculated by summing up the scores along the corresponding path. Now, in Fig. 1, suppose that we are given an example with class (blue color). Using beam search, we find two candidates with high scores, i.e., class 1 (green color) and class 2. Then, we let . In this case, we have , so we need to sample noises. Suppose we sample one class 6 (orange color). According to Eq. (7), the parameters along the corresponding paths (red color) will be updated.

1:  Input: , , representation function , number of candidates , number of sampled noises , sampling strategy and learning rate .
2:  Output: .
3:  Construct a tree on the classes;
4:  Initialize ;
5:  for every sampled example do
6:     Receive example ;
7:     Given , use beam search to find the classes with high scores to compose ;
8:     if  then
9:         Sample noises outside according to and denote the selected noise set as ;
10:         Find the paths with respect to the classes in ;
11:     else
12:         Find the paths with respect to the classes in ;
13:     end if
14:     Sum up the scores along each selected path for the corresponding class;
15:     // Update parameters in the tree model
16:      for each included in the selected paths according to Eqs. (10) and (11);
17:     // Update parameters in the representation function if it is parameterized
18:     ;
19:  end for
Algorithm 2 The Beam Tree Algorithm.

Formally, given example , if , we sample the noises as a set . Then for , where , the gradient with respect to is thus


Note that an edge may be included in multiple selected paths. For example, and share edges and in Fig. 1. The case of can be illustrated similarly. The gradient with respect to when is


The gradients in Eqs. (10) and (11) enjoy the following property.

Proposition 1.

If an edge is included in every selected path, does not need to be updated.

The proof of Proposition 1 is straightforward that if belongs to every selected path, then the gradients in Eqs. (10) and (11) are 0. The above property allows a fast detection of those parameters which do not need to be updated in SGD and hence can save more computations.

Since we use beam search to choose the candidates in a tree structure, the proposed algorithm is referred to as Beam Tree, which is depicted in Algorithm 2.111The beam search procedure in step 7 is provided in the appendix. For the tree construction method in step 3, we can use some hierarchical clustering based methods which will be detailed in the experiments and supplementary material. In the algorithm, the beam search needs operations, where is a constant related to the tree structure, e.g., binary tree for . The parameter updating needs operations, where . Therefore, Algorithm 2 has a complexity of which is logarithmic with respect to . The term is from the tree structure used in this specific candidate selection method, so it does not conflict with the complexity of the general Algorithm 1, which is independent of . Another advantage of the Beam Tree algorithm is that it allows fast predictions and can naturally output the top- predictions using beam search. The prediction time has an order of for the top- predictions.

5 Application to Neural Language Modeling

In this section, we apply the CANE method to neural language modeling which solves a probability density estimation problem. In neural language models, the conditional probability distribution of the target word given context is defined as

where is the scoring function with parameter . A word in the context will be represented with an embedding vector with embedding size . Given context , the model computes the score for the target word as

where , is the average embedding vector of the words in context and is the weight parameter for the target word . Both the word embedding and weight parameter need to be estimated. In language models, the vocabulary size is usually very large and the computation of the normalization factor is extremely expensive. Therefore, instead of estimating the exact probability distribution , sampling methods [26, 16] such as NCE and its variants are typically adopted to approximate .

Figure 2: Illustration of the CANE model used in neural language modeling.

In order to apply the CANE method, we will need to select the candidates given any context . For multi-class classification problem, we have devised a Beam Tree algorithm in Algorithm 2 that uses a tree structure to select candidates, and the tree can be obtained by some hierarchical clustering methods over the observations before learning. However, different from the classification problem, the word embeddings in the language model are not known before training, and a hierarchical clustering based tree is not easy to obtain. Therefore we propose another candidate set slection method for CANE. Suppose we are using -gram model and predicting the next word. Given a context , there exists a ‘next word’ set for the last word in such that the set contains all the words which appear just after at least once in the training texts. Denote the set as , and its complementary set according to the entire vocabulary as , then we may construct a light tree for every last word in the contexts as shown in Fig. 2, and we may re-parameterize for the target word as

where is the parameter of the target word , and it is shared by every last word . The two parameters and associated with determine the chance that the candidates will be selected from which subset. Given the context , suppose the model chooses the set , then we sample all the candidates from (i.e., ) according to some distribution such as the power-raised unigram distribution. For the noises in CANE, we directly sample words out of according to the distribution , which can be chosen as the power-raised unigram distribution as well.

6 Related Algorithms

We provide a discussion comparing CANE with the existing techniques for solving the large class problem. Given , NCE and its variants [13, 26, 24, 16, 37, 4] use the observed class as the only ‘candidate’, while CANE chooses a subset of candidates according to without observing . NCE assumes the entire noise distribution is known. For example, a power-raised unigram distribution is shown to be effective in language models because the word frequency can represent the noise (word) distribution well. However, in general multi-class classification problems, when the knowledge of the noise distribution is absent, NCE may have unstable estimations using an inaccurate noise distribution. CANE works for general multi-class classification problems and does not rely on a known noise distribution. Instead, it only focuses on a small candidate set . Once the true class label is contained in with high probability, CANE will have robust estimations as supported by our theoretical results. The variants of NCE [24, 16, 37, 4] also sample one or multiple noises to replace the large normalization factor; however, theoretical guarantees on the consistency and variance of their estimations are not provided. In addition, NCE and its variants can not speed up prediction while the Beam Tree algorithm for CANE can reduce the prediction complexity to .

The Beam Tree algorithm is related to the tree classifiers, while it is clear that the use of tree structure is only for selecting candidates in CANE. The Beam Tree method itself is also different from existing tree classifiers. Most of the state-of-the-art tree classifiers, e.g., LOMTree [6] and Recall Tree [7], store local classifiers in their internal nodes. Then, examples are pushed through the root until reaching the leaf. The Bean Tree algorithm shown in Fig. 1 does not maintain local classifiers, and it only uses the tree structure to perform global heuristic search for candidate selection. We will also compare our approach to some other state-of-the-art tree classifiers in our experiments.

7 Experiments

We evaluate the CANE method in various applications in this section, including both multi-class classification problems and neural language modeling.

7.1 Classification Problems

In this section, we consider multi-class classification problem with a large number of classes. We compare CANE with NCE, its variants and a number of the state-of-the-art tree classifiers that have been used for large class space problems. The evaluated methods include:

  • Softmax: the standard softmax logistic regression used as a baseline.

  • NCE: the Noise-Contrastive Estimation method [26, 27].

  • BlackOut: a variant of the NCE [16]. In [4], the results show that the complementary sum sampling [4] is comparable to BlackOut, and both of them outperform Negative Sampling [24]. So we compare BlackOut in our experiments.

  • Filter Tree: the Filter Tree classifier [3] with implementation in Vowpal-Wabbit (VW), which is a public fast learning system.222

  • LOMTree: the LOMTree classifier proposed in [6] with implementation in VW.

  • Recall Tree: the Recall Tree classifier proposed in [7] with implementation in VW.

  • CANE: our CANE method with Beam Tree algorithm.

For simplicity, we use -nary tree for CANE and set in all the experiments. We trade off between the number of candidates in and the number of selected noises in to see how these parameters affect the learning performance. Different estimations will be referred to as ‘CANE-( v.s. )’. We always let equal the number of noises used in NCE and BlackOut, so these methods will have the same number of considered classes. We use ‘NCE-’ and ‘BlackOut-’ to represent the corresponding method with noises. We uniformly sample noises in CANE. For NCE and BlackOut, by following [27, 26, 16, 4], we use the power-raised unigram distribution with the power factor selected from to sample the noises. However, when the classes are balanced as in many cases of the classification datasets, this distribution reduces to the uniform distribution.

All the methods use SGD with learning rate selected from . The Beam Tree algorithm requires a tree structure and we use some tree generated by a simple hierarchical clustering method on the centers of the individual classes.333This only needs to scan the data once and performs hierarchical clustering on centers. The time cost can be neglected compared to the training phrase. The tree construction method is provided in the appendix. All the evaluations are performed on a single machine with 3.3GHz quad-core Intel Core i5 processor.

For the compared tree classifiers, the Filter Tree generates a fixed tree itself in VW, the LOMTree and Recall Tree methods use binary trees and they are able to dynamically learn the tree structure.444More details of the experimental setting are provided in the appendix.

We test all the compared methods on 6 large multi-class classification datasets:

  • Sector555 a multi-class classification dataset with 105 classes [5]. Data are split into 90% training and 10% testing.

  • ALOI666 a color image classification problem with 1000 classes [12]. Data are split into 90% training and 10% testing.

  • LSHTC1 and Dmoz55footnotemark: 5: two multilingual datasets from [31]. Both of them have around 12K classes. Training and testing sets are from [41].

  • ImageNet 2010777 the image classification task in ImageNet 2010 competition [33]. There are 1000 classes and 1.3M images for training. We use the validation set which contains 50,000 images as the test set.

  • ImageNet-10K77footnotemark: 7: the large image classification task in ImageNet Fall 2009 release [8]. There are approximately 10K classes and 9M images. By following the protocols in [8, 34, 19], we randomly split the data into two halves for training and testing.

Data # classes # features # train # test
Sector 105 55K 8K 1K
ALOI 1K 128 100K 10K
LSHTC1 12K 347K 84K 5K
Dmoz 12K 833K 335K 38K
ImageNet 2010 1K 4K 1.3M 50K
ImageNet-10K 10K 4K 4.5M 4.5M
Table 1: Statistics of the number of classes, number of features, training size and testing size in different datasets.

Table 1 shows the statistics of the datasets. For the Sector, ALOI, LSHTC1 and Dmoz datasets, we use the original features downloaded from the provided sources and use linear models. For the ImageNet 2010 and ImageNet-10K datasets, similar to [30], we transfer the mid-level representations from the pre-trained VGG-16 net [35] on ImageNet 2012 data [33] to our case. Then, we concatenate CANE or other compared methods above the partial VGG-16 net as the top layer (the network structure is provided in the supplementary material). Similar to [30], the parameters of the partial VGG-16 net are pre-trained and kept fixed888The pre-trained parameters can be downloaded from the link: Only the parameters in the top layer are trained on the target datasets, i.e., ImageNet 2010 and ImageNet-10K. We run all the methods 50 epochs on Sector, ALOI, LSHTC1, Dmoz and ImageNet 2010 datasets and 20 epochs on ImageNet-10K, and report the accuracy v.s. epoch relationship.

(a) Sector
(b) ALOI
(c) LSHTC1
(d) Dmoz
(e) ImageNet 2010
(f) ImageNet-10K
Figure 3: Results of test accuracy v.s. epoch on different classification datasets.
Data NCE-10 BlackOut-10 CANE-1v9 CANE-5v5 CANE-9v1 Softmax
Sector 24s / 0.75s 20s / 0.76s 54s / 0.05s 1m45s / 0.14s 2m15s / 0.19s 6m7s / 0.88s
ALOI 3m / 5.9s 3m / 5.8s 4m / 0.1s 7m / 0.3s 8m / 0.5s 27m35s / 6.5s
Data NCE-20 BlackOut-20 CANE-5v15 CANE-10v10 CANE-15v5 Softmax
LSHTC1 26m / 23m57s 26m / 24m14s 31m / 3s 44m / 6s 52m / 9s 5d11h / 13m7s
Dmoz 1h20m / 32m4s 1h20m / 31m48s 1h58m / 30s 3h36m / 1m4s 3h53m / 1m27s 25d / 33m
ImageNet 2010 3h27m / 7m53s 3h23m / 7m50s 4h3m / 22s 5h48m / 41s 6h25m / 51s 4d / 8m42s
ImageNet-10K 13h / 5d 12h / 5d 20h / 51m 1d9h / 1h33m 1d15h / 2h 140d / 5d
Table 2: Training / testing time of the sampling methods. Running Softmax and testing NCE and BlackOut on large datasets are time consuming. We use multi-thread implementation for these methods and estimate the running time. ‘’ indicates the estimated time.
Data Filter Tree LOMTree Recall Tree
Sector Acc. 84.67 84.91 86.89
Time 21s / 0.4s 32s / 0.2s 42s / 0.2s
ALOI Acc. 20.07 82.70 83.03
Time 58s / 0.2s 3m20s / 1.0s 2m30s / 0.2s
LSHTC1 Acc. 12.38 11.80 11.58
Time 1h16m / 12s 1h36m / 13s 3h20m / 23s
Dmoz Acc. 24.79 24.13 18.11
Time 2h42m / 19s 3h35m / 23s 17h30m / 39s
ImageNet 2010 Acc. 48.29 49.87 61.28
Time 6h50m / 7s 17h50m / 20s 1d8h / 30s
ImageNet-10K Acc. 4.49 9.72 22.74
Time 2d7h / 15m 2d10h / 20m 7d2h / 1h14m
Table 3: Test accuracy and training / testing time of the tree classifiers.
# candidates Sector ALOI # candidates LSHTC1 Dmoz ImageNet 2010 ImageNet-10K
1 (CANE-1v9) 68.89 44.84 5 (CANE-5v15) 28.28 41.26 76.59 39.59
5 (CANE-5v5) 96.57 86.47 10 (CANE-10v10) 33.44 52.65 87.29 53.28
9 (CANE-9v1) 97.92 93.59 15 (CANE-15v5) 39.16 58.02 91.17 60.22
Table 4: The probability (%) that the true label is included in the selected candidate set on the test set, i.e., the top- accuracy.

Fig. 3 and Table 2 show the accuracy v.s. epoch curves and the training / testing time for NCE, BlackOut, CANE and Softmax. The tree classifiers in VW can not directly output test accuracy after each epoch and we report the final results of the tree classifiers in Table 3. For ImageNet-10K data, the Softmax method is very time consuming (even with multi-thread implementation) and we do not report this result. As we can observe, by fixing , using more candidates than noises in CANE will achieve better performance, because a larger will increase the chance to cover the target class . The probability that the target class is included in the selected candidate set on the test data is reported in Table 4, and the probability is indeed the top- accuracy. On all the datasets, CANE with larger candidate set achieves considerable improvement compared to other method in terms of accuracy, and sometimes CANE even surpasses the Softmax. The per epoch training time of CANE is slightly slower than the training of NCE and BlackOut because of the beam search, however, CANE converges much faster according to the training epochs. Moreover, the prediction time of CANE is much faster than those of NCE and BlackOut. It is worth noting that CANE also exceeds some state-of-the-art results on the ImageNet-10K data, e.g., 19.2% top-1 accuracy reported in [19] and 21.9% top-1 accuracy reported in [22] which are conducted from methods, while it still underperforms the recent result 28.4% in [14]. This is probably because the VGG-16 net works better than the neural net structure used in [19] and the distance-based method in [22] on this dataset, while [14] adopts a deep neural network combined with metric learning to achieve better feature embedding, leading to better prediction performance on this dataset.

7.2 Neural Language Modeling

In this experiment, we apply the CANE method to neural language modeling. We test on the Penn TreeBank corpus and Gutenberg corpus. The Penn TreeBank dataset contains 1M words and we choose 12K words appearing at least 5 times as the vocabulary. The Gutenberg dataset contains 50M words and we choose 116K words appearing at least 10 times as the vocabulary. For both datasets, the embedding vector has a dimension of 200. We use a 4-gram model and set the learning rate as 0.025. We sample 20 noises for NCE and Blackout, and sample 10 candidates and 10 noises for CANE. The tree classifiers evaluated in the previous classification problems can not be directly applied in language modeling. Therefore we compare CANE to NCE and BlackOut methods. For all the compared methods, we use power-raised unigram distribution to sample the noises with the power factor selected from . For the CANE method, given context , when the word-wise tree (introduced in Section 5) directs to one set ( or ), then we use the power-raised unigram distribution to sample the candidates in that set.

(a) Penn TreeBank
(b) Gutenberg
Figure 4: Test perplexity v.s. training epoch on Penn TreeBank and Gutenberg datasets.
Data NCE BlackOut CANE Softmax
Penn TreeBank 6.3k 6.2k 5.5k 63
Gutenberg 778 657 420 12
Table 5: Training speed (words/sec) of different methods.

The test perplexity is shown in Fig. 4. As we can observe, the CANE method shows faster convergence compared to NCE and Blackout and achieves the same test perplexity as Softmax in a few epochs. Similar to some cases observed in [16], the perplexity performances of both NCE and BlackOut converge quite slowly if we compare them to that of Softmax (and CANE). Table 5 shows the training speed of different methods. From the table, CANE shows comparable speed with NCE and BlackOut for processing each word, while its test performance converges significantly faster than NCE and BlackOut, showing better statistical efficiency as expected by our theory. Moreover, CANE achieves 3590 times speedup compared to Softmax. For testing, all the methods have the same complexity because we evaluate the perplexity over the entire distribution.

8 Conclusion

We proposed Candidates v.s. Noises Estimation (CANE) for fast learning in multi-class classification problems with many labels, and applied this method to the word probability estimation problem in neural language models. We showed CANE is consistent and the computation using SGD is always efficient (that is, independent of the class size ). Moreover, the new estimator has low statistical variance approaching that of the softmax logistic regression, if the observed class label belongs to the candidate set with high probability. Empirical results demonstrated that CANE is effective for speeding up both training and prediction in large multi-class classification problems and CANE is effective in neural language modeling. We note that this work employs a fixed distribution, such as the power-raised unigram distribution, to sample noises in CANE. However it can be very useful in practice to estimate the noise distribution, i.e., , during training, and select noise classes according to this distribution.


  • [1] R. Agrawal, A. Gupta, Y. Prabhu, and M. Varma. Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages. In Proceedings of The International Conference on World Wide Web (WWW), pages 13–24, 2013.
  • [2] S. Bengio, J. Weston, and D. Grangier. Label embedding trees for large multi-class tasks. In Advances in Neural Information Processing Systems (NIPS), pages 163–171, 2010.
  • [3] A. Beygelzimer, J. Langford, and P. Ravikumar. Error-correcting tournaments. In International Conference on Algorithmic Learning Theory (ALT), pages 247–262. Springer, 2009.
  • [4] A. Botev, B. Zheng, and D. Barber. Complementary sum sampling for likelihood approximation in large scale classification. In Artificial Intelligence and Statistics (AISTATS), pages 1030–1038, 2017.
  • [5] C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011.
  • [6] A. E. Choromanska and J. Langford. Logarithmic time online multiclass prediction. In Advances in Neural Information Processing Systems (NIPS), pages 55–63, 2015.
  • [7] H. Daume III, N. Karampatziakis, J. Langford, and P. Mineiro. Logarithmic time one-against-some. arXiv preprint arXiv:1606.04988, 2016.
  • [8] J. Deng, A. C. Berg, K. Li, and L. Fei-Fei. What does classifying more than 10,000 image categories tell us? In European Conference on Computer Vision (ECCV), pages 71–84, 2010.
  • [9] J. Deng, S. Satheesh, A. C. Berg, and F. Li. Fast and balanced: Efficient label tree learning for large scale object recognition. In Advances in Neural Information Processing Systems (NIPS), pages 567–575, 2011.
  • [10] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2):337–407, 2000.
  • [11] J. H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, pages 1189–1232, 2001.
  • [12] J.-M. Geusebroek, G. J. Burghouts, and A. W. Smeulders. The amsterdam library of object images. International Journal of Computer Vision (IJCV), 61(1):103–112, 2005.
  • [13] M. U. Gutmann and A. Hyvärinen. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research (JMLR), 13(Feb):307–361, 2012.
  • [14] C. Huang, C. C. Loy, and X. Tang. Local similarity-aware deep feature embedding. In Advances in Neural Information Processing Systems, pages 1262–1270, 2016.
  • [15] Y. Jernite, A. Choromanska, D. Sontag, and Y. LeCun. Simultaneous learning of trees and representations for extreme classification with application to language modeling. arXiv preprint arXiv:1610.04658, 2016.
  • [16] S. Ji, S. Vishwanathan, N. Satish, M. J. Anderson, and P. Dubey. Blackout: Speeding up recurrent neural network language models with very large vocabularies. arXiv preprint arXiv:1511.06909, 2015.
  • [17] A. Z. Kouzani and G. Nasireding. Multilabel classification by bch code and random forests. International Journal of Recent Trends in Engineering, 2(1):113–116, 2009.
  • [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 1097–1105, 2012.
  • [19] Q. V. Le. Building high-level features using large scale unsupervised learning. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8595–8598, 2013.
  • [20] B. Liu, F. Sadeghi, M. Tappen, O. Shamir, and C. Liu. Probabilistic label trees for efficient large scale image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 843–850, 2013.
  • [21] L. Liu, P. M. Comar, S. Saha, P.-N. Tan, and A. Nucci. Recursive nmf: Efficient label tree learning for large multi-class problems. In The International Conference on Pattern Recognition (ICPR), pages 2148–2151, 2012.
  • [22] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka. Distance-based image classification: Generalizing to new classes at near-zero cost. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 35(11):2624–2637, 2013.
  • [23] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
  • [24] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (NIPS), pages 3111–3119, 2013.
  • [25] A. Mnih and G. E. Hinton. A scalable hierarchical distributed language model. In Advances in Neural Information Processing Systems (NIPS), pages 1081–1088, 2009.
  • [26] A. Mnih and K. Kavukcuoglu. Learning word embeddings efficiently with noise-contrastive estimation. In Advances in Neural Information Processing Systems (NIPS), pages 2265–2273, 2013.
  • [27] A. Mnih and Y. W. Teh. A fast and simple algorithm for training neural probabilistic language models. In Proceedings of the 29th International Conference on Machine Learning (ICML), pages 1751–1758, 2012.
  • [28] F. Morin and Y. Bengio. Hierarchical probabilistic neural network language model. In The International Conference on Artificial Intelligence and Statistics (AISTATS), volume 5, pages 246–252, 2005.
  • [29] P. Norvig. Paradigms of artificial intelligence programming: case studies in Common LISP. Morgan Kaufmann, 1992.
  • [30] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring mid-level image representations using convolutional neural networks. In Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1717–1724, 2014.
  • [31] I. Partalas, A. Kosmopoulos, N. Baskiotis, T. Artieres, G. Paliouras, E. Gaussier, I. Androutsopoulos, M.-R. Amini, and P. Galinari. Lshtc: A benchmark for large-scale text classification. arXiv preprint arXiv:1503.08581, 2015.
  • [32] Y. Prabhu and M. Varma. Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In Proceedings of The ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 263–272, 2014.
  • [33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  • [34] J. Sánchez and F. Perronnin. High-dimensional signature compression for large-scale image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1665–1672, 2011.
  • [35] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [36] A. Sordoni, M. Galley, M. Auli, C. Brockett, Y. Ji, M. Mitchell, J.-Y. Nie, J. Gao, and B. Dolan. A neural network approach to context-sensitive generation of conversational responses. arXiv preprint arXiv:1506.06714, 2015.
  • [37] M. K. Titsias. One-vs-each approximation to softmax for scalable estimation of probabilities. In Advances in Neural Information Processing Systems (NIPS), pages 4161–4169, 2016.
  • [38] A. Vaswani, Y. Zhao, V. Fossum, and D. Chiang. Decoding with large-scale neural language models improves translation. In The Conference on Empirical Methods on Natural Language Processing (EMNLP), pages 1387–1392. Citeseer, 2013.
  • [39] P. Vincent, A. de Brébisson, and X. Bouthillier. Efficient exact gradient update for training deep networks with very large sparse targets. In Advances in Neural Information Processing Systems (NIPS), pages 1108–1116, 2015.
  • [40] J. Weston, A. Makadia, and H. Yee. Label partitioning for sublinear ranking. In Proceedings of The International Conference on Machine Learning (ICML), pages 181–189, 2013.
  • [41] I. E.-H. Yen, X. Huang, P. Ravikumar, K. Zhong, and I. Dhillon. Pd-sparse: A primal and dual sparse approach to extreme multiclass and multilabel classification. In Proceedings of The International Conference on Machine Learning (ICML), pages 3069–3077, 2016.

Supplementary Material

Appendix A Proofs

In the theorectical analysis, we fix . Then, we only need to consider . Now, the normalization factor becomes

with some sampled class . Now, we can rewrite and as

In the proofs, we will use point-wise notations , , and to represent , , and for simplicity.

a.1 Useful Lemma

We will need the following lemma in our analysis.

Lemma 1.

For any norm defined on the parameter space of , assume the quantities , and for are bounded. Then, for any compact set defined on the parameter space, we have


For fixed , let

Then we have and . By the Law of Large Numbers, we know that converges point-wisely to in probability.

According to the assumption, there exists a constant such that

Given any , we may find a finite cover so that for any , there exists such that . Since is finite, as , converges to in probability. Therefore, as , with probability , we have

Let , we obtain the first bound. The second and the third bounds can be similarly obtained. ∎

a.2 Proof of Theorem 1


can be re-written as

For , we have