Word2Vec is a special case of Kernel Correspondence Analysis and Kernels for Natural Language Processing
Abstract
We show that correspondence analysis (CA) is equivalent to defining a Gini index with appropriately scaled onehot encoding. Using this relation, we introduce a nonlinear kernel extension to CA. This extended CA gives a known analysis for natural language via specialized kernels that use an appropriate contingency table. We propose a semisupervised CA, which is a special case of the kernel extension to CA. Because CA requires excessive memory if applied to numerous categories, CA has not been used for natural language processing. We address this problem by introducing delayed evaluation to randomized singular value decomposition. The memoryefficient CA is then applied to a wordvector representation task. We propose a tailcut kernel, which is an extension to the skipgram within the kernel extension to CA. Our tailcut kernel outperforms existing wordvector representation methods.
1 Introduction
Principal component analysis (PCA) is a form of unsupervised feature extractor. When applied to chisquared distances of categorical data, PCA becomes correspondence analysis (CA). CA can extract numeric vector features from categorical data without supervised labeling. The simplest numerical representation for categorical data is a histogrambased representation such as tfidf or a “bagofwords”. Many applications use such simple representations. However, histogrambased representations cannot make use of information about correlations within the data. CA enables the representation of both histograms and correlations in data.
The most popular problem involving categorical data is natural language processing (NLP). However, CA has not been applied to NLP because most NLP problems involve a large number of categories. For example, the entire Wikipedia text comprises more than 10,000 different words. Because CA requires excessive memory if applied to numerous categories, CA has not been used for NLP problems involving more than 10,000 categories.
CA is implemented by singular value decomposition (SVD) of a contingency table. In many categorical problems, the contingency table is sparsely populated. Randomized SVD [6] is an appropriate SVD method for sparse matrices. However, CA requires dense matrix computation even for a sparse contingency table, which makes very large demands on memory resources. This research aims to address this problem. We propose using a randomized SVD with delayed evaluation to avoid expanding the sparse matrix into a dense matrix. We refer to this process as the delayed sparse randomized SVD (DSSVD) algorithm. We show that CA with DSSVD can be applied to NLP problems.
Neuralnetworkbased approaches are the most popular feature extractors used in NLP.
Of these, word2vec
[10] is well known.
Usually, such an approach will involve many parameters, which do not have explicit meanings in most cases.
These parameters have to be tuned by grid searching or manual parameter tuning, which is difficult in the absence of explicit meanings for the parameters.
This parameter problem with neuralnetworkbased approaches also gives rise to domain problems.
For example, if word2vec
is tuned for application to restaurant reviews, the tuning may not be appropriate for movie reviews.
In most cases, the weight values used in neural networks are initialized as random values, which means that the computed results will always be different.
For example, wordvector representations using word2vec
will always be different, even when the same parameter values are used, because of random initial values.
This adds to the difficulty of parameter tuning.
Since CA is PCA of a contingency table, always the same result is computed. From this viewpoint, the CA approach is better than neuralnetworkbased approaches. Although the latter have these issues, they can be used to approximate any nonlinear function, which means that they can be used for a wide variety of problems. However, because CA is a form of linear analysis, it is not directly applicable to nonlinear problems. To address this issue, this research introduces a nonlinear kernel extension to CA. We can then show that this nonlinear CA approach is better in accuracy compared with recent neuralnetworkbased approaches. In particular, we focus on comparison with respect to wordvector representation tasks. To distinguish linear and nonlinear CA, we refer to the linear CA as LCA.
2 Ca
CA is a statistical visualization method for picturing the associations between the levels of a twoway contingency table. As an illustration, consider the contingency table shown in Table 1. This is well known as “Fisher’s data” [4] and represents the eye and hair color of people in Caithness, Scotland. The CA of these data yields the graphical display presented in Figure 1, which shows the correspondence between eye and hair color.
Table 1 shows the joint population distribution of the categorical variable for eye color:
and the categorical variable for hair color:
The visualization is based on “onehot encodings” and the “indicator matrices” of categorical variables. For example, a onehot encoding and indicator matrix of the categorical variable can be defined as:
(1) 
(2) 
In the following, and denote categorical variables representing the row and column of a contingency table, respectively. and denote onehot encodings of and for the th instance.
3 Covariance Based on Gini Index
Consider the following relation [16] about the variance of continuous data.
Lemma 1.
The variance of continuous data can be expressed as the sum of the differences of individual instances:
(3)  
(4) 
where are continuous sample data. is the average value of the sample data. is the variance of the continuous data.
Using the same formulation about the sum of the differences between individual instances for categorical data gives a Gini index [5]:
(7) 
where
(8) 
Here, is a categorical variable that takes one of the values in . Rewriting this formulation with onehot encoding gives:
(9) 
This is more similar to the continuous case:
(10) 
Using this onehot encoding, we can also define the covariance of categorical data. Consider the categorical variables and . If the given sample categorical data is
(11) 
we can define the covariance of and :
(12) 
Okada [12, 11] showed that this definition is invalid by considering the contingency table shown in Table 2. In this contingency table, and are highly correlated and the instance reduces the correlation between and . However, the instance increases the covariance in the formulation (12). To avoid such an invalid increase, Okada defined the covariance using rotated onehot encoding [11].
Definition 1.
The covariance of categorical variables and is the maximized value :
(13) 
where is a rotation matrix that maximizes the covariance. The vectors and are onehot encodings of and .
In terms of this definition, the instance reduces the covariance. In this respect, this definition is better than (12). Expanding the maximization problem, (1) gives a simplified form.
Lemma 2.
The maximization problem (1) is equivalent to
subject to  (14) 
where
(15)  
(16)  
(17) 
is an contingency table, with entries giving the frequency with which row categorical variable occurs together with column categorical variable . denotes the vector of row marginals and is the vector of column marginals. .
Proof.
We can solve the maximization problem (2) using SVD.
Theorem 1.
Proof.
Local optima of the maximization problem are given by differentiating the Lagrangian:
(22) 
where is a Lagrange multiplier. The differentiation of this Lagrangian with respect to gives the stationary condition:
(23) 
This result shows that must be a symmetric matrix.
Consider the SVD:
(24) 
for the case . Here, is the symmetric matrix:
and is the rotation matrix:
satisfies the stationary condition (23) and the constraint of Lemma 2. We can therefore conclude that is a local optimum of the probleam of Lemma 2.∎
Theorem 2.
When all singular values of are positive, is the global optimum for the maximization problem (2).
Proof.
Substituting into (2) gives:
Note that is also a rotation matrix. Consider
where is the vector of the singular values of . Because is a rotation matrix, . Then,
(25) 
The case gives the upper limit:
∎
In our experiments, we did not find a case for which had a large negative singular value. In the following, we assume that is the global optimum of the maximization problem (2).
If negative singular values appear, we can use the following theorem.
Theorem 3.
Consider the following optimization problem for a given matrix :
subject to 
The global optimum of this optimization problem is:
Here,
is an SVD of the matrix , where is the vector of the singular values for the matrix .
Proof.
Consider the case for which some of the singular values are negative. In such a case, the upper limit (25) becomes:
(26) 
The case gives the upper limit:
∎
4 Lca
First, we introduce generalized singular value decomposition (GSVD).
Definition 2.
Generalized singular value decomposition (GSVD) of a given matrix with diagonal weight matrices and is the decomposition:
(27) 
where
(28)  
(29) 
denotes the diagonal matrix for which diagonal entries are the components of vector . The vectors and are given weight vectors. and are given by ordinary SVD:
(30) 
where
(31) 
Note that this decomposition maintains the perpendicularity of the base vectors in the decomposed space with the weight matrices:
(32) 
Using GSVD, we can define the wellknown analysis for categorical data ^{1}^{1}1http://forrest.psych.unc.edu/research/vistaframes/pdf/chap11.pdf.
Definition 3.
The liner correspondence analysis (LCA) of a given contingency table is GSVD:
(33) 
with weight matrices and . Here,
(34)  
(35)  
(36) 
Lemma 3.
LCA is equivalent to the maximization problem:
subject to  (37) 
and has the solution:
(38) 
Here, and are given by ordinary SVD:
(39) 
where
(40) 
Proof.
LCA is the GSVD of . The GSVD is the SVD of . Applying Theorem 1 to the SVD of gives the required result. ∎
(21) and (30) enable the SVD to be rewritten as the following maximization problem based on onehot encoding.
Theorem 4.
LCA is equivalent to the maximization problem:
subject to  
(41) 
where
(42)  
(43) 
are scaled onehot encodings.
Proof.
This maximization problem defines the rotated Gini index using scaled onehot encoding. We can therefore say that LCA is equivalent to defining a Gini index using scaled and rotated onehot encoding.
5 Nonlinear extension
We now consider extending the optimization problem (41) using onehot encoding on a nonlinear mapped space.
Definition 4.
A nonlinear extension to CA can be expressed as:
subject to  
(49) 
where are nonlinear mappings. are subtraction operators on the nonlinear mapped spaces. The summation operator performs cumulative addition on the nonlinear mapped spaces:
where is an addition operator on the nonlinear mapped spaces. We refer to this formulation (49) as kernel correspondence analysis (KCA).
To be able to use the kernel trick, we assume the following rules about subtract and add operations.
Assumption 1.
(50)  
(51)  
(52)  
(53)  
(54)  
(55) 
Because , and are nonlinear operators, these relations are not valid in general. However, moving lefthandside operators to the righthand side in these relations can move outside the expression. Moving to the extreme right enables to require evaluation only once.
When , expanding (49) using these expansion rules gives the following theorem.
Theorem 5.
If and the rules in Assumption 1 are valid, we can introduce kernel matrices and . Using the kernel matrices, the maximization problem (49) becomes:
subject to  (56) 
Note that this formulation requires to be evaluated only once.
Specifying the operators and kernel matrices enables access to various known analyses for categorical data and NLP. Table 3 gives the relation between the specifications and known methods.
Name  

LCA  
Gini index [12, 11]  
SGNS [10, 8]  
GloVe [13] 
method  Sim  Rel  MEN  M.Turk  Rare  S999  

CBOW  0.388  0.438  0.383  0.579  0.050  0.075  
SGNS  0.674  0.654  0.561  0.608  0.027  0.215  
GloVe  0.431  0.466  0.421  0.508  0.118  0.096  
fastText  0.655  0.609  0.636  0.623  0.059  0.223  
tailcut  0.762  0.667  0.682  0.649  0.121  0.212  

LCA  0.749  0.680  0.671  0.668  0.127  0.218 
LCA  0.741  0.657  0.672  0.640  0.135  0.211  
SCA+MEN  0.743  0.665  0.770  0.636  0.136  0.210  

SCA+M.Turk  0.741  0.658  0.672  0.798  0.136  0.211 

red: best result magenta: 2nd best
5.1 Semisupervised CA
Consider the case where we wish to manually tune the distance between onehot encodings using tuning ratio tables and .
(57) 
For this case, we can define the following problem.
Definition 5.
Semisupervised correspondence analysis (SCA) can be expressed as:
subject to  
(58) 
where is the Hadamard product.
This problem is defined by considering and (57). The tuning tables and can be regarded as supervised training data. However, this method can also be based on unsupervised training data like PCA. We refer to this process as semisupervised correspondence analysis (SCA).
6 Delayed Sparse Matrix
CA is ordinary SVD:
(59) 
Because is a dense matrix, is also a dense matrix, even when is a sparse matrix. This is the reason why the CA approach makes such a demand on memory resources.
However, computing the dense matrix can be avoided by delayed evaluation. Consider multiplying by an arbitrary matrix on both the lefthand and righthand side of (59).
(60) 
The righthandside multiplication can be expressed similarly:
(61) 
Randomized SVD requires only a multiplying operation on the matrix to be decomposed, as for the power method. We can execute the randomized SVD using and without involving the expanded matrix (59). Because this scheme can avoid computing the dense matrix, there is a reduction in both computing time and memory requirements. We refer to this scheme as the delayed sparse randomized SVD (DSSVD) algorithm. Python implementation of this CA is provided in https://github.com/niitsuma/delayedsparse/blob/master/delayedsparse/ca.py
7 Word Representation
This research discusses the application of CA to wordvector representation tasks. Consider the following contingency table for some given trainingtext data:
(62) 
where is the number of times that the word appears in the context .
Based on this table, Mikolov et al. [10] introduced vector representations of words, referred to as word2vec
.
is computed by using the skipgram model.
However, the skipgram model requires random sampling, which gives different results for each computation.
This research uses the following fixed representation.
This notation represents the number of subsentences for which an arbitrary words appear between words and . For example, consider the sentence:
“this is this is this is this is this.”
In this sentence, the number of times “is” occurs three words after “this” is:
“is” also appears in other locations.
Given an appropriate window size , this equation can represent a relation similar to the skipgram.
(63) 
This cannot ignore noise relations when is large. To ignore noise, we introduce the following weighted sum:
(64) 
where
is the number of times that the word appears in all the training text. is the total number of words in the given training text. The weighted sum can be introduced using a kernel extension similar to SCA. We refer to this extension as the “tailcut kernel”. We can compute the LCA of and and the KCA of .
8 Experiments
This section compares various wordvector representation tasks using the text8 corpus^{2}^{2}2 http://mattmahoney.net/dc/text8.zip .
8.1 Delayed Sparse Randomized SVD
Figures 3 and 3 show the computing times and the required memory for LCA, respectively. The horizontal axis is the size of the training data. The initial section of the text8 corpus was used as the training data for the LCA. The experiments were carried out in a Gentoo Linux environment using an Intel i73770K 3.50 GHz processor. Note that the vertical axes have logarithmic scales.
The LCA was computed using SVD with the numPy library, randomized SVD with the the scikitlearn library, and the DSSVD. DSSVD was 100 times faster than SVD with numPy and 10 times faster than randomized SVD. The memory required for DSSVD was 10% of that required for SVD of numPy and 20% of that required for randomized SVD. When using the whole text8 corpus, the differences became more emphatic. Because of excessive memory requirements, using CA for NLP is impossible without DSSVD. Python code of this experimets is provided in https://github.com/niitsuma/delayedsparse/blob/master/democa.sh
8.2 Word Representation
We evaluated the English wordvector representation by focusing on the similarity between words using six test datasets.
Table 4 shows a comparison between methods for the whole text8 corpus. Evaluation with these six test datasets provided a ranking of similarity among words. The evaluation values are Spearman’s rank correlation coefficients of the ranking of similarity among words. For comparison, we show the results for skipgram with negative sampling (SGNS) [10], continuous bagofwords (CBOW) [10], GloVe [13], and fastText [2].
In most cases, the tailcut kernel provided the best or almostbest results. The LCA with also provided some of the best results. However, the LCA with results were drastically affected by the windowsize parameter. LCA for also showed instability, whereas the tailcut kernel provided stable results. For window sizes larger than 30, its result changes become insignificant. This implies that the tailcut kernel is relatively independent of the window size parameter, thereby possibly decreasing the number of parameters by one.
SCA based on LCA for were also evaluated. The SCA used MEN data and M.Turk data as the supervised training data. SCA outperformed LCA for much of the test data. These results demonstrate that SCA can work effectively. Although the wordvector representation task is unsupervised learning, SCA can use supervised data within the wordvector representation task. Part of codes of this experiments is provided in https://github.com/niitsuma/wordca
9 Conclusion
We have proposed a memoryefficient CA method based on randomized SVD. The algorithm also drastically reduces the computation time. This efficient CA can be applied to the wordvector representation task. The experimental results show that CA can outperform existing methods in the wordvector representation task. We have further proposed the tailcut kernel, which is an extension of the skipgram approach within KCA. Again, the tailcut kernel outperformed existing wordvector representation methods.
References
 [1] Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Paşca, and Aitor Soroa. A study on similarity and relatedness using distributional and wordnetbased approaches. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 19–27. Association for Computational Linguistics, 2009.
 [2] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016.
 [3] Elia Bruni, Gemma Boleda, Marco Baroni, and Nam Khanh Tran. Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 136–145. Association for Computational Linguistics, July 2012.
 [4] R. A. Fisher. The precision of discriminant functions. Annals of Eugenics, 10:422–429, 1940.
 [5] C.W. Gini. Variability and mutability, contribution to the study of statistical distributions and relations. studi economicogiuridici della r. universita de cagliari (1912). reviewed in: Light, r.j., margolin, b.h.: An analysis of variance for categorical data. J. American Statistical Association, 66:534–544, 1971.
 [6] N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217–288, May 2011.
 [7] Felix Hill, Roi Reichart, and Anna Korhonen. Simlex999: Evaluating semantic models with genuine similarity estimation. Comput. Linguist., 41(4):665–695, December 2015.
 [8] Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In Proceedings of the 27th International Conference on Neural Information Processing Systems, pages 2177–2185, 2014.
 [9] Thang Luong, Richard Socher, and Christopher Manning. Better word representations with recursive neural networks for morphology. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 104–113, August 2013.
 [10] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems, pages 3111–3119. 2013.
 [11] Hirotaka Niitsuma and Takashi Okada. Covariance and PCA for categorical variables. In Proceedings of the 9th PacificAsia Conference on Knowledge Discovery and Data Mining, pages 523–528, 2005.
 [12] T. Okada. A note on covariances for categorical data. In K.S. Leung, L.W. Chan, and H. Meng, editors, Intelligent Data Engineering and Automated Learning  IDEAL 2000, 2000.
 [13] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1532–1543, 2014.
 [14] Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. A word at a time: Computing word relatedness using temporal semantic analysis. In Proceedings of the 20th International Conference on World Wide Web, pages 337–346, 2011.
 [15] Torsten Zesch, Christof Müller, and Iryna Gurevych. Using wiktionary for computing semantic relatedness. In Proceedings of the 23rd National Conference on Artificial Intelligence, pages 861–866, 2008.
 [16] Y. Zhang, H. Wu, and L. Cheng. Some new deformation formulas about variance and covariance. In Proceedings of International Conference on Modelling, Identification and Control, pages 987–992, June 2012.