Word2Vec is a special case of Kernel Correspondence Analysis and Kernels for Natural Language Processing
We show that correspondence analysis (CA) is equivalent to defining a Gini index with appropriately scaled one-hot encoding. Using this relation, we introduce a nonlinear kernel extension to CA. This extended CA gives a known analysis for natural language via specialized kernels that use an appropriate contingency table. We propose a semi-supervised CA, which is a special case of the kernel extension to CA. Because CA requires excessive memory if applied to numerous categories, CA has not been used for natural language processing. We address this problem by introducing delayed evaluation to randomized singular value decomposition. The memory-efficient CA is then applied to a word-vector representation task. We propose a tail-cut kernel, which is an extension to the skip-gram within the kernel extension to CA. Our tail-cut kernel outperforms existing word-vector representation methods.
Principal component analysis (PCA) is a form of unsupervised feature extractor. When applied to chi-squared distances of categorical data, PCA becomes correspondence analysis (CA). CA can extract numeric vector features from categorical data without supervised labeling. The simplest numerical representation for categorical data is a histogram-based representation such as tf-idf or a “bag-of-words”. Many applications use such simple representations. However, histogram-based representations cannot make use of information about correlations within the data. CA enables the representation of both histograms and correlations in data.
The most popular problem involving categorical data is natural language processing (NLP). However, CA has not been applied to NLP because most NLP problems involve a large number of categories. For example, the entire Wikipedia text comprises more than 10,000 different words. Because CA requires excessive memory if applied to numerous categories, CA has not been used for NLP problems involving more than 10,000 categories.
CA is implemented by singular value decomposition (SVD) of a contingency table. In many categorical problems, the contingency table is sparsely populated. Randomized SVD  is an appropriate SVD method for sparse matrices. However, CA requires dense matrix computation even for a sparse contingency table, which makes very large demands on memory resources. This research aims to address this problem. We propose using a randomized SVD with delayed evaluation to avoid expanding the sparse matrix into a dense matrix. We refer to this process as the delayed sparse randomized SVD (DSSVD) algorithm. We show that CA with DSSVD can be applied to NLP problems.
Neural-network-based approaches are the most popular feature extractors used in NLP.
word2vec  is well known.
Usually, such an approach will involve many parameters, which do not have explicit meanings in most cases.
These parameters have to be tuned by grid searching or manual parameter tuning, which is difficult in the absence of explicit meanings for the parameters.
This parameter problem with neural-network-based approaches also gives rise to domain problems.
For example, if
word2vec is tuned for application to restaurant reviews, the tuning may not be appropriate for movie reviews.
In most cases, the weight values used in neural networks are initialized as random values, which means that the computed results will always be different.
For example, word-vector representations using
word2vec will always be different, even when the same parameter values are used, because of random initial values.
This adds to the difficulty of parameter tuning.
Since CA is PCA of a contingency table, always the same result is computed. From this viewpoint, the CA approach is better than neural-network-based approaches. Although the latter have these issues, they can be used to approximate any nonlinear function, which means that they can be used for a wide variety of problems. However, because CA is a form of linear analysis, it is not directly applicable to nonlinear problems. To address this issue, this research introduces a nonlinear kernel extension to CA. We can then show that this nonlinear CA approach is better in accuracy compared with recent neural-network-based approaches. In particular, we focus on comparison with respect to word-vector representation tasks. To distinguish linear and nonlinear CA, we refer to the linear CA as LCA.
CA is a statistical visualization method for picturing the associations between the levels of a two-way contingency table. As an illustration, consider the contingency table shown in Table 1. This is well known as “Fisher’s data”  and represents the eye and hair color of people in Caithness, Scotland. The CA of these data yields the graphical display presented in Figure 1, which shows the correspondence between eye and hair color.
Table 1 shows the joint population distribution of the categorical variable for eye color:
and the categorical variable for hair color:
The visualization is based on “one-hot encodings” and the “indicator matrices” of categorical variables. For example, a one-hot encoding and indicator matrix of the categorical variable can be defined as:
In the following, and denote categorical variables representing the row and column of a contingency table, respectively. and denote one-hot encodings of and for the -th instance.
3 Covariance Based on Gini Index
Consider the following relation  about the variance of continuous data.
The variance of continuous data can be expressed as the sum of the differences of individual instances:
where are continuous sample data. is the average value of the sample data. is the variance of the continuous data.
Using the same formulation about the sum of the differences between individual instances for categorical data gives a Gini index :
Here, is a categorical variable that takes one of the values in . Rewriting this formulation with one-hot encoding gives:
This is more similar to the continuous case:
Using this one-hot encoding, we can also define the covariance of categorical data. Consider the categorical variables and . If the given sample categorical data is
we can define the covariance of and :
Okada [12, 11] showed that this definition is invalid by considering the contingency table shown in Table 2. In this contingency table, and are highly correlated and the instance reduces the correlation between and . However, the instance increases the covariance in the formulation (12). To avoid such an invalid increase, Okada defined the covariance using rotated one-hot encoding .
The covariance of categorical variables and is the maximized value :
where is a rotation matrix that maximizes the covariance. The vectors and are one-hot encodings of and .
The maximization problem (1) is equivalent to
is an contingency table, with entries giving the frequency with which row categorical variable occurs together with column categorical variable . denotes the vector of row marginals and is the vector of column marginals. .
We can solve the maximization problem (2) using SVD.
is a local optimum of the maximization problem (2). Here,
is an SVD of .
Local optima of the maximization problem are given by differentiating the Lagrangian:
where is a Lagrange multiplier. The differentiation of this Lagrangian with respect to gives the stationary condition:
This result shows that must be a symmetric matrix.
Consider the SVD:
for the case . Here, is the symmetric matrix:
and is the rotation matrix:
satisfies the stationary condition (23) and the constraint of Lemma 2. We can therefore conclude that is a local optimum of the probleam of Lemma 2.∎
When all singular values of are positive, is the global optimum for the maximization problem (2).
Substituting into (2) gives:
Note that is also a rotation matrix. Consider
where is the vector of the singular values of . Because is a rotation matrix, . Then,
The case gives the upper limit:
In our experiments, we did not find a case for which had a large negative singular value. In the following, we assume that is the global optimum of the maximization problem (2).
If negative singular values appear, we can use the following theorem.
Consider the following optimization problem for a given matrix :
The global optimum of this optimization problem is:
is an SVD of the matrix , where is the vector of the singular values for the matrix .
Consider the case for which some of the singular values are negative. In such a case, the upper limit (25) becomes:
The case gives the upper limit:
First, we introduce generalized singular value decomposition (GSVD).
Generalized singular value decomposition (GSVD) of a given matrix with diagonal weight matrices and is the decomposition:
denotes the diagonal matrix for which diagonal entries are the components of vector . The vectors and are given weight vectors. and are given by ordinary SVD:
Note that this decomposition maintains the perpendicularity of the base vectors in the decomposed space with the weight matrices:
Using GSVD, we can define the well-known analysis for categorical data 111http://forrest.psych.unc.edu/research/vista-frames/pdf/chap11.pdf.
The liner correspondence analysis (LCA) of a given contingency table is GSVD:
with weight matrices and . Here,
LCA is equivalent to the maximization problem:
and has the solution:
Here, and are given by ordinary SVD:
LCA is the GSVD of . The GSVD is the SVD of . Applying Theorem 1 to the SVD of gives the required result. ∎
LCA is equivalent to the maximization problem:
are scaled one-hot encodings.
This maximization problem defines the rotated Gini index using scaled one-hot encoding. We can therefore say that LCA is equivalent to defining a Gini index using scaled and rotated one-hot encoding.
5 Nonlinear extension
We now consider extending the optimization problem (41) using one-hot encoding on a nonlinear mapped space.
A nonlinear extension to CA can be expressed as:
where are nonlinear mappings. are subtraction operators on the nonlinear mapped spaces. The summation operator performs cumulative addition on the nonlinear mapped spaces:
where is an addition operator on the nonlinear mapped spaces. We refer to this formulation (49) as kernel correspondence analysis (KCA).
To be able to use the kernel trick, we assume the following rules about subtract and add operations.
Because , and are nonlinear operators, these relations are not valid in general. However, moving left-hand-side operators to the right-hand side in these relations can move outside the expression. Moving to the extreme right enables to require evaluation only once.
When , expanding (49) using these expansion rules gives the following theorem.
If and the rules in Assumption 1 are valid, we can introduce kernel matrices and . Using the kernel matrices, the maximization problem (49) becomes:
Note that this formulation requires to be evaluated only once.
Specifying the operators and kernel matrices enables access to various known analyses for categorical data and NLP. Table 3 gives the relation between the specifications and known methods.
|Gini index [12, 11]|
|SGNS [10, 8]|
red: best result magenta: 2nd best
5.1 Semi-supervised CA
Consider the case where we wish to manually tune the distance between one-hot encodings using tuning ratio tables and .
For this case, we can define the following problem.
Semi-supervised correspondence analysis (SCA) can be expressed as:
where is the Hadamard product.
This problem is defined by considering and (57). The tuning tables and can be regarded as supervised training data. However, this method can also be based on unsupervised training data like PCA. We refer to this process as semi-supervised correspondence analysis (SCA).
6 Delayed Sparse Matrix
CA is ordinary SVD:
Because is a dense matrix, is also a dense matrix, even when is a sparse matrix. This is the reason why the CA approach makes such a demand on memory resources.
However, computing the dense matrix can be avoided by delayed evaluation. Consider multiplying by an arbitrary matrix on both the left-hand and right-hand side of (59).
The right-hand-side multiplication can be expressed similarly:
Randomized SVD requires only a multiplying operation on the matrix to be decomposed, as for the power method. We can execute the randomized SVD using and without involving the expanded matrix (59). Because this scheme can avoid computing the dense matrix, there is a reduction in both computing time and memory requirements. We refer to this scheme as the delayed sparse randomized SVD (DSSVD) algorithm. Python implementation of this CA is provided in https://github.com/niitsuma/delayedsparse/blob/master/delayedsparse/ca.py
7 Word Representation
This research discusses the application of CA to word-vector representation tasks. Consider the following contingency table for some given training-text data:
where is the number of times that the word appears in the context .
Based on this table, Mikolov et al.  introduced vector representations of words, referred to as
is computed by using the skip-gram model.
However, the skip-gram model requires random sampling, which gives different results for each computation.
This research uses the following fixed representation.
This notation represents the number of subsentences for which an arbitrary words appear between words and . For example, consider the sentence:
“this is this is this is this is this.”
In this sentence, the number of times “is” occurs three words after “this” is:
“is” also appears in other locations.
Given an appropriate window size , this equation can represent a relation similar to the skip-gram.
This cannot ignore noise relations when is large. To ignore noise, we introduce the following weighted sum:
is the number of times that the word appears in all the training text. is the total number of words in the given training text. The weighted sum can be introduced using a kernel extension similar to SCA. We refer to this extension as the “tail-cut kernel”. We can compute the LCA of and and the KCA of .
This section compares various word-vector representation tasks using the text8 corpus222 http://mattmahoney.net/dc/text8.zip .
8.1 Delayed Sparse Randomized SVD
Figures 3 and 3 show the computing times and the required memory for LCA, respectively. The horizontal axis is the size of the training data. The initial section of the text8 corpus was used as the training data for the LCA. The experiments were carried out in a Gentoo Linux environment using an Intel i7-3770K 3.50 GHz processor. Note that the vertical axes have logarithmic scales.
The LCA was computed using SVD with the numPy library, randomized SVD with the the scikit-learn library, and the DSSVD. DSSVD was 100 times faster than SVD with numPy and 10 times faster than randomized SVD. The memory required for DSSVD was 10% of that required for SVD of numPy and 20% of that required for randomized SVD. When using the whole text8 corpus, the differences became more emphatic. Because of excessive memory requirements, using CA for NLP is impossible without DSSVD. Python code of this experimets is provided in https://github.com/niitsuma/delayedsparse/blob/master/demo-ca.sh
8.2 Word Representation
We evaluated the English word-vector representation by focusing on the similarity between words using six test datasets.
Table 4 shows a comparison between methods for the whole text8 corpus. Evaluation with these six test datasets provided a ranking of similarity among words. The evaluation values are Spearman’s rank correlation coefficients of the ranking of similarity among words. For comparison, we show the results for skip-gram with negative sampling (SGNS) , continuous bag-of-words (CBOW) , GloVe , and fastText .
In most cases, the tail-cut kernel provided the best or almost-best results. The LCA with also provided some of the best results. However, the LCA with results were drastically affected by the window-size parameter. LCA for also showed instability, whereas the tail-cut kernel provided stable results. For window sizes larger than 30, its result changes become insignificant. This implies that the tail-cut kernel is relatively independent of the window size parameter, thereby possibly decreasing the number of parameters by one.
SCA based on LCA for were also evaluated. The SCA used MEN data and M.Turk data as the supervised training data. SCA outperformed LCA for much of the test data. These results demonstrate that SCA can work effectively. Although the word-vector representation task is unsupervised learning, SCA can use supervised data within the word-vector representation task. Part of codes of this experiments is provided in https://github.com/niitsuma/wordca
We have proposed a memory-efficient CA method based on randomized SVD. The algorithm also drastically reduces the computation time. This efficient CA can be applied to the word-vector representation task. The experimental results show that CA can outperform existing methods in the word-vector representation task. We have further proposed the tail-cut kernel, which is an extension of the skip-gram approach within KCA. Again, the tail-cut kernel outperformed existing word-vector representation methods.
-  Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Paşca, and Aitor Soroa. A study on similarity and relatedness using distributional and wordnet-based approaches. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 19–27. Association for Computational Linguistics, 2009.
-  Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016.
-  Elia Bruni, Gemma Boleda, Marco Baroni, and Nam Khanh Tran. Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 136–145. Association for Computational Linguistics, July 2012.
-  R. A. Fisher. The precision of discriminant functions. Annals of Eugenics, 10:422–429, 1940.
-  C.W. Gini. Variability and mutability, contribution to the study of statistical distributions and relations. studi economico-giuridici della r. universita de cagliari (1912). reviewed in: Light, r.j., margolin, b.h.: An analysis of variance for categorical data. J. American Statistical Association, 66:534–544, 1971.
-  N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217–288, May 2011.
-  Felix Hill, Roi Reichart, and Anna Korhonen. Simlex-999: Evaluating semantic models with genuine similarity estimation. Comput. Linguist., 41(4):665–695, December 2015.
-  Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In Proceedings of the 27th International Conference on Neural Information Processing Systems, pages 2177–2185, 2014.
-  Thang Luong, Richard Socher, and Christopher Manning. Better word representations with recursive neural networks for morphology. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 104–113, August 2013.
-  Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems, pages 3111–3119. 2013.
-  Hirotaka Niitsuma and Takashi Okada. Covariance and PCA for categorical variables. In Proceedings of the 9th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 523–528, 2005.
-  T. Okada. A note on covariances for categorical data. In K.S. Leung, L.W. Chan, and H. Meng, editors, Intelligent Data Engineering and Automated Learning - IDEAL 2000, 2000.
-  Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1532–1543, 2014.
-  Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. A word at a time: Computing word relatedness using temporal semantic analysis. In Proceedings of the 20th International Conference on World Wide Web, pages 337–346, 2011.
-  Torsten Zesch, Christof Müller, and Iryna Gurevych. Using wiktionary for computing semantic relatedness. In Proceedings of the 23rd National Conference on Artificial Intelligence, pages 861–866, 2008.
-  Y. Zhang, H. Wu, and L. Cheng. Some new deformation formulas about variance and covariance. In Proceedings of International Conference on Modelling, Identification and Control, pages 987–992, June 2012.