Efficient Algorithms for t-distributed Stochastic Neighborhood Embedding
t-distributed Stochastic Neighborhood Embedding (t-SNE) is a method for dimensionality reduction and visualization that has become widely popular in recent years. Efficient implementations of t-SNE are available, but they scale poorly to datasets with hundreds of thousands to millions of high dimensional data-points. We present Fast Fourier Transform-accelerated Interpolation-based t-SNE (FIt-SNE), which dramatically accelerates the computation of t-SNE. The most time-consuming step of t-SNE is a convolution that we accelerate by interpolating onto an equispaced grid and subsequently using the fast Fourier transform to perform the convolution. We also optimize the computation of input similarities in high dimensions using multi-threaded approximate nearest neighbors. We further present a modification to t-SNE called “late exaggeration,” which allows for easier identification of clusters in t-SNE embeddings. Finally, for datasets that cannot be loaded into the memory, we present out-of-core randomized principal component analysis (oocPCA), so that the top principal components of a dataset can be computed without ever fully loading the matrix, hence allowing for t-SNE of large datasets to be computed on resource-limited machines.
In many fields, the visualization of large, high-dimensional datasets is essential. t-distributed Stochastic Neighborhood Embedding (t-SNE), introduced by van der Maaten and Hinton (2008), has become enormously popular in many fields, such as in the analysis of single-cell RNA-sequencing (scRNA-seq) data, where it is used to discover the subpopulations among large numbers of cells in an unsupervised fashion. Unfortunately, even efficient methods for approximate t-SNE require many hours to embed datasets on the order of hundreds of thousands to millions of points, as often encountered in scRNA-seq and elsewhere. In this paper, we present Fast Fourier Transform-accelerated Interpolation-based t-SNE (FIt-SNE) for fast and accurate computation of t-SNE, essentially making it feasible to use t-SNE on datasets of this scale. Furthermore, we build on recent theoretical advances to more clearly separate clusters in t-SNE embeddings. Finally, we present an out-of-core implementation of randomized principal component analysis (oocPCA) so that users can embed datasets that are too large to load in the memory.
1.1. t-distributed Stochastic Neighborhood Embedding
Given a -dimensional dataset , t-SNE aims to compute the low dimensional embedding where , such that if two points and are close in the input space, then their corresponding points and are also close. Affinities between points and in the input space, , are defined as
is the bandwidth of the Gaussian distribution, and it is chosen using such that the perplexity of matches a given value, where is the conditional distribution of all the other points given . Similarly, the affinity between points and in the embedding space is defined using the Cauchy kernel
t-SNE finds the points that minimize the Kullback-Leibler divergence between the joint distribution of points in the input space and the joint distribution of the points in the embedding space ,
Starting with a random initialization, the cost function is minimized by gradient descent, with the gradient (as derived by van der Maaten and Hinton (2008))
where is a global normalization constant
We split the gradient into two parts
where the first sum corresponds to an attractive force between points and the second sum corresponds to a repulsive force
The computation of the gradient at each step is an -body simulation, where the position of each point is determined by the forces exerted on it by all other points. Exact computation of -body simulations scales as , making exact t-SNE computationally prohibitive for datasets with tens of thousands of points. Accordingly, van der Maaten (2014)’s popular implementation of t-SNE produces an approximate solution, and can be used on larger datasets. In that implementation, they approximate by nearest neighbors as computed using vantage-point trees (Yianilos (1993)). Since the input similarities do not change, they can be precomputed, and hence do not dominate the computational time. On the other hand, the repulsive forces are approximated at each iteration using the Barnes-Hut Algorithm (Barnes and Hut (1986)), a tree-based algorithm which scales as . Despite these accelerations, it can still take many hours to run t-SNE on large scRNA-seq datasets. Furthermore, given that t-SNE is often run many times with different initializations to find the best embedding, faster algorithms are needed. In this work, we present an approximate nearest neighbor based implementation for computing and an interpolation-based fast Fourier transform accelerated algorithm for computing , both of which are significantly faster than current methods.
1.2. Early exaggeration
van der Maaten (2014) and van der Maaten and Hinton (2008) note that as the number of points increases, the convergence rate slows down. To circumvent this problem, implementations of t-SNE multiply the term by a constant during the first iterations of gradient descent:
This “early exaggeration” forces the points into tight clusters which can move more easily, and are hence less likely to get trapped in local minima. Linderman and Steinerberger (2017) showed that this early exaggeration phase is essential for convergence of the algorithm and that when the exaggeration coefficient is set optimally, t-SNE is guaranteed to recover well-separated clusters. In Section §4 we show that late exaggeration (i.e. setting during the last several hundred iterations) is also useful, and can result in improved separation of clusters.
The organization of this paper is as follows: we first present and benchmark a fast Fourier transform accelerated interpolation-based method for optimizing the computation of in Section §2. Section §3 describes methods for accelerating the computation of input similarities required for . Section §4 describes “late exaggeration” for improving separation of clusters in t-SNE embeddings. Section §5 describes t-SNE heatmaps, an application of 1-dimensional t-SNE to the visualization of single-cell RNA-sequencing data. Finally, Section §6 presents an implementation of out-of-core PCA for the analysis of datasets too large to fit in the memory.
2. The repulsive forces
Suppose is an -dimensional embedding of a collection of -dimensional vectors At each step of gradient descent, the repulsive forces are given by
where , and denotes the component of Evidently, the repulsive force between the vectors consists of pairwise interactions, and were it computed directly, would require CPU-time scaling as Even for datasets consisting of a few thousand points, this cost becomes prohibitively expensive. Our approach enables the accurate computation of these pairwise interactions in time. Since the majority of applications of t-SNE are for two-dimensional embeddings (and in §5 we present an application of one-dimensional embeddings), in the following we focus our attention on the cases where or However, we note that our algorithm extends naturally to arbitrary dimensions. In such cases, though the constants in the computational cost will vary, our approach will still yield an algorithm with a CPU-time which scales as
We begin by observing that the repulsive forces defined in \crefeq:frep can be expressed as sums of the form
where the kernel is either
for (see Appendix). Note that both of the kernels and are smooth functions of for all . The key idea of our approach is to use polynomial interpolants of the kernel in order to accelerate the evaluation of the body interactions defined in \crefeq:nbody2.
2.1. Mathematical Preliminaries
First, we demonstrate with a simple example how polynomial interpolation can be used to accelerate the computation of the body interactions with a smooth kernel. Suppose that and . Let and denote the intervals and , respectively. Note that no assumptions are made regarding the relative locations of and in particular, the case is also permitted.
Now consider the sums
Let be a positive integer. Suppose that are a collection of points on the interval and that , are a collection of points on the interval . Let denote a bivariate polynomial interpolant of the kernel satisfying
A simple calculation shows that is given by
where and are the Lagrange polynomials
. In the following we will refer to the points , and as interpolation points.
Let denote the approximation to obtained by replacing the kernel in \crefeq:phi1 by its polynomial interpolant , i.e.
for . Clearly the error in approximating via is bounded (up to a constant) by the error in approximating via . In particular, if the polynomial interpolant satisfies the inequality
then the error is given by
A direct computation of requires operations. On the other hand, the values , , can be computed in operations as follows. Using \crefeq:kerinterp, can be rewritten as
for . The values , are computed in three steps.
Step 1: Compute the coefficients defined by the formula
for each . This step requires operations.
Step 2: Compute the values at the interpolation nodes defined by the formula
for all . This step requires operations.
Step 3: Evaluate the potential using the formula
for all . This step requires operations.
See \creffig:fftcodes for an illustrative figure of the above procedure.
In this section, we present the main algorithm for the rapid evaluation of the repulsion forces \crefeq:nbody2. The central strategy is to use piecewise polynomial interpolants of the kernel with equispaced points, and use the procedure described in Section §2.1.
Specifically, suppose that the points , are all contained in the interval . We subdivide the interval , into intervals of equal length. Let denote equispaced nodes on the interval given by
where , , and .
The nodes , , and , defined in \crefeq:equinodes, are also equispaced on the whole interval .
The interaction between any two intervals , , i.e.
can be accelerated via the algorithm discussed in \crefsubsec:repelprelim. This procedure amounts to using a piecewise polynomial interpolant of the kernel on the domain as opposed to using an interpolant on the whole interval. We summarize the procedure below.
Step 1: For each interval , , compute the coefficients defined by the formula
for each . This step requires operations.
Step 2: Compute the values at the equispaced nodes defined by the formula
for all , . This step requires operations.
Step 3: For each interval , , compute the potential via the formula
for all points . This step requires operations.
In this procedure, the functions , , are the Lagrange polynomials corresponding to the equispaced interpolation nodes on interval .
In Step 2 of the above procedure, we are evaluating body interactions on equispaced grid points. For notational convenience, we rewrite the sum \crefeq:equikereval
. The kernels of interest ( and defined in \crefeq:kers) are translationally-invariant, i.e., the kernels satisfy for any . The combination of using equispaced points, along with the translational-invariance of the kernel, implies that the matrix associated with the evaluation of the sums \crefeq:equikereval2 is Toeplitz. This computation can thus be accelerated via the fast-Fourier transform (FFT), which reduces the computational complexity of evaluating the sums \crefeq:equikereval2 from operations to .
Algorithm 1 describes the fast algorithm for evaluating the repulsive forces \crefeq:nbody2 in one dimension (s=1) which has computational complexity .
[!ht] \KwIn Collection of points , source strengths , number of intervals , number of interpolation points per interval
For each interval , form the equispaced nodes , given by \crefeq:equinodes
\KwTo Compute the coefficients given by
Use the fast-Fourier transform to compute the values of given by
where is the Toeplitz matrix given by
\KwTo Compute at all points via
2.3. Optimal choice of and
Recall that the computational complexity of Algorithm 1 is . We remark that the choice of the parameters and depends solely on the specified tolerance and is independent of the number of points . Generally, increasing will reduce the number of intervals required to obtain the same accuracy in the computation. However, we observe that the reduction in for an increased is not advantageous from a computational perspective—since, as the number of points increases, the computational cost is independent of and is only a function of . Moreover, for the t-SNE kernels and defined in \crefeq:kers, it turns out that for a fixed accuracy the product remains nearly constant for . Thus, it is optimal to use for all t-SNE calculations. In a more general environment, when higher accuracy is required and for other translationally invariant kernels , the choice of the number of nodes per interval and the total number of intervals can be optimized based on the accuracy of computation required.
Special care must be taken when increasing in order to achieve higher accuracy due to the Runge phenomenon associated with equispaced nodes. In fact, the kernels that arise in t-SNE are archetypical examples of this phenomenon. Since we use only low-order piecewise polynomial interpolation (), we encounter no such difficulties.
2.4. Extension to two dimensions
The above algorithm naturally extends to two-dimensional embeddings (s=2). In this case, we divide the computational square into a collection of squares with equal side length, and for polynomial interpolation, we use tensor product equispaced nodes on each square. The matrix mapping the coefficients to the coefficients which is of size , is not a Toeplitz matrix, however, it can be embedded into a Toeplitz matrix of twice its size. The computational complexity of the algorithm analogous to Algorithm 1 for two-dimensional t-SNE is .
In order to compare the computation time for computing using FFT-accelerated Interpolation-based t-SNE (FIt-SNE) and the Barnes Hut (BH) implementation t-SNE, we set , and chose to be at either or , whichever is larger, so that the accuracy in 2D is comparable to that of the Barnes-Hut method (with , the default) for all iterations (Fig. 2). After the early exaggeration phase ( for the first iterations), the points expand abruptly, resulting in decreased accuracy.
The computation time for computing the gradient for iterations, with an increasing number of points, is shown in Fig. 3. For 1 million points, our method is 15 and 30 times faster than BH when embedding in 1D and 2D respectively, allowing for t-SNE of large datasets on the order of millions of points.
3. The attractive forces
At each step of gradient descent, the attractive forces on the th point
attract it to other points in the embedding that are close in the original space. In practice, computing the interaction energies between all pairs of points is too expensive and hence van der Maaten (2014) restricts to computing, for every point, only the interactions with the nearest neighbors. In that implementation, nearest neighbors are computed with vantage-point trees (Yianilos (1993)), which are highly effective in low dimensions, but are prohibitively expensive when embedding large, high dimensional datasets.
A recent theoretical advance by Linderman et al. (2017) can be used to optimize this step: it suggests that connecting every point to its (for example) nearest neighbors is not more effective than connecting every point to 2 randomly chosen points out of its 100 nearest neighbors. The main reason is that this randomized procedure, when executed on point clouds lying on manifolds, creates expander graphs at the local scale which represent the local geometry accurately at a slightly coarser level. In the purely discrete case, this relates to problems in random graphs first raised by Ulam and Erdős-Renyi, and we refer to Linderman et al. (2017) for details. This simple insight may allow for a massive speedup since the number of interaction terms is much smaller. In practice, we use this result to justify the replacement of nearest neighbors with approximate nearest neighbors. Specifically, we compute approximate nearest neighbors using a randomized nearest neighbor method called ANNOY (Bernhardsson (2017)), as we expect the resulting “near neighbors” to capture the local geometry at least as effectively as the same number of nearest neighbors. We further accelerated this step by parallelizing the neighbor lookups. The resulting speed-ups over the vantage point tree approach for computing are shown in Fig. 4, as measured on a machine with 12 Intel Xeon E7540 CPUs clocked at 2.00GHz.
4. Early and Late Exaggeration
In the expression for the gradient descent, the sum of attractive and repulsive forces,
the numerical quantity plays a substantial role as it determines the strength of attraction between points that are similar (in the sense of pairs with large). In early exaggeration, first for the first several hundred iterations, after which it set to (see van der Maaten and Hinton (2008)). One of the main results of Linderman and Steinerberger (2017) is that plays a crucial role and that when it is set large enough, t-SNE is guaranteed to separate well-clustered data and also successfully embed various synthetic datasets (e.g. a swiss roll) that were previously thought to be poorly embedded by t-SNE.
We present a novel variation called “late exaggeration,” which refers to setting for the last several hundred iterations. This approach seems to produce more easily interpretable visualizations: one recurring issue with t-SNE outputs (see Fig. 5) is that the arising structure, while clustered, has its clusters close to each other and does not allow for an easy identification of segments. By increasing (and thus the attraction between points in the same cluster), the clusters contract and are more easily distinguishable.
5. t-SNE Heatmaps
The 2D t-SNE plot has become a staple of many scRNA-seq analyses, in which it is used to visualize clusters of cells, colored by the expression of interesting genes. Although this information is presented in 2D, users are most interested in which genes are associated with which clusters, not the 2D shape or relations of the clusters. In general, the location of clusters with respect to one another is meaningless, and their 2D shape is not interpretable and dependent on initialization (Wattenberg et al. (2016)). We hypothesize that 1D t-SNE would contain the same information as 2D t-SNE, and since it is much more compact, it would allow simultaneous visualization of the expression of hundreds of genes in a heatmap-like fashion. The general idea is shown in Fig. 6, where we embedded the 49k retinal cells of Macosko et al. (2015) using 1D and 2D t-SNE and assigned corresponding points the same color to show that the embeddings are equivalent.
This 1D t-SNE representation can be extended into a visualization we call “t-SNE Heatmaps.” The 1D t-SNE is first discretized into bins, and we sum the expression of each gene in each of the bins, such that each gene is a vector in . A distance between genes is now defined as the Euclidean distance between these vectors. In practice, the user provides a set of genes of interest (GOI), and this gene set is then enriched with genes that are closest to the genes of interest in this metric. Each of the -vectors corresponding to this new gene set are rows in a heatmap, and can be used to visualize hundreds of genes’ expression on the t-SNE embedding. We give a small example in Fig. 7, where three genes in corresponding to known Retinal subpopulations are enriched with the four genes closest to each in the t-SNE metric and visualized in this heatmap format.
In general, the rows need not correspond to individual genes; if a method for clustering genes is available, then t-SNE Heatmaps can be used to visualize how the cell clusters are associated with gene clusters. Overall, t-SNE Heatmaps uses the 1D t-SNE of the cells to define a distance on the genes, and could be similarly extended to define a distance on the cells. This would allow for t-SNE based iterative methods to be developed, similar to Mishne et al. (2017), where the the embedding of the cells is used to improve the embedding of the genes, which is then used to improve the embedding of the cells, and so forth.
6. Out-of-Core PCA
The methods for t-SNE presented above allow for embedding of millions of points in only hours, but can only be used to reduce the dimensionality of datasets that can fit in the memory. For many large, high dimensional datasets, specialized servers must be used in order to simply load the data. For instance, a single cell RNA-seq dataset with a million cells, where the expression of genes are measured for each cell, requires 160GB of memory - far exceeding the capacity of a standard personal computer. In order to allow for visualization and analysis of such datasets on resource-limited machines, we present an out-of-core implementation of randomized PCA, which can be used to compute the top few (e.g. 50) principal components of a dataset to high accuracy, without ever loading it in its entirety (Halko et al. (2011a)).
6.1. Randomized Methods for PCA
The goal of PCA is to approximate the matrix being analyzed (after mean centering of its columns) with a low-rank matrix. PCA is primarily useful when such an approximation makes sense; that is, when the matrix being analyzed is approximately low-rank. If the input matrix is low-rank, then by definition, its range is low-dimensional. As such, when the input matrix is applied to a small number of random vectors, the resulting vectors nearly span its range. This observation is the core idea behind randomized algorithms for PCA: applying the input matrix to a small number of random vectors results in vectors that approximate the range of the matrix. Then, simple linear algebra techniques can be used to compute the principal components. Notably, the only operations involving the large input matrix are matrix-vector multiplications, which are easily parallelized, and for which highly optimized implementations exist. Randomized algorithms have been rigorously proven to be remarkably accurate with extremely high probability (e.g. Halko et al. (2011b); Witten and Candes (2015)), because for a rank- matrix, as few as random vectors are sufficient for the probability of missing a significant part of the range to be negligible. The algorithm and its underlying theory are covered in detail in Halko et al. (2011b). An easy-to-use “black box” implementation of randomized PCA is available and described in Li et al. (2017), but it requires the entire matrix to be loaded in the memory. We present an out-of-core implementation of PCA in R, oocPCA, allowing for decomposition of matrices which cannot fit in the memory.
[ht] \KwInMatrix of size stored in slow memory, non-negative integers , , , , where , and defaults to \KwOutOrthonormal of size , non-negative diagonal matrix of size , orthonormal of size , such that
Generate uniform random matrix of size
Form block-wise, rows at a time
Renormalize with LU factorization
\KwTo Form block-wise, rows at a time
Renormalize with LU factorization
Renormalize with QR factorization
Compute SVD of small matrix
Our implementation is described in Algorithm 1. Given an matrix of doubles , stored in row-major format on the disk of a machine with bytes of available memory, the number of rows that can fit in the memory is calculated as . The only operations performed using are matrix multiplications, which can be performed block-wise. Specifically, the matrix product , where is an matrix stored in the fast memory, can be computed by loading the first rows of , and forming the inner product of each row with the columns of . The process can be continued with the remaining blocks of the matrix, essentially “filling in” the product with each new block. In this manner, left multiplication by can be computed without ever loading the full matrix .
By simply replacing the matrix multiplications in Li et al. (2017)’s implementation with block-wise matrix multiplication, an out-of-core algorithm can be obtained. However, significant optimization is possible. The run-time of an out-of-core algorithm is almost entirely determined by disk access time; namely, the number of times the matrix must be loaded to the memory. As suggested in Li et al. (2017), the renormalization step between the application of and is not necessary in most cases, and in the out-of-core setting, doubles the number of times must be loaded per power iterations. In our implementation, we remove this renormalization step, and apply simultaneously, hence requiring the matrix only be loaded once per iteration.
Our implementation is in C++ with an R (R Core Team (2017)) wrapper. For maximum optimization of linear algebra operations, we use the highly parallelized Intel MKL for all BLAS functions (e.g. matrix multiplications). The R wrapper provides functions for PCA of matrices in CSV and in binary format. Furthermore, basic preprocessing steps including transformation and mean centering of rows and/or columns can also be performed prior to decomposition, so that the matrix need not ever be fully stored in the memory.
We generated a random rank- matrix of doubles, which would require GB to simply store in the memory, far exceeding the memory capacity of a personal computer. Using oocPCA we can compute the top principal components of the matrix with much less memory (Fig. 8). By storing only 1GB of the matrix in the memory at a time, and all other parameters set to default, the top principal components of this matrix can be computed in 16 minutes, while attaining an approximation accuracy of in the spectral norm.
7. Summary and Discussion
In this work, we present an implementation of t-SNE that allows for embedding of high dimensional datasets with millions of points in only a few hours. Our implementation includes a “late exaggeration” feature, which can make it easier to identify clusters in t-SNE plots. Finally, we presented an out-of-core algorithm for PCA, allowing for analysis and visualization of datasets that cannot fit into the memory. A natural extension of the present work is the analysis of large scRNA-seq datasets, and we are currently applying this approach to analyze the 1.3 million mouse brain cells dataset of 10X Genomics (2016). Our methods allow for visualization of these datasets, without subsampling, on a standard personal computer in a reasonable amount of time.
8. Software Availability
FIt-SNE and ooPCA are both available at https://github.com/KlugerLab/. The FIt-SNE repository also contains a script for producing t-SNE Heatmap visualizations.
The authors would like to thank Vladimir Rokhlin and Mark Tygert for many useful discussions.
Appendix A Appendix
In Section §2 we noted that the repulsive forces defined in \crefeq:frep can be expressed as sums of the form
where the kernel is either
for . In this appendix, we demonstrate for . The following sums are computed at each step of gradient descent:
At each step of gradient descent, the repulsive forces can then be expressed in terms of these sums as follows:
- 10X Genomics (2016). Transciptional profiling of 1.3 million brain cells with the chromium single cell 3’ solution. Application Note.
- Barnes, J. and Hut, P. (1986). A hierarchical O(N log N) force-calculation algorithm. Nature, 324(6096):446–449.
- Bernhardsson, E. (2017). Annoy: Approximate nearest neighbors in c++/python optimized for memory usage and loading/saving to disk. https://github.com/spotify/annoy.
- Halko, N., Martinsson, P.-G., Shkolnisky, Y., and Tygert, M. (2011a). An algorithm for the principal component analysis of large data sets. SIAM Journal on Scientific computing, 33(5):2580–2594.
- Halko, N., Martinsson, P.-G., and Tropp, J. A. (2011b). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217–288.
- Li, H., Linderman, G. C., Szlam, A., Stanton, K. P., Kluger, Y., and Tygert, M. (2017). Algorithm 971: an implementation of a randomized algorithm for principal component analysis. ACM Transactions on Mathematical Software (TOMS), 43(3):28.
- Linderman, G. C., Mishne, G., Kluger, Y., and Steinerberger, S. (2017). Randomized near neighbor graphs, giant components, and applications in data science. arXiv preprint arXiv:1711.04712.
- Linderman, G. C. and Steinerberger, S. (2017). Clustering with t-SNE, provably. arXiv preprint arXiv:1706.02582.
- Macosko, E. Z., Basu, A., Satija, R., Nemesh, J., Shekhar, K., Goldman, M., Tirosh, I., Bialas, A. R., Kamitaki, N., Martersteck, E. M., et al. (2015). Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell, 161(5):1202–1214.
- Mishne, G., Talmon, R., Cohen, I., Coifman, R. R., and Kluger, Y. (2017). Data-driven tree transforms and metrics. IEEE Transactions on Signal and Information Processing over Networks.
- R Core Team (2017). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
- van der Maaten, L. (2014). Accelerating t-SNE using tree-based algorithms. Journal of machine learning research, 15(1):3221–3245.
- van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(Nov):2579–2605.
- Wattenberg, M., Viégas, F., and Johnson, I. (2016). How to use t-SNE effectively. Distill, 1(10):e2.
- Witten, R. and Candes, E. (2015). Randomized algorithms for low-rank matrix factorizations: sharp performance bounds. Algorithmica, 72(1):264–281.
- Yianilos, P. N. (1993). Data structures and algorithms for nearest neighbor search in general metric spaces. In SODA, volume 93, pages 311–321.