Linear tSNE Optimization for the Web
Understanding how data points are arranged in a high-dimensional space plays a crucial role in exploratory data analysis . In recent years, non-linear dimensionality reduction techniques, also known as manifold learning algorithms, became a powerful tools for mining knowledge from data, e.g., the presence of clusters at different scales. Manifold learning algorithms for data visualization reduce the dimensionality of the points to 2 or 3 dimensions while preserving some characteristic of the data such as the preservation of the local neighborhoods. The success of this approach is motivated by the fact that most of the real-world data satisfy the “manifold hypothesis”, i.e., it lies on relatively low-dimensional manifolds embedded in a high-dimensional space. The manifolds are typically mapped to a lower dimensional space and visualized and analyzed, for example, in a scatterplot.
The t-distributed Stochastic Neighbor Embedding (tSNE) algorithm  has been accepted as the state of the art for nonlinear dimensionality reduction applied to visual analysis of high-dimensional space in several application areas, such as life sciences [2, 4, 15] and machine learning model understanding and human-driven supervision [17, 21, 9]. tSNE can be logically separated in two computation modules; first it computes the similarities of the high-dimensional points as a joint probability distribution and, second, it minimizes the KullbackâLeibler divergence  of a similarly computed joint probability distribution that measures the closeness of the points in the low dimensional space. The memory and computational complexity of the algorithm is , where is the number of data points.
Given its popularity, research efforts have been spent on improving the computational and memory complexity of the algorithm. While many works focused on the improvement of the similarity computation [26, 20, 24, 19, 16], only limited effort have been spent in improving the minimization algorithm employed for the creation of the embedding [26, 16, 11]. The most notable of these improvements is the Barnes-Hut-SNE that makes use of an -body simulation approach  to approximate the repulsive forces between the data points. Despite the improvements, the minimization requires many minutes using a highly-optimized C++ implementation.
In this work we focus on the minimization of the objective function for the creation of the embedding. We observe that the heavy tail of the t-Student distribution used by tSNE makes the application of the -body simulation not particularly effective. To address this problem we propose a novel minimization approach that embraces this characteristic and we reformulate the gradient of the objective function as a function of scalar fields and tensor operations. Our technique has linear computational and memory complexity and, more importantly, is implemented in a GPGPU fashion. The latter allowed us to implement a version for the browser that minimizes the objective function for standard datasets in a matter of seconds.
The contribution of this work is twofold:
A linear complexity minimization of the tSNE objective function that makes use of the modern WebGL rendering pipeline. Specifically, we
approximate the repulsive forces between data points by drawing low-resolution textures and
we adopt a tensor-based computation of the objective function’s gradient.
An efficient implementation of our result that is released as part of Google’s TensorFlow.js library
The rest of the paper is structured as follows. In the next section, we provide a theoretical primer on the tSNE algorithm that is needed to understand the related work (Section 3) and our contributions (Section 4). In Section 5, we provide the details regarding our implementation, released within Google’s TensorFlow.js library.
In this section, we provide a short introduction to tSNE , which is needed to understand the related work and our contribution. tSNE interprets the overall distances between data points in the high-dimensional space as a symmetric joint probability distribution that encodes their similarities. Likewise a joint probability distribution is computed that describes the similarity in the low-dimensional space. The goal is to achieve a representation, referred to as embedding, in the low dimensional space, in which faithfully represents . This is achieved by optimizing the positions in the low-dimensional space to minimize the cost function given by the Kullback-Leibler () divergence between the joint-probability distributions and :
Given two data points and in the dataset , the probability models the similarity of these points in the high-dimensional space. To this extent, for each point, a Gaussian kernel is chosen, whose variance is defined according to the local density in the high-dimensional space and then is described as follows:
can be seen as a relative measure of similarity based on the local neighborhood of a data point . The perplexity value is a user-defined parameter that describes the effective number of neighbors considered for each data point. The value of is chosen such that for fixed and each :
A Student’s t-Distribution with one degree of freedom is used to compute the joint probability distribution in the low-dimensional space , where the positions of the data points should be optimized. Given two low-dimensional points and , the probability that describes their similarity is given by:
The gradient of the Kullback-Leibler divergence between and is used to minimize (see Eq. 1). It indicates the change in position of the low-dimensional points for each step of the gradient descent and is given by:
The gradient descent can be seen as an -body simulation , where each data-point exerts an attractive and a repulsive force on all other points ( and ).
3 Related Work
We now present the work that has been done to improve the tSNE computation of tSNE embeddings in term of quality and scalability. Van der Maaten proposed the Barnes-Hut-SNE (BH-SNE) , which reduces the complexity of the algorithm to for both the similarity computations and the objective function minimization. More specifically, in the BH-SNE approach the similarity computations are seen as a -nearest neighborhood graph computation problem, which is obtained using a Vantage-Point Tree . The minimization of the objective function is then seen as an -body simulation, which is solved by applying the Barnes-Hut algorithm .
Pezzotti et al.  observed that the computation of the -nearest neighborhood graph for high-dimensional spaces using the Vantage-Point Tree is affected by the curse of dimensionality, limiting the efficiency of the computation. To overcome this limitation, they proposed the Approximated-tSNE (A-tSNE) algorithm , where approximated -nearest neighborhood graphs are computed using a forest of randomized KD-trees . Moreover, A-tSNE adopts the novel Progressive Visual Analytics paradigm [23, 7], allowing the user to observe the evolution of the embedding during the minimization of the objective function. This solution does not only enable a user-driven early termination of the algorithm but also led to novel discoveries in cell differentiation pathways .
A similar observation was later made by Tang et al. that led to the development of the LargeVis technique . LargeVis uses random projection trees  followed by a kNN descent procedure  for the computation of the similarities and a different objective function that is minimized using a Stochastic Gradient Descent approach . Despite the improvements, both the A-tSNE and LargeVis tools require 15 to 20 minutes to compute the embedding of the MNIST dataset , a 784-dimensional dataset of 60k images of handwritten digits, that is often used as benchmark for manifold-learning algorithms. Better performance is achieved by the UMAP algorithm , which provides a different formulation of the dimensionality-reduction problem as a cross-entropy minimization between topological representations. Computationally, UMAP follows very closely LargeVis and adopts a kNN descent procedure  and Stochastic Gradient Descent minimization of the objective function.
A completely different approach is taken in the Hierarchical Stochastic Neighbor Embedding algorithm (HSNE) . HSNE efficiently builds a hierarchical representation of the manifolds and embeds only a subset of the initial data that represent an overview of the available manifolds. This approach can embed the MNIST dataset in less than 2 minutes. The user can “drill-in” the hierarchy by requesting more detailed embeddings that reveal smaller clusters of data points. HSNE is implemented in the Cytosplore  tool and led to the discovery of new cell populations  in large samples, i.e., containing more that 5 million cells. While HSNE produces better embeddings due to an easier minimization process, it has the downside that it does not produce a single embedding that depicts the complete dataset but it requires the user to actively explore the data and it generates embeddings on request.
The techniques presented so far do not take advantage of the target domain, in which the data is embedded. As a matter of fact, tSNE is mostly used for data visualization in 2-dimensional scatterplots, while the previously introduced techniques are general and can be used for higher dimensional spaces. Based on this observation, Kim et al. introduced the PixelSNE technique  that employs a -body simulation approach similar to the BH-SNE, but quantizes the embedding space to the pixels used for visualizing the embedding. However, PixelSNE requires to scale the number of used pixels with respect to the size of the dataset in order to achieve a good embedding quality due to the quantization of the embedding space.
In this work, we take advantage of the 2-dimensional domain in which the embedding resides and we propose a more efficient way to minimize the tSNE objective function. Contrary to PixelSNE we observe that, by quantizing only the 2-dimensional space for the computation of the repulsive forces presented in Equation 8, embeddings that are hardly distinguishable from those generated by the BH-SNE implementation are computed in a fraction of the time. Moreover, we this approach allows to develop a linear complexity GPGPU implementation that runs in the client side of the browser.
4 Linear complexity tSNE minimization
In this section, we present our approach to minimize the objective function, presented in Equation 1, by rewriting its gradient, presented in Equation 7. The computation of the gradient relies on a scalar field and a vector field that are computed in linear time on the GPU.
4.1 Gradient of the objective function
The gradient of the objective function has the same form as the previously introduced one:
with attractive and repulsive forces acting on every point . We denote the forces with a to distinguish them from their original counterparts. Our main contribution is to rewrite the computation of the gradient as a form of a scalar field and a vector field .
Intuitively, represents the density of the points in the embedding space, according to the t-Student distribution, and it is used to compute the normalization of the joint probability distribution . An example of the field is shown in Figure 1b. The vector field represents the directional repulsive force applied to the entire embedding space. An example of is presented in Figure 1c-d, where the horizontal and vertical components are visualized separately. In the next section, we will present how both and are computed with a complexity of and sampled in constant time. For now, we assume these fields given and we present how the gradient of the objective function are derived from these two fields, accelerating hereby their calculation drastically.
For the attractive forces, we adopt the restricted neighborhood contribution as presented in the Barnes-Hut-SNE technique . The rationale of this approach is that, by imposing a fixed perplexity to the Gaussian kernel, only a limited number of neighbors effectively apply an attractive force on any given point (see Equation 3 and 4). Therefore we limit the number of contributing points to a multiple of the value of perplexity, equal to three times the value of the chosen perplexity, effectively reducing the computational and memory complexity to , since where is the size of the neighborhood.
The normalization factor , as it was presented in Equation 6, has complexity . In our approach we compute in linear time by sampling the scalar field .
Note that and formulation is identical but, since we assume that is computed in linear time, computing has linear complexity. Moreover, since does not depend on the point , for which we are computing the gradient, it needs to be computed only once for all the points.
The repulsive force assumes even a simpler form
being the value of the vector field in the location identified by the coordinates normalized by . Similarly as for , has an equivalent formulation as but with computational and memory complexity equal to . So far, we assumed that and are computed in linear time and queried in constant time. In the next section we present how we achieve this result by using the WebGL rendering pipeline to compute an approximation of these fields.
4.2 Computation of the fields
In the previous section, we formulated the gradient of the objective function as dependent from a scalar field and a vector field . If the fields are evaluated independently, the complexity of the approach is due to the summation in Equations 10 and 11. We achieve a linear complexity by precomputing and approximating the fields on the GPU using textures of appropriate resolution. An example of the fields for the MNIST dataset  is given in Figure 1b-d.
A similar approach is used for Kernel Density Estimation  that has applications in visualization  and non-parametric clustering . In this setting, given a number of points, the goal is to estimate a 2-dimensional probability density function, from which the points were sampled. This is usually achieved by overlaying a Gaussian kernel, whose has to be estimated, on top of every data point.
Lampe et al.  were the first to propose a computation of the kernel density on the GPU for a visualization purpose. They observed that the Gaussian kernel used for estimating the density has a limited support, i.e., having value almost equal to zero if they are sufficiently far away from the origin. A good approximation of the density function is then achieved by drawing, instead of the points, little quads that are textured with a precomputed Gaussian kernel. By using additive blending available in OpenGL, i.e., by summing the values in every pixel, the resulting drawing corresponds to the desired density function.
If we analyze Equations 10 and 11, we can observe that every element in the summations for both and have a limited support, making it indeed very similar to the Kernel Density Estimation case discussed before. The drawn functions, however, are different and Figure 2 shows them for and . Therefore, we can compute the fields by drawing over a texture with a single additive drawing operation. Each point is drawn as a quad and colored with a floating-point RGB texture where each channel encodes one of the functions shown in Figure 2.
Contrary to the Kernel Density Estimation case, where the size of the quads changes according to the chosen for the Gaussian kernel, our functions have a fixed support in the embedding space. Therefore, given a certain embedding , the resolution of the texture influences the quality of the approximation but not the overall shape of the fields. To achieve linear complexity, we define the resolution of the target texture according to the size of the embedding. In this way, every data point updates the value of a constant number of pixels in the target texture, effectively leading to complexity for the computation of the fields.
Computing the value of and for a point corresponds to extracting the interpolated value in the textures that represents the fields. This operation is extremely fast on the GPU, as WebGL natively supports the bilinear interpolation of texture values. In the next section, we provide a more detailed overview of the computational pipeline as a number of tensor operations and custom drawing operations.
The randomly initialized tSNE embedding is stored in a 2-dimensional tensor. We then proceed to compute the repulsive forces and attractive forces , shown respectively in the lower and upper side of Figure 3. The attractive forces are computed in a custom shader that measures the sum of the contribution of every neighboring point in the high-dimensional space. The neighborhoods are encoded in the joint probability distribution that is stored in a WebGL texture. can be computed server-side, for example using an approximated -nearest-neighborhood algorithm [18, 6, 5] or by the Hierarchical-SNE technique . However, we provide a WebGL implementation of the kNN-Descent algorithm  and the computation of directly in the browser to enable a client-side only computational workflow.
The repulsive forces are computed using the approach presented in previous sections. In a custom shader, we draw for each point, whose location is defined by the value in the embedding tensor, a quad that is textured with the functions presented in Figure 2. The resulting 3-channel texture, an example of which is presented in Figure 1b-d, represents the scalar field and the vector field . For each embedding point , the values of and are stored in tensors and are computed by a custom WebGL shader that interpolates the value of the texture in the corresponding channel. The normalization factor is then obtained by summing all the elements in the tensor with the interpolated values of , an operation that is efficiently performed on the GPU by TensorFlow.js.
The remaining computational steps are computed as tensor operations. is obtained by dividing the interpolated values of by , and, by adding the attractive forces , the gradient of the objective function is obtained. The gradient is then added to the embedding, hence, modifying the position of the points according to their similarities. Our work is released as part of the TensorFlow.js library and can be found on GitHub at the following address: https://github.com/tensorflow/tfjs-tsne
In this work we presented a novel approach for the optimization of the objective function of the t-Distributed Stochastic Neighbor Embedding algorithm (tSNE) that scales to large datasets in the client side of the browser. Our approach relies on modern graphics hardware to efficiently compute the gradient of the objective function from a scalar field that represents the point density and and the directional repulsive forces in the embedding space. The implementation of the technique is based on the TensorFlow.js library and can be found on GitHub at the following address: https://github.com/tensorflow/tfjs-tsne. Examples that validate our approach can also be found on GitHub https://github.com/tensorflow/tfjs-tsne-examples.
As future work, we want to perform a systematic analysis of the results of our technique, both in terms of speed and quality of the results. More specifically we plan to perform a comparison with single-embedding techniques such as LargeVis , UMAP  and PixelSNE . Moreover, we want to integrate our technique in the Hierarchical-SNE technique  and in tools for the analysis of Deep Neural Networks such as the Embedding Projector 111https://projector.tensorflow.org, TensorBoard 222https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard and DeepEyes . To conclude, we believe that having a scalable tSNE implementation that runs in the browsers open exciting possibilities for the development of new analytical systems.
-  S. J. Aarseth, Gravitational N-Body Simulations. Cambridge University Press, 2003, cambridge Books Online.
-  E.-a. D. Amir, K. L. Davis, M. D. Tadmor, E. F. Simonds, J. H. Levine, S. C. Bendall, D. K. Shenfeld, S. Krishnaswamy, G. P. Nolan, and D. Pe’er, “viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia,” Nature biotechnology, vol. 31, no. 6, pp. 545–552, 2013.
-  J. Barnes and P. Hut, “A hierarchical o (n log n) force-calculation algorithm,” nature, vol. 324, p. 446, 1986.
-  B. Becher, A. Schlitzer, J. Chen, F. Mair, H. R. Sumatoh, K. W. W. Teng, D. Low, C. Ruedl, P. Riccardi-Castagnoli, and M. Poidinger, “High-dimensional analysis of the murine myeloid cell system,” Nature immunology, vol. 15, no. 12, pp. 1181–1189, 2014.
-  S. Dasgupta and Y. Freund, “Random projection trees and low dimensional manifolds,” in Proceedings of the fortieth annual ACM symposium on Theory of computing, 2008, pp. 537–546.
-  W. Dong, C. Moses, and K. Li, “Efficient k-nearest neighbor graph construction for generic similarity measures,” in Proceedings of the 20th international conference on World wide web. ACM, 2011, pp. 577–586.
-  J.-D. Fekete and R. Primet, “Progressive analytics: A computation paradigm for exploratory data analysis,” arXiv preprint arXiv:1607.05162, 2016.
-  T. Höllt, N. Pezzotti, V. van Unen, F. Koning, E. Eisemann, B. Lelieveldt, and A. Vilanova, “Cytosplore: Interactive immune cell phenotyping for large single-cell datasets,” in Computer Graphics Forum, vol. 35, 2016, pp. 171–180.
-  M. Kahng, P. Y. Andrews, A. Kalro, and D. H. P. Chau, “A cti v is: Visual exploration of industry-scale deep neural network models,” IEEE transactions on visualization and computer graphics, vol. 24, no. 1, pp. 88–97, 2018.
-  J. Kiefer and J. Wolfowitz, “Stochastic estimation of the maximum of a regression function,” The Annals of Mathematical Statistics, pp. 462–466, 1952.
-  M. Kim, M. Choi, S. Lee, J. Tang, H. Park, and J. Choo, “Pixelsne: Visualizing fast with just enough precision via pixel-aligned stochastic neighbor embedding,” arXiv preprint arXiv:1611.02568, 2016.
-  S. Kullback, Information theory and statistics. Courier Corporation, 1997.
-  O. D. Lampe and H. Hauser, “Interactive visualization of streaming data with kernel density estimation,” in Visualization Symposium (PacificVis), 2011 IEEE Pacific, 2011, pp. 171–178.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
-  N. Li, V. van Unen, T. Höllt, A. Thompson, J. van Bergen, N. Pezzotti, E. Eisemann, A. Vilanova, S. M. Chuva de Sousa Lopes, B. P. Lelieveldt, and F. Koning, “Mass cytometry reveals innate lymphoid cell differentiation pathways in the human fetal intestine,” Journal of Experimental Medicine, 2018.
-  L. McInnes and J. Healy, “Umap: Uniform manifold approximation and projection for dimension reduction,” arXiv preprint arXiv:1802.03426, 2018.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
-  M. Muja and D. Lowe, “Scalable nearest neighbor algorithms for high dimensional data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 11, pp. 2227–2240, 2014.
-  N. Pezzotti, T. Höllt, B. Lelieveldt, E. Eisemann, and A. Vilanova, “Hierarchical stochastic neighbor embedding,” in Computer Graphics Forum, vol. 35, 2016, pp. 21–30.
-  N. Pezzotti, B. Lelieveldt, L. van der Maaten, T. Hollt, E. Eisemann, and A. Vilanova, “Approximated and user steerable tsne for progressive visual analytics,” IEEE Transactions on Visualization and Computer Graphics, vol. PP, no. 99, pp. 1–1, 2016.
-  N. Pezzotti, T. Höllt, J. Van Gemert, B. P. Lelieveldt, E. Eisemann, and A. Vilanova, “Deepeyes: Progressive visual analytics for designing deep neural networks,” IEEE transactions on visualization and computer graphics, vol. 24, no. 1, pp. 98–108, 2018.
-  M. Rosenblatt, “Remarks on some nonparametric estimates of a density function,” The Annals of Mathematical Statistics, pp. 832–837, 1956.
-  C. Stolper, A. Perer, and D. Gotz, “Progressive visual analytics: User-driven visual exploration of in-progress analytics,” IEEE Transactions on Visualization and Computer Graphics, vol. 20, no. 12, pp. 1653–1662, 2014.
-  J. Tang, J. Liu, M. Zhang, and Q. Mei, “Visualizing large-scale and high-dimensional data,” in Proceedings of the 25th International Conference on World Wide Web, 2016, pp. 287–297.
-  J. W. Tukey, “The future of data analysis,” The Annals of Mathematical Statistics, pp. 1–67, 1962.
-  L. Van Der Maaten, “Accelerating t-sne using tree-based algorithms,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 3221–3245, 2014.
-  L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal of Machine Learning Research, vol. 9, no. 2579-2605, p. 85, 2008.
-  V. van Unen, T. Hollt, N. Pezzotti, N. Li, M. J. T. Reinders, E. Eisemann, A. Vilanova, F. Koning, and B. P. F. Lelieveldt, “Interactive visual analysis of mass cytometry data by hierarchical stochastic neighbor embedding reveals rare cell types,” Nature Communications, vol. 8, 2017.
-  P. N. Yianilos, “Data structures and algorithms for nearest neighbor search in general metric spaces,” in Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 1993, pp. 311–321.
Nicola Pezzotti received his Master of Science degree from the University of Brescia, Italy, in 2011. Previously he worked as a research fellow at University of Brescia and as a research engineer in Open Technologies. He is a PhD student at Delft University of Technology in the Computer Graphics and Visualization group. He joined as visiting researcher the AVIZ/INRIA team in 2017 and Google Research in 2018. His research interests include visual analytics, machine learning and high-performance computing.
Alexander Mordvintsev received his Computer Science Master degree from Saint-Petersburg ITMO University, Russia in 2010. He worked on a number of industrial projects focusing on simulation and computer vision tasks as a research engineer in Transas Group. In 2014 he joined Google Zurich. His research interests include interpretability of machine learned models, computer vision and graphics. In 2015 he developed the DeepDream algorithm.
Boudewijn P.F. Lelieveldt received a PhD in medical image analysis from the Leiden University in 1999. He is heading the Division of Image Processing (www.lkeb.nl) at the Leiden University Medical Center, and hold a Medical Delta professor chair of Biomedical Imaging at Leiden University and Delft University of Technology. His research interest includes dimensionality reduction methods, with application in complex biomedical datasets. He is a member of IEEE.
Thomas Höllt received the Diplom (MSc) from the University of Koblenz-Landau, Germany, in 2008, and the PhD in computer science from the King Abdullah University of Science and Technology, Saudi Arabia, in 2013. He is a Postdoctoral fellow at Delft University of Technology. His research interests include visualization, computer graphics and GPGPU. He is a member of IEEE and Eurographics.
Elmar Eisemann is a professor at Delft University of Technology, heading the Computer Graphics and Visualization group. Before, he was an associate professor at Telecom Paris-Tech and senior researcher in the Cluster of Excellence at MPII/Saarland University. His interests include real-time and perceptual rendering, alternative representations, shadow algorithms, global illumination, and GPU acceleration techniques. In 2011, he was honored with the Eurographics Young Researcher Award.
Anna Vilanova is associate professor at the Delft University of Technology in the Computer Graphics and Visualization group. Before, she was assistant professor at the Eindhoven University of Technology. She is leading a research group in the subject of multivalued image analysis and visualization. Her research interests include visual analytics, medical visualization, volume visualization, multivalued visualization, and medical image analysis.