PixelSNE: Visualizing Fast with Just Enough Precision via Pixel-Aligned Stochastic Neighbor Embedding

PixelSNE: Visualizing Fast with Just Enough Precision via Pixel-Aligned Stochastic Neighbor Embedding

Abstract

Embedding and visualizing large-scale high-dimensional data in a two-dimensional space is an important problem since such visualization can reveal deep insights out of complex data. Most of the existing embedding approaches, however, run on an excessively high precision, ignoring the fact that at the end, embedding outputs are mapped into coarse-grained pixel coordinates in a limited screen space. Motivated by this observation and directly considering it in an embedding algorithm, we accelerate Barnes-Hut tree-based t-distributed stochastic neighbor embedding (BH-SNE), known as a state-of-the-art 2D embedding method, and propose a novel alternative called PixelSNE, a highly-efficient, screen resolution-driven 2D embedding method with a linear computational complexity in terms of the number of data items. Our experimental results show the significantly fast running time of PixelSNE by a large margin against BH-SNE, while maintaining the comparable embedding quality. Finally, the source code of our method is publicly available at https://github.com/awesome-davian/pixelsne.

t-SNE; Barnes-Hut SNE; Dimension reduction; 2D embedding; Scatterplot; Visualization
\setcopyright

rightsretained \acmDOI10.475/123_4 \acmISBN123-4567-24-567/08/06 \acmConference[WOODSTOCK’97]ACM Woodstock conferenceJuly 1997El Paso, Texas USA \acmYear1997 \copyrightyear2016 \acmPrice15.00

\titlenote

Produces the permission block, and copyright information

\authornote

the corresponding author

{CCSXML}

¡ccs2012¿ ¡concept¿ ¡concept_id¿10010147.10010257.10010321.10010336¡/concept_id¿ ¡concept_desc¿Computing methodologies Feature selection¡/concept_desc¿ ¡concept_significance¿500¡/concept_significance¿ ¡/concept¿ ¡/ccs2012¿

\ccsdesc

[500]Computing methodologies Feature selection

1 Introduction

Visualizing high-dimensional data in a two-dimensional (2D) or three-dimensional (3D)1 scatterplot is an effective approach that can reveal deep insights about data. Through such visualization, one can obtain the idea about the relationships among clusters as well as those of individual data, e.g., similar or outlying clusters and/or data items.

To generate a scatterplot given high-dimensional data, one can apply various dimension reduction or low-dimensional embedding approaches including traditional methods (e.g., principal component analysis [Jolliffe (2002)] and multidimensional scaling [Kruskal (1964a), Kruskal (1964b)]) and recent manifold learning methods (e.g., isometric feature mapping [Tenenbaum et al. (2000)], locally linear embedding [Saul and Roweis (2003)], and Laplacian Eigenmaps [Belkin and Niyogi (2003)]).

These methods, however, do not properly handle the significant information loss due to reducing high dimensionality down to two or three, and in response, an advanced dimension reduction technique called t-distributed stochastic neighbor embedding (t-SNE) [van der Maaten and Hinton (2008)] has been proposed, showing its outstanding advantages in generating 2D scatterplots. A drawback of t-SNE is its significant computing time against a large number of data items, e.g., the computational complexity of , where represents the number of data items. Although various approximation techniques attempting to accelerate its algorithm have been proposed, e.g., Barnes-Hut SNE (BH-SNE) [Van Der Maaten (2014)] with the complexity of , it still takes much time to apply them to large-scale data.

To tackle this issue, this paper proposes the novel framework that can significantly accelerate the 2D embedding algorithms in visualization applications. The proposed framework is motivated by the fact that most embedding approaches compute the low-dimensional coordinates with an excessive precision, e.g., a double precision with 64-bit floating point representation, compared to the limited precision that our screen space has. For instance, if our screen space has pixels, then the resulting coordinates computed from an embedding algorithm will be mapped with and integer levels of and coordinates, respectively. Moreover, when making sense of a 2D scatterplot, human perception does not often require a high precision from its results [Choo and Park (2013)].

Leveraging this idea of the just enough precision for our screen and human perception to the above-mentioned state-of-the-art method, BH-SNE, we propose a significantly fast alternative called pixel-aligned SNE (PixelSNE), which shows more than 5x fold speedup against BH-SNE for 421,161 data item of News aggregator dataset. In detail, by lowering and matching the precision used in BH-SNE to that of pixels in a screen space, PixelSNE directly optimizes 2D-coordinates in the screen space with a pre-given resolution.

In this paper, we describe the mathematical and algorithmic details of how we utilized the idea of a pixel-based precision in BH-SNE and present the extensive experimental results that show the significantly fast running time of PixelSNE by a large margin against BH-SNE, while maintaining the the embedding quality.

Generally, our contributions can be summarized as follows:

1. We present a novel framework of a pixel-based precision driven by a screen space with a finite resolution.

2. We propose a highly-efficient 2D embedding approach called PixelSNE by leveraging our idea of a pixel-based precision in BH-SNE.

3. We perform extensive quantitative and qualitative analyses using various real-world datasets, showing the significant speedup of our proposed approach against BH-SNE along with a comparable quality of visualization results.

2 Related Work

Dimension reduction or low-dimensional embedding of high-dimensional data [Van der Maaten et al. (2009)] has long been an active research area. Typically, most of these methods attempt to generate the low-dimensional coordinates that maximally preserve the pairwise distances/similarities given in a high-dimensional space. Such low-dimensional outputs generally work for two purposes: (1) generating the new representations of original data for improving the desired performance of a downstream target task, such as its accuracy and/or computational efficiency, and (2) visualizing high-dimensional data in a 2D scatterplot for providing humans with the in-depth understanding and interpretation about data.

Widely-used dimension reduction methods used for visualization application include principal component analysis (PCA) [Jolliffe (2002)], multidimensional scaling [Kruskal (1964a), Kruskal (1964b)], Sammon mapping [Sammon (1969)], generative topographic mapping [Bishop et al. (1998)], and self-organizing map [Kohonen (2001)]. While these traditional methods generally focus on preserving global relationships rather than local ones, a class of nonlinear, local dimension reduction techniques called manifold learning [Lee and Verleysen (2007)] has been actively studied, trying to recover an intrinsic curvilinear manifold out of given high-dimensional data. Representative methods are isometric feature mapping [Tenenbaum et al. (2000)], locally linear embedding [Saul and Roweis (2003)], Laplacian Eigenmaps [Belkin and Niyogi (2003)], maximum variance unfolding [Weinberger and Saul (2006)], and autoencoder [Hinton and Salakhutdinov (2006)].

Specifically focusing on visualization applications, a recent method, t-distributed stochastic neighbor embedding [van der Maaten and Hinton (2008)], which is built upon stochastic neighbor embedding [Hinton and Roweis (2002)], has shown its superiority in generating the 2D scatterplots that can reveal meaningful insights about data such as clusters and outliers. Since then, numerous approaches have been proposed to improve the visualization quality and its related performances in the 2D embedding results. For example, a neural network has been integrated with t-SNE to learn the parametric representation of 2D embedding [Maaten (2009)]. Rather than the Euclidean distance or its derived similarity information, other information types such as non-metric similarities [van der Maaten and Hinton (2012)] and relative ordering information about pairwise distances in the form of similarity triplets [van der Maaten and Weinberger (2012)] have been considered as the target information to preserve. Additionally, various other optimization criteria and their optimization approaches, such as elastic embedding [Carreira-Perpinán (2010)] and NeRV [Venna et al. (2010)], have been proposed.

The computational efficiency and scalability of 2D embedding approaches has also been widely studied. An accelerated t-SNE based on the approximation using the Barnes-Hut tree algorithm has been proposed [Van Der Maaten (2014)]. Gisbrecht et al. proposed a linear approximation of t-SNE [Gisbrecht et al. (2012)]. More recently, an approximate, but user-steerable t-SNE, which provides interactions with which a user can control the degree of approximation on user-specified areas, has also been studied [Pezzotti et al. (2016)]. In addition, a scalable 2D embedding technique called LargeVis [Tang et al. (2016)] significantly reduced the computing times with a linear time complexity in terms of the number of data items.

Even with such a plethora of 2D embedding approaches, to the best of our knowledge, none of the previous studies have directly exploited the limited precision of our screen space and human perception for developing highly efficient 2D embedding algorithms, and our novel framework of controlling the precision in return for algorithm efficiency and the proposed PixelSNE, which significantly improves the efficiency of BH-SNE, can be one such example.

3 Preliminaries

Figure 1: Overview of the proposed PixelSNE in comparison with the original Barnes-Hut SNE (BH-SNE)

In this section, we introduce the problem formulation and the two existing methods, t-distributed stochastic neighbor embedding (t-SNE) and its efficient version called Barnes-Hut SNE (BH-SNE).

3.1 Problem Formulation

A 2D embedding method takes a set of high-dimensional data items, , where is the number of data items and is the original dimensionality, and gives their 2D (low-dimensional) embedding, , as an output. Given a screen of resolution , where and , the - and -axis resolutions, respectively, are positive integers, the scatterplot is generated by transforming to their corresponding (zero-based) pixel coordinates

(1)
(2)

3.2 t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE embeds high-dimensional data into a low-dimensional space by minimizing the differences between the joint probability distribution representing pairwise relationships in and its counterpart in . In detail, t-SNE computes the Euclidean pairwise distance matrix of and then converts it to the high-dimensional joint probability matrix using a Gaussian kernel.

Next, given randomly initialized , t-SNE computes the Euclidean pairwise distance matrix and then converts it to a low-dimensional joint probability matrix using a Student’s -distribution. To be specific, the -th component of at iteration , which represents the similarity between and in a probabilistic sense, is computed as

(3)

where .

The objective function of t-SNE is defined using the Kullback-Leibler divergence between and as

(4)

and t-SNE iteratively performs gradient-decent update on where the gradient with respect to is computed [Van Der Maaten (2014)] as

(5)

Note that every data point exerts an attraction and an repulsion forces to one another based on the difference between the two pairwise joint probability matrices and .

3.3 Barnes-Hut-SNE (BH-SNE)

While t-SNE adopts a brute-force approach with the computational complexity considering all the pairwise relationships, BH-SNE adopts two different tree-based approximation methods to reduce this complexity. The first one called the vantage-point tree [Yianilos (1993)] approximately computes and then as sparse matrices by ignoring small pairwise distances as zeros (Fig. 1(a)). This approximation reduces the complexity of computing and to where is a pre-defined parameter of perplexity and . Accordingly, involving only those nonzero ’s in the sparse matrix , the complexity of computing in Eq. (5) reduces to .

When it comes to optimizing low-dimensional coordinates (Fig. 1(b)), BH-SNE adopts Barnes-Hut algorithm [] to compute in Eq. (5). When , the forces of and to are similar to each other when computing the gradient. Based on this observation, BH-SNE uses Barnes-Hut algorithm to find a single representative point of multiple data points used for gradient update. For example, if we set the representative data point of and as , then the low-dimensional joint probability can be used to substitute each of and .

To dynamically obtain the representative points ’s, BH-SNE constructs a quadtree at each iteration, recursively splitting the 2D region that contains ’s into four pieces. Each node, which we call a cell , contains the following information:

1. the boundary about its corresponding 2D region, i.e.,

(6)

2. the set of ’s contained in , i.e.,

(7)

and its cardinality, , and

3. the center of mass, , of ’s in , i.e.,

(8)

Given at iteration , BH-SNE starts by forming a root node , containing all the ’s, by setting in Eq. (7) and

(9)
(10)

where is a small number in Eq. (6). BH-SNE then recursively splits into four (equally-sized) quadrant cells located at “northwest”, “northeast”, “southwest”, and “southeast” by setting their boundaries accordingly, assigning ’s to these child cells based on the boundary information in Eq. (7), and computing their centers of mass. As assigning ’s one by one to the tree, the quadtree grows until each leaf node contains at most a single .

Afterwards, when computing the gradient with respect to in Eq. (5) (Fig. 1(b-3)), BH-SNE traverses the quadtree via depth-first search to determine whether can work as the representative point of those contained in based on the criterion

(11)

where represents the diagonal length of the region of and is a threshold parameter. Once finding satisfying this criterion, the term in Eq. (5) for those points contained in is approximated as , thus reducing the computational complexity of in Eq. (5) to .

4 Pixel-Aligned SNE (PixelSNE)

In this section, we present our PixelSNE, which significantly reduces the computational time of BH-SNE.

4.1 Pixel-Aligned Barnes-Hut Tree

A major novelty of PixelSNE originates from the fact that it directly considers the screen-space coordinates (Eq. (1)) instead of . That is, PixelSNE utilizes the two properties of during its optimization process: (1) the range of remains fixed as for throughout algorithm iterations (Eq. (2)) and (2) the precision of is limited by the screen space resolution. Utilizing these characteristics, we accelerate the Barnes-Hut tree construction as the assignment process of ’s to cells as follows. For clarification, we denote our accelerated quadtree algorithm of PixelSNE as a pixel-aligned quadtree (P-Quadtree) while we call the original quadtree algorithm of BH-SNE as a data-driven quadtree (D-Quadtree).

One-time tree construction (instead of every iteration). Unlike BH-SNE, which builds a quadtree from scratch per iteration, the above-mentioned properties of allow PixelSNE to build P-Quadtree just one time before the main iterations of gradient-decent update and to recycle it during the iterations. That is, PixelSNE pre-computes the full information about (1) the boundary (Eq. (6)) of each cell and (2) its center of mass (Eq. (8)) as follows.

For the boundary information, we utilize the fact that since and for every iteration, Eqs. (9) and (10) boil down to

which is no longer dependent on iteration . This constant boundary of the root node makes those of all the child cells in P-Quadtree constant as well. This results a fixed depth of P-Quadtree as long as the minimum size of the cell is determined, which will be discussed later part of this sub-section in detail. Based on this idea, PixelSNE pre-computes the boundaries of all the cells in P-Quadtree.

Next, since the boundary of each cell is already determined, we simply approximate the center of mass as the center location of the cell corresponding region, e.g.,

which is also not dependent on iteration .

Once the pre-computation of the above information (Fig. 1(c-1)) finishes, the iterative gradient optimization in PixelSNE simply assigns ’s to these pre-computed cells in P-Quadtree and updates and (Fig. 1(c-2)). Note that in contrast, BH-SNE iteratively computes all these steps every iteration (Figs. 1(b-1) and (b-2)), which acts as the critical inefficiency compared to PixelSNE.

Bounded tree depth based on screen resolution. Considering a typical screen resolution, BH-SNE performs an excessively precise computation. That is, when mapping D-Quadtree to pixel grids of our screen, one can view that BH-SNE subdivides the pixels, the minimum unit of the screen, into much smaller cells until each left cell contains at most one data point (Fig. 1(T-1)).

On the contrary, the minimum size of the cell in P-Quadtree is bounded to the pixel size to avoid an excessive precision in computation (Fig. 1(T-2)). In detail, P-Quadtree keeps splitting the cell while satisfying the condition,

which indicates that the cell length is larger than the unit pixel size for at least one of the two axes. As a result, the depth of P-Quadtree is bounded by

Such a bounded depth of P-Quadtree causes a leaf node to possibly contain multiple data points, and in the gradient-descent update step, we may not find the cell satisfying Eq. (11) even after reaching a leaf node in the depth-first search traversal. In this case, we just stop the traversal and use the center of mass of the corresponding leaf node to approximately compute (Eq. (5)) of those points contained in the node.

Finally, let us clarify that we maintain ’s as double-precision numbers just as BH-SNE, but our idea of a limited precision is realized mainly by the two above-described novel techniques.

Computational complexity. The depth of the Barnes-Hut tree acts as a critical factor in algorithm efficiency since both the assignment of data point to cells and the approximation of (Eq. (5)) are performed based depth-first search traversal, the computational complexity of which is linear to the tree depth. In the worst case where a majority of data points are located in a relatively small region in a 2D space and a few outliers are located far from them, D-Quadtree can grow as deeply as the depth of , which is much larger than that of P-Quadtree, , when . Owing to this characteristics, the overall computational complexity of PixelSNE is obtained as , which reduces to the linear computational complexity in terms of the number of data items, given the fixed screen resolution ’s.

Intuitively, BH-SNE traverses much finer-grained nodes to obtain the excessively precise differentiation among data points. On the other hand, PixelSNE assumes that as long as multiple data points are captured within a single pixel, they are close enough to be represented as the center of mass for their visualization in a screen. This guarantees the depth of P-Quadtree to be limited by the screen resolution, instead of being influenced by the undesirable non-uniformity of the 2D embedding results during algorithm iterations.

4.2 Screen-Driven Scaling

In order for the outputs ’s at every iteration to satisfy Eq. (2) after their gradient-descent update, we scale them as follows. Let us denote as the 2D coordinates computed at iteration and as its updated coordinates at the next iteration via a gradient-descent method. Note that satisfies Eq. (2) while does not. Hence, we normalize each of the 2D coordinates of and obtain as

where and is a small constant, e.g., . The reason for introducing is to have the integer-valued 2D pixel coordinates in Eq. (1) lie exactly between and for . For example, for a resolution, we impose the pixel coordinates at each iteration to exactly have the range from to and that from to for - and -coordinates, respectively.

This scaling step, however, affects the computation of the low-dimensional pairwise probability matrix since it scales each pairwise Euclidean distance, , in Eq. (3), resulting in a different probability distribution from the case with no scaling. To compensate this scaling effect in computing , we re-define with respect to the scaled ’s as

(12)

where and is defined as

(13)

where , since is randomly initialized from . By introducing defined in this manner, PixelSNE uses Eq. (12), which is still equivalent to Eq. (3).

4.3 Accelerating Construction of

To accelerate the process of constructing the matrix , we adopt a recently proposed, highly efficient algorithm of constructing the approximate -nearest neighbor graph (K-NNG) [Tang et al. (2016)]. This algorithm builds on top of the classical state-of-the-art algorithm based on random-projection trees but significantly improves its efficiency. It first starts with a few random projection trees to construct an approximate K-NNG. Afterwards, based on the intuition that “the neighbors of my neighbors are also likely to be my neighbors,” the algorithm iteratively improves the accuracy of K-NNG by exploring the neighbors of neighbors defined according to the current K-NNG. In our experiments, we show that this algorithm is indeed able to significantly accelerate the process of constructing .

4.4 Implementation Improvements

The implementation of PixelSNE is built upon that of BH-SNE publicly available at https://github.com/lvdmaaten/bhtsne. Other than the above algorithmic improvements, we made implementation improvements as follows.

First, the Barnes-Hut tree involves many computations of division by two (or powers of two). We replaced such computations by more efficient, bitwise left-shift operations. Similarly, we replaced modulo-two operations, which is used in data point assignment to cells, with bitwise masking with . Finally, we replaced the ‘pow’ function from ‘math’ library in C/C++, which is used for squared-distance computation with a single multiplication, e.g., instead of ’pow(, 2)’. Although not significant, these implementation modifications rendered consistently faster computing times than the case without them.

5 Experiments

In this section, we present the quantitative analyses as well as qualitative visualization examples of PixelSNE in comparison with original t-SNE and BH-SNE. All the experiments were conducted on a single desktop computer with Intel Core i7-6700K processors, 32GB RAM, and Ubuntu 16.04.1 LTS installed.

20News FacExp MNIST NewsAgg Yelp
P Coord Total P Coord Total P Coord Total P Coord Total P Coord Total
t-SNE 2.43m 36.87m 39.31m 6.48m 84.94m 91.41m (-) (-) (-) (-) (-) (-) (-) (-) (-)
(0.98s) (55.08s) (56.06s) (0.23s) (10.12s) (10.13s)
BH-SNE 22.25s 156.70s 178.95s 0.19m 5.48m 5.56m 5.10m 13.93m 19.03m 3.96h 1.81h 5.77h 12.44h 5.83h 18.27h
(1.55s) (6.62s) (8.16s) (0.84s) (40.73s) (39.89s) (10.84s) (2.27m) (2.45m) (7.05m) (4.04m) (11.45m) (19.29m) (17.93m) (37.22m)
PixelSNE-VP 14.68s 70.71s 85.39s 0.14m 2.04m 2.19m 3.76m 6.17m 9.92m 2.84h 0.63h 3.46h 8.89h 3.11h 12.00h
(1.46s) (1.58s) (3.03s) (0.66s) (3.52s) (4.16s) (16.48s) (11.92s) (28.37s) (5.49m) (1.04m) (6.52m) (14.26m) (7.47m) (21.71m)
PixelSNE-RP 15.53s 72.17s 87.70s 0.32m 2.04m 2.36m 1.40m 6.05m 7.45m 0.30h 0.63h 0.94h 0.76h 3.14h 3.90h
(0.49s) (1.44s) (1.93s) (1.44s) (3.57s) (4.99s) (2.22s) (7.81s) (9.86s) (0.43m) (2.08m) (2.50m) (0.97m) (10.23m) (11.20m)
Table 1: Comparison of computing times among 2D embedding algorithms. The t-SNE results are excluded for large datasets due to its long computing time. The total computing times are shown in boldface, and the numbers in parentheses represent the standard deviation from three repetitions.

5.1 Experimental Setup

This section describes our experimental setup including datasets and compared methods.

Datasets

Our experiments used four real-world datasets as follows: (1) MNIST digit images, (2) Facial expression images (FaceExp), (3) 20 Newsgroups documents (20News), (4) News aggregator dataset (NewsAgg) and (5) Yelp reviews.

MNIST 2 dataset contains a set of 70,000 gray-scale handwritten digit images with pixels, resulting in an input matrix . Each image has its digit label among ten different digits.

FacExp 3 dataset contains a set of 28,709 gray-scale facial image data with pixels, with the label information out of seven different facial expression categories. For this dataset, we extracted the features from each image using a convolutional neural network. This network starts with four convolutional layers with the 32, 64, 96, and 128 filters, respectively, with the size of each filter being . Each convolutional layer is followed by the batch normalization layer, rectified linear unit (ReLU), and the max-pooling layer over patches. After then, two fully connected layers follow: the first one with 256 output nodes followed by ReLU and the second with 256 output nodes followed by a softmax classifier for seven facial expression categories. After training this model, we used the last hidden layer output of 256 dimensions as the feature vector of each image, resulting in an input matrix .

20News 4 dataset is a collection of 18,846 newsgroup posts on 20 different newsgroup categories. We used the label of each document as one of seven higher-level newsgroup categories. From each document, we extracted the embedding vectors of those words in it using a pre-trained word2vec embedding vector5 and then averaged them to represent a single document, resulting in an input matrix .

NewsAgg 6 dataset contains a collection of 421,161 news headlines on four different categories. We represent the each headline with the averaged words vector pre-processed as explained above. This results in an input matrix .

Yelp 7 review dataset contains around 2.7 million reviews written by around 687,000 users. We pre-processed this dataset by removing those rare keywords appearing in less than 20 documents and other general keywords such as food, restaurant, etc. After forming a term-document matrix, we randomly selected one million reviews from it, resulting in an input matrix . Afterwards, we extracted topic labels out of ten topics computed by nonnegative matrix factorization [Lee and Seung (1999)] as a topic modeling method.

For all datasets, we reduced the dimensionality, i.e.g, the number of rows of each input matrix to 50 by applying PCA, which is a standard pre-processing step of t-SNE and BH-SNE.

Figure 2: Computing times vs. the number of data items on NewsAgg dataset. The error bar represents the standard deviation from five repetitions. The t-SNE results are excluded for large datasets due to its long computing time.

Compared Methods

We compared our methods against the original t-SNE and BH-SNE. For both methods, we used the publicly available code written by the original author.8 We used the default parameter values for both methods, e.g., the perplexity value as 50, the number of iterations as 1,000, and the threshold in Eq. (11) as 0.5.

For our PixelSNE, we used its two different versions depending on the algorithm for constructing : (1) the vantage-point tree (PixelSNE-VP) used originally in BH-SNE and (2) the random projection tree-based approach (PixelSNE-RP) used in LargeVis [Tang et al. (2016)] (Section 4.3). For the latter, we extracted the corresponding part called -nearest neighbor construction from the publicly available code9 and integrated it with our implementation of PixelSNE with its default parameter settings except for setting the number of threads as to exclude the effect of parallelism. We set the number of iterations and the threshold in Eq. (11) as the same as BH-SNE, e.g., 1,000 and 0.5, respectively. We set both of the screen resolution parameters and as 512 for 20News and FecExp datasets. For MNIST, NewsAgg and Yelp datasets, both are set as 1024, 2048 and 8192, respectively.

(a) 30,000 data items
(b) 50,000 data items
Figure 3: Computing times of BH-SNE and PixelSNE-RP with the different screen resolutions on MNIST dataset. The error bar represents the standard deviation from five repetitions.

5.2 Computing Time Analysis

(a) 20News
(b) MNIST
(c) FacExp
(d) NewsAgg
(e) Yelp
Figure 4: Embedding quality comparisons between BH-SNE and PixelSNE-RP. We report the neighborhood precision (Eq. (14)) for (a) and (b) and the -NN classification accuracy for (c), (d), and (e).

Table 1 shows the comparison of computing times of various algorithms for different datasets. We report the computing time of two sub-routines as well as their total times: (1) constructing the original pairwise probability matrix  (P), (2) optimizing low-dimensional coordinates (Coord), and (3) the total computing time (Total). For fair comparisons, the minor improvements presented in Section 4.4 are not applied to PixelSNE-VP nor to PixelSNE-RP. In addition, due to its significant computing time, the computation time results of t-SNE is excluded for large datasets such as Yelp or NewsAgg datasets.

Comparison results. In all cases, PixelSNE-VP and PixelSNE-RP consistently outperform t-SNE and BH-SNE by a large margin. For example, for Yelp dataset, BH-SNE took more than 18 hours while PixelSNE-VP and PixelSNE-RP took 12 hours and less than 4 hours, respectively. For the part of optimizing the low-dimensional coordinates (Coord), where we mainly applied our screen resolution-driven precision, PixelSNE-RP and PixelSNE-VP both show the significant performance boost against BH-SNE. For instance, PixelSNE-VP and PixelSNE-RP compute this part more than three times faster than BH-SNE for NewsAgg dataset. When it comes to the part of constructing  (P), as the size of the data gets larger, e.g., NewsAgg and Yelp datasets, PixelSNE-RP runs significantly faster than PixelSNE-VP due to the advantage of random projection tree adopted in PixelSNE (Section. 4.3).

Scalability due to the data size. Next, Fig. 2 shows the computing times depending on the total number of data items, sampled from NewsAgg dataset. As the data size gets larger, our PixelSNE-RP runs increasingly faster than BH-SNE as well as t-SNE, which shows the promising scalability of PixelSNE.

Effects of the precision parameter. Fig. 3 shows the computing time depending on the screen resolution parameter for 30,000 and 50,000 random samples from MNIST dataset. Although PixelSNE-RP is consistently faster than BH-SNE, it tends to run slowly as increases. Nonetheless, our method still runs much faster, e.g., around two-fold speedup compared to BH-SNE, even with a fairly large values of , i.e. 4,096, which can contain more than 16 million pixels.

Effects of implementation improvements. We compared the computing time between PixelSNE-RP and improved PixelSNE-RP, which adopts the efficient operation proposed in Section 4.4. Table. 2 presents that the improved version consistently brings around 4% speed improvement compared to the original PixelSNE-RP for all the datasets. Interestingly, our improvement also reduces the variance of computing times.

20News FecExp MNIST NewsAgg Yelp
Improved PixelSNE-RP 83.26s 134.43s 429.08s 3194.86s 13485.27s
(Sec. 4.4) (0.19s) (0.21s) (9.05s) (38.99s) (51.1s)
PixelSNE-RP 87.7s 141.54s 446.88s 3370.36s 14025.12s
(1.93s) (4.99s) (9.86s) (150.04s) (671.84s)
Table 2: Computing times of PixelSNE-RP with and without the improvements shown in Section 4.4). The numbers in the parentheses represent the standard deviation from three repetitions.

5.3 Embedding Quality Analysis

in nearest neighbors
1 3 5 10 20 30
512 .2401 .3207 .3475 .3719 .3916 .4237
(.0125) (.0117) (.0098) (.0064) (.0033) (.0023)
1024 .2405 .3165 .3434 .3702 .3932 .4290
(.0084) (.0077) (.0061) (.0037) (.0023) (.0011)
2048 .2463 .3216 .3481 .3750 .3977 .4328
(.0056) (.0045) (.0046) (.0030) (.0022) (.0008)
4096 .2517 .3272 .3532 .3782 .4006 .4344
(.0054) (.0052) (.0046) (.0027) (.0018) (.0009)
Table 3: Neighborhood precision (Eq. (14)) of PixelSNE-RP on the 10,000 samples of MNIST dataset. The numbers in the parentheses represent the standard deviation from ten repetitions.

Evaluation measures. To analyze the embedding quality we adopted the two following measures:

Neighborhood precision measures how much the original nearest neighbors in a high-dimensional space are captured in the nearest neighbors in the 2D embedding results. In detail, let us denote the nearest neighbors of data in the high-dimensional space and those in the low-dimensional (2D) space as and , respectively. The neighborhood precision is computed as

(14)

-NN classification accuracy measures the -nearest neighbor classification accuracy based on the 2D embedding results along with their labels.

Method comparisons. Fig. 4 shows the comparison results between PixelSNE-RP and BH-SNE depending on different values. For the neighborhood precision measure, PixelSNE-RP achieves the similar or even better performance compared to the BH-SNE (Figs. 4(a) and (b)). We conjecture the reason is because the random projection tree used in PixelSNE-RP (Section 4.3) finds more accurate nearest neighbors than the vantage-point tree used in BH-SNE [Tang et al. (2016)]. For -NN classification accuracy results shown in Figs. 4(c)-(e), the performance gap between the two are small, indicating that our method has a comparable quality of outputs to BH-SNE.

(a) PixelSNE-RP (3h 54m),
(b) BH-SNE (18h 16m)
Figure 5: 2D embedding of Yelp dataset. The numbers in parentheses indicate the computing time.
(a)
Figure 6: 2D embedding of NewsAgg dataset generated by PixelSNE-RP.
in nearest neighbors
1 3 5 10 20 30
512 .9390 .9482 .9463 .9428 .9367 .9297
(.0012) (.0023) (.0023) (.0023) (.0033) (.0045)
1024 .9387 .9489 .9475 .9450 .9400 .9335
(.0017) (.0023) (.0014) (.0014) (.0011) (.0017)
2048 .9386 .9489 .9472 .9395 .9450 .9323
(.0020) (.0018) (.0011) (.0020) (.0014) (.0015)
4096 .9380 .9484 .9471 .9437 .9382 .9324
(.0018) (.0007) (.0013) (.0013) (.0014) (.0019)
Table 4: -NN classification accuracy of PixelSNE-RP on the 10,000 samples of MNIST dataset. The numbers in parentheses represent the standard deviation from ten repetitions.

Effects of the precision parameter. Tables 3 and 4 show the above two measures for PixelSNE-RP with respect to different values of a screen resolution . Unlike the -NN classification accuracy, which stays roughly the same regardless of different values of , the neighborhood precision consistently increases as gets large. However, the gap is not that significant.

Cost value analysis. Finally, Table 5 compares the cost function value (Eq. (4)) after convergence. Considering the baseline cost value of the random initialization, both BH-SNE and PixelSNE-RP achieved a similar level of the algorithm objective. Also, throughout all the values of tested, this value remains almost the same between BH-SNE and PixelSNE-RP, which indicates that the screen resolution-based precision of PixelSNE has minimal to no impact in achieving the optimal cost value, which is consistent with the results found in Tables 3 and 4.

Random BH-SNE PixelSNE-RP (with )
Coordinates
Cost value 91.580 1.815 1.820 1.865 1.871 1.868
(Eq. 4) (.000) (.0071) (.0303) (.0167) (.0113) (.0167)
Table 5: Comparison of cost function values between BH-SNE and PixelSNE-RP. The numbers in parentheses represent the standard deviation from ten repetitions.

5.4 Exploratory Analysis

Finally, we presents visualization examples for Yelp and NewsAgg datasets.10 Fig. 5 presents the visualization generated from PixelSNE-RP and BH-SNE on Yelp dataset. Different colors represent ten different labels generated from topic modeling. Even though PixelSNE-RP ran 4.7 times faster than BH-SNE, the visualization results are comparable to each other.

Fig. 6 shows the visualization result on NewsAgg dataset from PixelSNE-RP. A data point, which corresponds to a news headline, is labeled with four different categories: Business, Science and Technology, etc., denoted by different color. Although all the headlines are categorized into only four subjects, Fig. 6 revealed the additional sub-category information by closely embedding news with the similar topics. Each topic is connected with the relevant area as shown in Fig. 6. For example, those news belonging to Health category are roughly divided into three parts, as follow. The topical region “Infectious disease” on the left side consists of the headline with the keywords such as “Ebola”, “West Nile virus” or “Mers” while the region connected with “Non-infectious disease” has headlines with the keywords of “Breast cancer”, “Obesity” or “Alzheimer”. The topical area of “Drug”, “Tobacco” has news about health-related life styles, e.g., “Mid-life Drinking Problem Doubles Later Memory Issues”. Note that the topical regions of “Infectious disease” and “Non-infectious disease” are closely located compared to the other regions.

Another interesting observation is that as long as news headlines even from different categories share similar keywords, they are closely located to each other. For example, the topic cluster of “Market” and “Product”, on the lower-left region contains news from Business and Science and technology. However, it turns out that the news from Science and technology category has the headlines about the stock price of the electronic company or the sales of the electronic products, which is highly related to the headline about the stock market and company performance from Business category.

6 Conclusions

In this paper, we presented a novel idea of exploiting the screen resolution-driven precision as the fundamental framework for significantly accelerating the 2D embedding algorithms. Applying this idea to a state-of-the-art method called Barnes-Hut SNE, we proposed a novel approach called PixelSNE as a highly efficient alternative.

In the future, we plan to apply our framework to other advanced algorithms such as LargeVis [Tang et al. (2016)]. We also plan to develop a parallel distributed version of PixelSNE to further improve the computational efficiency.

Footnotes

  1. This paper discusses only the 2D embedding case, but our proposed approach can be easily extended to the 3D case.
  2. http://cs.nyu.edu/~roweis/data.html
  3. https://goo.gl/W3Z8qZ
  4. http://qwone.com/~jason/20Newsgroups/
  5. https://github.com/mmihaltz/word2vec-GoogleNews-vectors
  6. https://www.kaggle.com/uciml/news-aggregator-dataset .
  7. https://www.yelp.com/dataset_challenge
  8. https://lvdmaaten.github.io/tsne/
  9. https://github.com/lferry007/LargeVis
  10. High-resolution images and other comparison results are available at http://davian.korea.ac.kr/myfiles/jchoo/resource/kdd17pixelsne_appendix.pdf.

References

  1. Josh Barnes and Piet Hut. 1986. A hierarchical O (N log N) force-calculation algorithm. nature 324, 6096 (1986), 446–449.
  2. Mikhail Belkin and Partha Niyogi. 2003. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation 15, 6 (2003), 1373–1396.
  3. Christopher M Bishop, Markus Svensén, and Christopher KI Williams. 1998. GTM: The generative topographic mapping. Neural computation 10, 1 (1998), 215–234.
  4. Miguel A Carreira-Perpinán. 2010. The Elastic Embedding Algorithm for Dimensionality Reduction.. In Proc. the International Conference on Machine Learning (ICML). 167–174.
  5. Jaegul Choo and Haesun Park. 2013. Customizing Computational Methods for Visual Analytics with Big Data. IEEE Computer Graphics and Applications (CG&A) 33, 4 (2013), 22–28.
  6. A. Gisbrecht, B. Mokbel, and B. Hammer. 2012. Linear basis-function t-SNE for fast nonlinear dimensionality reduction. In Proc. the International Joint Conference on Neural Networks (IJCNN). 1–8.
  7. Geoffrey E Hinton and Sam T Roweis. 2002. Stochastic neighbor embedding. In Advances in neural information processing systems. 833–840.
  8. G. E. Hinton and R. R. Salakhutdinov. 2006. Reducing the Dimensionality of Data with Neural Networks. Science 313, 5786 (2006), 504–507.
  9. Ian T. Jolliffe. 2002. Principal component analysis. Springer.
  10. T. Kohonen. 2001. Self-organizing maps. Springer.
  11. J. Kruskal. 1964a. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29 (1964), 1–27. Issue 1.
  12. J. Kruskal. 1964b. Nonmetric multidimensional scaling: A numerical method. Psychometrika 29 (1964), 115–129. Issue 2.
  13. D. D. Lee and H. S. Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401 (1999), 788–791.
  14. John A Lee and Michel Verleysen. 2007. Nonlinear dimensionality reduction. Springer Science & Business Media.
  15. Laurens Maaten. 2009. Learning a Parametric Embedding by Preserving Local Structure. In Proc. the International Conference on Artificial Intelligence and Statistics (AISTATS), Vol. 5. 384–391.
  16. N. Pezzotti, B. Lelieveldt, L. van der Maaten, T. Hollt, E. Eisemann, and A. Vilanova. 2016. Approximated and User Steerable tSNE for Progressive Visual Analytics. IEEE Transactions on Visualization and Computer Graphics (TVCG) PP, 99 (2016), 1–1.
  17. Jr. Sammon, John W. 1969. A Nonlinear Mapping for Data Structure Analysis. Computers, IEEE Transactions on C-18, 5 (may. 1969), 401 – 409.
  18. L. K. Saul and S. T. Roweis. 2003. Think Globally, Fit Locally: Unsupervised Learning of Low Dimensional Manifolds. Journal of Machine Learning Research (JMLR) 4 (2003), 119–155.
  19. Jian Tang, Jingzhou Liu, Ming Zhang, and Qiaozhu Mei. 2016. Visualizing Large-scale and High-dimensional Data. In Proc. the International Conference on World Wide Web (WWW). 287–297.
  20. Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. 2000. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 290, 5500 (2000), 2319–2323. DOI:http://dx.doi.org/10.1126/science.290.5500.2319  arXiv:http://www.sciencemag.org/cgi/reprint/290/5500/2319.pdf
  21. Laurens Van Der Maaten. 2014. Accelerating t-SNE using tree-based algorithms. Journal of machine learning research (JMLR) 15, 1 (2014), 3221–3245.
  22. L. van der Maaten and G. Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research (JMLR) 9 (2008), 2579–2605.
  23. Laurens van der Maaten and Geoffrey Hinton. 2012. Visualizing non-metric similarities in multiple maps. Machine Learning 87, 1 (2012), 33–55.
  24. LJP Van der Maaten, EO Postma, and HJ Van den Herik. 2009. Dimensionality reduction: A comparative review. Technical Report TiCC TR 2009-005 (2009).
  25. L. van der Maaten and K. Weinberger. 2012. Stochastic triplet embedding. In IEEE International Workshop on Machine Learning for Signal Processing. 1–6.
  26. Jarkko Venna, Jaakko Peltonen, Kristian Nybo, Helena Aidos, and Samuel Kaski. 2010. Information retrieval perspective to nonlinear dimensionality reduction for data visualization. Journal of Machine Learning Research (JMLR) 11, Feb (2010), 451–490.
  27. Kilian Weinberger and Lawrence Saul. 2006. Unsupervised Learning of Image Manifolds by Semidefinite Programming. International Journal of Computer Vision 70 (2006), 77–90. Issue 1.
  28. Peter N Yianilos. 1993. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proc. the ACM-SIAM Symposium on Discrete Algorithms (SODA). 311–21.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minumum 40 characters
Add comment
Cancel
Loading ...
107248
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description