# PixelSNE: Visualizing Fast with Just Enough Precision via Pixel-Aligned Stochastic Neighbor Embedding

## Abstract

Embedding and visualizing large-scale high-dimensional data in a two-dimensional space is an important problem since such visualization can reveal deep insights out of complex data. Most of the existing embedding approaches, however, run on an excessively high precision, ignoring the fact that at the end, embedding outputs are mapped into coarse-grained pixel coordinates in a limited screen space. Motivated by this observation and directly considering it in an embedding algorithm, we accelerate Barnes-Hut tree-based t-distributed stochastic neighbor embedding (BH-SNE), known as a state-of-the-art 2D embedding method, and propose a novel alternative called PixelSNE, a highly-efficient, screen resolution-driven 2D embedding method with a linear computational complexity in terms of the number of data items. Our experimental results show the significantly fast running time of PixelSNE by a large margin against BH-SNE, while maintaining the comparable embedding quality. Finally, the source code of our method is publicly available at https://github.com/awesome-davian/pixelsne.

rightsretained \acmDOI10.475/123_4 \acmISBN123-4567-24-567/08/06 \acmConference[WOODSTOCK’97]ACM Woodstock conferenceJuly 1997El Paso, Texas USA \acmYear1997 \copyrightyear2016 \acmPrice15.00

Produces the permission block, and copyright information

the corresponding author

¡ccs2012¿ ¡concept¿ ¡concept_id¿10010147.10010257.10010321.10010336¡/concept_id¿ ¡concept_desc¿Computing methodologies Feature selection¡/concept_desc¿ ¡concept_significance¿500¡/concept_significance¿ ¡/concept¿ ¡/ccs2012¿

[500]Computing methodologies Feature selection

## 1 Introduction

Visualizing high-dimensional data in a two-dimensional (2D) or three-dimensional
(3D)^{1}

To generate a scatterplot given high-dimensional data, one can apply various dimension reduction or low-dimensional embedding approaches including traditional methods (e.g., principal component analysis [Jolliffe (2002)] and multidimensional scaling [Kruskal (1964a), Kruskal (1964b)]) and recent manifold learning methods (e.g., isometric feature mapping [Tenenbaum et al. (2000)], locally linear embedding [Saul and Roweis (2003)], and Laplacian Eigenmaps [Belkin and Niyogi (2003)]).

These methods, however, do not properly handle the significant information loss due to reducing high dimensionality down to two or three, and in response, an advanced dimension reduction technique called t-distributed stochastic neighbor embedding (t-SNE) [van der Maaten and Hinton (2008)] has been proposed, showing its outstanding advantages in generating 2D scatterplots. A drawback of t-SNE is its significant computing time against a large number of data items, e.g., the computational complexity of , where represents the number of data items. Although various approximation techniques attempting to accelerate its algorithm have been proposed, e.g., Barnes-Hut SNE (BH-SNE) [Van Der Maaten (2014)] with the complexity of , it still takes much time to apply them to large-scale data.

To tackle this issue, this paper proposes the novel framework that can significantly accelerate the 2D embedding algorithms in visualization applications. The proposed framework is motivated by the fact that most embedding approaches compute the low-dimensional coordinates with an excessive precision, e.g., a double precision with 64-bit floating point representation, compared to the limited precision that our screen space has. For instance, if our screen space has pixels, then the resulting coordinates computed from an embedding algorithm will be mapped with and integer levels of and coordinates, respectively. Moreover, when making sense of a 2D scatterplot, human perception does not often require a high precision from its results [Choo and Park (2013)].

Leveraging this idea of the just enough precision for our screen and human perception to the above-mentioned state-of-the-art method, BH-SNE, we propose a significantly fast alternative called pixel-aligned SNE (PixelSNE), which shows more than 5x fold speedup against BH-SNE for 421,161 data item of News aggregator dataset. In detail, by lowering and matching the precision used in BH-SNE to that of pixels in a screen space, PixelSNE directly optimizes 2D-coordinates in the screen space with a pre-given resolution.

In this paper, we describe the mathematical and algorithmic details of how we utilized the idea of a pixel-based precision in BH-SNE and present the extensive experimental results that show the significantly fast running time of PixelSNE by a large margin against BH-SNE, while maintaining the the embedding quality.

Generally, our contributions can be summarized as follows:

1. We present a novel framework of a pixel-based precision driven by a screen space with a finite resolution.

2. We propose a highly-efficient 2D embedding approach called PixelSNE by leveraging our idea of a pixel-based precision in BH-SNE.

3. We perform extensive quantitative and qualitative analyses using various real-world datasets, showing the significant speedup of our proposed approach against BH-SNE along with a comparable quality of visualization results.

## 2 Related Work

Dimension reduction or low-dimensional embedding of high-dimensional data [Van der Maaten et al. (2009)] has long been an active research area. Typically, most of these methods attempt to generate the low-dimensional coordinates that maximally preserve the pairwise distances/similarities given in a high-dimensional space. Such low-dimensional outputs generally work for two purposes: (1) generating the new representations of original data for improving the desired performance of a downstream target task, such as its accuracy and/or computational efficiency, and (2) visualizing high-dimensional data in a 2D scatterplot for providing humans with the in-depth understanding and interpretation about data.

Widely-used dimension reduction methods used for visualization application include principal component analysis (PCA) [Jolliffe (2002)], multidimensional scaling [Kruskal (1964a), Kruskal (1964b)], Sammon mapping [Sammon (1969)], generative topographic mapping [Bishop et al. (1998)], and self-organizing map [Kohonen (2001)]. While these traditional methods generally focus on preserving global relationships rather than local ones, a class of nonlinear, local dimension reduction techniques called manifold learning [Lee and Verleysen (2007)] has been actively studied, trying to recover an intrinsic curvilinear manifold out of given high-dimensional data. Representative methods are isometric feature mapping [Tenenbaum et al. (2000)], locally linear embedding [Saul and Roweis (2003)], Laplacian Eigenmaps [Belkin and Niyogi (2003)], maximum variance unfolding [Weinberger and Saul (2006)], and autoencoder [Hinton and Salakhutdinov (2006)].

Specifically focusing on visualization applications, a recent method, t-distributed stochastic neighbor embedding [van der Maaten and Hinton (2008)], which is built upon stochastic neighbor embedding [Hinton and Roweis (2002)], has shown its superiority in generating the 2D scatterplots that can reveal meaningful insights about data such as clusters and outliers. Since then, numerous approaches have been proposed to improve the visualization quality and its related performances in the 2D embedding results. For example, a neural network has been integrated with t-SNE to learn the parametric representation of 2D embedding [Maaten (2009)]. Rather than the Euclidean distance or its derived similarity information, other information types such as non-metric similarities [van der Maaten and Hinton (2012)] and relative ordering information about pairwise distances in the form of similarity triplets [van der Maaten and Weinberger (2012)] have been considered as the target information to preserve. Additionally, various other optimization criteria and their optimization approaches, such as elastic embedding [Carreira-Perpinán (2010)] and NeRV [Venna et al. (2010)], have been proposed.

The computational efficiency and scalability of 2D embedding approaches has also been widely studied. An accelerated t-SNE based on the approximation using the Barnes-Hut tree algorithm has been proposed [Van Der Maaten (2014)]. Gisbrecht et al. proposed a linear approximation of t-SNE [Gisbrecht et al. (2012)]. More recently, an approximate, but user-steerable t-SNE, which provides interactions with which a user can control the degree of approximation on user-specified areas, has also been studied [Pezzotti et al. (2016)]. In addition, a scalable 2D embedding technique called LargeVis [Tang et al. (2016)] significantly reduced the computing times with a linear time complexity in terms of the number of data items.

Even with such a plethora of 2D embedding approaches, to the best of our knowledge, none of the previous studies have directly exploited the limited precision of our screen space and human perception for developing highly efficient 2D embedding algorithms, and our novel framework of controlling the precision in return for algorithm efficiency and the proposed PixelSNE, which significantly improves the efficiency of BH-SNE, can be one such example.

## 3 Preliminaries

In this section, we introduce the problem formulation and the two existing methods, t-distributed stochastic neighbor embedding (t-SNE) and its efficient version called Barnes-Hut SNE (BH-SNE).

### 3.1 Problem Formulation

A 2D embedding method takes a set of high-dimensional data items, , where is the number of data items and is the original dimensionality, and gives their 2D (low-dimensional) embedding, , as an output. Given a screen of resolution , where and , the - and -axis resolutions, respectively, are positive integers, the scatterplot is generated by transforming to their corresponding (zero-based) pixel coordinates

(1) |

(2) |

### 3.2 t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE embeds high-dimensional data into a low-dimensional space by minimizing the differences between the joint probability distribution representing pairwise relationships in and its counterpart in . In detail, t-SNE computes the Euclidean pairwise distance matrix of and then converts it to the high-dimensional joint probability matrix using a Gaussian kernel.

Next, given randomly initialized , t-SNE computes the Euclidean pairwise distance matrix and then converts it to a low-dimensional joint probability matrix using a Student’s -distribution. To be specific, the -th component of at iteration , which represents the similarity between and in a probabilistic sense, is computed as

(3) |

where .

The objective function of t-SNE is defined using the Kullback-Leibler divergence between and as

(4) |

and t-SNE iteratively performs gradient-decent update on where the gradient with respect to is computed [Van Der Maaten (2014)] as

(5) |

Note that every data point exerts an attraction and an repulsion forces to one another based on the difference between the two pairwise joint probability matrices and .

### 3.3 Barnes-Hut-SNE (BH-SNE)

While t-SNE adopts a brute-force approach with the computational complexity considering all the pairwise relationships, BH-SNE adopts two different tree-based approximation methods to reduce this complexity. The first one called the vantage-point tree [Yianilos (1993)] approximately computes and then as sparse matrices by ignoring small pairwise distances as zeros (Fig. 1(a)). This approximation reduces the complexity of computing and to where is a pre-defined parameter of perplexity and . Accordingly, involving only those nonzero ’s in the sparse matrix , the complexity of computing in Eq. (5) reduces to .

When it comes to optimizing low-dimensional coordinates (Fig. 1(b)), BH-SNE adopts Barnes-Hut algorithm [] to compute in Eq. (5). When , the forces of and to are similar to each other when computing the gradient. Based on this observation, BH-SNE uses Barnes-Hut algorithm to find a single representative point of multiple data points used for gradient update. For example, if we set the representative data point of and as , then the low-dimensional joint probability can be used to substitute each of and .

To dynamically obtain the representative points ’s, BH-SNE constructs a quadtree at each iteration, recursively splitting the 2D region that contains ’s into four pieces. Each node, which we call a cell , contains the following information:

1. the boundary about its corresponding 2D region, i.e.,

(6) |

2. the set of ’s contained in , i.e.,

(7) |

and its cardinality, , and

3. the center of mass, , of ’s in , i.e.,

(8) |

Given at iteration , BH-SNE starts by forming a root node , containing all the ’s, by setting in Eq. (7) and

(9) | ||||

(10) |

where is a small number in Eq. (6). BH-SNE then recursively splits into four (equally-sized) quadrant cells located at “northwest”, “northeast”, “southwest”, and “southeast” by setting their boundaries accordingly, assigning ’s to these child cells based on the boundary information in Eq. (7), and computing their centers of mass. As assigning ’s one by one to the tree, the quadtree grows until each leaf node contains at most a single .

Afterwards, when computing the gradient with respect to in Eq. (5) (Fig. 1(b-3)), BH-SNE traverses the quadtree via depth-first search to determine whether can work as the representative point of those contained in based on the criterion

(11) |

where represents the diagonal length of the region of and is a threshold parameter. Once finding satisfying this criterion, the term in Eq. (5) for those points contained in is approximated as , thus reducing the computational complexity of in Eq. (5) to .

## 4 Pixel-Aligned SNE (PixelSNE)

In this section, we present our PixelSNE, which significantly reduces the computational time of BH-SNE.

### 4.1 Pixel-Aligned Barnes-Hut Tree

A major novelty of PixelSNE originates from the fact that it directly considers the screen-space coordinates (Eq. (1)) instead of . That is, PixelSNE utilizes the two properties of during its optimization process: (1) the range of remains fixed as for throughout algorithm iterations (Eq. (2)) and (2) the precision of is limited by the screen space resolution. Utilizing these characteristics, we accelerate the Barnes-Hut tree construction as the assignment process of ’s to cells as follows. For clarification, we denote our accelerated quadtree algorithm of PixelSNE as a pixel-aligned quadtree (P-Quadtree) while we call the original quadtree algorithm of BH-SNE as a data-driven quadtree (D-Quadtree).

One-time tree construction (instead of every iteration). Unlike BH-SNE, which builds a quadtree from scratch per iteration, the above-mentioned properties of allow PixelSNE to build P-Quadtree just one time before the main iterations of gradient-decent update and to recycle it during the iterations. That is, PixelSNE pre-computes the full information about (1) the boundary (Eq. (6)) of each cell and (2) its center of mass (Eq. (8)) as follows.

For the boundary information, we utilize the fact that since and for every iteration, Eqs. (9) and (10) boil down to

which is no longer dependent on iteration . This constant boundary of the root node makes those of all the child cells in P-Quadtree constant as well. This results a fixed depth of P-Quadtree as long as the minimum size of the cell is determined, which will be discussed later part of this sub-section in detail. Based on this idea, PixelSNE pre-computes the boundaries of all the cells in P-Quadtree.

Next, since the boundary of each cell is already determined, we simply approximate the center of mass as the center location of the cell corresponding region, e.g.,

which is also not dependent on iteration .

Once the pre-computation of the above information (Fig. 1(c-1)) finishes, the iterative gradient optimization in PixelSNE simply assigns ’s to these pre-computed cells in P-Quadtree and updates and (Fig. 1(c-2)). Note that in contrast, BH-SNE iteratively computes all these steps every iteration (Figs. 1(b-1) and (b-2)), which acts as the critical inefficiency compared to PixelSNE.

Bounded tree depth based on screen resolution. Considering a typical screen resolution, BH-SNE performs an excessively precise computation. That is, when mapping D-Quadtree to pixel grids of our screen, one can view that BH-SNE subdivides the pixels, the minimum unit of the screen, into much smaller cells until each left cell contains at most one data point (Fig. 1(T-1)).

On the contrary, the minimum size of the cell in P-Quadtree is bounded to the pixel size to avoid an excessive precision in computation (Fig. 1(T-2)). In detail, P-Quadtree keeps splitting the cell while satisfying the condition,

which indicates that the cell length is larger than the unit pixel size for at least one of the two axes. As a result, the depth of P-Quadtree is bounded by

Such a bounded depth of P-Quadtree causes a leaf node to possibly contain multiple data points, and in the gradient-descent update step, we may not find the cell satisfying Eq. (11) even after reaching a leaf node in the depth-first search traversal. In this case, we just stop the traversal and use the center of mass of the corresponding leaf node to approximately compute (Eq. (5)) of those points contained in the node.

Finally, let us clarify that we maintain ’s as double-precision numbers just as BH-SNE, but our idea of a limited precision is realized mainly by the two above-described novel techniques.

Computational complexity. The depth of the Barnes-Hut tree acts as a critical factor in algorithm efficiency since both the assignment of data point to cells and the approximation of (Eq. (5)) are performed based depth-first search traversal, the computational complexity of which is linear to the tree depth. In the worst case where a majority of data points are located in a relatively small region in a 2D space and a few outliers are located far from them, D-Quadtree can grow as deeply as the depth of , which is much larger than that of P-Quadtree, , when . Owing to this characteristics, the overall computational complexity of PixelSNE is obtained as , which reduces to the linear computational complexity in terms of the number of data items, given the fixed screen resolution ’s.

Intuitively, BH-SNE traverses much finer-grained nodes to obtain the excessively precise differentiation among data points. On the other hand, PixelSNE assumes that as long as multiple data points are captured within a single pixel, they are close enough to be represented as the center of mass for their visualization in a screen. This guarantees the depth of P-Quadtree to be limited by the screen resolution, instead of being influenced by the undesirable non-uniformity of the 2D embedding results during algorithm iterations.

### 4.2 Screen-Driven Scaling

In order for the outputs ’s at every iteration to satisfy Eq. (2) after their gradient-descent update, we scale them as follows. Let us denote as the 2D coordinates computed at iteration and as its updated coordinates at the next iteration via a gradient-descent method. Note that satisfies Eq. (2) while does not. Hence, we normalize each of the 2D coordinates of and obtain as

where and is a small constant, e.g., . The reason for introducing is to have the integer-valued 2D pixel coordinates in Eq. (1) lie exactly between and for . For example, for a resolution, we impose the pixel coordinates at each iteration to exactly have the range from to and that from to for - and -coordinates, respectively.

This scaling step, however, affects the computation of the low-dimensional pairwise probability matrix since it scales each pairwise Euclidean distance, , in Eq. (3), resulting in a different probability distribution from the case with no scaling. To compensate this scaling effect in computing , we re-define with respect to the scaled ’s as

(12) |

where and is defined as

(13) |

where , since is randomly initialized from . By introducing defined in this manner, PixelSNE uses Eq. (12), which is still equivalent to Eq. (3).

### 4.3 Accelerating Construction of

To accelerate the process of constructing the matrix , we adopt a recently proposed, highly efficient algorithm of constructing the approximate -nearest neighbor graph (K-NNG) [Tang et al. (2016)]. This algorithm builds on top of the classical state-of-the-art algorithm based on random-projection trees but significantly improves its efficiency. It first starts with a few random projection trees to construct an approximate K-NNG. Afterwards, based on the intuition that “the neighbors of my neighbors are also likely to be my neighbors,” the algorithm iteratively improves the accuracy of K-NNG by exploring the neighbors of neighbors defined according to the current K-NNG. In our experiments, we show that this algorithm is indeed able to significantly accelerate the process of constructing .

### 4.4 Implementation Improvements

The implementation of PixelSNE is built upon that of BH-SNE publicly available at https://github.com/lvdmaaten/bhtsne. Other than the above algorithmic improvements, we made implementation improvements as follows.

First, the Barnes-Hut tree involves many computations of division by two (or powers of two). We replaced such computations by more efficient, bitwise left-shift operations. Similarly, we replaced modulo-two operations, which is used in data point assignment to cells, with bitwise masking with . Finally, we replaced the ‘pow’ function from ‘math’ library in C/C++, which is used for squared-distance computation with a single multiplication, e.g., instead of ’pow(, 2)’. Although not significant, these implementation modifications rendered consistently faster computing times than the case without them.

## 5 Experiments

In this section, we present the quantitative analyses as well as qualitative visualization examples of PixelSNE in comparison with original t-SNE and BH-SNE. All the experiments were conducted on a single desktop computer with Intel Core i7-6700K processors, 32GB RAM, and Ubuntu 16.04.1 LTS installed.

20News | FacExp | MNIST | NewsAgg | Yelp | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

P | Coord | Total | P | Coord | Total | P | Coord | Total | P | Coord | Total | P | Coord | Total | |

t-SNE | 2.43m | 36.87m | 39.31m | 6.48m | 84.94m | 91.41m | (-) | (-) | (-) | (-) | (-) | (-) | (-) | (-) | (-) |

(0.98s) | (55.08s) | (56.06s) | (0.23s) | (10.12s) | (10.13s) | ||||||||||

BH-SNE | 22.25s | 156.70s | 178.95s | 0.19m | 5.48m | 5.56m | 5.10m | 13.93m | 19.03m | 3.96h | 1.81h | 5.77h | 12.44h | 5.83h | 18.27h |

(1.55s) | (6.62s) | (8.16s) | (0.84s) | (40.73s) | (39.89s) | (10.84s) | (2.27m) | (2.45m) | (7.05m) | (4.04m) | (11.45m) | (19.29m) | (17.93m) | (37.22m) | |

PixelSNE-VP | 14.68s | 70.71s | 85.39s | 0.14m | 2.04m | 2.19m | 3.76m | 6.17m | 9.92m | 2.84h | 0.63h | 3.46h | 8.89h | 3.11h | 12.00h |

(1.46s) | (1.58s) | (3.03s) | (0.66s) | (3.52s) | (4.16s) | (16.48s) | (11.92s) | (28.37s) | (5.49m) | (1.04m) | (6.52m) | (14.26m) | (7.47m) | (21.71m) | |

PixelSNE-RP | 15.53s | 72.17s | 87.70s | 0.32m | 2.04m | 2.36m | 1.40m | 6.05m | 7.45m | 0.30h | 0.63h | 0.94h | 0.76h | 3.14h | 3.90h |

(0.49s) | (1.44s) | (1.93s) | (1.44s) | (3.57s) | (4.99s) | (2.22s) | (7.81s) | (9.86s) | (0.43m) | (2.08m) | (2.50m) | (0.97m) | (10.23m) | (11.20m) |

### 5.1 Experimental Setup

This section describes our experimental setup including datasets and compared methods.

#### Datasets

Our experiments used four real-world datasets as follows: (1) MNIST digit images, (2) Facial expression images (FaceExp), (3) 20 Newsgroups documents (20News), (4) News aggregator dataset (NewsAgg) and (5) Yelp reviews.

MNIST ^{2}

FacExp ^{3}

20News ^{4}^{5}

NewsAgg ^{6}

Yelp ^{7}

For all datasets, we reduced the dimensionality, i.e.g, the number of rows of each input matrix to 50 by applying PCA, which is a standard pre-processing step of t-SNE and BH-SNE.

#### Compared Methods

We compared our methods against the original t-SNE and BH-SNE. For
both methods, we used the publicly available code written by the original
author.^{8}

For our PixelSNE, we used its two different versions depending on
the algorithm for constructing : (1) the vantage-point tree (PixelSNE-VP)
used originally in BH-SNE and (2) the random projection tree-based
approach (PixelSNE-RP) used in LargeVis [Tang
et al. (2016)]
(Section 4.3). For the latter, we extracted
the corresponding part called -nearest neighbor construction from
the publicly available code^{9}

### 5.2 Computing Time Analysis

Table 1 shows the comparison of computing times of various algorithms for different datasets. We report the computing time of two sub-routines as well as their total times: (1) constructing the original pairwise probability matrix (P), (2) optimizing low-dimensional coordinates (Coord), and (3) the total computing time (Total). For fair comparisons, the minor improvements presented in Section 4.4 are not applied to PixelSNE-VP nor to PixelSNE-RP. In addition, due to its significant computing time, the computation time results of t-SNE is excluded for large datasets such as Yelp or NewsAgg datasets.

Comparison results. In all cases, PixelSNE-VP and PixelSNE-RP consistently outperform t-SNE and BH-SNE by a large margin. For example, for Yelp dataset, BH-SNE took more than 18 hours while PixelSNE-VP and PixelSNE-RP took 12 hours and less than 4 hours, respectively. For the part of optimizing the low-dimensional coordinates (Coord), where we mainly applied our screen resolution-driven precision, PixelSNE-RP and PixelSNE-VP both show the significant performance boost against BH-SNE. For instance, PixelSNE-VP and PixelSNE-RP compute this part more than three times faster than BH-SNE for NewsAgg dataset. When it comes to the part of constructing (P), as the size of the data gets larger, e.g., NewsAgg and Yelp datasets, PixelSNE-RP runs significantly faster than PixelSNE-VP due to the advantage of random projection tree adopted in PixelSNE (Section. 4.3).

Scalability due to the data size. Next, Fig. 2 shows the computing times depending on the total number of data items, sampled from NewsAgg dataset. As the data size gets larger, our PixelSNE-RP runs increasingly faster than BH-SNE as well as t-SNE, which shows the promising scalability of PixelSNE.

Effects of the precision parameter. Fig. 3 shows the computing time depending on the screen resolution parameter for 30,000 and 50,000 random samples from MNIST dataset. Although PixelSNE-RP is consistently faster than BH-SNE, it tends to run slowly as increases. Nonetheless, our method still runs much faster, e.g., around two-fold speedup compared to BH-SNE, even with a fairly large values of , i.e. 4,096, which can contain more than 16 million pixels.

Effects of implementation improvements. We compared the computing time between PixelSNE-RP and improved PixelSNE-RP, which adopts the efficient operation proposed in Section 4.4. Table. 2 presents that the improved version consistently brings around 4% speed improvement compared to the original PixelSNE-RP for all the datasets. Interestingly, our improvement also reduces the variance of computing times.

20News | FecExp | MNIST | NewsAgg | Yelp | |
---|---|---|---|---|---|

Improved PixelSNE-RP | 83.26s | 134.43s | 429.08s | 3194.86s | 13485.27s |

(Sec. 4.4) | (0.19s) | (0.21s) | (9.05s) | (38.99s) | (51.1s) |

PixelSNE-RP | 87.7s | 141.54s | 446.88s | 3370.36s | 14025.12s |

(1.93s) | (4.99s) | (9.86s) | (150.04s) | (671.84s) |

### 5.3 Embedding Quality Analysis

in nearest neighbors | ||||||
---|---|---|---|---|---|---|

1 | 3 | 5 | 10 | 20 | 30 | |

512 | .2401 | .3207 | .3475 | .3719 | .3916 | .4237 |

(.0125) | (.0117) | (.0098) | (.0064) | (.0033) | (.0023) | |

1024 | .2405 | .3165 | .3434 | .3702 | .3932 | .4290 |

(.0084) | (.0077) | (.0061) | (.0037) | (.0023) | (.0011) | |

2048 | .2463 | .3216 | .3481 | .3750 | .3977 | .4328 |

(.0056) | (.0045) | (.0046) | (.0030) | (.0022) | (.0008) | |

4096 | .2517 | .3272 | .3532 | .3782 | .4006 | .4344 |

(.0054) | (.0052) | (.0046) | (.0027) | (.0018) | (.0009) |

Evaluation measures. To analyze the embedding quality we adopted the two following measures:

Neighborhood precision measures how much the original nearest neighbors in a high-dimensional space are captured in the nearest neighbors in the 2D embedding results. In detail, let us denote the nearest neighbors of data in the high-dimensional space and those in the low-dimensional (2D) space as and , respectively. The neighborhood precision is computed as

(14) |

-NN classification accuracy measures the -nearest neighbor classification accuracy based on the 2D embedding results along with their labels.

Method comparisons. Fig. 4 shows the comparison results between PixelSNE-RP and BH-SNE depending on different values. For the neighborhood precision measure, PixelSNE-RP achieves the similar or even better performance compared to the BH-SNE (Figs. 4(a) and (b)). We conjecture the reason is because the random projection tree used in PixelSNE-RP (Section 4.3) finds more accurate nearest neighbors than the vantage-point tree used in BH-SNE [Tang et al. (2016)]. For -NN classification accuracy results shown in Figs. 4(c)-(e), the performance gap between the two are small, indicating that our method has a comparable quality of outputs to BH-SNE.

in nearest neighbors | ||||||
---|---|---|---|---|---|---|

1 | 3 | 5 | 10 | 20 | 30 | |

512 | .9390 | .9482 | .9463 | .9428 | .9367 | .9297 |

(.0012) | (.0023) | (.0023) | (.0023) | (.0033) | (.0045) | |

1024 | .9387 | .9489 | .9475 | .9450 | .9400 | .9335 |

(.0017) | (.0023) | (.0014) | (.0014) | (.0011) | (.0017) | |

2048 | .9386 | .9489 | .9472 | .9395 | .9450 | .9323 |

(.0020) | (.0018) | (.0011) | (.0020) | (.0014) | (.0015) | |

4096 | .9380 | .9484 | .9471 | .9437 | .9382 | .9324 |

(.0018) | (.0007) | (.0013) | (.0013) | (.0014) | (.0019) |

Effects of the precision parameter. Tables 3 and 4 show the above two measures for PixelSNE-RP with respect to different values of a screen resolution . Unlike the -NN classification accuracy, which stays roughly the same regardless of different values of , the neighborhood precision consistently increases as gets large. However, the gap is not that significant.

Cost value analysis. Finally, Table 5 compares the cost function value (Eq. (4)) after convergence. Considering the baseline cost value of the random initialization, both BH-SNE and PixelSNE-RP achieved a similar level of the algorithm objective. Also, throughout all the values of tested, this value remains almost the same between BH-SNE and PixelSNE-RP, which indicates that the screen resolution-based precision of PixelSNE has minimal to no impact in achieving the optimal cost value, which is consistent with the results found in Tables 3 and 4.

Random | BH-SNE | PixelSNE-RP (with ) | ||||
---|---|---|---|---|---|---|

Coordinates | ||||||

Cost value | 91.580 | 1.815 | 1.820 | 1.865 | 1.871 | 1.868 |

(Eq. 4) | (.000) | (.0071) | (.0303) | (.0167) | (.0113) | (.0167) |

### 5.4 Exploratory Analysis

Finally, we presents visualization examples for Yelp and NewsAgg datasets.^{10}

Fig. 6 shows the visualization result on NewsAgg dataset from PixelSNE-RP. A data point, which corresponds to a news headline, is labeled with four different categories: Business, Science and Technology, etc., denoted by different color. Although all the headlines are categorized into only four subjects, Fig. 6 revealed the additional sub-category information by closely embedding news with the similar topics. Each topic is connected with the relevant area as shown in Fig. 6. For example, those news belonging to Health category are roughly divided into three parts, as follow. The topical region “Infectious disease” on the left side consists of the headline with the keywords such as “Ebola”, “West Nile virus” or “Mers” while the region connected with “Non-infectious disease” has headlines with the keywords of “Breast cancer”, “Obesity” or “Alzheimer”. The topical area of “Drug”, “Tobacco” has news about health-related life styles, e.g., “Mid-life Drinking Problem Doubles Later Memory Issues”. Note that the topical regions of “Infectious disease” and “Non-infectious disease” are closely located compared to the other regions.

Another interesting observation is that as long as news headlines even from different categories share similar keywords, they are closely located to each other. For example, the topic cluster of “Market” and “Product”, on the lower-left region contains news from Business and Science and technology. However, it turns out that the news from Science and technology category has the headlines about the stock price of the electronic company or the sales of the electronic products, which is highly related to the headline about the stock market and company performance from Business category.

## 6 Conclusions

In this paper, we presented a novel idea of exploiting the screen resolution-driven precision as the fundamental framework for significantly accelerating the 2D embedding algorithms. Applying this idea to a state-of-the-art method called Barnes-Hut SNE, we proposed a novel approach called PixelSNE as a highly efficient alternative.

In the future, we plan to apply our framework to other advanced algorithms such as LargeVis [Tang et al. (2016)]. We also plan to develop a parallel distributed version of PixelSNE to further improve the computational efficiency.

### Footnotes

- This paper discusses only the 2D embedding case, but our proposed approach can be easily extended to the 3D case.
- http://cs.nyu.edu/~roweis/data.html
- https://goo.gl/W3Z8qZ
- http://qwone.com/~jason/20Newsgroups/
- https://github.com/mmihaltz/word2vec-GoogleNews-vectors
- https://www.kaggle.com/uciml/news-aggregator-dataset .
- https://www.yelp.com/dataset_challenge
- https://lvdmaaten.github.io/tsne/
- https://github.com/lferry007/LargeVis
- High-resolution images and other comparison results are available at http://davian.korea.ac.kr/myfiles/jchoo/resource/kdd17pixelsne_appendix.pdf.

### References

- Josh Barnes and Piet Hut. 1986. A hierarchical O (N log N) force-calculation algorithm. nature 324, 6096 (1986), 446–449.
- Mikhail Belkin and Partha Niyogi. 2003. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation 15, 6 (2003), 1373–1396.
- Christopher M Bishop, Markus Svensén, and Christopher KI Williams. 1998. GTM: The generative topographic mapping. Neural computation 10, 1 (1998), 215–234.
- Miguel A Carreira-Perpinán. 2010. The Elastic Embedding Algorithm for Dimensionality Reduction.. In Proc. the International Conference on Machine Learning (ICML). 167–174.
- Jaegul Choo and Haesun Park. 2013. Customizing Computational Methods for Visual Analytics with Big Data. IEEE Computer Graphics and Applications (CG&A) 33, 4 (2013), 22–28.
- A. Gisbrecht, B. Mokbel, and B. Hammer. 2012. Linear basis-function t-SNE for fast nonlinear dimensionality reduction. In Proc. the International Joint Conference on Neural Networks (IJCNN). 1–8.
- Geoffrey E Hinton and Sam T Roweis. 2002. Stochastic neighbor embedding. In Advances in neural information processing systems. 833–840.
- G. E. Hinton and R. R. Salakhutdinov. 2006. Reducing the Dimensionality of Data with Neural Networks. Science 313, 5786 (2006), 504–507.
- Ian T. Jolliffe. 2002. Principal component analysis. Springer.
- T. Kohonen. 2001. Self-organizing maps. Springer.
- J. Kruskal. 1964a. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29 (1964), 1–27. Issue 1.
- J. Kruskal. 1964b. Nonmetric multidimensional scaling: A numerical method. Psychometrika 29 (1964), 115–129. Issue 2.
- D. D. Lee and H. S. Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401 (1999), 788–791.
- John A Lee and Michel Verleysen. 2007. Nonlinear dimensionality reduction. Springer Science & Business Media.
- Laurens Maaten. 2009. Learning a Parametric Embedding by Preserving Local Structure. In Proc. the International Conference on Artificial Intelligence and Statistics (AISTATS), Vol. 5. 384–391.
- N. Pezzotti, B. Lelieveldt, L. van der Maaten, T. Hollt, E. Eisemann, and A. Vilanova. 2016. Approximated and User Steerable tSNE for Progressive Visual Analytics. IEEE Transactions on Visualization and Computer Graphics (TVCG) PP, 99 (2016), 1–1.
- Jr. Sammon, John W. 1969. A Nonlinear Mapping for Data Structure Analysis. Computers, IEEE Transactions on C-18, 5 (may. 1969), 401 – 409.
- L. K. Saul and S. T. Roweis. 2003. Think Globally, Fit Locally: Unsupervised Learning of Low Dimensional Manifolds. Journal of Machine Learning Research (JMLR) 4 (2003), 119–155.
- Jian Tang, Jingzhou Liu, Ming Zhang, and Qiaozhu Mei. 2016. Visualizing Large-scale and High-dimensional Data. In Proc. the International Conference on World Wide Web (WWW). 287–297.
- Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. 2000. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 290, 5500 (2000), 2319–2323. DOI:http://dx.doi.org/10.1126/science.290.5500.2319 arXiv:http://www.sciencemag.org/cgi/reprint/290/5500/2319.pdf
- Laurens Van Der Maaten. 2014. Accelerating t-SNE using tree-based algorithms. Journal of machine learning research (JMLR) 15, 1 (2014), 3221–3245.
- L. van der Maaten and G. Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research (JMLR) 9 (2008), 2579–2605.
- Laurens van der Maaten and Geoffrey Hinton. 2012. Visualizing non-metric similarities in multiple maps. Machine Learning 87, 1 (2012), 33–55.
- LJP Van der Maaten, EO Postma, and HJ Van den Herik. 2009. Dimensionality reduction: A comparative review. Technical Report TiCC TR 2009-005 (2009).
- L. van der Maaten and K. Weinberger. 2012. Stochastic triplet embedding. In IEEE International Workshop on Machine Learning for Signal Processing. 1–6.
- Jarkko Venna, Jaakko Peltonen, Kristian Nybo, Helena Aidos, and Samuel Kaski. 2010. Information retrieval perspective to nonlinear dimensionality reduction for data visualization. Journal of Machine Learning Research (JMLR) 11, Feb (2010), 451–490.
- Kilian Weinberger and Lawrence Saul. 2006. Unsupervised Learning of Image Manifolds by Semidefinite Programming. International Journal of Computer Vision 70 (2006), 77–90. Issue 1.
- Peter N Yianilos. 1993. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proc. the ACM-SIAM Symposium on Discrete Algorithms (SODA). 311–21.