Function Preserving Projection for Scalable Exploration of High-Dimensional Data
We present function preserving projections (FPP), a scalable linear projection technique for discovering interpretable relationships in high-dimensional data. Conventional dimension reduction methods aim to maximally preserve the global and/or local geometric structure of a dataset. However, in practice one is often more interested in determining how one or multiple user-selected response function(s) can be explained by the data. To intuitively connect the responses to the data, FPP constructs 2D linear embeddings optimized to reveal interpretable yet potentially non-linear patterns of the response functions. More specifically, FPP is designed to (i) produce human-interpretable embeddings; (ii) capture non-linear relationships; (iii) allow the simultaneous use of multiple response functions; and (iv) scale to millions of samples. Using FPP on real-world datasets, one can obtain fundamentally new insights about high-dimensional relationships in large-scale data that could not be achieved using existing dimension reduction methods.
The rapid advances in both experimental and computational capabilities have resulted in a deluge of data being collected in all branches of science and engineering. Whether this data describes outputs of computer simulations or observations from experiments, it is imperative to uncover patterns and relationships in the resulting high-dimensional data to gain insights into the underlying phenomena. Existing solutions for high dimensional analysis struggle to balance the complexity of an approach with the need to interpret the results. For example, one may fit a linear model which is easily understood yet cannot recover non-linear patterns or deploy more general techniques, such as deep neural networks, which are more flexible but difficult to interpret. Similarly, simple visual encodings such as scatterplot matrices and parallel coordinates Inselberg and Dimsdale (1990); Carr et al. (1987) are easy to read but do not scale to large dimensions. Non-linear dimension reduction can sometimes address the dimensionality challenge at the cost of losing interpretability as axes lose their meaning. Furthermore, many existing visualization techniques for high-dimensional data do not scale gracefully to large datasets.
Instead, we need an approach that can deal with large data and non-linear relationships while remaining interpretable. Function preserving projections (FPP) achieve both objectives by focusing on specific aspects of data through the lens of user-selected response functions. Response function(s) typically correspond to one or multiple diagnostic measurements or simulation outputs, defined at each data point, and can provide a powerful window into underlying patterns in the data. More specifically, rather than aiming to preserve the entire neighborhood structure of high-dimensional data in hopes of finding interesting patterns, we deliberately search for the best linear projection, such that the chosen response functions create an interpretable pattern in the projected space. An important consequence of focusing on the projected response function is that FPP inherently ignores domain variables and structures not pertinent to the response and does not require the domain to have a low intrinsic dimension. The latter is a key property in many applications, most notably simulation ensembles, where the domain is defined as a uniform sampling of a high dimensional hypercube. In these cases, dimension reduction is futile as – by definition – there exists no low dimensional structure in the domain one could discover and thus many of the traditional techniques do not apply. We consider a pattern interpretable if it can be approximated by a chosen regressor and by adjusting the type and order of the regressor, i.e., polynomial vs. exponential, linear vs. non-linear, etc., we allow users to choose the complexity of the pattern they deem acceptable. The key insight driving the development of FPP is the fact that interpreting non-linear embeddings is challenging yet humans are highly skilled in understanding non-linear patterns. By restricting the initial project to a linear map FPP preserves the interpretability of the resulting plots, i.e., axis labels, while non-linear regressors enable us to discover noisy and non-linear relationships.
Conceptually, FPP is a dual approach to kernel machines in machine learning, which employ non-linear, and often infinite-dimensional mappings to enable the use of simple linear models for complex data. From a visual exploration standpoint, we argue that the use of linear models in D is not necessary since humans can still interpret more complex relationships. On the other hand, non-linear mappings of the data coordinates are not explainable, thereby making the subsequent analysis also highly opaque. Another crucial feature of FPP is that it supports much larger data than related techniques and enables unified analysis with multiple response functions, wherein a single D projection is identified that jointly preserves all responses. Finally, an often-overlooked challenge for view-finding or any pattern detection algorithm is that, in higher dimensions, it is challenging to qualitatively distinguish between meaningful structure and artifacts. This behavior can be directly attributed to the curse of dimensionality, where the sample sizes used are not sufficient to make meaningful inferences about the data. In the case of view-finding approaches like FPP, this manifests as overfitting, where one can almost always create a seemingly meaningful pattern given low enough sample counts and sufficiently high dimensions. To address this problem we argument the D embeddings with the equivalent of a -statistic that quantifies the likelihood of the given hypothesis (the observed pattern) to occur in a random function. This, for the first time, provides a quantitative and easy to interpret indicator on how reliable a given visualization is likely to be, which is crucial to confidently infer new insights. Using several case studies we show that FPP provides comprehensive insights that cannot be easily obtained using conventional techniques.
2 Related Work
Even though FPP produces linear projections, it is fundamentally different from existing linear dimension reduction techniques like principal component analysis (PCA) Jolliffe (2011). Dimension reduction, as currently understood, is typically aimed at preserving global (PCA) or local neighborhood structures (locality preserving projection He and Niyogi (2004)) of the high-dimensional point geometry. The resulting projection is then colored according to a response function in hopes of highlighting interesting relationships. However, when analyzing particular response functions the complete geometry of the point set may not be relevant and can even be detrimental by introducing “spurious” variations unrelated to the response of interest. Instead, FPP directly targets the response functions of interest and preserves only the aspects of the high dimensional geometry relevant to the problem.
In this context, FPP is similar to cross decomposition approaches, such as canonical component analysis (CCA) Hardoon et al. (2004) and partial least square (PLS) Chin et al. (1998). CCA aims to find the subspace that best aligns the domain and the range of a high-dimensional function. However, this produces a subspace that is at most equal to the minimal dimension of either the domain and the range. Consequently, for scalar functions, one can only find a 1D subspace, rather than a 2D projection, and for more than two response functions a secondary projection must be added for visualization which results in an unoptimized projection. Partial least square Chin et al. (1998) does not have this limitation but is restricted to a linear regressor making it difficult to identify even simple nonlinear patterns, i.e. a circle (see Section 4). Similarly, inverse sliced regression Li (1991), which utilize an inverse regression formulation to reduce the dimension of the input with respect to the response (function), is also limited to linear correlation structure.
For visualization of the nonlinear structure of the high-dimensional point, many nonlinear dimensionality reduction techniques Maaten and Hinton (2008); Kruskal (1964) have been proposed. However, while they can capture some intrinsic structure of a point sample it is difficult to connect the observed patterns back to the domain as the axes no longer have well-defined meaning and distances can be heavily distorted. Furthermore, similar to the linear projections these techniques do not consider the response function in creating the projection. As a result, a projection that explains the response well is due to a fortunate coincident rather than a deliberate design.
The technique most similar to FPP is projection pursuit regression (PPR) Friedman and Stuetzle (1981), which has been designed as a universal high-dimensional function approximator for regression tasks. The PPR fits a linear combination of multiple non-linear transformations of linear combinations of variables in the data. The non-linear mapping allows PPR to capture certain nonlinear patterns. Similar to FPP, the PPR formulation can also be considered as a dual of the kernel regression as explored in Donoho and Johnstone (1989). However, designed as a function approximator, the performance of PPR improves as we increase the number of nonlinear transformation components, which can lead to challenges in its interpretation. With FPP, we instead directly fit a non-linear model in the 2D projected space, which not only allows intuitive visualization but also simplified the optimization process that allows us to efficiently scale the proposed technique to extremely large sample size and dimensions.
As discussed in the introduction, the growing need for exploring large and complex high-dimensional dataset call for visualization tools that 1) produce interpretable embedding; 2) capable for capturing nonlinear pattern; 3) scalable to large sample size and high dimensionality. For interpretability, contrary to many high-dimensional data visualization approach (e.g., t-SNE or MDS) that employ a complex map from high-dimensional to 2D space, we focus on a simple linear transformation that produces a 2D embedding with well-defined axes. To capture the potentially nonlinear structure of the function, we frame the pattern discovery as a nonlinear 2D regression problem, where the choice of the regressor, i.e., the polynomial degree, provides direct control over the visual complexity a user expects or is willing to consider as salient structure. In the most basic form, we can consider the problem as a joint optimization of both dimensionality reduction and regression (see Figure 1) that can be formulated as follow:
For a given HD dataset of samples in dimensions, , FPP infers -dimensional embeddings , based on a response function defined at each data point . Here, and denote the input and the embedded spaces respectively. The response function space is defined as either , in case of continuous response functions, or as when assumes one of discrete values. This leads to the following general formulation of FPP:
In this formulation, denotes the mapping function, with parameters , between the embedded space and the response function, is a linear orthonormal projection applied to each data sample and is a scoring function used to evaluate the quality of mapping . In order to achieve interpretability, FPP relies on linear projections and for visualization purposes is typically fixed at . One can then capture non-linear relationships by allowing sufficient flexibility for the mapping function , i.e. using higher order regression models. For continuous , the examples below use polynomial regressors, where the polynomial degree directly controls the complexity of the inferred structure. However, other regressors could easily be integrated as well. As scoring function any one of the standard goodness of fit measure can be used, such as the mean squared error (MSE). In the case of classification when is discrete, is defined as a softmax classifier to predict the class labels. In these cases is defined as the cross entropy between true and predicted response values. Finally, Eq. (1) can be extended to multiple response functions as follows:
Here, our goal is to infer an unified projection that simultaneously recovers the response functions.
To solve Eq. (1) requires incorporating the constraint to ensure that the columns of are orthonormal and the projection constructs a valid linear subspace. More specifically, FPP leverages the popular deep learning framework TensorFlow Abadi et al. (2016) to implement a projected gradient descent (PGD) optimization. The linear projection is realized with a dense layer with weight matrix of size , and the orthonormality constraint is enforced by projecting estimated weights onto the Stiefel manifold, the set of all orthonormal matrices of the form , through singular value decomposition (SVD) Golub and Reinsch (1971). The embeddings from the linear projection, , are then used for predicting the response using a non-linear mapping . The detailed steps of FPP is summarized in Algorithm 1. Note that for small , i.e. the SVD step is computationally efficient and consequently FPP scales to tens of millions of samples and tens of thousands of dimensions. Our implementation can be found at https://github.com/LLNL/fpp.
For any pattern-finding scheme, it is imperative to evaluate the trustworthiness of the identified pattern. Since we are optimizing for low-dimensional patterns in a high-dimensional space, there are many potential opportunities for overfitting. We address this challenge from two perspectives: First, like all statistical problems, we can split the data into training and testing set, fit the projection using training data, and compare the result on both the training and testing set. If the identified pattern is due to a salient correlation we expect both projections to result in a very similar structure. Alternatively, we can consider the problem as a hypothesis test, by defining a confidence value (analog to a p-value) which describes how likely a pattern of similar strength can be found in random data. Such a test can also provide us with general guidelines on whether we should be concerned about potential overfitting at a given sample count and data dimension. As illustrated in Figure 2 (a), for the given 2D polynomial (degree 4) regressor, we show its score when fitted to randomly generated data of different dimension and size. As expected, overfitting is more likely to happen as the dimension increases or as sample count decreases. Subsequently, we utilize the score samples on random datasets to estimate a p-value for a given projection. As illustrated in Figure 2(b), we show p-values assuming the projection give a score of 0.5 for configurations. The colormap is clamped at , which reveal a clear line separating the significant and non-significant sides. Such an observation indicates that a constant factor exists between sample size and dimension size for finding a trustworthy pattern of the regression problem (in this case, the sample size should be at least ten times larger). We can also obtain the p-value estimation from the loss function value (instead of ), which can be used for both regression and classification scenarios. Polynomial regressor is effective for capturing the nonlinear and low-frequency pattern human can easily comprehend. However, the proposed framework does not limit to any particular 2D regressor, provided they are differentiable to utilize the existing SGD framework. Also, different types of regressors also impose different priors on the patterns. Therefore, we can utilize the selection of regressor and their setup, i.e., the degree of the polynomial regressor, as a tunable knob in the system to focus on different types of pattern or complexity. However, we should expect to see potentially different overfitting behaviors due to the choice of the model. For certain datasets with high extrinsic but low intrinsic dimension, one can also utilize random projections as a pre-process to lower the dimensionality of the problem. By its nature, random projections are highly unlikely to cause overfitting but the resulting lower dimensionality significantly reduces the overfitting risk for the subsequent FPP projection.
In this section, we demonstrate the applicability of the proposed method on synthetic data as well as on dataset from real-word applications for both regression and classification problems.
As a synthetic experiment, we define a single function on a uniformly sampled high-dimensional domain. Here, the function has a circular pattern (see the third column of the Figure 3) in a 2D subspace of a 5D and 30D domain respectively, where each dimension is generated from a uniform random distribution (between to ). In the top row of Figure 3 shows projections from the 5D dataset and the bottom row are from the 30D dataset. Due to the linear assumption, both partial least square (PLS) and sliced inverse regression (SIR) fail to capture the circular pattern. By focusing on the visual domain and relying on a nonlinear regressor (in this case, a polynomial regressor of degree 3), the proposed FPP approach can easily reveal the pattern in both the 5D and 30D domain. On the rightmost column, we illustrate the significance estimation (p-value) of the pattern captured by the proposed technique. The blue histogram shows the regression loss value distribution for a sample of 300 randomized functions (i.e., a random shuffle of the function values), whereas the red line marker indicates the loss value for the input function. This plot provides insights on whether we overfit to the data by illustrating how likely we will be able to find a pattern of similar strength in the random function. The clear separation between the randomized function regression loss distribution and the input function loss distribution lead to an estimated of p-value 0.0, which indicates a strong evidence against the hypothesis that the observed pattern is from spurious correlation.
In the synthetic dataset, we use FPP to produce a 2D embedding based on one function of interests. However, in many applications, we interest in the joint behavior of multiple scalar function, i.e., output properties of physical simulation ensembles. As discussed in Section 3, we can easily extend the single function formulation to multiple ones. In the following example, we give an example of a multi-function projection, where we produce a single 2D embedding that explains all major variation of the functions. The application is a physical simulation ensemble (1M samples) produced by a recently proposed semi-analytic simulation model Gaffney et al. (2014); Springer et al. (2013) for inertial confinement fusion 111https://github.com/rushilanirudh/icf-jag-cycleGAN. The simulator has a 5-dimensional input parameter space and produces several images of the implosion as well as 15 diagnostic scalar outputs. For this example, we want to understand what are the main driving factor for not one but all 15 scalar outputs. We can explore such a relationship by producing a simple 2D projection that would best capture the changes and pattern of the 15 scalars. As illustrated in Figure 4(a), by utilizing a degree-3 polynomial regressor for each function, we identify a single 2D projection (color by 15 different scalar values) that explains most of the variation of all scalar outputs in the simulation ensemble. The fitted model has a loss of 0.0374. To estimate the p-value, we obtain the mean and variance of the loss distribution on randomized function computed from 300 samples, which lead to an estimated p-value of 0.0. As it returns out, as shown in Figure 4(b), if we apply in-plane rotation, the two dominating directions corresponds to two of the input parameters. According to the physics, the other three parameters (out of five input parameters) are shape parameters, therefore, they will mostly impact the generated image instead of the scalars. This example provides a real word verification of the proposed ability to capture a shared configuration, which helps interpret a set of functions defined in the same domain.
In previous examples, we have demonstrated the effectiveness of FPP for projecting continuous, high-dimensional functions, i.e., for regression problems. For classification problems, i.e., find the 2D projection that best separate samples with different labels, we can simply replace the 2D regressor with a 2D classifier. Here, we utilize a 2D logistic regression classifier (with an additional nonlinear layer) to drive the selection of the linear projection. In Figure 5(a), we compare the proposed FPP with linear discriminant analysis (LDA) Fisher (1936) on the MNIST dataset 222http://yann.lecun.com/exdb/mnist/ that consists of 60K sample of 28 by 28 images. We can see the FPP can find a 2D linear projection that separates different digits’ images better than the LDA result. We also apply the method on a high-dimensional RNA sequence data 333https://archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq, which has more than 20531 feature dimensions and 801 samples. Due to the extremely high-dimensional and the very low sample count, there is a high potential for overfitting. As discussed in previous examples, we can obtain the p-value , which give high confidence on the captured structure. Alternatively, we can validate the trustworthiness of both the FPP and LDA results by split the datasets into training and test and then evaluate the trained models on the test set. As shown in Figure 5(b), not only does FPP separate the class better than LDA, but it also produces more robust projection that generalized well to the test set, unlike the LDA projection which entirely fails on the test set. When examining the LDA projection matrix, we notice only 3 of the 20K dimensions are contributing to the projection, which leads to a degenerate projection for the test set.
The 2D loss function combined with the SGD based implementation allows FPP to scale to data sizes significantly beyond the ability of most traditional projection/dimensionality reduction approaches. In the following example, we use FPP to probe into feature representations of two popular deep learning architecture (AlexNet, ResNet) on the entire ImageNet challenge Deng et al. (2009) training dataset, which consists of more than 1.28 million images and 1000 classes. We use the last layer before the softmax as the feature representation for both networks resulting in a 2048- and 4096-dimensional feature space for ResNet and AlexNet respectively. Combined with a large number of samples this results in datasets of more than 40GB. Despite their massive size, we can generate projections for each of these 1.28M sample datasets within minutes (between 3 to 10 minutes). The detailed timing results and parameter setups for all our examples are in listed in Table 1. For the first five datasets, we compute the results on a laptop with an Intel Core i7-6820HQ processor (2.9GHz), whereas the last four are computed on a server (due to limited memory on the laptop) with an Intel Xeon E5-2695 processor (2.1GHz).
For the experiments, we first group the entire 1.28M images (1000 classes) into 5 coarse categories, namely, “living thing”, “natural object”, “food”, “artifact”, “misc”, where the first 4 categories consists 984 of the 1000 classes available in the imageNet challenge. We then project all the images feature representations with the categories as the label. As shown in Figure 6(a), we can see that the AlexNet’s representation has trouble distinguishing the purple samples (“misc” category) from the rest, whereas the ResNet’s representation, despite being lower-dimensional, can. For visualizing more detailed category separability, we generate 10 categories with more concrete and meaningful labels (i.e., “fish”, “bird”, “mammal”, “invertebrate”, “food”, “fruit”, “vehicle”, “appliance”, “tool”, “instrument”), which consists of 611k images of the 1.28M. As we can see in Figure 6(b), the feature representation of ResNet again seems to be able to better separate these categories compared to AlexNet’s representations, which may explain the gap in their predictive performances (AlexNet and ResNet have Top-5 errors of 20.91% and 6.44%, respectively).
|Circle (5D)||3000||5||1||0.14 (MB)||50||50||2.98|
|Circle (30D)||3000||30||1||0.74 (MB)||50||50||3.39|
|ICF simulator||1M||5||15||152.6 (MB)||1||200||24.7|
|ResNet ImageNet Sub||611K||2048||1||9.3 (GB)||10||200||184.5|
|ResNet ImageNet Full||1.28M||2048||1||19.6 (GB)||10||200||404.8|
|AlexNet ImageNet Sub||611K||4096||1||18.7 (GB)||10||200||283.7|
|AlexNet ImageNet Full||1.28M||4096||1||39.1 (GB)||10||200||640.1|
In this work, we introduce a novel class of linear projection methods for visualizing interesting and interpretable visual patterns of the function in 2D subspaces of the function domain. The combination of linear projection and nonlinear pattern searching schemes (i.e., a polynomial regressor, or a nonlinear classifier in 2D) allows us to exploit our innate ability to perceive complex (and potentially nonlinear) visual pattern in 2D while compensating our inability to comprehend nonlinear transformation by focus only on a linear transformation from high-dimensional space to 2D. The efficient formulation also allows us to easily scale the problem beyond million of samples and tens of thousands of dimensions that is unimaginable for most dimensionality reduction methods.
This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Released under LLNL-JRNL-790959.
- Abadi et al.  Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
- Carr et al.  Daniel B Carr, Richard J Littlefield, WL Nicholson, and JS Littlefield. Scatterplot matrix techniques for large n. Journal of the American Statistical Association, 82(398):424–436, 1987.
- Chin et al.  Wynne W Chin et al. The partial least squares approach to structural equation modeling. Modern methods for business research, 295(2):295–336, 1998.
- Deng et al.  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Donoho and Johnstone  David L Donoho and Iain M Johnstone. Projection-based approximation and a duality with kernel methods. The Annals of Statistics, pages 58–106, 1989.
- Fisher  Ronald A Fisher. The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2):179–188, 1936.
- Friedman and Stuetzle  Jerome H Friedman and Werner Stuetzle. Projection pursuit regression. Journal of the American statistical Association, 76(376):817–823, 1981.
- Gaffney et al.  Jim Gaffney, Paul Springer, and Gilbert Collins. Thermodynamic modeling of uncertainties in nif icf implosions due to underlying microphysics models. In APS Meeting Abstracts, 2014.
- Golub and Reinsch  Gene H Golub and Christian Reinsch. Singular value decomposition and least squares solutions. In Linear Algebra, pages 134–151. Springer, 1971.
- Hardoon et al.  David R Hardoon, Sandor Szedmak, and John Shawe-Taylor. Canonical correlation analysis: An overview with application to learning methods. Neural computation, 16(12):2639–2664, 2004.
- He and Niyogi  Xiaofei He and Partha Niyogi. Locality preserving projections. In Advances in neural information processing systems, pages 153–160, 2004.
- Inselberg and Dimsdale  Alfred Inselberg and Bernard Dimsdale. Parallel coordinates: a tool for visualizing multi-dimensional geometry. In Proceedings of the 1st conference on Visualization’90, pages 361–378. IEEE Computer Society Press, 1990.
- Jolliffe  Ian Jolliffe. Principal component analysis. Springer, 2011.
- Kruskal  Joseph B Kruskal. Nonmetric multidimensional scaling: a numerical method. Psychometrika, 29(2):115–129, 1964.
- Li  Ker-Chau Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414):316–327, 1991.
- Maaten and Hinton  Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
- Springer et al.  PT Springer, C Cerjan, R Betti, JA Caggiano, MJ Edwards, JA Frenje, V Yu Glebov, SH Glenzer, SM Glenn, N Izumi, et al. Integrated thermodynamic model for ignition target performance. In EPJ Web of Conferences, volume 59, page 04001. EDP Sciences, 2013.
Appendix A Hypothesis Testing Failure Case
As illustrated in Figure 2, when the ratio between data sample and dimension count is relatively large (i.e., more than 20), we often do not need to worry about overfitting the model to spurious correlation in the data. This observation is also supported by the examples in the result, where we always obtain very small p-values when the sample to dimension ratio is high. In the following example, we illustrate a scenario, where the model overfits to the data. We show that both the p-value and the train/test split help reveal the problem. Here we again look at the RNA-seq dataset (20531 dimensions, 801 samples). Based on the sample and dimension ratio, there is a high likelihood of overfitting for a regression problem. In this example, instead of using a 2D classifier (as in the original result), we treat the labels as multiple binary functions and solve it using the multi-function regression setup. As illustrated in Figure 7(a), we are able to separate the 5 classes even in the regression setup. However, when estimating the p-value from that projection, as shown in (c), we obtain an estimated value of 0.17, which indicates an untrustworthy result. The overfitting problem can also be detected by projecting the test dataset (b).
Appendix B Additional Details For the ImageNet Results
In the result section, we illustrate category separation of the full and subset of the imageNet dataset. However, due to the occlusion caused by a large number of samples in the 2D plots, we may not have a clear understanding of the distribution for each of the category. Here in Figure 8 and Figure 9, we show the class distribution details by decomposing each multi-category plot into multiple plots where only samples from one category are displayed.