Mapping the Similarities of Spectra: Global and Locally-biased Approaches to SDSS Galaxy Data

Mapping the Similarities of Spectra: Global and Locally-biased Approaches to SDSS Galaxy Data

David Lawlor Tamás Budavári Dept of Applied Mathematics and Statistics, The Johns Hopkins University Michael W. Mahoney

We apply a novel spectral graph technique, that of locally-biased semi-supervised eigenvectors, to study the diversity of galaxies. This technique permits us to characterize empirically the natural variations in observed spectra data, and we illustrate how this approach can be used in an exploratory manner to highlight both large-scale global as well as small-scale local structure in Sloan Digital Sky Survey (SDSS) data. In particular, we use this method in a way that simultaneously takes into account the measurements of spectral lines as well as the continuum shape. Unlike Principal Component Analysis, this method does not assume that the Euclidean distance between galaxy spectra is a good global measure of similarity between all spectra, but instead it only assumes that local difference information between similar spectra is reliable. Moreover, unlike other nonlinear dimensionality methods, this method can be used to characterize very finely both small-scale local as well as large-scale global properties of realistic noisy data. The power of the method is demonstrated on the SDSS Main Galaxy Sample by illustrating that the derived embeddings of spectra carry an unprecedented amount of information. By using a straightforward global or unsupervised variant of our method, we observe that the main features correlate strongly with star formation rate and that they clearly separate active galactic nuclei. In addition, computed parameters of the method can be used to describe line strengths and their interdependencies. By using a locally-biased or semi-supervised variant of our method, we are able to focus on typical variations around specific objects of astronomical interest. We present several examples illustrating that this approach can enable new discoveries in the data as well as a detailed understanding of very fine local structure that would otherwise be overwhelmed by large-scale noise and global trends in the data.

1 Introduction

The physical properties of the Universe and the internal mechanisms of galaxies are ultimately intertwined in astronomical observations. Characterizing the diversity of galaxies is vital not only for understanding their evolution but also to unravel the nature of dark energy in the context of our cosmological models. While today’s large-scale spectroscopic surveys provide a plethora of data, novel data analysis methods are needed to help extract astronomical insight from these data.

Current data analysis approaches in this area generally fall into one of two categories. In the first category, the observed spectra are fitted by semi-analytic models, e.g., [BC03], to infer model-based parameters. These parameters in turn provide a model-dependent physical coordinate system with absolute scales such as age or metallicity. Challenges for these methods typically include systematic biases due to imperfect models as well as correlated parameters. In the second category, one adopts a more empirical approach, where galaxies are analyzed in relation to other galaxies based on the original measurements, i.e., based on the observed spectra. A major challenge for these more empirical methods is the conceptual problem of how best to compare empirical spectra, e.g., which features of a spectrum are most important for identifying similarities between two spectra. The approach we describe in this paper falls into this second category, and it aims to address the fundamental issue of measuring similarity between galaxy spectra, both with respect to the large-scale global properties of the empirical data, as well as with respect to finer-scale local structure that might be overwhelmed by global data analysis tools that focus on global properties of the data.

A canonical example of a global data analysis tool that adopts this more empirical approach is Principal Component Analysis (PCA), which is widely-used to find the globally-dominant linear trends in the data. PCA was first applied to galaxies by [CSB95], who found that a significant fraction of the variance in the spectra can be captured by only three components. In other words, the analyzed spectra could be well approximated by a linear combination of three eigenspectra. The coefficients serve as summaries of the high-dimensional spectra, and in this coordinate system galaxies could be meaningfully compared to one another. PCA has been used in many research areas, including photometric redshift estimation [CBS99, BSC00], sky subtraction [WH05], as well as classification of galaxies and quasars [FHFC92, CS99, YCS04b, YCV04]. The Sloan Digital Sky Survey (SDSS) [YAA00] has adopted the method in its data reduction pipeline, and it automatically derives the first five eigencoefficients (called eCoeff_0eCoeff_4). These are considered the state of the art in describing the continuum shape of the spectra. Figure 1 shows the mixing angles and of the three leading eigencoefficients for the Main Galaxy Sample (MGS) in SDSS Data Release 7 [AAMA09]. These coordinates are defined as in [YCS04a] by


In the embedding illustrated in Figure 1, every point is a galaxy, and nearby points (i.e., galaxies or points that are near each other in the two-dimensional representation) have similar observations, by construction of the empirical coordinate system. In particular, “red and dead” galaxies appear at the top of the plot, while star-forming blue ones are on the lower right.

Figure 1: Embedding of the Main Galaxy Sample of the SDSS Data Release 7 on the mixing angles and of the first three eigen-coefficients.

There are, however, shortcomings of the simplified view of the data provided by PCA. First of all, a significant fraction of the galaxies are actually removed from or scattered out of the plot as a part of the usual data analysis pipeline used to generate the plot; we only really see the core of the distribution in a figure such as this one. In addition, the lack of structure in this visualization is surprising, especially considering the large amount of high-quality data and the wide range of galaxy types in the data. Finally, the interpretation of the axes is difficult, as they are linear combinations of actual data elements and not actual data elements. Extensions and variants of PCA have been proposed to overcome these challenges, including non-negative matrix factorization [LS99], the use of robust statistics [BWS09], and CX/CUR matrix decompositions [MD09, YMS14]. While these methods have alleviated some of the issues associated with PCA, the fundamental limitation of its assumed linear model remained.

Perhaps the biggest conceptual change in the area was introduced by [VC09], who applied the Locally Linear Embedding (LLE) method of [RS00]. This more sophisticated empirical approach attempts to identify and exploit local structure in the data, and thus it broke away from the straightforward global linear model underlying PCA. While there are other related nonlinear approaches [TDSL00, BN03], LLE in particular attempts to provide an angle-preserving mapping that assigns coordinates to galaxies such that each galaxy is approximately a linear combination of its nearest neighbors, with the same weights as in the observed space. The power and practical usefulness of LLE (as well as other related nonlinear methods [TDSL00, BN03, CL06]), however, is known to be severely diminished in many practical situations. The reasons for this are many: due to the difficulty of these methods in dealing with non-uniform point densities; since the global objective function used to enforce angle preservation or other neighborhood information can damage small-scale or local structure in the data; since these methods are quite sensitive to realistic noise in the data; and since these methods are very sensitive to the “details” of constructing the nearest neighbor graph, e.g., to the functional form of the nearby distance and the choice of parameters used to define nearness. (This is in spite of a large body of theory stating that in idealized situations these details do not matter.) In addition to exploiting the strong algorithmic and statistical theory underlying our main method [HM14, MOV12], dealing appropriately with these and other related practical graph construction issues will be central to our approach, and thus we postpone further discussion of it until Sections 2 and 3.

A third empirical approach that is worth mentioning is based completely on line measurements. Recall that the high resolution in wavelength often allows the identification and measurement of different spectral lines, and that it is common to plot spectra in terms of carefully-chosen line ratios. That is, while not usually described as an embedding method, the typical use of line measurements often involves embedding or mapping the data to a low-dimensional space. (The state-of-the-art in this area actually uses PCA as a pre-processing step to subtract the continuum, in order to measure better the line strength relative to this baseline [THK04].) In particular, the BPT diagrams [BPT81] plot different line ratios on a logarithmic scale, enabling, e.g., the classification of galaxies [BCW04, KGKH06]. For example, Figure 2 shows several BPT diagrams of the SDSS MGS. In Figure 2, the characteristic V-shape of the embedding on the ratios vs.  is clearly visible, despite significant scatter that is partly due to noisy measurements of the individual lines. Little to no structure is evident in Figure 2 and 2, whose -axes plot different line strengths. The insight conveyed by the BPT plots can be considered complementary to that of the PCA results, which is primary based on the continuum shape. Again, though, the lack of fine-scale structure in these visualizations is somewhat surprising, given the quality and diversity of the data.

Figure 2: Embeddings based on several pairs of BPT line-ratio diagrams of the SDSS DR7 MGS.

In this paper, we present a novel approach to studying galaxies that combines elements of several aforementioned techniques but that moves away from the limiting assumptions in their underlying mathematical models. Our method, which is an extension of semi-supervised eigenvectors [HM14] from locally-biased machine learning [MOV12], uses ideas from spectral graph theory; and we study the properties of the SDSS Main Galaxy Sample data, using both a traditional global variant as well as a more recently-developed local variant of these methods. The global variant reduces to a version of Laplacian eigenmaps [BN03] and related methods such as diffusion maps [CL06], while the local variant exploits recent work on local spectral graph partitioning to engineer locality into the basic global method, while at the same time preserving the strong algorithmic and statistical properties of the global variant [HM14, MOV12]. Among other things, the method efficiently handles the continuum shape and the spectral lines simultaneously, and the method can be used to explore the data in a qualitatively more refined manner in order to obtain better insights for galaxy studies. Since the method is unfamiliar in this application domain, in Section 2, we briefly review the approach; and in Section 3, we illustrate how several of the key “knobs” of the method behave on these data, as a function of data modeling design decisions. Then, in Section 4, we apply the global variant of the method to study the large-structure of the data; and in Section 5, we use the local variant of the method to study much finer-scale structure in the data. Finally, in Section 6, we present a brief conclusion.

2 Global and Local Spectral Embedding Methods

The method of locally-biased semi-supervised eigenvectors follows the general approach of recently-popular graph-based machine learning methods: take as input a collection of data points, define some sort of nearest-neighbor graph, and then use eigenvectors and eigenvalues of that graph to define features, parameterize data, perform classification or regression or clustering, etc. Prior methods that follow this general approach include LLE [RS00], ISOMAP [TDSL00], Laplacian eigenmaps [BN03] and the related diffusion maps [CL06], and so on. These methods are known as nonlinear dimensionality reduction methods since in certain idealized cases they can recover hypothesized nonlinear manifolds and since they provide greater statistical modeling flexibility than straightforward PCA. It is often more fruitful, however, to view them simply as constructing data-dependent kernels (in the machine learning sense of the word kernel [SS01]) and associated Reproducing Kernel Hilbert Spaces [HLMS04].

Among other things, this approach can be used to provide an embedding, i.e., a mapping, of the input data (typically in a very high dimensional Euclidean space) to a very low dimensional Euclidean space. The embedding is constructed to optimize an objective having to do with connectivity information among data points in an appropriately defined metric. For example, Laplacian eigenmaps and diffusion maps use entries of the leading eigenvectors of the Laplacian of the graph or of diffusion processes on the graph to define the embedding. Under certain assumptions, the Euclidean distance between data points in the embedded space equals a diffusion-based or resistance-based metric on the graph [BN03, CL06]. The embedding can then be used for various downstream machine learning and data analysis tasks. For example, these diffusion-based methods have been explored in the astronomical community, with applications such as photometric [FNL09] and spectroscopic [RFLS09] redshift estimation, and the statistical properties of diffusion embeddings have also been studied previously [LL06, LW10].

2.1 Constructing Data Graphs

Since our method of locally-biased semi-supervised eigenvectors takes as input a graph, we will start with a description of how data graphs can be constructed from rawer vector-based data, e.g., vectors of astronomical spectra. Say that we have a collection of points in . Importantly, since we view the method as constructing a data-dependent kernel, we do not assume that the data come from some intrinsically low-dimensional manifold. To construct a weighted graph on the data, where the vertices are the data points and the edges represent local connectivity information, we add an edge to the graph if is one of ’s nearest neighbors or if is one of ’s nearest neighbors. Note that this ensures the adjacency matrix of the graph is symmetric. We then weight each edge with a measure of local similarity given by


where is a parameter that controls the amount of “locality” present in the weight function. Alternatively, we can add an edge between every pair of points, with weight given by Eqn. (2), and then sparsify the matrix. The parameter can either be constant across all data points or it can be allowed to vary. A useful choice is to set , where is the distance of point to its nearest neighbor. Another adaptive choice of was proposed in [RZMC11]. Alternatively, one could choose to connect all pairs of points that are closer than some distance threshold. This is the parameterization that we adopt in the remainder of this paper, and we refer to it as “autotuning” the bandwidth . (The graph construction choices surrounding or or are problem-dependent, and it is known that graph-based machine learning methods are typically quite sensitive to them. We will discuss the sensitivity of our embeddings to in Section 3 below.)

For the application of interest in this paper, we note that our similarity measure in Eqn. (2) does not take into account the varying signal-to-noise ratio of wavelength bins. While we do have access to a per-wavelength uncertainty measure for each data point , we choose not to incorporate this information into the weight matrix, and instead we leave this important extension of our method as future work.

2.2 Locally-biased Semi-supervised Eigenvectors

Next, we will describe how, given a data graph, we can use global and local spectral methods to construct global and locally-biased embeddings of the nodes of that graph that can then be used for further downstream analysis. Recall that the usual eigenvectors and eigenvalues of the combinatorial or normalized Laplacian describe successively-orthogonalized directions of maximum variance, and they have strong connections with random walks and diffusion processes on the graph [Chu97]. Moreover, they are widely-used in machine learning in general and by nonlinear dimensionality reduction methods in particular. Here, we will review a methodology to construct so-called semi-supervised eigenvectors of a graph Laplacian [HM14], which is an example of a locally-biased machine learning method [MOV12]. With appropriate parameter settings, this methodology reduces to the usual global eigenvectors and diffusion-based methods, but with different parameter settings this methodology provides locally-biased analogues of these quantities.

To set notation, let be a connected undirected graph with vertices and edges, in which edge has weight In the following, will denote the adjacency matrix of , while will denote the diagonal degree matrix of , i.e., , the weighted degree of vertex . The combinatorial Laplacian of is defined as , and the normalized Laplacian of is defined as . The Laplacian is the symmetric matrix having quadratic form , for . This implies that is positive semidefinite and that the all-one vector is the eigenvector corresponding to the smallest eigenvalue . The generalized eigenvalues of are . We will use to denote smallest non-trivial eigenvector, i.e., the eigenvector corresponding to ; to denote the next eigenvector; and so on. We will overload notation to use to denote the smallest non-zero generalized eigenvalue of with respect to .

Given this notation, consider the left panel of Figure 3. The leading nontrivial global eigenvector of the normalized Laplacian (or, equivalently, the leading nontrivial generalized eigenvectors of ) is the solution to the problem GlobalSpectral. The next eigenvector is the solution to GlobalSpectral, augmented with the constraint that ; and in general the generalized eigenvector of is the solution to GlobalSpectral, augmented with the constraints that , for . It is these vectors that are widely-used in machine learning and data analysis, e.g., they are used to construct embeddings in Laplacian eigenmaps, and it is regularized variants of these vectors (computed from random walks) that are used to construct embeddings in diffusion maps. In particular, these eigenvectors can be used to define a low-dimensional embedding via


where the embedding dimension is a parameter to be chosen and the notation represents the element of the vector . (Since standard results from spectral graph theory show that the constant vector of all ones is an eigenvector with eigenvalue one [Chu97], the first eigenvector is not used in the embedding.) In the embedding space, the standard Euclidean distance between two points is proportional to the average length of a random walk starting at one point and reaching the second. In this sense, the embedding given by these eigenvectors preserves “connectivity” information about the original data. For further details on the theory of these methods, we refer the reader to [BN03, CL06].

GlobalSpectral s.t LocalSpectral s.t Generalized LocalSpectral s.t
Figure 3: Left: The usual GlobalSpectral partitioning problem; the vector achieving the optimal solution is , the leading nontrivial generalized eigenvector of with respect to . Middle: The LocalSpectral problem, which includes a locality constraint, was originally introduced in [MOV12]. Right: The Generalized LocalSpectral problem, which includes both the locality constraint and a more general orthogonality constraint, was originally introduced in [HM14]. The main algorithm for computing locally-biased semi-supervised eigenvectors will iteratively compute the solution to Generalized LocalSpectral for a sequence of constraint matrices [HM14].

While these embeddings provide an intuitive low-dimensional representation of high-dimensional data, they suffer from their fundamentally global nature. By this we mean that the embeddings are given by the eigenvectors of a suitably defined matrix, which can themselves be viewed as the solution of the global optimization problem GlobalSpectral, and as a consequence of the global orthogonality requirements, local structure and local heterogeneities tend to be lost in such an embedding. One (weak) way to tune this homogenizing effect is via the nearest-neighbor parameter ; another (stronger) way to recover local information is to enforce an additional constraint in the optimization program.

Motivated by this observation, we will define locally-biased analogues of these quantities in two steps: first, we will define the locally-biased analogue of the leading nontrivial eigenvector; and second, we will define the locally-biased analogue of the remainder of the eigenvectors. (The reason for this two-step process is that the leading nontrivial eigenvector and the remaining eignevectors have slightly different interpretations with respect to random walks; describing this is beyond the scope of this paper, but details can be found in [MOV12, HM14].) To do so, assume that we are given as input a (possibly weighted) data graph , an indicator vector of a small “seed set” of nodes, a correlation parameter , and a positive integer . (This will be the number of eigenvectors to be chosen, and thus it is different than the used to define the number of nearest neighbors.) Then, informally, we would like to construct vectors that satisfy the following bicriteria: first, each of these vectors is well-correlated with the input seed set; and second, those vectors describe successively-orthogonalized directions of maximum variance, in a manner analogous to the leading nontrivial global eigenvectors of the graph Laplacian. We emphasize that—in the same way as and are input to the usual global spectral methods—here the seed set of nodes, the integer , and the correlation parameter are also part of the input. They should be thought of as being available in a semi-supervised manner.

Given this, consider the middle panel of Figure 3. This presents LocalSpectral, an optimization problem that was introduced in [MOV12] which extends GlobalSpectral by including a constraint that the solution be well-correlated with an input seed set. We can assume (without loss of generality) that is properly normalized and orthogonalized so that and . Also, while can be a general unit vector orthogonal to , it may be helpful to think of as the indicator vector of one or more vertices in . The solution to LocalSpectral may be interpreted as a locally-biased version of the second eigenvector of the Laplacian. Importantly, it’s solution can be computed efficiently as the solution to a set of linear equations that generalize the popular diffusion-based Personalized PageRank procedure (and thus it can also be computed via a random walk process). In particular,


where is chosen to saturate the correlation constraint. By performing a sweep cut and appealing to a variant of Cheeger’s inequality, this locally-biased eigenvector can be used to perform locally-biased spectral graph partitioning [MOV12]. Observe that, for , this coincides with the usual GlobalSpectral objective, while for , this produces solutions that are biased toward the seed vector .

Finally, consider the right panel of Figure 3. For this Generalized LocalSpectral problem, in addition to the previous setup, we are also given an constraint matrix that may be assumed to be an orthogonal matrix. In words, this problem asks us to find a vector that minimizes the variance subject to several constraints: that is unit length; that is orthogonal to the span of ; and that is -well-correlated with the input seed set vector . To compute these locally-biased semi-supervised eigenvectors, we follow [HM14] and iteratively compute the solution to Generalized LocalSpectral, updating to contain the already-computed semi-supervised eigenvectors. In particular, to compute the first semi-supervised eigenvector, we let (i.e., the -dimensional all-ones vector, which is the trivial eigenvector , in which case is an matrix), in which case the first nontrivial semi-supervised eigenvector is given by Eqn. (4). To compute each subsequent semi-supervised eigenvector, we let the columns of consist of and the other semi-supervised eigenvectors found in each of the previous iterations. See [HM14] for details on how Generalized LocalSpectral can be solved efficiently in terms of a sequence of linear equations or constrained eigenvalue problems. We simply note that, in practice, the correct value of is unknown, and one performs a binary search over the range until the correlation constraint is saturated. Each stage of this binary search involves solving a system of linear equations, which can be done efficiently using iterative methods such as conjugate gradient.

We should note that, in this paper, we are interested in understanding local versus global properties of the data, and so we adopt an “exploratory” approach. For other downstream learning tasks, e.g., classification or regression, various model-selection methods can be used to select , , , , etc. [FHT01]. Extending the methodology of locally-biased semi-supervised eigenvectors to model selection and other related statistical questions, e.g., those considered in [RFLS09, VC09], is straightforward.

2.3 Regularization and Diffusion Interpretation

While the previous discussion has been in terms of constrained eigenvectors of the Laplacian, one can also describe these global and local spectral methods in terms of random walks, and this often aids intuition and computation. Consider, first, the usual global method. We can construct a matrix related to the transition matrix of a “lazy” random walk on the data as follows:


If both factors of acted on the right of , this would correspond to the transition matrix of a random walk on the data, where at each step the walker flips a fair coin: heads, he stays put; tails, he jumps to a random neighbor of the current point with probabilities proportional to the row of . Note that this transition matrix differs by a similarity transformation from as defined in Eqn. (5), given by . The two matrices thus share eigenvalues, and their eigenvectors differ by multiplication with a diagonal matrix (). Moreover, this eigensystem of is directly related to the eigensystem of the graph Laplacian  [Chu97]. The normalization we choose ensures that is symmetric, while the offset by (the “laziness” of the walk) ensures that is positive semidefinite. These properties are desirable for the numerical computation of the leading eigenvalues and corresponding eigenvectors using iterative methods.

We should also note that there is a natural “regularization” interpretation underlying our construction of our locally-biased semi-supervised eigenvectors and that this is intimately related to working with locally-biased random walks on the data graph . To see this, recall that the first step of our algorithm can be computed as the solution of a set of linear equations of the form of Eqn. (4), for some normalization constant and some . The quantity can be interpreted as a “regularized” version of the pseudoinverse of , where serves as the regularization parameter. This regularization interpretation has recently been made precise [MO11, PM11]. Alternatively, the quantity quantity is exactly a generalization of the usual PageRank vector that involves “random walkers” who uniformly (or non-uniformly, in the case of Personalized PageRank) “teleport” with a probability . As described in [MOV12], choosing corresponds precisely to choosing . In this case, the random walk interpretation is that one performs random walks consisting of a small number of steps, starting from a locally-biased seed node (as opposed to a large number of steps starting from an arbitrary seed node, which yields the usual global diffusion-based methods). For readers more comfortable with Laplacian eigenmaps or diffusion maps (or other diffusion-based machine learning methods), we simply note that the locally-biased semi-supervised eigenvectors framework can be used to provide locally-biased variants of these methods; we refer the interested reader to [MOV12, HM14] for details.

Given these connections, our computations of global diffusion embeddings were performed in MATLAB using a modified version of the DiffusionGeometry package of Bremer and Maggioni [CLL05, BM07]; and our computations of local embeddings were performed in MATLAB using the sseigs_demo package of Hansen and Mahoney [Han12, HM14].

3 Initial Illustration of the Method on SDSS Data

In this section, we describe the data we will consider as well as some of the most important “knobs” of the global and local spectral embedding methods described in Section 2. Details such as these are often relegated to an appendix or a methods section, but we place them here since the behavior of the method depends on these knobs in important ways.

3.1 The Main Galaxy Sample

The Main Galaxy Sample (MGS) has become a testbed for a wide range of astronomical studies. There are several reasons for this: due to the well-understood selection function, due to the large volume of high-quality data, and due to the prior systematic analyses that serve as a reference for new techniques. The definition of the MGS [SWL02] uses a single brightness limit of 17.77 Petrosian magnitudes. In the local (and recent) universe, this translates to a complete sample of galaxies that is dominated by red ellipticals. In our study, we use the entire rest frame wavelength range between 3450 Å and 8350 Å. Starting from the spectrophotometrically calibrated spectra, our only preprocessing consists of dealing with gaps in the wavelength coverage. The missing parts of the spectra are simply filled-in based on the best fit linear combination using eigenspectra from a prior PCA analysis following [YCS04b]. This simplifies the implementations of our method, but we note that our approach is also capable of dealing with incomplete data, e.g., similarly to gappy PCA [CCS95].

3.2 Effect of Using -NN or -NN Parameterization

An important aspect of our method involves the choice of nearest neighbors (NNs), and so we pause here to remark on the extremely heterogenous density of the sample, and how it affects our choice of embedding parameters.

In many applications, it is of interest, e.g., for reasons of interpretability, to use a constant bandwidth over all data points, i.e., choose NN based on an -NN threshold. A naive implementation of this would be computationally infeasible, even for moderately-large data, since the majority of galaxies lie in a very dense region of wavelength space. The matrices whose eigenvectors define our embeddings become very dense if we require galaxies within a fixed distance to be connected; and, indeed, such matrices would not fit in memory on the machines used.

One view of this heterogeneity is presented in Figure 4, which shows a histogram (in logarithmic scale) of the ratios of first to 32nd NNs over the entire data set. It is clear from the figure that the vast majority of galaxies lie in very dense regions of space (for which this ratio is close to one), while a non-negligible fraction lie in more sparsely populated regions (for which the ratio is less than one half.)

In this paper, we choose to autotune the bandwidths and fix the number of NN edges, in order to keep our computations tractable on the hardware at our disposal. We note that [CL06] proposed an alternative normalization of the diffusion operator of Eqn. (5) (which they term the “Beltrami” normalization) that removes the effect of the sampling density on the embeddings. Our computations confirmed the amelioration of this effect to some degree, but at the cost of increased runtime due to the slower decay of the spectrum associated with the Beltrami normalization. For this reason we keep the former normalization of Eqn. (5), termed “Markov” normalization, for the remainder of this paper.

Figure 4: Histograms of ratios of distances to first to 32nd nearest neighbors, over the full 517K spectra data set. The vast majority of galaxies lie in very dense regions of the parameter space (corresponding to ratios near unity), while a non-negligible fraction lie in very sparsely populated regions (the bumps near zero).

3.3 Effect of Global via Nearest-neighbors

Next, we discuss the choice of , the number of NN edges when using the -NN rule, on the constructed graphs and their embeddings. While the effects of the choice of can be highly problem-dependent, and while the choice of ultimately should be determined by a downstream astronomical model selection criteria (e.g., optimizing mean-squared error, or a precision-recall metric if one is performing classification, or obtaining qualitative insight into rare galaxy types of interest if one is interested in a more targeted goal), we have noticed several general trends of interest. First, if one chooses to be larger, then one tends to identify well the large-scale or global structure in the data, while washing out or homogenizing small-scale or local structure. Second, and relatedly, if one chooses to be smaller, then local or small-scale structure tends to be highlighted, but since the constructed graphs are much sparser and noisier, it is more difficult to draw sufficient statistical strength to identify as reliably global or large-scale structure.

To make these informal observations somewhat more quantitative, we describe next the uniformity properties of both the eigenvalues as well as the eigenvectors used in our embeddings. In Figure 5, we plot the decay of the top 101 eigenvalues of the lazy Markov operator with autotuned bandwidths, for values of ranging from to by powers of two. From the figure, we can see that as is increased and thus as more edges are added to the graph, the rate at which the eigenvalues decay increases, i.e., the matrix is more well-embeddable in a low-dimensional space. Next, in Figure 5, we plot the ratio of the largest eigenvector norm to the median eigenvector norm, as a function of the index of the eigenvectors. Specifically, if is the embedding of spectrum on eigenvectors 1 to , we plot the quantity


as a function of . This quantity is a measure of the non-uniformity of the distribution of mass on the eigenvectors, in a manner similar to leverage scores in linear dimension reduction [YMS14, CH09, MD09]. In general, the eigenvectors become more uniform with increasing , and local heterogeneities—as captured by localized eigenvectors—become less prominent with the inclusion of additional edges. In Table 1, we give the values of several measures of matrix complexity as is changed. We note in particular that as the number of nearest-neighbor edges increases, both the stable rank and the leverage scores decrease monotonically. This accords with the intuition mentioned previously that higher connectivity tends to “smooth out” the embeddings.

It is perhaps interesting to note that for these values of , the Markov matrix is not particularly well-approximated by a low-rank matrix. Nevertheless, the leading eigenvectors of do correlate well with physical intuition, and they do provide meaningful low-dimensional representations of the data. We should also note that these effects of increasingly-localized eigenvectors and slower eigenvalue decay as NN data graphs are made sparser has been observed previously in connection with the Nyström method in large-scale machine learning applications; see the statistics on the sparsified radial basis function kernels in the analysis of [GM16]. Our results are consistent with this prior work [GM16] that demonstrated that sparser graphs (within a parameterized family such as with radial basis function kernels or -NN graphs) are much less “nice” in the sense that their eigenvalues decay more slowly and that their eigenvectors are more localized.

Figure 5: (a) Top 101 eigenvalues of the lazy Markov operator with autotuned bandwidths, for (red) to (yellow). (b) Max-to-median ratio of eigenvector norms, as a function of embedding dimension for the lazy Markov operator with autotuned bandwidths, for (red) to (yellow).
% nnz

100th largest leverage score

2 0.000963 317866 1.0000 99.1685 99.9844 99.9754 604
4 0.001730 252304 0.9999 97.4220 99.9807 99.9726 347
8 0.003263 201239 0.9996 95.2012 99.9765 99.9698 182
16 0.006323 168355 0.9994 92.7084 99.9727 99.9674 138
32 0.012431 149618 0.9983 89.6513 99.9704 99.9661 92
64 0.024609 139635 0.9995 86.1339 99.9696 99.9657 65
128 0.048867 134492 0.9995 81.8590 99.9701 99.9661 44
256 0.097110 131889 0.9995 77.3375 99.9716 99.9669 30
512 0.192818 130582 0.9966 72.1517 99.9736 99.9682 18
1024 0.382121 129932 0.9990 67.0726 99.9759 99.9697 12
2048 0.755086 129610 0.9997 62.2042 99.9785 99.9714 9
Table 1: Several measures of matrix complexity as the nearest-neighbor parameter is varied. As is decreased, the graphs are sparser, and both eigenvector-based and eigenvalue-based “niceness” metrics for the matrices are much worse (in agreement with the results of [GM16]).

Finally, in Figure 6, we show the embeddings of the full data set on the third and fourth eigenvectors (we will consider other eigenvectors below) of the lazy Markov operator with autotuned bandwidths, for ranging from to by factors of four. The points are color-coded by the value of the second eigenvector. These plots hide density information (one example of which we will present below), but they illustrate that as increases, the data points become visually more compressed, with more linkages between different features of the point cloud. In particular, the red part of the embedding varies considerably as is adjusted, and small-scale local structure such as the cyan “heel” pointing downward and to the right for is lost for larger values of . In Section 4, we will argue that physical considerations imply that the connection between the dark blue and cyan points is an artifact of the method, and that the value seems to provide a more meaningful embedding; and in Section 5, we will describe in more detail how to obtain insight about the small-scale or local properties of this.

Figure 6: Embedding on third and fourth eigenvectors of the lazy Markov operator with autotuned bandwidths, for . Points are color-coded by the value of the second eigenvector.

4 Global Structure via Global Embeddings

In this section, we provide examples of how our method can be used as an exploratory tool to identify large-scale global structure in the SDSS data. This should provide some intuition regarding the physical interpretation of the leading non-trivial eigenvectors of the Markov operator , and it will illustrate the strength of our approach at highlighting properties of the data. In particular, in Section 4.1, we illustrate the density of galaxies in the embedding from several different angles; in Section 4.2, we provide an illustration of the data that highlights the continuum shape and line strength; in Section 4.3, we provide an illustration of the data that discriminates particularly well red elliptical galaxies; in Section 4.4, we illustrate the data embedded on the PCA mixing angles; in Section 4.5, we compare our method with the insight that can be obtained with the BPT diagrams; in Section 4.6, we compare the results of our method with the SDSS classification; and finally in Section 4.7, we discuss a peculiar aspect of one of the figures and explain it in terms of an artifact of the experimental measurements.

4.1 Density of Galaxies

The number density of galaxies on eigenvectors 2 through 5 of the lazy Markov embedding, with =32 and autotuned bandwidths is presented in Figure 7. Each of the subfigures presents the same data presented from a different angle—literally, since the data are being projected onto two dimensions of the embedding space. (We included all of these for comparison with previous and subsequent figures. For example, Figure 6 is taken from the same angle as Figure 7, and thus the value of means that Figure 6 corresponds to Figure 7.) Due to the widely varying nature of the density, the color is plotted in logarithmic scale. It is clear that the vast majority of galaxies lie on the orange to dark red “spine” in the lower portion of Figure 7, but this appears very different on different pairs of eigenvectors. For example, the high-density regions are somewhat spread out in Figure 7, but they are projected to almost the same point and thus not easily-discriminatable in Figures 7 and 7. Figure 7 also reveals two disjoint regions whose density are an order of magnitude greater than surrounding regions. As young galaxies evolve and burn up their fuel, they move from the left-most tip of the figure toward the high-density regions as they get redder. Yet this subfigure shows that the red galaxies on the right-most ridge are not a “well-connected” cluster, in the sense that there is a density minimum between the two peaks. This suggests that one can obtain improved quantitative insight about this class of galaxies by studying these embeddings using stellar population synthesis models, a topic which we leave to future work.

Figure 7: Density (in scale) of galaxies on eigenvectors 2 through 5 of the lazy, autotuned Markov operator: (a) eigenvectors (2,3); (b) eigenvectors (2,4); (c) eigenvectors (2,5); (d) eigenvectors (3,4); (e) eigenvectors (3,5); (f) eigenvectors (4,5). Plots on all pairs of eigenvectors 2 through 5 are presented as a point of reference for comparison with other figures.

4.2 Continuum Shape and Line Strength

Figure 8 illustrates how our method can be used to gain insight into the continuum shape and line strength. To do so, we present a view of the data set as embedded on the second and fourth eigenvectors of in Figure 8. (Thus, this perspective corresponds to Figure 7.) In this plot, color corresponds to the value of the second eigenvector. The boxes labeled A1–A5, R1–R5, and E1–E5 denote regions of embedding coordinates over which we calculated the mean spectra to increase the signal-to-noise ratio. The corresponding spectra are shown in Figure 8 through 8. These delineations were drawn in an ad-hoc manner, and are meant only to guide our interpretation of the embedding dimensions.

In Figure 8, in particular, we present the mean spectrum (in dark blue) and the standard deviation of the mean (in light blue) of the spectra in the boxes labeled R1–R5 in Figure 8. From these average spectra, we can clearly see the correlation of the second eigenvector with the shape of the continuum: positive values of correspond to red continuum shapes, while negative values correspond to blue continuum shapes. There is also a clear correlation with the strengths of emissions lines, with negative values corresponding to larger fluxes. As the continuum spectrum gets bluer, the spectral lines appear in the early-type galaxy indicating star formation. A natural continuation of this trend is in the E1–E5 boxes shown in the Figure 8, where the lines become overwhelmingly strong in the young galaxies. At first glance, we see a dramatic change in not only the lines but also their ratios. For example, the [O iii] (4959Å, 5007Å) lines grow significantly in comparison to the H  (6563Å). In Figure 8, we present the mean and standard deviation of spectra in boxes A1–A5 of Figure 8. These spectra trace a trend that is different from the main direction, but we see increasing line strengths and bluer spectra. This population of galaxies will be immediately obvious in Section 4.6, where we show the results of the SDSS classification [BCW04]: these are Active Galactic Nuclei (AGN).

Figure 8: (a) Subpopulations of galaxies sampled according to the embedding on eigenvectors 2 and 4 of the Markov operator. (b)-(d) Mean spectrum (in dark blue) and standard deviation (in light blue) for each of the subpopulations in (a).

4.3 A View into Red Ellipticals

Figure 9 illustrates how our method can be used to gain insight into the properties of red elliptical galaxies. In Figure 9 we present a view of the data set as embedded on the second and fifth eigenvectors of . (Thus, this perspective corresponds to Figure 7.) Here, we have chosen to explore this projection since discriminates red galaxies in the very dense region of spectrum space better than other low-order eigenvectors. As in Figure 8, color corresponds to the value of , and we have drawn bounding boxes in an ad-hoc manner.

In Figure 9, we present the mean and standard deviation of spectra in boxes RR1–RR5 from Figure 9. It is clear that all spectra in this area of the embedding share a red continuum shape, with small or absent emissions lines. We also note the increasing strength of H  with increasing values of . Comparing with Figure 7, it is clear that the density of galaxies also decreases with increasing , and in particular that box RR1 contains tens of thousands of galaxies. In Figure 9 we present the mean and standard deviation of spectra in boxes RG1–RG5. These correspond to a region of high density, as is evident in Figure 7. The transition from red to blue continuum shape is perhaps the most evident in this small region of the embedding space.

Finally, for completeness, in Figure 9 we present selected outlier spectra, labeled O1–O5 in Figure 9. Spectra O4 and O5 have been scaled by factors of 1/10 and 1/100, respectively, for legibility. It is common to want to identify outliers, either to clean up the data or in the hopes of identifying new phenomena. In this case, all of these spectra that were identified as outlying by our method appear to be artifacts or errors in the pre-processing, e.g., the gap correction. We note that each of these erroneous spectra appear separated from the remainder of the data, indicating the robustness and usefulness of the method for identifying outliers. These artifacts are truly exceptions and have been included for illustrative purposes.

Figure 9: (a) Subpopulations of red galaxies sampled according to the embedding on eigenvectors 2 and 5 of the Markov operator. (b)-(c) Mean spectrum (in dark blue) and standard deviation (in light blue) for each of the subpopulations in (a). (d) Selected outlier spectra; O4 has been scaled by a factor of 1/10 and O5 by a factor of 1/100 for legibility.

4.4 Relationship to eCoeff

Here, we compare our embeddings with those obtained via PCA, a dimension-reduction method that optimally preserves linear structure in high-dimensional data sets. These embeddings are computed in the SDSS pipeline and stored as ecoeff_i for i ranging from 0 to 4. (The data are not mean-centered in this computation, so ecoeff_0 corresponds to the projection onto the average spectrum.) Recall that the coefficients of the expansion on the eigenspectra have previously been used to classify galaxy types [YCS04a], among many other uses.

In Figure 10, we plot the galaxies in the embedding on the mixing angles and , where in this figure the opacity is proportional to the density and the coloring is determined by eigenvector 2 (in Figure 10) or eigenvector 5 (in Figure 10) of the lazy, autotuned, Markov operator. From Figure 10 it is clear that and are highly correlated. Given our results in the previous subsection, this is not surprising, since both measures mediate between red and blue continuum shapes. Not shown are figures showing that and discriminating among blue spectra, while Figure 10 displays how picks up in a more complicated way variation among red spectra.

Figure 10: Embedding of galaxies on the PCA mixing angles and , color-coded by eigenvectors 2 and 5 of the lazy, autotuned, Markov operator.

4.5 Relationship with BPT Diagrams

Here, we compare our embeddings with those based on certain emission line strength ratios, known as BPT diagrams. These line-ratio embeddings are based on the flux in four wavelength bins, corresponding to the , and emissions lines. In Figure 11, we show the embedding plot the galaxies in the embedding on the ratios


and as in Figure 10 the opacity is proportional to the density and the coloring is determined by the second eigenvector of the lazy, autotuned, Markov operator. These color versions of Figure 2 show that the apparent bifurcation in the BPT plots is really a continuous and gradual change. The most dominant component shown in color resolves the degeneracies in these plots and shows that the scatter is primarily due to the red galaxies. These plots hint at the possibility of a better classification algorithm that uses the coordinates provided by our method instead of just the BPT line measurements, and this is an obvious direction for future work.

Figure 11: BPT line-ratio diagrams, color-coded by second eigenvector of the lazy, autotuned, Markov operator.

4.6 Comparison to the SDSS Classification

Here, we augment our embeddings with the addition of the SDSS class labels as derived by [BCW04]. Our sample of about spectra are split into the following six categories: star-forming; low signal-to-noise star-forming; composite; active galactic nuclei (AGN); low signal-to-noise LINER; and unclassified. This state-of-the-art classification scheme goes beyond the BPT diagrams and uses a total of 7 lines to distinguish the separate classes. See Figure 12 for BPT line-ratio diagrams with galaxies color-coded by class label. (That is, the color-coding in this figure depends on these galaxy labels and is different than that used in previous figures.) The linear class boundaries through high-density regions of the data, as well as a quick comparison with Figure 11, highlights the arbitrariness of the class boundaries.

Figure 12: BPT line-ratio diagram of the SDSS DR7 MGS, with galaxies color-coded by class label.

In Figure 13, we present the embeddings on eigenvectors 2 through 5 of the lazy, autotuned Markov operator with . We have color-coded the points according to their type: blue for star-forming, cyan for low signal-to-noise star-forming, green for composite, magenta for AGN, and red for LINER. We have omitted unclassified spectra from these figures in order to make the embeddings more legible; and we have adjusted the transparency of the plotted points so that the density of spectra is visually evident. In Figure 13, one can discern a clear transition from star-forming to composite to LINER with increasing . This agrees with the earlier remarks regarding the correlation of with continuum shape. In addition, as noted previously, the AGN form a separate “spur” between the star-forming galaxies and LINERs. In Figure 13, we note the concentration of composite galaxies along the high-density spine evident in Figure 7 that was investigated in Figure 9. These spectra exhibit a clear transition from blue to red continuum shapes, which agrees with their position in our embeddings between star-forming and LINER galaxies. We also note the concentration of LINER and low signal-to-noise star-forming galaxies along the lower right rim of the embedding. (We will see in Section 5 that these two types are difficult to distinguish from one another.) In Figures 1313, we note the separation between the bulk of the spectra and those labeled as AGN. In particular, in Figure 13 one can discern a decision boundary for AGN in the purple axis behind the “fan” of composite and LINER spectra. Again, this is striking visual evidence of the arbitrariness of this class boundary.

Figure 13: Top four nontrivial global eigenvectors of the lazy, autotuned Markov operator, color-coded by type: blue for star-forming; cyan for low signal-to-noise star-forming; green for composite; magenta for active galactic nuclei (AGN); and red for LINER. Unclassified spectra are omitted from this figure. (a) Embedding on eigenvectors (2,3); (b) embedding on eigenvectors (2,4); (c) embedding on eigenvectors (2,5); (d) embedding on eigenvectors (3,4); (e) embedding on eigenvectors (3,5); (f) embedding on eigenvectors (4,5). For comparison, the ordering of the subfigures is the same as in Figure 7, but the different color-coding and handling of the density makes these subfigures look different.

4.7 Bimodality of the Blue Ridge in Figure 8

We conclude this section by noting that, upon closer inspection of Figure 8, the blue sequence of galaxies actually appear to have a bifurcation and there are two parallel ridges going through the the boxes E1–E3. To highlight and understand better this apparent bimodality, consider Figure 14, where we present multiple embeddings on two pairs of higher-order eigenvectors. In these panels, the separate trendline is clearly visible. The key insight, however, comes from the color scheme we used: it simply represents the redshift of each galaxy. That is, again, the color-coding in Figure 14 is incomparable with that of Figure 8, as here it is determined by the redshift of the galaxy. We see that the one of the strands contains the lowest redshift spectra with that belong to the largest galaxies on the sky, and one might wonder whether this is of astronomical significance. Upon examination, we have determined that this is an artifact: the SDSS photo pipeline is known to break apart large galaxies, and here we witness its power to pick out individual star-forming H ii regions, which the target selection identified as galaxies. Using the SkyServer Image Cutout services, we checked the thumbnail of these sources, and their morphology is in complete agreement. In some of the cases the enhanced star-formation appears to be induced by merging galaxies.

Figure 14: Embeddings of SDSS DR7 MGS on higher-order eigenvectors, 3rd versus 6th and 3rd versus 7th in 14 and 14, respectively, color-coded by redshift in logarithmic scale.

5 Local Structure via Local Embeddings

In this section, we provide examples of how our method can be used as an exploratory tool to identify small-scale local structure in the SDSS data. Previous work has shown performance gains using appropriately-seeded locally-biased semi-supervised eigenvector embeddings for tasks such as classifying handwritten digits and brain activity in fMRI images [HM14]. Thus, we expect that by seeding our local embeddings in an appropriate manner, we should be able to identify small-scale local properties of the data as well as to extract more information from a smaller number of features, especially as pertains to the classification of smaller classes of galaxies.

5.1 Zooming in Locally around Given Seeds

Here, we provide some intuition into the effect of the choice of seed vector on the local embedding introduced in Section 2. For this purpose, we work with the lazy, autotuned Markov operator with bandwidth , and we use the global embedding to provide meaningful local seeds. In Figure 15, we plot embedding of the galaxies on global eigenvectors 2 and 3, and color the points by the second global eigenvector. (This corresponds to Figure 7.) Also indicated with black outlines are two subsets of galaxies. These galaxies could be studied with global methods, as described in Section 4, but one might expect (and we have confirmed) that identifying such small-scale local structure with global eigenvectors—even if we permit the flexibility of multiple scales, multiple number of nearest neighbors, etc.—is extremely difficult and fragile. Instead, these galaxies can be used as seed vectors for a local embedding with our main method. In each case, the seed vector used in Generalized LocalSpectral weas taken to be the indicator vector of the subset identified in Figure 15, and the correlation parameter was set to 1/4 for each eigenvector. The seed sets were defined by manually choosing one data point of interest and taking its 100 nearest neighbors in the global embedding space.

In Figures 15 and 15, we plot the resulting local embeddings. That is, all the galaxy data points are plotted in the space spanned by the top two semi-supervised eigenvectors. For comparison with Figure 15, we color the data points by the second global eigenvector (and the data points comprising the seed vector are indicated by black outlines). In both Figure 15 and 15, galaxies similar to those in the seed vector are drawn away from the bulk of the embedding, offering a “zoomed-in” view of the local region of interest defined by the seed vector; and galaxies very dissimilar to those in the seed vector are given lower importance and clumped together. As with previous work that employed semi-supervised eigenvectors [HM14], these locally-biased embeddings can be used to visualize the data “near” the seed set of nodes, and they can be used to define features for use in downstream classification tasks.

Figure 15: (a) Embedding on global eigenvectors 2 and 3 of the lazy, autotuned Markov operator with , with two seed node sets highlighted in black. (b)–(c) Embedding on the leading two semi-supervised eigenvectors using seeds highlighted in (a), illustrating how semi-supervised eigenvectors can “zoom in” on particular parts of the data, down-weighting parts of the data far from the seed nodes. In all subfigures, points are color-coded by the second global eigenvector.

5.2 Locally-biased Learning

To illustrate how the locally-biased embeddings can be used to perform improved locally-biased classification, we constructed a set of five local embeddings, one per class, seeded with a number of positive and negative examples of one class. That is, for a class and number of examples , we choose random spectra of class and set the corresponding elements of to . We further choose random spectra from classes other than and set the corresponding elements of to . (We omit unclassified spectra from our sample, but we discuss these unclassified spectra below.) We expect that the local embedding with such a seed vector will better separate the class from the remaining spectra. For this study, we varied in the range , but for the sake of brevity we only present results with =640. We calculate the top local eigenvectors for each class, with uniform correlation parameter . (Varying these parameters could potentially lead to improvements, as has been shown previously.)

We then trained a five-class logistic regression model using our global and local embeddings as features. We grouped features together into sets of five, taking a fixed number of local eigenvectors per class. Thus, we tested one model using the top local eigenvector from each class, another using the top two local eigenvectors from each class, and so on. We compared these with models built using the name number of global eigenvectors, that is, the top 5, 10, and so on. For each number of features, we cross-validated the model using 10 folds with a 10%/90% train/test split. The choice to train on the smaller side of the fold was made due to computational considerations, and we did not see a significant degradation in accuracy when the training and testing sets were reversed.

The logistic regression model returns a vector of probabilities that a spectrum in the test set belongs to a given class. Using these probabilities, we constructed what we term “probability confusion matrices”, whose entry represents the average probability (over the test set) that a spectrum of type is assigned by the model to type . Here, we take the class labels described in Section 4.6 as ground truth (although it should be noted that these labels are themselves imperfect, as they are the output of a separate model based on emission line ratios [BCW04]).

In Figure 16, we present the probability confusion matrices for models constructed with 5, 10, and 15 global and local eigenvectors of the lazy, autotuned Markov operator with . In this figure, the top row corresponds to increasing number of global eigenvectors and the bottom to local eigenvectors. Comparing Figures 16 and 16, we note that with 5 global features, the model has difficulty correctly classifying AGNs, while using 5 local features leads to a significant increase in the correct classification of such spectra. Thus, for a fixed budget of features the local model outperforms the global version, as expected. Increasing the number of features, global or local, leads to improved performance on this rare galaxy type, and the local model consistently outperforms its global counterpart. (Clearly, the benefit of using local features is much less if we are interested in predicting very non-rare classes, as we have checked and confirmed.)

We also note that for all models shown, there is a large ambiguity between low signal-to-noise star-forming galaxies and LINERs. Indeed, for all cases the probability that a galaxy labeled LINER is classified as low signal-to-noise star-forming is greater than the probability that it is correctly labeled. This is not surprising in light of Figure 13, where significant mixing of these two types, colored cyan and red respectively, is visually evident. Finally, we tested the performance of the models as a function of the number of nearest neighbors in the underlying graph. In Figure 17 we present the per-class true positive rates and false discovery rates as a function of . Somewhat surprisingly, there is little dependence in either until , in which case we see a slight improvement in performance.

Figure 16: (a)–(c) Probability confusion matrices for five-class logistic regression using 5, 10, and 15 global eigenvectors of the graph; (d)–(f) using 5, 10, 15 local eigenvectors, with seed consisting of 640 positive and negative examples for each class. Results are cross-validated with 10 folds of 10/90 train/test splits.
Figure 17: (a) Per-class true positive rates with 45 global and local eigenvectors as a function of bandwidth ; (b) per-class false discovery rates as a function of .

5.3 Labeling Unclassified Spectra

For completeness, we include here a few comments about a use case that might have already occurred to the reader. There are a large number of galaxies for which there do not exist reliable ground truth labels into one of the classes we have been working with, and one might hope to define features—either global or locally-biased—to help with the classification of such unclassified spectra. For example, given what we have observed above, there is a gradual transition between different classes, and the use of locally-biased features might define a locally-biased geometry to help with this imputation task. We have examined this problem in some detail, and while this topic no doubt merits further attention, we have discovered that this is very challenging. The reason is that unclassified spectra (even when they are not obvious outliers casued by experimental artifacts) typically have properties that are very different than classified spectra. That is, they are unclassified for a good reason, i.e., since they are very different than spectra in one of the main classes, and in some sense they form their own (diverse) “other” cluster, whether viewed from a global or a locally-biased perspective.

6 Conclusion

We have presented a novel technique based on locally-biased semi-supervised eigenvectors to gain insight into the similarity of galaxies, and we have demonstrated the recovered nonlinear structure on the Main Galaxy Sample of the Sloan Digital Sky Survey. By constructing low-dimensional embeddings which respect local connectivity, we are able to visualize a range of astronomical phenomenon, e.g., the process of stellar evolution from hot, blue galaxies to cool, red ones. Unlike previously-used methods such as PCA or other recently-popular nonlinear dimensionality reduction methods, our method can focus on and highlight in a refined way local properties and/or global properties of the data. Depending on the choice of knobs of the method, the embedding maps we create contain all galaxies, and they can be used to identify either large-scale global structure in the data or small-scale local structure in the data.

The main parameters that we empirically derive clearly correspond to changes in the continuum shape and the strengths of spectral lines, and these help disambiguate traditional BPT plots and isolated AGNs. We observe that there are no disjoint groups of galaxies to indicate natural classes, but instead there are smooth continuous transitions between classes; and we observe that in many cases outliers (sometimes artifacts) are quite different than any other spectra in the data set. Moreover, the locally-biased versions of these embeddings (which use semi-supervised seed labels to construct semi-supervised eigenvectors) demonstrate that the method can be used to enhance the embeddings such that we can better focus on galaxies of interest, e.g., rare galaxy types that can be overwhelmed by global methods.

To summarize, our method of locally-biased semi-supervised eigenvectors may be viewed as a new type of computational microscope for astronomical data. It can not only reproduce known properties of the data and identify outliers and artifacts, but it can also be used to enhance subtle trends around selected galaxies to facilitate new discoveries. Obvious future work should focus on applying this new methodology to improve galaxy classification, to derive continuous spectral models, e.g., for photometric redshift estimators, and to understand better galaxy evolution with the help of stellar population synthesis models.


We would like to thank Ilse Ipsen for valuable discussions. This material was based upon work partially supported by the National Science Foundation under Grant DMS-1127914 to the Statistical and Applied Mathematical Sciences Institute. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. MWM would also like to acknowledge the Defense Advanced Research Projects Agency and the Department of Energy for providing partial support for this work.


  • [AAMA09] K. Abazajian, J. Adelman-McCarthy, M. Agüeros, S. Allam, C. Prieto, D. An, K. Anderson, S. Anderson, J. Annis, N. Bahcall, et al., The seventh data release of the sloan digital sky survey, The Astrophysical Journal Supplement Series 182 (2009), no. 2, 543.
  • [BC03] G. Bruzual and S. Charlot, Stellar population synthesis at the resolution of 2003, Monthly Notices of the Royal Astronomical Society 344 (2003), 1000–1028.
  • [BCW04] J. Brinchmann, S. Charlot, S. D. M. White, C. Tremonti, G. Kauffmann, T. Heckman, and J. Brinkmann, The physical properties of star-forming galaxies in the low-redshift Universe, Monthly Notices of the Royal Astronomical Society 351 (2004), 1151–1179.
  • [BM07] J. Bremer and M. Maggioni, DiffusionGeometry,, 2006–2007.
  • [BN03] M. Belkin and P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Computation 15 (2003), no. 6, 1373–1396.
  • [BPT81] J. A. Baldwin, M. M. Phillips, and R. Terlevich, Classification parameters for the emission-line spectra of extragalactic objects, Publications of the Astronomical Society of the Pacific (1981), 5–19.
  • [BSC00] T. Budavári, A. S. Szalay, A. J. Connolly, I. Csabai, and M. Dickinson, Creating Spectral Templates from Multicolor Redshift Surveys, The Astronomical Journal 120 (2000), 1588–1598.
  • [BWS09] T. Budavári, V. Wild, A. Szalay, L. Dobos, and C. Yip, Reliable eigenspectra for new generation surveys, Monthly Notices of the Royal Astronomical Society 394 (2009), no. 3, 1496–1502.
  • [CBS99] A. J. Connolly, T. Budavári, A. S. Szalay, I. Csabai, and R. J. Brunner, An Orthogonal Approach to Photometric Redshifts, Photometric Redshifts and the Detection of High Redshift Galaxies (R. Weymann, L. Storrie-Lombardi, M. Sawicki, and R. Brunner, eds.), Astronomical Society of the Pacific Conference Series, vol. 191, 1999, p. 13.
  • [CCS95] A. J. Connolly, I. Csabai, A. S. Szalay, D. C. Koo, R. G. Kron, and J. A. Munn, Slicing Through Multicolor Space: Galaxy Redshifts from Broadband Photometry, The Astronomical Journal 110 (1995), 2655.
  • [CH09] S. Chatterjee and A. Hadi, Sensitivity analysis in linear regression, vol. 327, John Wiley & Sons, 2009.
  • [Chu97] F. Chung, Spectral graph theory, vol. 92, American Mathematical Society, 1997.
  • [CL06] R. Coifman and S. Lafon, Diffusion maps, Applied and Computational Harmonic Analysis 21 (2006), no. 1, 5–30.
  • [CLL05] R. Coifman, S. Lafon, A. Lee, M. Maggioni, B. Nadler, F. Warner, and S. Zucker, Geometric diffusions as a tool for harmonic analysis and structure definition of data. part i: Diffusion maps, PNAS 102 (2005), no. 21, 7426–7431.
  • [CS99] A. J. Connolly and A. S. Szalay, A Robust Classification of Galaxy Spectra: Dealing with Noisy and Incomplete Data, The Astronomical Journal 117 (1999), 2052–2062.
  • [CSB95] A. J. Connolly, A. S. Szalay, M. A. Bershady, A. L. Kinney, and D. Calzetti, Spectral Classification of Galaxies: an Orthogonal Approach, The Astronomical Journal 110 (1995), 1071.
  • [FHFC92] P. J. Francis, P. C. Hewett, C. B. Foltz, and F. H. Chaffee, An objective classification scheme for QSO spectra, Astrophysical Journal 398 (1992), 476–490.
  • [FHT01] J. Friedman, T. Hastie, and R. Tibshirani, The elements of statistical learning, vol. 1, Springer series in statistics, 2001.
  • [FNL09] P. Freeman, J. Newman, A. Lee, J. Richards, and C. Schafer, Photometric redshift estimation using spectral connectivity analysis, Monthly Notices of the Royal Astronomical Society 398 (2009), no. 4, 2012–2021.
  • [GM16] A. Gittens and M. W. Mahoney, Revisiting the Nyström method for improved large-scale machine learning, Journal of Machine Learning Research 17 (2016), no. 117, 1–65.
  • [Han12] T. J. Hansen, sseigs_demo,, 2012.
  • [HLMS04] J. Ham, D.D. Lee, S. Mika, and B. Schölkopf, A kernel view of the dimensionality reduction of manifolds, Proceedings of the 21st International Conference on Machine Learning, 2004.
  • [HM14] T. J. Hansen and M. W. Mahoney, Semi-supervised eigenvectors for large-scale locally-biased learning, Journal of Machine Learning Research 15 (2014), 3691–3734.
  • [KGKH06] L. J. Kewley, B. Groves, G. Kauffmann, and T. Heckman, The host galaxies and classification of active galactic nuclei, Monthly Notices of the Royal Astronomical Society 372 (2006), 961–976.
  • [LL06] S. Lafon and A. Lee, Diffusion maps and coarse-graining: A unified framework for dimensionality reduction, graph partitioning, and data set parameterization, IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (2006), no. 9, 1393–1403.
  • [LS99] D. D. Lee and H. S. Seung, Learning the parts of objects by non-negative matrix factorization, Nature 401 (1999), no. 6755, 788–791.
  • [LW10] A. Lee and L. Wasserman, Spectral connectivity analysis, Journal of the American Statistical Association 105 (2010), no. 491, 1241–1255.
  • [MD09] M. W. Mahoney and P. Drineas, CUR matrix decompositions for improved data analysis, Proc. Natl. Acad. Sci. USA 106 (2009), 697–702.
  • [MO11] M. W. Mahoney and L. Orecchia, Implementing regularization implicitly via approximate eigenvector computation, Proceedings of the 28th International Conference on Machine Learning, 2011, pp. 121–128.
  • [MOV12] M. W. Mahoney, L. Orecchia, and N. Vishnoi, A local spectral method for graphs: With applications to improving graph partitions and exploring data graphs locally, The Journal of Machine Learning Research 13 (2012), no. 1, 2339–2365.
  • [PM11] P. O. Perry and M. W. Mahoney, Regularized Laplacian estimation and fast eigenvector approximation, Annual Advances in Neural Information Processing Systems 24: Proceedings of the 2011 Conference, 2011.
  • [RFLS09] J. Richards, P. Freeman, A. Lee, and C. Schafer, Exploiting low-dimensional structure in astronomical spectra, The Astrophysical Journal 691 (2009), no. 1, 32.
  • [RS00] S. Roweis and L. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (2000), no. 5500, 2323–2326.
  • [RZMC11] M. Rohrdanz, W. Zheng, M. Maggioni, and C. Clementi, Determination of reaction coordinates via locally scaled diffusion map, The Journal of Chemical Physics 134 (2011), no. 12, 124116.
  • [SS01] B. Schölkopf and A. J. Smola, Learning with kernels: Support vector machines, regularization, optimization, and beyond, MIT Press, Cambridge, MA, USA, 2001.
  • [SWL02] M. A. Strauss, D. H. Weinberg, R. H. Lupton, V. K. Narayanan, J. Annis, M. Bernardi, M. Blanton, S. Burles, A. J. Connolly, J. Dalcanton, M. Doi, D. Eisenstein, J. A. Frieman, M. Fukugita, J. E. Gunn, Ž. Ivezić, S. Kent, R. S. J. Kim, G. R. Knapp, R. G. Kron, J. A. Munn, H. J. Newberg, R. C. Nichol, S. Okamura, T. R. Quinn, M. W. Richmond, D. J. Schlegel, K. Shimasaku, M. SubbaRao, A. S. Szalay, D. Vanden Berk, M. S. Vogeley, B. Yanny, N. Yasuda, D. G. York, and I. Zehavi, Spectroscopic Target Selection in the Sloan Digital Sky Survey: The Main Galaxy Sample, The Astronomical Journal 124 (2002), 1810–1824.
  • [TDSL00] J. B Tenenbaum, V. De Silva, and J. C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (2000), no. 5500, 2319–2323.
  • [THK04] C. A. Tremonti, T. M. Heckman, G. Kauffmann, J. Brinchmann, S. Charlot, S. D. M. White, M. Seibert, E. W. Peng, D. J. Schlegel, A. Uomoto, M. Fukugita, and J. Brinkmann, The Origin of the Mass-Metallicity Relation: Insights from 53,000 Star-forming Galaxies in the Sloan Digital Sky Survey, Astrophysical Journal 613 (2004), 898–913.
  • [VC09] J. VanderPlas and A. Connolly, Reducing the dimensionality of data: locally linear embedding of sloan galaxy spectra, The Astronomical Journal 138 (2009), no. 5, 1365.
  • [WH05] V. Wild and P. C. Hewett, Peering through the OH forest: a new technique to remove residual sky features from Sloan Digital Sky Survey spectra, Monthly Notices of the Royal Astronomical Society 358 (2005), 1083–1099.
  • [YAA00] D. G. York, J. Adelman, J. E. Anderson, Jr., S. F. Anderson, J. Annis, N. A. Bahcall, J. A. Bakken, R. Barkhouser, S. Bastian, E. Berman, W. N. Boroski, S. Bracker, C. Briegel, J. W. Briggs, J. Brinkmann, R. Brunner, S. Burles, L. Carey, M. A. Carr, F. J. Castander, B. Chen, P. L. Colestock, A. J. Connolly, J. H. Crocker, I. Csabai, P. C. Czarapata, J. E. Davis, M. Doi, T. Dombeck, D. Eisenstein, N. Ellman, B. R. Elms, M. L. Evans, X. Fan, G. R. Federwitz, L. Fiscelli, S. Friedman, J. A. Frieman, M. Fukugita, B. Gillespie, J. E. Gunn, V. K. Gurbani, E. de Haas, M. Haldeman, F. H. Harris, J. Hayes, T. M. Heckman, G. S. Hennessy, R. B. Hindsley, S. Holm, D. J. Holmgren, C.-h. Huang, C. Hull, D. Husby, S.-I. Ichikawa, T. Ichikawa, Ž. Ivezić, S. Kent, R. S. J. Kim, E. Kinney, M. Klaene, A. N. Kleinman, S. Kleinman, G. R. Knapp, J. Korienek, R. G. Kron, P. Z. Kunszt, D. Q. Lamb, B. Lee, R. F. Leger, S. Limmongkol, C. Lindenmeyer, D. C. Long, C. Loomis, J. Loveday, R. Lucinio, R. H. Lupton, B. MacKinnon, E. J. Mannery, P. M. Mantsch, B. Margon, P. McGehee, T. A. McKay, A. Meiksin, A. Merelli, D. G. Monet, J. A. Munn, V. K. Narayanan, T. Nash, E. Neilsen, R. Neswold, H. J. Newberg, R. C. Nichol, T. Nicinski, M. Nonino, N. Okada, S. Okamura, J. P. Ostriker, R. Owen, A. G. Pauls, J. Peoples, R. L. Peterson, D. Petravick, J. R. Pier, A. Pope, R. Pordes, A. Prosapio, R. Rechenmacher, T. R. Quinn, G. T. Richards, M. W. Richmond, C. H. Rivetta, C. M. Rockosi, K. Ruthmansdorfer, D. Sandford, D. J. Schlegel, D. P. Schneider, M. Sekiguchi, G. Sergey, K. Shimasaku, W. A. Siegmund, S. Smee, J. A. Smith, S. Snedden, R. Stone, C. Stoughton, M. A. Strauss, C. Stubbs, M. SubbaRao, A. S. Szalay, I. Szapudi, G. P. Szokoly, A. R. Thakar, C. Tremonti, D. L. Tucker, A. Uomoto, D. Vanden Berk, M. S. Vogeley, P. Waddell, S.-i. Wang, M. Watanabe, D. H. Weinberg, B. Yanny, N. Yasuda, and SDSS Collaboration, The Sloan Digital Sky Survey: Technical Summary, The Astronomical Journal 120 (2000), 1579–1587.
  • [YCS04a] C.-W. Yip, A. Connolly, A. Szalay, T. Budavari, M. SubbaRao, J. Frieman, R. Nichol, A. Hopkins, D. York, and S. Okamura, Distributions of galaxy spectral types in the sloan digital sky survey, The Astronomical Journal 128 (2004), no. 2, 585.
  • [YCS04b] C. W. Yip, A. J. Connolly, A. S. Szalay, T. Budavári, M. SubbaRao, J. A. Frieman, R. C. Nichol, A. M. Hopkins, D. G. York, S. Okamura, J. Brinkmann, I. Csabai, A. R. Thakar, M. Fukugita, and Ž. Ivezić, Distributions of Galaxy Spectral Types in the Sloan Digital Sky Survey, The Astronomical Journal 128 (2004), 585–609.
  • [YCV04] C. W. Yip, A. J. Connolly, D. E. Vanden Berk, Z. Ma, J. A. Frieman, M. SubbaRao, A. S. Szalay, G. T. Richards, P. B. Hall, D. P. Schneider, A. M. Hopkins, J. Trump, and J. Brinkmann, Spectral Classification of Quasars in the Sloan Digital Sky Survey: Eigenspectra, Redshift, and Luminosity Effects, The Astronomical Journal 128 (2004), 2603–2630.
  • [YMS14] C.-W. Yip, M. W. Mahoney, A. Szalay, I. Csabai, T. Budavári, R. Wyse, and L. Dobos, Objective identification of informative wavelength regions in galaxy spectra, The Astronomical Journal 147 (2014), no. 5, 110.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description