Anomaly detection for machine learning redshifts

# Anomaly detection for machine learning redshifts applied to SDSS galaxies

Ben Hoyle, Markus Michael Rau, Kerstin Paech, Christopher Bonnett Stella Seitz, Jochen Weller

Universitaets-Sternwarte, Fakultaet fuer Physik, Ludwig-Maximilians Universitaet Muenchen, Scheinerstr. 1, D-81679 Muenchen, Germany
Excellence Cluster Universe, Boltzmannstr. 2, D-85748 Garching, Germany
Institut de Fõsica d’Altes Energies, Universitat Autonoma de Barcelona, E-08193 Bellaterra, Spain
Max Planck Institute for Extraterrestrial Physics, Giessenbachstr. 1, D-85748 Garching, Germany

E-mail: hoyleb@usm.uni-muenchen.de
Accepted —-. Received —-; in original form —-.
###### Abstract

We present an analysis of anomaly detection for machine learning redshift estimation. Anomaly detection allows the removal of poor training examples, which can adversely influence redshift estimates. Anomalous training examples may be photometric galaxies with incorrect spectroscopic redshifts, or galaxies with one or more poorly measured photometric quantity.

We select 2.5 million ‘clean’ SDSS DR12 galaxies with reliable spectroscopic redshifts, and 6730 ‘anomalous’ galaxies with spectroscopic redshift measurements which are flagged as unreliable. We contaminate the clean base galaxy sample with galaxies with unreliable redshifts and attempt to recover the contaminating galaxies using the Elliptical Envelope technique. We then train four machine learning architectures for redshift analysis on both the contaminated sample and on the preprocessed ‘anomaly-removed’ sample and measure redshift statistics on a clean validation sample generated without any preprocessing. We find an improvement on all measured statistics of up to when training on the anomaly removed sample as compared with training on the contaminated sample for each of the machine learning routines explored. We further describe a method to estimate the contamination fraction of a base data sample.

###### keywords:
galaxies: distances and redshifts, catalogues, surveys.
pagerange: Anomaly detection for machine learning redshifts applied to SDSS galaxiesLABEL:lastpagepubyear: 2010

## 1 introduction

Photometric surveys can be maximally exploited for large scale structure analyses once galaxies have been identified and their positions on the sky and in redshift space have been measured. Measuring accurate spectroscopic redshifts is costly and time intensive, and is typically only performed for a small subsample of all galaxies. For this subsample of galaxies one may learn a mapping between the measured photometric properties, and the spectroscopic redshift. This mapping can then be applied to all photometrically identified galaxies to estimate redshifts. This is the basis of machine learning, and inherently assumes that the galaxies used to construct the mapping form an unbiased and uncontaminated sample of the final dataset.

Recent work by the current authors shows that if the base training sample is biased compared to the final sample, it may be augmented, e.g., by adding galaxies from simulations, to make the data sets appear more similar (Hoyle et al., 2015). The data augmentation process has been shown to improve the redshift estimate of the final test sample. In this paper we examine the problem of identifying poorly measured galaxy properties which contaminate the base training set. The contamination may be due to incorrectly measured spectroscopic redshifts, or unreliable photometric properties.

Photometric redshifts are also estimated by parametric techniques, for example from galaxy Spectral Energy Distribution (hereafter SED) templates. Some templates encode our knowledge of stellar population models which result in predictions for the evolution of galaxy magnitudes and colors. This parametric encoding of the complex stellar physics coupled with the uncertainty of the parameters of the stellar population models combine to produce redshift estimates which are little better than many non-parametric techniques (see e.g., Hildebrandt et al., 2010; Dahlen, 2013, for an overview of different techniques).

When a representative training sample is available, machine learning methods offer an alternative to template methods to estimate galaxy redshifts. The ‘machine architecture’ determines how to best manipulate the photometric galaxy input properties (or ‘features’) to produce a machine learning redshift. The machine attempts to learn the most effective manipulations to minimize the difference between the spectroscopic redshift and the machine learning redshift of the training sample.

The field of machine learning for photometric redshift analysis has been developing since Tagliaferri et al. (2003) used artificial Neural Networks (aNNs). A plethora of machine learning architectures, including tree based methods, have been applied to the problem of point prediction redshift estimation (see e.g. Sánchez et al., 2014, for a further list and routine comparisons), or to estimate the full redshift probability distribution function (hereafter pdf, Gerdes et al., 2010; Carrasco Kind & Brunner, 2013; Bonnett, 2015; Rau et al., 2015). Machine learning architectures have also had success in other fields of astronomy such as galaxy morphology identification, and star & quasar separation (see for example Lahav, 1997; Yeche et al., 2009).

It is often assumed that the training data does not contain galaxies with unreliable spectroscopic redshift estimates, or does not contain galaxies with incorrectly estimated photometric properties. However the contamination of a training sample can adversely affect the recovered machine learning redshifts. The authors Cunha et al. (2014) use simulated spectra to show how the cosmological constraints for a weak lensing survey are degraded in the presence of even 1% of spectroscopic outliers in the training sample.

Previous work on outlier analysis has been confined to examining the properties of the machine learning redshifts after the system has been trained. For example, photometric redshift ‘outliers’ which actually sit in a different redshift bin than expected, can be identified by cross correlating data across bins (see e.g., Schneider et al., 2006; Bernstein & Huterer, 2010; McQuinn & White, 2013). Training data can also be carefully removed if the final machine learning redshift and the spectroscopic redshift are found to be very dissimilar (Cunha et al., 2014). More recently outlier detection has been performed on galaxies after a pdf has been obtained, by examining the pdf for multiple peaks or other irregularities (Carrasco Kind & Brunner, 2014a). All of these methods enable the construction of a cleaner final sample of galaxies. However the cleaned sample must first be carefully checked to ensure that a sample bias has not been introduced before being used for scientific analysis. In particular the final test sample must be made to be representative of the cleaned training sample.

In this paper we explore the effect of performing outlier analysis, or anomaly detection, on the training sample to identify discrepant photometric data, or unreliable spectroscopic redshifts before the sample is used to estimate a machine learning redshift. We then show how the removal of this anomalous data improves the machine learning redshift metrics for two very different groups of machine learning architectures.

This paper is organized as follows: In §2 we describe the data sample and the machine learning methods employed; We present the anomaly analysis and improvement to the redshift estimates using the anomaly detection in §3; and conclude in §4.

## 2 Data and Machine Architectures

In this study we use observational data drawn from the final SDSS data release, and explore a selection of machine learning architectures for anomaly detection and machine learning redshift estimation.

### 2.1 Observational dataset

The observational data in this study are drawn from SDSS III Data Release 12 (Alam et al., 2015). The SDSS I-III uses a 4 meter telescope at Apache Point Observatory in New Mexico and has CCD wide field photometry in 5 bands ( Gunn et al., 2006; Smith et al., 2002), and an expansive spectroscopic follow up program (Eisenstein, 2011) covering steradians of the northern sky. The SDSS collaboration has obtained more than 3 million galaxy spectra using dual fiber-fed spectrographs. An automated photometric pipeline performs object classification to a magnitude of and measures photometric properties of more than 100 million galaxies. The complete data sample, and many derived catalogs such as the photometric properties, are publicly available through the CasJobs server (Li & Thakar, 2008)111skyserver.sdss3.org/CasJobs.

The SDSS is well suited to the analyses presented in this paper due to the enormous number of photometrically selected galaxies with spectroscopic redshifts to use as training and test samples. In particular if a galaxy spectra is obtained and subsequently found to be erroneous by the processing pipeline, the flag ‘zWarning’ is set to be larger than 0. For the SDSS dataset the quality flag zWarning is a good estimator of the reliability of the spectroscopic redshift. This is not always true for other spectroscopic surveys or datasets e.g. PRIMUS (e.g. see Bonnet et al in prep. Coil et al., 2011; Cool et al., 2013). Furthermore the SDSS galaxies which have unreliably measured redshifts are often followed up at a later date and new spectra are obtained. Many of these new spectra often do not incur warnings during processing. It is exactly these cases which are utilized in this paper. Firstly we identify galaxies with at least one poorly measured spectrum and one well measured spectrum. Then we extract all occurrences of these galaxies from the base sample. We then assign the unreliably measured spectroscopic redshift to the galaxy and then contaminate the clean base sample with this galaxy. Finally we use machine learning to try to identify the galaxies which have unreliable spectroscopic redshifts from those with reliable redshifts.

We select all objects from CasJobs with both spectroscopic redshifts and photometric properties which are classified as galaxies by the photometric pipeline. This sample will also include some contamination from stars and quasars. In detail we run the MySQL query shown in the appendix. The query extracts 2.5M galaxies with a range of photometric and spectroscopic qualities. The data selection is very relaxed in terms of allowed measured errors in both photometric and spectroscopic properties. In §3.4.2 we perform a similar analysis to that which follows but impose a more stringent set of selection criteria. This query also obtains galaxies with multiple spectra measurements and allows us to identify 76639 unique galaxies with ‘zWarning’. Of these galaxies 9115 galaxies have both a poorly measured spectroscopic redshift above 0, and a well measured spectroscopic redshift with an error less than 0.001. We next select galaxies which have a difference in poorly measured and well measured redshifts to be greater than resulting in 6734 galaxies of which 3502 are unique. We impose this selection because we do not expect the error on the machine learning redshift estimate to be below 0.01.

We use the SDSS k-correct (Blanton & Roweis, 2007) package to estimate the absolute R band magnitude of the 6734 galaxies assuming both the reliable and unreliable spectroscopic redshifts. We present the distribution of absolute R magnitude against redshift in the top panel of Fig. 1, and mark the reliable and unreliable data by the circle and starred data points respectively. The bottom panel of Fig. 1 shows the redshift distribution of the full galaxy sample by the solid grey line. The dashed orange line shows the reliable redshift distribution of those galaxies which have both a reliable and an unreliable redshift. We will remove these galaxies from the base sample. The joined dotted blue line shows the distribution of unreliable redshifts for those galaxies which we have just removed. We will use these galaxies with unreliable redshifts later to contaminate the base sample.

The top panel of Fig. 1 shows that both the redshift distribution and the absolute magnitude distribution of the galaxies with reliable and unreliable redshift estimates are very different. The unreliable spectroscopic redshift distribution is peaked at higher redshifts. The distribution of unreliable data is also peaked at brighter absolute magnitudes. This is because the photometrically measured apparent magnitudes are unchanged, and therefore the offset is correlated with redshift. The bottom panel shows that the sample of galaxies with both reliable and unreliable redshifts are representative of the full base sample. As expected the unreliable redshift distribution appears to be very different from the redshift distribution of the base sample.

In the analysis which follows we construct two training samples. The first is drawn from the base sample of data with reliable redshifts which has then been contaminated with anomalous data that has unreliable redshift estimates. The second training system is the first sample with a preprocessing step to remove anomalous data. We describe the method to pre-process the data in §2.2. Finally we construct a validation, or ‘test’ sample which is not used during training. The validation sample is always drawn from a non-overlapping set of base data which have reliable redshifts. We describe the construction of these samples in more detail in §3.1.

In this work we have concentrated on the following eight features for outlier estimation; the spectroscopic redshift and error, band magnitude, the following colors: ,,,, and the Petrosian radius measured in the band. Of course will only use the photometric quantities when estimating redshifts. Previous work has shown that there are many other readily obtained photometric features which also have strong predictive power when estimating redshifts (Hoyle et al., 2015).

### 2.2 Anomaly identification

We use the robust scikit-learn (Pedregosa et al., 2011) package called Elliptical Envelope routine as our base anomaly detector. We briefly describe the routine below, and refer the reader to Hubert & Debruyne (2010) for a review.

The Elliptical Envelope routine models the data as a high dimensional Gaussian distribution with possible covariances between feature dimensions. In short it attempts to find an boundary ellipse that contains most of the data. Any data outside of the ellipse is classified as anomalous. The Elliptical Envelope routine uses the FAST-Minimum Covariance Determinate (Rousseeuw & Driessen, 1999) to estimate the size and shape of the ellipse.

In detail, the FAST-Minimum Covariance Determinate routine selects non overlapping subsamples of data and computes the mean , and covariance matrix , in each feature dimension for each subsample. The Mahalanobis distance d, is computed for each multidimensional data vector , in each subsample and the data are ordered ascendingly by d. The Mahalanobis distance is defined by

 dMH=√(→x−→μ)TC−1(→x−→μ) (1)

which reduces to the Euclidean distance if the covariance matrix is the identity matrix, and the normalised Euclidean distance if the covariance matrix is diagonal. To summarise; the Mahalanobis distance measures how many ‘sigma’ a data point is from the mean of a distribution.

The FAST-Minimum Covariance Determinate method continues by selecting subsamples from the original samples, with small values of d. The mean, covariance, and the values of d of the subsamples are again computed. This procedure is iterated until the determinate of the covariance matrix converges. The covariance matrix with the smallest determinate from all subsample forms an ellipse which encompasses a fraction of the original data. Data within the ellipse surface are labeled as ‘inliers’, and data outside of the ellipse are labeled as ‘outliers’ or anomalous, which may then be removed.

The hyper-parameter of the Elliptical Envelope routine is the contamination rate , which is the apriori assumed fractional contamination rate of the data sample. We explore this hyper-parameter in our subsequent analysis, but note that this parameter does not need to be known to high accuracy before using the routine. We further present a method to estimate the contamination fraction from the data in §3.3. The contamination rate hyper-parameter , describes approximately how much of the data sample should sit outside of the enclosing high dimensional ellipse that contains the majority of the data.

### 2.3 Tree based methods

One of the machine learning architectures to estimate galaxy redshifts used in this work is the scikit-learn implementation of decision trees for regression (Breiman et al., 1984). The tree based machine learning architecture recursively partitions the input feature dimensions into an increasing number of bins. Each bin is chosen to minimize the scatter of the output feature, which for these purposes is the spectroscopic redshift. This results in data with very similar spectroscopic redshifts being within the same, or possibly nearby bins.

The power of tree based methods is enhanced by combining many trees. One technique to do this is called Adaptive Boosting or Adaboost (Freund & Schapire, 1997; Drucker, 1997) which adds trees sequentially to generate an ensemble of trees. In the following we will refer to this technique as simply ‘Adaboost’. Adaboost weighs each new tree by its ability to predict redshifts correctly, and decides how new trees are grown such that redshift estimates are improved for the data with poorly estimated redshifts. For more details about combining trees with Adaboost we refer the reader to Hastie et al. (2001).

In this work we choose to fix the hyper-parameter set for a single decision tree and the final number of trees and the method of growing trees. We choose the number of data on each leaf node to be 10, and the number of trees to be 100. For Adaboost we select the linear loss function, but we find that using the exponential loss function does not change the results significantly. We choose the linear loss function because the exponential loss function has previously been shown to be sensitive to label noise in classification problems Dietterich (2000). We note that the best machine learning hyper-parameters are normally tuned by using a cross validation sample. We note that tuning the hyper-parameters of the model can have a large effect on the machine learning redshift predictions.

#### 2.3.1 Mean and median regression

We also explore a type of tree based machine learning architecture called Quantile regression, which can include the use of the median value, as opposed to the mean value when constructing the loss function for regression trees. We use the scikit-learn package called GradientBoostingRegressor (Friedman, 1999, 2001) which accepts a parameter to determine type of loss function, for example ‘least squares’ corresponding to mean regression, and ‘quantile’ with a corresponding value of 50% for median regression. For both mean and median regression we again fix the hyper-parameters of the machine learning architecture to be the same as that of the section above, except for the choice of loss function.

The loss function , is the method that the learning algorithm uses to find the best fitting model parameters. For trees the best fitting parameters can be the numerical values along the feature dimensions at which a split is chosen. The mean regression loss function is the least squares loss function given by

 L(u)=1NN∑i=0(yi−u)2 (2)

where the sum runs over each of the data on each leaf node on the tree. The least squares loss function is sensitive to outliers, and so we would expect it to be more affected by outliers in the training set. For median regression the loss function is given by

 L(u)=0.5N(−∑yi

which is less sensitive to outliers because of the linear dependence on the differences between values and . We compare the results of training these different architectures on contaminated, and outlier removed data samples in §3.4.1.

### 2.4 Self Organising Maps

Another popular machine learning architecture is the Self Organising Map (Kohonen, 1997, hereafter SOM), which have recently been used for redshift estimation (Geach, 2012). SOMs are also being used in combination with template fitting routines for photometric redshifts (Greisel et al in prep). We use the public implementation of a SOM, called SOMz (Carrasco Kind & Brunner, 2014b)333github.com/mgckind/MLZ which we briefly describe below. We refer the reader to Carrasco Kind & Brunner (2014b) for more details. We choose to include SOMz in this paper because it represents a very different machine learning architecture than those of tree based methods. Using both SOMs and trees suggests how generalisable the results found in this paper are.

SOMz combine neural networks with dimensionality reduction and similarity clustering. The SOMs are evolved from random starting weights such that training examples with similar high dimensional inputs appear clustered in a two dimensional space of pixels. The map evolution is unsupervised because it is performed by only examining the input features. Once the SOMz is stable, the training examples are again passed through the map, and the values of the output feature are combined to produce an output value for each pixel. New data are passed through the SOMz and the pixel, or nearby pixels, which have the largest activation values contribute to the predicted value returned.

In this work we choose to fix the hyper-parameters of the SOMz to have a spherical map geometry with 768 () pixels and we perform 100 training iterations. For a full analysis on the effect of using different hyper-parameters with SOMz see Carrasco Kind & Brunner (2014b). Again we mention that tuning these hyper-parameters can lead to a large amount of improvement, however this is not the focus of this work.

## 3 Analysis and Results

We first introduce the anomaly detection method to both identify the inserted contaminating galaxies with unreliable redshifts, and then to build a cleaner training sample. We then provide a method to estimate the contamination fraction of a dataset. Finally we train separately on both the full contaminated sample, and the cleaned sample, and show the effect on the measured statistics of the machine learning redshift as calculated on an independent and single cross validation sample.

### 3.1 Anomaly identification

We examine the ability of the Elliptical Envelope method to correctly identify the galaxies with unreliable redshift estimates that we use to contaminate the training sample. We perform more than 250 sets of independent analysis for both the Adaboost, and SOMz machine learning architectures. We initial randomly select a number , where 1006730, of galaxies with unreliable redshifts, and combine them with randomly selected galaxies with reliable redshifts from the base sample. We restrict to values 100k.

We then construct a training and cross validation sample from this combined sample and perform feature normalization on all of the features. Throughout this paper we ensure that the cross validation sample is only drawn from those galaxies with a reliable redshift estimate, because it would be irrelevant to try to predict the redshift of galaxies with unreliable redshift estimates. For this training sample we explore a range of values of the hyper-parameter , ranging from , corresponding to different initial ‘best guesses’ of the expected contamination fraction as used by the Elliptical Envelope routine.

For each value of the anomaly detection code produces a classification of either ‘inlier’ or ‘outlier’ for each galaxy. We determine the percentage of correctly identified outlier galaxies which have an unreliable redshift, and also the percentage of potentially incorrectly identified galaxies with a reliable redshift. We note that the galaxy sample with reliable redshifts may however be an outlier along a different feature dimension other than the spectroscopic redshift.

In Fig. 2 we show the percentage of correctly identified galaxies with unreliable redshifts as a function of the contamination hyper-parameter . The dispersion of data at fixed is due to the different randomly selected combined samples of random size. The dark lines show the mean of the distribution and the upper and lower shaded regions show the 68% spread of the data. The black error bar shows the actual range of contamination fractions, that correspond to the number of galaxies with unreliable redshifts inserted into the base sample. Each of the 250 experiments has a different inserted contamination fraction.

We find that the fraction of data with unreliable redshifts which is classified as anomalous, or an outlier, is between one and two orders of magnitude larger than the corresponding fraction of data with reliable redshifts. This demonstrates the success of the Elliptical Envelope technique to identify data with unreliable redshift estimates.

We next explore which of the contaminating data with unreliable redshifts is classified as anomalous, and show projections through the data in Fig. 3. In the top panel of Fig. 3, the transparent circles show the redshift and apparent magnitude distribution of the base sample contaminated with unreliable redshifts. The blue stars show which of those galaxies are classified as being outliers using the Elliptical Envelope technique for a given contamination hyper-parameter value . The bottom panel of Fig. 3 concentrates on those contaminating galaxies with both a reliable and unreliable redshift. The panel shows the absolute difference between the reliable and unreliable redshifts for the contaminating galaxies which are not classified as outliers by the Elliptical Envelope technique as a function of increasing .

The top panel of Fig. 3 shows that galaxies which occupy a region of redshift and band apparent magnitude space which is very different from the majority of other galaxies are classified as being anomalous. We also note that a small fraction of galaxies which occupy the same region of redshift and band apparent magnitude space as the majority of galaxies, is also classified as anomalous. This could be because the data is anomalous along one or more different feature dimensions, which is not easily viewed in this two dimensional projection. There are three distinct clouds of data with reliable redshifts in the top panel. These clouds of data correspond to the different observing phases of the SDSS.

The bottom panel of Fig. 3 shows that the number of galaxies with unreliable redshifts which are not classified as anomalous decreases as the contamination fraction hyper-parameter increases. We also note that the most extreme examples of galaxies with very anomalous unreliable redshifts are preferentially removed as increases. The sharp drop at the x-axis location of 0.01 is due to the construction of the contaminating data sample.

In the top panel of Fig. 3 there are distinct clouds of data in these feature projections. These are due to the different observing strategies of the SDSS. For example most of the faint, high redshift cloud were observed in SDSS III, while the lower redshift clouds were observed in SDSSI/II. We also perform outlier detection separately for these samples, and find the following similar trend in both samples: the fainter the galaxy is in the band, the more likely it is to have an anomalously large unreliable redshift. This can be understood by the fainter galaxies being more difficult to observe spectroscopically and requiring larger integration times.

We have also explored the use of One Class Support Vector Machines (Cortes & Vapnik, 1995) as the machine learning anomaly detector, but do not find an improvement over the results using the Elliptical Envelope method. This suggests that a hyper dimensional ellipse provides a good model to enclose, and therefore identify, the non-anomalous data.

### 3.2 The distribution of data with anomalies removed

We explore how the distribution of galaxies changes as a function of the contamination hyper-parameter , as compared to the initial sample. We construct a sample of size 100k which is contaminated with 3k galaxies with unreliable redshifts.

We perform anomaly detection on the contaminated sample for different values of . In Fig. 4 we show the distribution of spectroscopic redshift against apparent magnitude in the band, for three different values of indicated in each panel. The combined sample in each case is shown by the solid lines, and the sample with anomalous outliers removed is shown by the thick dotted lines.

Fig. 4 shows that as the contamination hyper-parameter increases above so the distribution of galaxies becomes biased with respect to each other. For small values of the distributions are mostly unaffected. If there is no anomalous data, and the Elliptical Envelope routine is expecting a large fraction of contaminated data, then even clean data is removed, however if anomalous data is indeed present, then the routine will detect it. This behavior can also be seen in Fig. 2.

In the next section we derive a prescription to estimate the contamination fraction from a base data sample that may be contaminated.

### 3.3 Estimating the contamination fraction

We next provide a prescription to make an empirical initial estimate for the contamination fraction. We note that the Elliptical Envelope method is not very sensitive to the exact value of the contamination fraction, as shown in Fig 2, and therefore we are interested in obtaining an order of magnitude estimate. We use the measured values of Mahalanobis distance d, to estimate the contamination rate.

To make this analysis more realistic we construct a base, and contaminated sample, with more stringent selection criteria on the allowed photometric and spectroscopic errors. We select galaxies which pass the following selection criteria: measured errors in bands between error and spectroscopic redshifts greater than 0 and spectroscopic redshift errors between error. This reduces the base sample with reliable redshifts to 2.1M and the sample with unreliable redshifts to 3017.

For this analysis we construct 250 datasets, which again contain a random amount of data with unreliable redshifts, and a random sample of base data with reliable redshifts. We use the Elliptical Envelope technique with a range of contamination fractions , to measure d of the data for each value of . We note that the dimensionality of the input feature space N, is 8, as described in §2.1. We then assign the class ‘outlier’ to data that satisfies . We find that the choice of provides a good estimation for the outlier fraction, and discuss the robustness of this value below.

Fig. 5 shows the fractional contamination rate of data with unreliable redshifts inserted into the base sample against the estimated contamination fraction using the Mahalanobis distance d. The error bars are inflated by a factor of 10, and show the 68% spread of results using different values of the contamination hyper-parameter , when using the Elliptical Envelope technique to measure d.

We note that a large range of values corresponding to also produce reasonable ‘order of magnitude’ estimates of the inserted contamination fraction. As an illustrative example we could compare this result to the case of a two dimensional Gaussian distribution of width ; this relationship is equivalent to assigning the classification of outlier to data that is more than away from the mean value.

### 3.4 Machine learning redshifts from anomaly removed training data

We next present the effect on the machine learning redshift if we train only on the training sample with anomalous data removed, instead of training on the full contaminated sample. We remove anomalous data using the Elliptical Envelope technique. We choose to use Adaboost and SOMz in independent sets of analyses.

In each set of analyses we first train on the contaminated training sample, and then use the Elliptical Envelope method with a fixed contamination fraction hyper-parameter , to remove anomalous data, irrespective of whether or not they are drawn from the sample with reliable or unreliable redshift estimates. This produces a cleaned training set, which we then independently train on. We refer to this as the ‘cleaned’ training sample in what follows.

We construct a cross-validation sample drawn from galaxies with reliable spectra. To make a fair comparison later, we do not modify the cross-validation sample at all, irrespective of their inlier or outlier definitions. We then pass the same cross-validation sample through both learned systems, and obtain a machine learning redshift estimate , for each galaxy.

We construct the redshift scaled residual vector (spec)/(1+spec) and measure the following metrics: , corresponding to the median value of , and the values corresponding to the 68% and 95% spread of , and we additionally measure the ‘outlier rate’ defined as the fraction of galaxies for which . Note that the outlier rate here has a different, albeit related, definition from the anomaly detection sections. We repeat this analysis for Adaboost and SOMz, and then repeat the entire analysis for a different value of . We perform 250 sets of experiments, each with a randomly selected initial training sample of data with reliable and unreliable redshifts, and with a randomly selected cross-validation sample.

In Fig. 6 we show the percentage relative improvement when training on the anomaly cleaned sample instead of the initial contaminated sample on each of the measured statistics, as a function of the hyper-parameter . In the left hand panel we show the results of the analysis with Adaboost, and in the right hand panel we show the results with SOMz. The lines and shaded regions again corresponds to the median and 68% of the distribution.

In both sets of analysis we find that for very small values of , corresponding to a removal of 1% of data with unreliable redshifts, and 0.05% of data with reliable redshifts (see Fig. 2), we find a small improvement in the measured metrics at the level of a few percent or less. For increasing values of to 0.07, corresponding to a removal of 70% of unreliable data and 3% of reliable data, we find the improvement in the metrics for both machine learning systems with values between 20% and 80%. The metrics most affected by the removal of anomalous data are the median values, and the tails of the distribution, namely and the outlier fraction . Fig. 6 shows that there is a slight to moderate decline in improvement of the metrics at larger values of . This degradation in improvement can be understood by examining the effect of large on the resulting distributions of training galaxies as see in Fig. 4. For larger values of the cleaned samples become less representative of the initial sample, and therefore the training and test sets become less representative of each other, and the machine learning mapping extends into the realm of extrapolation. Extrapolating outside of the training set leads to spurious and degrading results, as seen here.

Fig. 6 shows the relative improvement for each of the two machine learning techniques. We also perform a comparison between these two machine learning architectures and show the results in the top two rows of Table 1. We note that this is not the main objective of this work because similar comparisons have already been performed (e.g., Carrasco Kind & Brunner, 2014b). The table shows the machine learning architecture used and the effect of training on both the data sample that is contaminated with unreliable redshifts, and the data sample with outliers removed using the Elliptical Envelope technique. We show the the measured statistics in the final four column headings. The quoted values are the median values at fixed of the 250 samples that each have a different inserted contamination fraction. We note that Adaboost outperforms the SOMz algorithm on all metrics by a factor of when training on the contaminated samples, and is comparable with or outperforms the SOMz algorithm when training on the cleaned samples. We have chosen to show the results obtained for a contamination hyper-parameter value , but note the same behavior is found for all values of .

We note that both panels of Fig. 6 show improvement as the base sample is cleaned of contaminating data. This shows that the machine learning routines for which the improvement is the greatest, are the least robust techniques to use when presented with some fraction of anomalous training data. We further explore other techniques which are less susceptible to anomalous training data in §3.4.1.

During this work we assume that the cross validation sample is not contaminated with anomalous data, which is true by construction. However this may not be true of other data sets. In such cases one could perform anomaly detection on both the training, cross validation, and test sets to remove outliers from the full sample. If the sample anomaly detection results were applied to a final test sample, this would result in a fair analysis. However one would need to check that this preprocessed data is suitable for the final science application at hand. One further method would be to identify anomalous cross validation data, and then investigate these data to understand why they have been so classified.

#### 3.4.1 Mean vs Median regression

We next explore the machine learning architecture called mean and Quantile, or median, regression. Quantile regression can use the median, as opposed to the mean value when constructing the loss function for boosted regression trees. We expect median regression to be less strongly affected by contamination in the training data. For comparison with §3.2 using Adaboost, we construct very similar machine learning architectures using the same hyper-parameters and only vary the loss function. We continue by applying the same formalism as before: we first train on the contaminated data sample, and then use the Elliptical Envelope method to remove outlier data, and finally retrain on the cleaned data sample. We show the results of using mean regression in the left hand panel of Fig. 7, and we show the results using the median regression in the right hand panel. Again we show the actual spread of the inserted contamination fraction using the galaxy sample with unreliable redshifts is shown by the black starred data point and error bar.

We find that both machine learning architectures show large improvement in the measured statistics and when the data sample is pre-cleaned using the Elliptical Envelope technique. This again shows how poorly the base routines perform on anomalous training data. As expected from the effect of outliers on the loss functions, we find that mean regression is more affected by contamination than median regression. We note that the dispersion measures and are very well controlled for the median regression architecture. We show the absolute values for each of the measured metrics in the third and fourth rows of Table. 1. We again show the values of each of the measured statistics, averaged over the 250 samples, for a chosen value of the contamination hyper-parameter .

Comparing quantile and median regression with the SOMz and Adaboost routines is not the primary focus of this work (see e.g., Dietterich, 2000; Caruana & Niculescu-Mizil, 2005) but we note that Adaboost with decision trees for regression is the best performing machine learning architecture on all measured statistics. However the continued success of Adaboost with contaminated data appears to be in disagreement with studies that include a large fraction of label noise in classification tasks (Dietterich, 2000). This may be an artifact of the chosen datasets and how noise is added to the data.

#### 3.4.2 Anomaly detection using a cleaner galaxy sample

In the previous sections we use data samples with very relaxed selection criteria, which allows both photometric, and spectroscopic data with large measured errors to be included in the base sample. We now examine the effect on the machine learning redshift if one chooses to use a base galaxy sample which has much more stringent limits of the allowed magnitude of both photometric and spectroscopic errors. We again select galaxies which pass the selection criteria described in §3.3.

We repeat the above analysis by again contaminating the base sample and using the Elliptical Envelope method to clean the sample, and then train Adaboost and SOMz for redshift analysis on both contaminated, and cleaned samples. We again find a similar distribution of improvements in the redshift metrics as a function of the contamination hyper-parameter , but with a slightly reduced amplitude. The improvement for Adaboost ranges from 15% for the outlier fraction, to 85% for the median value, and the improvement for SOMz ranges from 40% for , to 95% for the median value.

#### 3.4.3 Anomaly detection of non-contaminated galaxies

We also examine the effect on the machine learning redshift if one uses only the base galaxies with a reliable spectroscopic redshift, without the addition of galaxies with unreliable redshifts. We continue as before by determining inliers and outliers as a function of the hyper-parameter . In this section ‘anomalous data’ could mean that a photometric magnitude in a particular band is very different from other similar galaxies at that redshift.

We proceed by again separately training Adaboost and the SOMz on both the initial training set and the cleaned training set. We present the results of this analysis in Fig. 8. Note that the y-axis scale is different between panels, and we have not shown due to the large scatter seen on this metric, caused by being very small.

If we adopt a contamination fraction hyper-parameter of and remove anomalous data, we find a very slight improvement at the level of % using Adaboost and up to 4% using SOMz in the measured metrics. Note that the relative error on is unstable, although does remain small. As increases, the SOMz continue to benefit from a cleaned training sample, whereas Adaboost begins to degrade in its predictive power.

The degradation in the measured statistics seen at large values of in Fig. 8 can be attributed to the removal of representative training data as seen in Fig. 4. Recall that the validation set is a random sample from the uncontaminated data with reliable redshifts, and thus would more closely resemble the solid lines in Fig. 4. For increasing values of , the training and validation samples become more unrepresentative and a machine learning system would naturally degrade. We do note that SOMz appear to be less affected by small differences in the training and test data sets, but also degrade in predictive ability once the samples become very unrepresentative.

In the last two rows of Table 1 we quote the median values on each of the measured statistics from each of the samples when both training on, and further cleaning, these uncontaminated galaxy samples. We note that the effect of training on the further cleaned sample improves the measured statistics using SOMz by a few percent, but can degrade some of the measured statistics by a few percent when using the Adaboost algorithm with decision trees.

An interesting future application which is being explored by the authors is to trim the anomalous data and then apply data augmentation (see Hoyle et al., 2015) techniques to make the training and test samples again more representative.

## 4 conclusions

Machine learning methods can be used to assign redshift estimates to photometrically selected galaxy catalogues if a representative training set with both photometric properties or ‘features’ and spectroscopic redshifts exists. Machine learning methods require that the base training sample which is used to learn the mapping between these quantities is representative of the final, or ‘test’, data sample. This requires that the training sample spans a similar input photometric feature space as the test sample, and does not contain anomalous data (e.g., galaxies with incorrect spectroscopic redshifts) otherwise an incorrect mapping will be learnt. In this work we examine the ability of machine learning architectures to identify and remove such anomalous data.

In contrast to previous work on outlier analysis which removes anomalous data after the machine learning redshift system has been trained (e.g., Schneider et al., 2006; Bernstein & Huterer, 2010; Carrasco Kind & Brunner, 2014a), the method presented here identifies anomalous data before the sample is used to estimate a redshift. The benefit of this approach is that this pre-cleaning can then be used to define a new input feature space which is much less complex than using the post processing methods. Our method makes it easier to construct of a final sample of test galaxy.

The analysis in this paper uses a base sample of 2.5M galaxies drawn from the SDSS DR12 which have reliably measured spectroscopic redshifts, and some of which also have an unreliably measured spectroscopic redshift. We construct an ‘anomalous data sample’ by selecting galaxies that have a difference between the reliable and the unreliable redshift by more than 0.01, and proceed by assigning the unreliable redshift to the galaxy. We apply this selection because we do not expect the recovered photometric redshift to have an error below . We contaminate a base data sample with data from the anomalous sample, and then use machine learning to identify the anomalous data.

We choose the Elliptical Envelope routine (Rousseeuw & Driessen, 1999; Hubert & Debruyne, 2010) as the machine learning anomaly detector algorithm. The resulting ellipse encompasses a fraction of the data which are classified as ‘inliers’ and data outside of the ellipse are classified as ‘outliers‘ or anomalous data. We explored an alternative machine learning architecture for anomaly detection called One Class Support Vector Machines (Cortes & Vapnik, 1995) and found that the Elliptical Envelope routine is more suitable for our dataset. This implies that the high dimensional data cloud is well described by a hyper-ellipse, rather than a hyper-surface with distinct regions of reliable and unreliable data which would be analysed more favourably using Support Vector Machines. There is one hyper-parameter of the Elliptical Envelope routine which is the a priori assumed contamination fraction of the data set. We describe a method to estimate this fraction using a rule-of-thumb relation between the distributions of Mahalanobis distances and the number of feature dimensions, but note the results are not very sensitive to the actual value assumed.

We show how the removal of this anomalous data improves the machine learning redshift metrics for two different groups of machine learning architectures. We choose to explore both decision tree based methods and artificial Neural Network based Self Organizing Maps. These very different architectures suggest that the results found here are generalisable, and not an artefact of the machine learning method chosen. We train the machine learning systems to estimate redshifts for a test sample separately on data from the base sample contaminated with unreliable redshift estimates, and with the cleaned base sample once anomalous data has been removed.

We find improvement in the all of the explored metrics when training on the cleaned sample compared with training on the contaminated sample, when comparing each machine learning method with respect to itself. We also compare the results across machine learning architectures, and find the best redshift estimation results are found using Decision Trees boosted using the AdaBoost routine (Freund & Schapire, 1997; Drucker, 1997). This result has been seen before by the authors (Hoyle et al., 2015), however in that work the results are coupled with the enhanced ability of tree methods to use many tens, or hundreds of input feature dimensions.

The SDSS data used in this work represents an optimal dataset because it covers a similar wavelength range in the photometry and spectrometry. Many other surveys do not have this luxury. For example the Dark Energy Survey (The Dark Energy Survey Collaboration, 2005) has photometry with varying depth across the sky, and have spectra drawn from heterogeneous sources. Performing outlier detection with a heterogeneous spectroscopic sample would still be possible as long as the photometry were not varying in depth drastically, otherwise even reliable data could be flagged as anomalous. If this is not the case, one potential avenue could be to degrade the entire photometry, or large fractions of it, to a similar depth and again perform outlier detection as described in this work. Furthermore we note that the spectroscopic quality flags for the SDSS data are a good estimator of reliability. This is not always true for other datasets and spectroscopic surveys e.g., the PRIMUS dataset appears to have unreliable redshift estimates for some of the most secure redshifts provided by their quality flags (Coil et al., 2011; Cool et al., 2013, see Bonnet et al in prep). However one should still perform anomaly detection, even with a less reliable data sample, or one may be learning trends from spurious data.

In this work we have also assumed that the final test sample is not contaminated by data with unreliable spectroscopic redshifts. If such a sample could not be constructed, this would not necessarily remove the usefulness of the techniques presented in this paper. This is because a contaminated test sample would provide a similar detrimental effect to any training sample and so they would be penalised equally. This is unless the pathological case exists in which galaxies with very similar photometry, and also similar unreliable redshifts values inhabit both the training and test samples.

An interesting avenue of future research would be to perform outlier detection on a data sample to remove anomalous training data. This may reduce the feature parameter space such that the training sample is no longer representative of the test sample. One may then employ methods from data augmentation (see Hoyle et al., 2015) which enhances the training sample using third party data, from models, simulations or other dataset to make the training sample again representative of the test sample. This would work if the augmented data sample spans a similar input feature space (i.e. has the same measured photometric properties) as the training and test samples.

As with all machine learning works, the results found here should be applied cautiously to new datasets. Similar analysis to that described here should be performed to check if there is indeed a problem with contaminating data. If so, then we have shown that the removal of this contaminating data can greatly improve the machine learning redshift point estimates.

## Appendix A CasJobs MySQL query

We obtain observational data from the SDSS using the following MySQL query which is run in the DR12 schema:

select s.specObjID, q.objid, q.ra,q.dec,
s.z as specz, s.zerr as specz_err,
q.dered_u,q.dered_g,q.dered_r,q.dered_i,q.dered_z,
q.modelMagErr_u,q.modelMagErr_g,q.modelMagErr_r,
q.modelMagErr_i,q.modelMagErr_z,
s.sourceType as specType, q.type as photpType,
s.zWarning
into mydb.specPhotoDR12v2 from SpecObjAll as s
join photoObjAll as q on s.bestobjid=q.objid
and q.dered_g>0 and q.dered_r>0
and q.dered_z>0 and q.type=3


## Acknowledgments

We thank the anonymous referee for comments and suggestions which have improved the paper. S.Seitz and M. M. Rau are supported by the Transregional Collaborative Research Centre TRR 33 - The Dark Universe and the DFG cluster of excellence “Origin and Structure of the Universe”. CB: Funding for this project was partially provided by the Spanish Ministerio de Economa y Com- petitividad (MINECO) under projects FPA2013-47986, and Centro de Excelencia Severo Ochoa SEV-2012-0234. Funding for the SDSS and SDSS-II has been provided by the Alfred P. Sloan Foundation, the Participating Institutions, the National Science Foundation, the U.S. Department of Energy, the National Aeronautics and Space Administration, the Japanese Monbukagakusho, the Max Planck Society, and the Higher Education Funding Council for England. The SDSS Web Site is http://www.sdss.org/.

## References

• Alam et al. (2015) Alam S., Albareti F. D., et al. 2015, ArXiv 1501.00963
• Bernstein & Huterer (2010) Bernstein G., Huterer D., 2010, MNRAS, 401, 1399
• Blanton & Roweis (2007) Blanton M. R., Roweis S., 2007, AJ, 133, 734
• Bonnett (2015) Bonnett C., 2015, MNRAS, 449, 1043
• Breiman et al. (1984) Breiman L., Friedman J. H., Olshen R. A., Stone C. J., 1984, Classification and Regression Trees. Wadsworth International Group, Belmont, CA
• Carrasco Kind & Brunner (2013) Carrasco Kind M., Brunner R. J., 2013, MNRAS, 432, 1483
• Carrasco Kind & Brunner (2014a) Carrasco Kind M., Brunner R. J., 2014a, MNRAS, 442, 3380
• Carrasco Kind & Brunner (2014b) Carrasco Kind M., Brunner R. J., 2014b, MNRAS, 438, 3409
• Caruana & Niculescu-Mizil (2005) Caruana R., Niculescu-Mizil A., 2005, in In Proc. 23 rd Intl. Conf. Machine learning (ICMLâ06 An empirical comparison of supervised learning algorithms using different performance metrics. pp 161–168
• Coil et al. (2011) Coil A. L., Blanton M. R., Burles S. M., Cool R. J., Eisenstein D. J., Moustakas J., Wong K. C., Zhu G., Aird J., Bernstein R. A., Bolton A. S., Hogg D. W., 2011, ApJ, 741, 8
• Cool et al. (2013) Cool R. J., Moustakas J., Blanton M. R., Burles S. M., Coil A. L., Eisenstein D. J., Wong K. C., Zhu G., Aird J., Bernstein R. A., Bolton A. S., Hogg D. W., Mendez A. J., 2013, ApJ, 767, 118
• Cortes & Vapnik (1995) Cortes C., Vapnik V., 1995, Machine Learning, 20, 273
• Cunha et al. (2014) Cunha C. E., Huterer D., Lin H., Busha M. T., Wechsler R. H., 2014, MNRAS, 444, 129
• Dahlen (2013) Dahlen T. e. a., 2013, ApJ, 775, 93
• Dietterich (2000) Dietterich T. G., 2000, Mach. Learn., 40, 139
• Drucker (1997) Drucker H., 1997, in Proceedings of the Fourteenth International Conference on Machine Learning ICML ’97, Improving regressors using boosting techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 107–115
• Eisenstein (2011) Eisenstein D. J. e. a., 2011, AJ, 142, 72
• Freund & Schapire (1997) Freund Y., Schapire R. E., 1997, Journal of Computer and System Sciences, 55, 119
• Friedman (1999) Friedman J. H., 1999, Computational Statistics and Data Analysis, 38, 367
• Friedman (2001) Friedman J. H., 2001, Ann. Statist., 29, 1189
• Geach (2012) Geach J. E., 2012, Monthly Notices of the Royal Astronomical Society, 419, 2633
• Gerdes et al. (2010) Gerdes D. W., Sypniewski A. J., McKay T. A., Hao J., Weis M. R., Wechsler R. H., Busha M. T., 2010, ApJ, 715, 823
• Gunn et al. (2006) Gunn J. E., Siegmund W. A., Mannery E. J., Owen R. E., Hull C. L., Leger R. F., Carey L. N., Knapp G. R., York D. G., Boroski W. N., Kent S. M., Lupton R. H., Rockosi C. M., et al., 2006, AJ, 131, 2332
• Hastie et al. (2001) Hastie T., Tibshirani R., Friedman J., 2001, The Elements of Statistical Learning. Springer Series in Statistics, Springer New York Inc., New York, NY, USA
• Hildebrandt et al. (2010) Hildebrandt H., Arnouts S., Capak P., Moustakas L. A., Wolf C., Abdalla e. a., 2010, Astron. & Astrophys., 523, A31
• Hoyle et al. (2015) Hoyle B., Rau M. M., Bonnett C., Seitz S., Weller J., 2015, MNRAS, 450, 305
• Hoyle et al. (2015) Hoyle B., Rau M. M., Zitlau R., Seitz S., Weller J., 2015, MNRAS, 449, 1275
• Hubert & Debruyne (2010) Hubert M., Debruyne M., 2010, Wiley Interdisciplinary Reviews: Computational Statistics, 2, 36
• Kohonen (1997) Kohonen T., ed. 1997, Self-organizing Maps. Springer-Verlag New York, Inc., Secaucus, NJ, USA
• Lahav (1997) Lahav O., 1997, in Di Gesu V., Duff M. J. B., Heck A., Maccarone M. C., Scarsi L., Zimmerman H. U., eds, Data Analysis in Astronomy Artificial neural networks as a tool for galaxy classification.. pp 43–51
• Li & Thakar (2008) Li N., Thakar A. R., 2008, Computing in Science and Engineering, 10, 18
• McQuinn & White (2013) McQuinn M., White M., 2013, MNRAS, 433, 2857
• Pedregosa et al. (2011) Pedregosa F., et al., 2011, Journal of Machine Learning Research, 12, 2825
• Rau et al. (2015) Rau M. M., Seitz S., Brimioulle F., Frank E., Friedrich O., Gruen D., Hoyle B., 2015, ArXiv 1503.08215
• Rousseeuw & Driessen (1999) Rousseeuw P. J., Driessen K. V., 1999, Technometrics, 41, 212
• Sánchez et al. (2014) Sánchez C., Carrasco Kind M., Lin H., Miquel R., et al., 2014, MNRAS, 445, 1482
• Schneider et al. (2006) Schneider M., Knox L., Zhan H., Connolly A., 2006, ApJ, 651, 14
• Smith et al. (2002) Smith J. A., et al., 2002, AJ, 123, 2121
• Tagliaferri et al. (2003) Tagliaferri R., Longo G., Andreon S., Capozziello S., Donalek C., Giordano G., 2003, Lecture Notes in Computer Science, 2859, 226
• The Dark Energy Survey Collaboration (2005) The Dark Energy Survey Collaboration 2005, ArXiv 0510346
• Yeche et al. (2009) Yeche C., Petitjean P., Rich J., Aubourg E., Busca N., Hamilton J. ., Le Goff J. ., Paris I., Peirani S., Pichon C., Rollinde E., Vargas-Magana M., 2009, ArXiv 0910.3770
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters