Kernel PCA for type Ia supernovae photometric classification
Abstract
The problem of supernova photometric identification will be extremely important for large surveys in the next decade. In this work, we propose the use of Kernel Principal Component Analysis (KPCA) combined with k = 1 nearest neighbour algorithm (1NN) as a framework for supernovae (SNe) photometric classification. The method does not rely on information about redshift or local environmental variables, so it is less sensitive to bias than its template fitting counterparts. The classification is entirely based on information within the spectroscopic confirmed sample and each new light curve is classified one at a time. This allows us to update the principal component (PC) parameter space if a new spectroscopic light curve is available while also avoids the need of redetermining it for each individual new classification. We applied the method to different instances of the Supernova Photometric Classification Challenge (SNPCC) data set. Our method provide good purity results in all data sample analysed, when SNR5. As a consequence, we can state that if a sample as the postSNPCC was available today, we would be able to classify of the initial data set with purity 90% (D+SNR3). Results from the original SNPCC sample, reported as a function of redshift, show that our method provides high purity (up to ), specially in the range of , when compared to results from the SNPCC, while maintaining a moderate figure of merit (). This makes our algorithm ideal for a first approach to an unlabelled data set or to be used as a complement in increasing the training sample for other algorithms. We also present results for SNe photometric classification using only premaximum epochs, obtaining 63% purity and 77% successful classification rates (SNR5). In a tougher scenario, considering only SNe with MLCS2k2 fit probability 0.1, we demonstrate that KPCA+1NN is able to improve the classification results up to (SNR3) purity without the need of redshift information. Results are sensitive to the information contained in each light curve, as a consequence, higher quality data points lead to higher successful classification rates. The method is flexible enough to be applied to other astrophysical transients, as long as a training and a test sample are provided.
keywords:
supernovae: general; methods: statistical; methods: data analysis1 Introduction
Since its discovery (Riess et al., 1998; Perlmutter et al., 1999), dark energy (DE) has become a big challenge in theoretical physics and cosmology. In order to improve our understanding about its nature, multiple observations are used to add better constraints over DE characteristics (e.g., Mantz et al., 2010; Blake et al., 2011; Plionis et al., 2011). In special, large samples of type Ia supernovae (SNe Ia) are being used to measure luminosity distances as a function of redshift in order to constraint cosmological parameters (e.g., Kessler et al., 2009; Ishida & de Souza, 2011; BenitezHerrera et al., 2012; Conley et al., 2011). As part of the efforts towards understanding DE, we expect many thousands of SNe candidates from large photometric surveys, such as the Large Synoptic Survey Telescope (LSST) (Tyson, 2002), SkyMapper (Schmidt et al., 2005) and the Dark Energy Survey (DES) (Wester & Dark Energy Survey Collaboration, 2005). However, with rapidly increasing available data, it is already impracticable to provide spectroscopical confirmation for all potential SNe Ia discovered in large field imaging surveys. After a great effort in allocating their resources for spectroscopic followup, the SuperNova Legacy Survey (SNLS)(Astier et al., 2006) and the Sloan Digital Sky Survey (SDSS)(York et al., 2000), were able to provide confirmation for almost half of their lightcurves. These constitute the major SNe Ia samples currently available, but it is very unlikely that their power of spectroscopic followup will continue to increase as it did in the last decade (Kessler et al., 2010). In this context, we do not have much choice left other than develop (or adapt) statistical and computational tools which allow us to perform classification on photometric data alone. Beyond that, such tools should ideally provide a quick and flexible framework, where information from new data may be smoothly added in the pipeline.
Trying to solve this puzzle, in the recent years a good diversity of techniques were applied to the problem of SNe photometric classification (Poznanski et al., 2002; Johnson & Crotts, 2006; Sullivan et al., 2006; Poznanski et al., 2007; Kuznetsova & Connolly, 2007; Kunz et al., 2007; Sako et al., 2008; Rodney & Tonry, 2009; Gong et al., 2010; Falck et al., 2010). Most of them use the idea of template fitting, so the classification is estimated by comparison between the unlabelled SN and a set of confirmed light curve templates. The method starts with the hypothesis that the new, unlabelled, light curve belongs to one of the categories in the template sample. The procedure then continues to determine which category best resembles the characteristic of this new object. It produced good results (Sako et al., 2008), but its final classification rates are highly sensitive to the characteristics of the template sample.
To overcome such difficulty, Newling et al. (2011); Sako et al. (2011) describe different techniques which address a posterior probability to each classification output. These algorithms produce not a specific type for each SN, but a probability of belonging to each one of the template classes. Such an improvement allow the user to impose selection cuts on posterior probability and, for example, use for cosmology only those SNe with a high probability of being Ia.
Another interesting approach proposed by Kunz et al. (2007), and further developed by Newling et al. (2012), takes a somewhat different path. Instead of separating between Ia and nonIa before the cosmological analysis, they use all the available data. However, the influence of each data point in determining the cosmological parameters is weighted according to their posterior probability (obtained from some classifier like that of Sako et al. (2011), for example). The method was able to identify the fiducial cosmological parameters in a simulated data set, although some bias still remains and worth further investigation.
Following a different line of thought, Richards et al. (2012) (hereafter R2012) proposes the use of diffusion maps to translate each light curve into a low dimensional parameter space. Such space is constructed using the entire sample and, after a suitable representation is found, the label of the spectroscopic sample is revealed. In the final step a random forest classification algorithm is used to assign a label to the photometric light curves, based on their low dimensional distribution when compared to the one from the spectroscopically confirmed SNe. Results were comparable to template fitting methods in a simulated data set, but it also showed large sensitivity to the representativeness between training and test samples.
More recently, Karpenka et al. (2012) presented a twostep algorithm where each light curve in the spectroscopic sample is first fitted to a parametric function. The values of parameters found are subsequently used in training a neural network (NN) algorithm. The NN is then applied to the photometric sample and, for each light curve, it returns the probability of being a Ia. Their classification results are overall not depending on redshift distribution and, as other analysis cited before, can be vary significantly depending on the training sample used.
In order to better understand and compare the state of art of photometric classification techniques, Kessler et al. (2010) released the SuperNova Photometric Classification Challenge (hereafter, SNPCC). It consisted of a blind sample of 20.000 SNe light curves, generated using the SuperNova ANAlysis^{1}^{1}1http://sdssdp62.fnal.gov/sdsssn/SNANAPUBLIC/ (SNANA) light curve simulator (Kessler et al., 2009), and designed to mimic data from the DES. Approximately 1000 of these were given with labels, so to represent a spectroscopically confirmed subsample. The participants were offered 2 instances of the data, with and without the host galaxy photometric redshift (photoz). Around a dozen entries were submitted to the Challenge and, although none of them obtained an outstanding result when compared to others, it provided a clear picture of what can be done currently and what we should require from future surveys in order to improve photometric classifications. There was also an instance of the data containing only observations before maximum, which aimed at choosing potential spectroscopic followup candidates. However, this data set did not received replies from the participants (Kessler et al., 2010). After the Challenge, the organizers released an updated version of the data, including all labels, bug fixes and other improvements found necessary during the competition^{2}^{2}2http://sdssdp62.fnal.gov/sdsssn/SIMGEN_PUBLIC/. The works of Newling et al. (2011), R2012 and Karpenka et al. (2012) present detailed results from applying their algorithm to this postSNPCC^{3}^{3}3Nomenclature taken from Newling et al. (2011). data.
Given the stimulating activity in the field of SNe photometric classification, and the urgency with which the problem imposes itself, our purpose here is to present an alternative method which optimizes purity in the final SNe Ia sample, in order to provide a statistically significant number of photometrically classified SNe Ia for cosmological analysis. Our algorithm uses a machine learning approach, similar in philosophy to the entry of R2012 submitted to the SNPCC. This class of statistical tools has already been applied to a variety of astronomical topics (for a recent review see Ball & Brunner (2010)).
We propose the use of Kernel Principal Component Analysis (hereafter, KPCA) as a tool to find a suitable low dimension representation of SNe light curves. In constructing this low dimensional space only the spectroscopically confirmed sample is used. Each unlabelled light curve is then projected into this space one at a time and a knearest neighbour (kNN) algorithm performs the classification. The procedure was applied to the postSNPCC data set using the entire light curves and also using only premaximum observations. In order to allow a more direct comparison with SNPCC results, we also applied the algorithm to the complete light curves in the original SNPCC data set^{4}^{4}4http://www.hep.anl.gov/SNchallenge/ DES_BLINDnoHOSTZ.tar.gz.
Our procedure returns purity levels higher than to top ranked methods reported in the SNPCC. The results are sensitive to the spectroscopic sample, but more on the quality of each individual observation than on representativeness between spectroscopic and photometric samples. Assuming that results can only be as good as the input data, we perform classification in subsamples of SNPCC and postSNPCC data based on signal to noise ratio (SNR) levels.
Considering only light curves with Multicolor Light Curve Shape (MLCS2k2) (Jha et al., 2007) fit probability, FitProb0.1, we demonstrate that our method is capable of increasing purity and successful classification rates even in a context with only light curves very similar between each other.
The paper is organized as follows: section 2 briefly describe linear PCA and its transition to the KPCA formalism. In section 3 we detailed the crossvalidation and kNN algorithm used for classification. In section 4 we present the guidelines to prepare the raw light curve data into a data vector suitable for KPCA. The results applied to a best case scenario simulation, to the postSNPCC data and comparison with MLCS2k2 fit probability results are shown in section 5. We report outcomes from applying our method to the original SNPCC data set in section 6. Finally, we discuss the results and future perspectives in section 7. Throughout the text, mainly in section 2, we refer to a few theorems and mathematical statements are made. Those which are most crucial for the development of the KPCA argument are briefly demonstrated in appendix A. A detailed description of results achieve using linear PCA+kNN algorithm is presented in appendix B. Appendix C shows classification rates as a function of redshift and SNR cuts and appendix D displays our achievements when no SNR cuts are applied. Graphical representation of results from SNPCC data set for all the tests we performed, which can be directly compared to those of Kessler et al. (2010) are displayed in appendix E. Complete summary tables reporting the number of data points in different subsamples of SNPCC and postSNPCC data and classification results mentioned in the text are shown in appendix F.
2 Principal Component Analysis
The main goal of PCA is to reduce an initial large number of variables to a smaller set of uncorrelated ones, called Principal Components (PCs). This set of PCs is capable of reproducing as much variance from the original variables as possible. Each of them can be viewed as a composite variable summarizing the original ones, and its eigenvalue indicates how successful this summary is. If all variables are highly correlated, one single PC is sufficient to describe the data. If the variables form two or more sets, and correlations are high within sets and low between sets, a second or third PC is needed to summarize the initial variables. PCA solutions with more than one PC are referred to as multidimensional solutions. In such cases, the PCs are ordered according to their eigenvalues. The first component is associated with the largest eigenvalue, and accounts for most of the variance, the second accounts for as much as possible of the remaining variance, and so on.
There are a few different ways which lead to the determination of PCs. Particularly, we have already shown that it is possible to derive the PCs beginning from a theoretical description of the likelihood function (e.g., Ishida & de Souza, 2011; Ishida et al., 2011).
In the present work we are interested in exploring the KPCA and, as a consequence, our description shall be based on dot products. In doing so, the connection between PCA and KPCA occurs almost smoothly. We follow closely Hofmann et al. (2008) and Max Welling’s notes A first encounter with Machine Learning^{5}^{5}5http://www.ics.uci.edu/welling/teaching/ICS273Afall11 /IntroMLBook.pdf, which the reader is refereed to for a more complete mathematical description of the steps shown here.
2.1 Linear PCA
We begin by defining a set of vectors , which contains our observational measurements. If is the vector of mean values of , let be the set of vectors which holds the centered observations,
(1) 
In order to find the PCs, we shall diagonalize the covariance matrix^{6}^{6}6The covariance matrix is traditionally defined as the expectation value of . For convenience, we shall address the term covariance matrix to the maximum likelihood estimate of the covariance matrix for a finite sample, given by equation (2)(Schölkopf et al., 1996). ^{7}^{7}7It is also possible to apply PCA to a correlation matrix. This is advised mainly when the data matrix is composed by measurements with different orders of magnitude and/or units. Since in our particular case all measurements are in the same units (fluxes) and normalized in advance, we shall use the covariance matrix. For a detail discussion on the pros and cons of each case, see Jollife (2002)  section 2.3.
(2) 
This can be accomplished by solving the eigenvalue equation
(3) 
where are the eigenvalues and the eigenvectors of the covariance matrix.
If we consider the set of eigenvectors of and the set of data points projections in , the elements of will be given by
(4) 
where is the matrix formed by the first PCs as columns. It is possible to show that the elements of will be uncorrelated, independently of the dimension chosen for matrix .
This is where the dimensionality reduction takes place. We can choose the number of PCs that will compose the matrix based on how much of the initial variance we are willing to reproduce in . At the same time, depending on the nature of our data, the spread of the points in the PCs space might also reveal some underlying information, as the existence of two classes of data points, for example.
The main goal of this work is to use PCA to project the data in a subspace where photometric data vectors associated with different supernova types can be separated. In order to do so, our first step is to show that it is possible to calculate the projected data points without the need of explicitly defining the eigenvectors . This will be important when we consider nonlinear correlations in the next subsection.
Given that all vectors must lie in the space spanned by the data vectors , we can show that (see appendix A)
(5) 
and as a consequence, instead of solving equation (3) we can also find the elements of by solving the projected equations^{8}^{8}8Equation 6 results from writing each eigenvector as a linear combination of the data vectors.
(6) 
This leads us to an eigenvalue equation in the form
(7) 
where
(8) 
and . Normalizing , we can also show that .
Finally, consider a test data vector . Its projections in the PCs space are given by
(9) 
where .
This demonstration was specifically designed to rely only on the matrix . Although, the classification we aim in this work is not possible in the linear regime. In order to be able to disentangle light curves from different supernovae, we need to perform PCA in a higher dimensional space, where the characteristics we are interested in are linearly correlated.
2.2 Kernel Principal Component Analysis
KPCA generalizes PCA by first mapping the data nonlinearly into a higher dimensional dot product space (hereafter, feature space):
(10) 
where is a nonlinear function and has arbitrary (usually very large) dimensionality.
The covariance matrix, , will be defined similarly as
(11) 
We assume that are centred in feature space. We shall come back to this point latter on.
Consider the eigenvector of and its eigenvalue. Using the same line of argument shown in the previous subsection, we can define an kernel matrix
(12) 
which allows us to compute the value of dot product in without having to carry out the map . The kernel function has to satisfy the Mercer’s theorem to ensure that it is possible to construct a mapping into a space where acts as a dot product^{9}^{9}9http://ni.cs.tuberlin.de/lehre/mimaterials/Mercer_theorem.pdf. The projection of a new test point, , is given by
(13) 
where is defined by the solutions to the eigenvalue equation .
Finally, it is important to stress that all the arguments shown in this subsection rely on the assumption that the data are centred in feature space. This is not a direct consequence of using instead of . Equation (1) is responsible for centring data vectors in , in order to perform centralization in , we need to construct the kernel matrix using . This can also be computed without any information about the function . It is shown in appendix A that the centred kernel matrix, , can be expressed in terms of the noncentered kernel matrix, , as
(14) 
where . The reader should be aware that we always refer to the centred kernel matrix . However, for the sake of simplicity, the tilde is not used in our notation.
At this point, we have the tools necessary to compute the centred kernel matrix based on dot products in input space. However, we still need to choose a form for the kernel function .
In the present work, for the sake of simplicity, we make an a priori choice of using a Gaussian kernel,
(15) 
where the value of is determined by a crossvalidation processes (see subsection 3.2). Although, it is important to emphasize that there is extensive literature on how to choose the appropriate kernel for each particular data set at hand (Lanckriet et al., 2004; Zang et al., 2006). To compare the analysis between different kernel choices is out of the scope of this work. As our goal is to focus on the KPCA procedure itself, we are using the standard kernel choice. An analysis of performances from different kernel choices within the KPCA framework should certain be topic of future research.
3 Classification
By virtue of what was presented so far, we have a set of centred data points, , and a kernel function, . This allows us to calculate the kernel matrix in feature space, , and its corresponding eigenvalues, . Using equation (9), we can obtain the projection of each data point in the eigenvectors of .
From now on, we will work in the space spanned by these eigenvectors. More precisely, we will look for a 2dimensional subspace of , which can optimize our ability to separate the projected data in 2 different classes (namely Ia and nonIa supernovae). We chose to keep this subspace bidimensional in order to avoid overfitting to the particular data set we are analysing.
The procedure describe before is now applied to two different instances of our data. A data set suitable for the analysis we present here must be composed of two subsamples. For one of them we have the appropriate label for each data point (we know which class they belong to), from now on this subset will be called training sample. For the other subsample (hereafter test sample) the labels are not available, and we want to classify them based on our previous knowledge about the training sample.
In a first moment, we will concentrate our efforts in the training sample. Its projections in a certain pair of PCs are calculated through equation (9). Given that labels of data in this sample are known, we can calculate projections in different PCs and determine which PC pair better translates the initial light curves into a separable point configuration.
3.1 The kNearest Neighbor algorithm
Our choice of which subspace of is more adequate for a specific data situation will be balanced by how well we can classify the training sample using the kNearest Neighbor algorithm (kNN).
kNN is one of the most simple classification algorithms and it has been proved efficient in low dimension parameter spaces, (, for a further discussion on kNN performance in higher dimensions see Beyer et al. (1999)). The method begins with the training sample organized as , where is the data vector and its label, and a definition of distance between 2 data vectors . Given a new unlabelled test point , the algorithm computes the distance between and all the other points in the training sample, , ordering them from lower to higher distance. The labels of the first data vectors (the ones closer to ) are counted as votes in the definition of . Finally, is set as the label with highest number of votes. Given this last voting characteristic, kNN is many times refereed to as a type of majority vote classifier (James, 1998).
Throughout our analysis, we used an Euclidean distance metric and order . As this is the first attempt in applying KPCA to the photometric problem, we chose to be bounded by the Bayes error rate (hereafter, BER). The BER is defined as the error rate resulting from the best possible classifier. It can be shown that, in the limit of large samples, the error rate of a nearest neighbour algorithm is never larger than BER (for a scratch of the proof see Ripley (1996), page 195). From now on, this will be refereed to as 1NN algorithm (nearest neighbour with ).
So far we described how to define a convenient 2dimensional space where our data points will be separated in Ia and nonIa populations (subsection 2.2) and a classification tool that allows us to add a label to a new, unlabelled data point (subsection 3.1). However, we still need to define which pair of PCs of the feature space better maps our data. This is done in the next subsection.
3.2 Crossvalidation
The main idea behind the crossvalidation procedure is to remove from the training sample a random set of data points, . The remaining part of the training sample is given as input in some classifier algorithm and used to classify the points in . In this way, we can measure the success rate of the classifier over different random choices of and also compare results from different classifiers given the same training and sets (for a complete review on crossvalidation methods see Arlot & Celisse (2010)).
The the number of points in is a free parameter and must be defined based on the clustering characteristics of the given data set. Here we chose the most classical exhaustive data splitting procedure, sometimes called Leave One Out (LOO) algorithm. As the name states, we construct subsamples , each one containing only one data point, . The training sample is then crossvalidated and the performance judged by the average number of correct classifications.
Data exhaustive algorithms like LOO have a larger variance in the final results, although, they are highly recommended for avoiding biases regarding local data clustering and some nonuniform geometrical distribution of data points in a given parameter space^{10}^{10}10http://www.public.asu.edu/ltang9/papers/encycrossvalidation.pdf.
3.2.1 The algorithm
In the context of KPCA, we used LOO and 1NN algorithms to decide the appropriate pair of PCs and value of (equation (15)) for each data set.
The next trick question to answer is: which PCs we should test with the algorithms described before? Obviously there is a high number of vectors in and it would not be possible to test all available pairs. Fortunately, we can make use of the fact that the firsts eigenvectors (those with larger eigenvalues) represent directions of greater data variance in feature space. Although we cannot visualize such vectors, it is easy to confirm that the magnitude of data points projections in become very similar to each other for higher . In other words, the smaller eigenvalues correspond to PCs carrying mostly noise, so their projections will, in average, be very similar, and meaningless (Schölkopf et al., 1996). For classification purposes, one expects that the PC pair tailored to provide geometrical separation of the data projection into classes will be among the PCs with higher eigenvalues.
For the case studied here, we restrict ourselves to testing the first 5 PCs in a first round and extend the search to other PCs only if the classification success rate do not monotonically decrease with the use of higher PCs. In the same line of thought, we start our search with in a grid with steps of 0.1 and make this interval wider only if the results do not converge after a first round of evaluations.
The crossvalidation algorithm we used is better summarized as:

Pick a PC pair, .

Define a grid of values for parameter , .

Pick a value from the above grid, .

Cross validate the training sample using the KCPA projections in the chosen PCs, 1NN and LOO algorithms.

Calculate the average classification success rate for .

Repeat steps (ii) to (v) 10 times. If the average number of successful classifications monotonically decreases in the upper and lower boundaries of , go to step (vii). If not, repeat steps (ii) to (vi) until they do.

Repeat steps (i) to (vi) for all pairs of .

If the average number of successful classifications monotonically decreases when using higher PCs, go to step (ix). Otherwise, consider and repeat steps (i) to (viii).

Choose for , values corresponding to the largest average number of successful classifications.
Once the crossvalidation is completed, we use the resulting parameter values to calculate the training sample projections in PC space. We can finally use 1NN algorithm to assign a label to each data point in the test sample. The final procedure of classifying the test sample is called KPCA+1NN algorithm throughout the text.
The framework described so far can be applied to any set of astrophysical objects, as long as we have a training and a test sample. The crossvalidation procedure is performed only in the training sample and each point in the test sample is classified at a time. This avoids running the whole machinery again every time one new point is added to the test sample, and prevent us from introducing misleading data as part of the features to be mapped by the PCs. However, the parameter space composed by the PC pair and value of can always be updated if we have at hand new data points whose types are known. Only then it is necessary to rerun the crossvalidation process.
From now on we focus on the problem of photometrically classifying SNe Ia as a practical example, although the exact same steps could be applied for any transient with observable light curves. In the next section, we describe how the light curve data should be prepare before we try to classify them.
3  +24  1  
3  
0  +15  1  
3  
10  0  1  
3  
3  +45  1  
3 
4 Light curve preparation
In case we have b different filters, the observational data available from the SN can be arranged as . Considering the filter, . In our notation, the correspond to the observation epoch (in MJD), is the measured flux at , is the error in flux measurement and is the total number of observation epochs in filter .
Our next task is to translate the time of each observation from MJD to the time since maximum brightness in a particular filter. Which filter shall be used as a reference does not have much influence in the final result. The ideal is to choose a band where the ability to determine the time of peak brightness is greater, and use that reference band for all SN in the sample. The time of maximum brightness in our reference band for the SN is addressed as . As a result, we obtain data points in a particular filter as , where .
We must also deal with the fact that, in a real situation, the input from observations consists in some nonuniform sampling of the light curve in various (most cases more than 3) different filters for each SNe. Although, it is necessary to translate such information into a grid equally spaced in time. This is done by using a cubic regression spline fit for each light curve. The spline fit was chosen based on its ability to fit nonuniform functions in a parameter independent manner. As a consequence, we have a smooth light curve function for each SNe and filter.
As a final step, we must keep the light curve functions within a reasonable range (so to avoid divergence in the exponent of equation (15) due to very bright or dim sources, for example). This is done through the normalization of the light curve functions by the maximum flux measured in all filters for a particular SN. In our notation corresponds to the normalized fitted light curve for the SN in filter . The use of the same normalization factor for all filters for a given SN ensures that the colour and shape of each light curves are preserved.
We now use the functions in order to construct our initial data matrix, , composed by rows and columns. Each row contains all information available for a single SN and each column contains the flux measurements in a specific observation epoch and filter. The difference in time since maximum brightness between 2 successive columns of is defined as and for the purposes of this work it is kept constant. However, we do address the analysis with different values for later on. The lowest and highest observation epoch since is referred to as and , respectively.
SIM1  SNPCC  

filter  
g  
r  
i  
z 
Throughout this work, we took the conservative approach of not extrapolating functions outside the time domain covered by the data. In other words, we only considered classifiable those SN which have at least one observation epoch and at least one epoch , in all available filters. The values of and must be chosen so to include the largest possible number of SNe and, at the same time, to probe an interval of the light curve which posses information enough to satisfy our classification purposes. We applied the algorithm considering values of and shown in table 1. The demand that this sampling must be fulfilled in all filters could be relaxed, leading to an interesting study about the importance and role of each frequency band. We leave that for a future work, focusing our efforts in data points for which information is available in all bands.
5 Application
5.1 Data sets
We applied the procedure described so far to different samples taken from the postSNPCC data set. The postSNPCC consisted of 20.000 SNe light curves, simulated according to DES specifications and using the SNANA light curve simulator. This large set is subdivided in 2 subsamples: a small spectroscopically confirmed one of 1103 lightcurves (training) and a photometric sample of 20216 light curves (test). The role of the training sample was to mimic, in SNe types, proportions and data quality, a spectroscopically confirmed subset available for a survey like DES. After the challenge results were released, the organizers made public an updated version of the simulated data set (postSNPCC), which was used in most of this work. This updated data set is quite different from the one used in the challenge itself (SNPCC), due to a few bug fixes and other improvements aimed to a more realistic representation of the data expected for DES. As a consequence, its results should not be compared to those of the SNPCC. A detailed analysis of our findings from the postSNPCC faced to others published after the challenge, which use the same data set (namely, Newling et al. (2011), R2012 and Karpenka et al. (2012)), is presented in section 7.
For the sake of completeness, we also present results from applying our method to the SNPCC sample. Although this sample contains the bugs mentioned before, it allow us to coherently compare our method to a broader range of alternatives. Detailed comparison of our results with those reported in Kessler et al. (2010) is presented in section 6.
Our first move is to check if KPCA can correctly classify SNe light curves in a bestcase scenario. In order to do so, we generated a high quality data set, hereafter SIM1. This set consists of 2206 SNe, composed by 2 subsamples (training and test), both with at least 3 observation epochs having SNR5 in all filters. SNe types and proportions in each subset are the same as those found in the postSNPCC training sample. As a consequence, the 2 subsamples in SIM1 are completely representative of one another. This was done to avoid classification problems found by other studies when the training sample is not representative of the test sample (e.g., Newling et al. (2011) and R2012). At this moment, the purpose of SIM1 is only to perform a consistency check for the KCPA and light curve preparation prescriptions described above.
In generating SIM1, we used the input SNANA files provided as part of the postSNPCC package, and ran the simulator until the required number of each SNe type passing selection cuts was reached.The kernel matrix was constructed considering and . After verifying that our algorithm was indeed effective in ideal conditions, we will focus on the analysis of the postSNPCC itself.
The Phillips relation for type Ia SN can be consider the first SNe Ia standardization procedure (Phillips, 1993). It establishes a correlation between the magnitude measured at maximum brightness and the magnitude measured 15 days after that (hereafter Phillips interval). For our purposes, this relation highlights a time interval in the light curve where important information are stored. However, at this point we cannot say if a data set sampled solely in this time interval can provide enough information. As a consequence, we considered 8 different subsets of postSNPCC data, whose parameters are described in Table 1. This requirements were imposed to training and test samples within a given data set.
to probe the light curve so to include the Phillips interval. and aim at testing the KPCA+1NN procedure in a region of the light curve that was not explored in the SNPCC: with points only before maximum. Although this kind of classification does not result in cosmological useful SNe Ia, it is very important in pointing candidates for spectroscopic followup (Kessler et al., 2010). and are tailored to include the second maxima in infrared bands expected to occur after 20 days since maximum brightness (Kasen, 2006).
In Table 1, we varied not only the maximum and minimum epoch of observation, but also considered different values for . The purpose of this analysis is to investigate if the classification results are sensitive to the step size between different columns of the kernel matrix. We expect this result to be correlated with data quality, since the interpolated functions are influenced by errors in flux measurements. To test this hypothesis, we applied the classification procedure to different subsamples of each data set, according to their SNR.
Finally, we only considered SNe with at least 3 observational epochs above a certain SNR threshold in each filter. As the spline fitted functions are supposed to get the overall behaviour of a smooth light curve, this selection cut assures that at least 3 of the points with higher weights in the spline fitting procedure correspond to good quality measurements. We also present results without a SNR selection cut, addressed as SNR0.
5.2 Results
In order to choose a filter as our reference band, we used the SNe Ia in the training sample of SIM1 and postSNPCC. As our primary goal is to correctly separate a sample containing only type Ia, our decision was based on the results from SNe Ia in the spectroscopic sample only. Interpolated light curve functions before normalization were used to determine the time of peak brightness in all bands. The residual between the simulated and derived date of maximum brightness, , in each band were computed for all SNe Ia in the training samples. This resulted in a distribution of points whose spread represents our ability (or lack of) in determining this parameter for each filter. The mean values and standard deviations encountered are shown in Table 2.
From this we realized that the band is the best choice for determining the time of peak brightness, since it has the less biased mean value with the smallest standard deviation. Such results agree with those found in R2012, also based on SNANA simulations, but with a different argument. All the results presented from now on were calculated using the time of peak brightness in band as reference.
The final classification results are reported in terms of efficiency (eff), purity (pur) and successful classification (SC) rates,
(16)  
(17)  
(18) 
where () is the number of successfully classified SNe Ia (nonIa), is the total number of SNe Ia, is the number of nonIa wrongly classified as Ia and is the total number of SNe which survived selection cuts.
Efficiency values are shown for two different normalizations: eff considers the total number of SNe Ia before any selection cuts, and eff was calculated using the total number of SNe Ia remaining after selection cuts.^{11}^{11}11By selection cuts we mean the SNR requirement for each subssample the time window constraints of described in table 1. The definition used in the SNPCC corresponds to eff, and aims at addressing the impact on final sample not only due to the classifier, but also to the selection cuts used. In our particular case, we chose to display values of eff in order to isolate the classification power of the algorithm itself. As stated before, our results are mainly influenced by the quality of each observation. Beyond that, we made a specific choice of not extrapolating the light curve where data is not present (table 1). As a consequence, we consider our selection cuts as a minimum amount of information necessary to coherently compare different light curves without the need of further ad hoc hypothesis. In this scenario, the use of eff gives a better idea on the classifier performance. However, when comparing with previous analysis from the literature, eff should be referred to. From now on, for all our results that can be compared to previous ones, both quantities are shown. ^{12}^{12}12For the sake of clarity, when both quantities are present (results that might be compared with others from the literature), outcomes normalized after selection cuts are shown in appendixes C and E.
By definition, eff measures our capacity in recognizing the SNe Ia, while pur measures the contamination from nonIa SNe in our final sample. SC values are presented in order to provide an overall picture of our classification results regarding nonIa as well.
In order to make our results easier to compare with other analysis from the literature, we also report them in terms of the figure of merit (FoM) and pseudopurity (ppur), used to rank classifiers in the SNPCC,
(19)  
(20) 
where is used to input a stronger penalty on nonIa contaminating the final SNe Ia sample. Following the SNPCC, we used . Given that FoM is a function of efficiency, we report values for FoM and FoM for total number of SNe after and before selection cuts, respectively.
5.2.1 Sim1
We must now prepare the light curves according the prescription described in section 4. We randomly selected one example of type Ia light curve in SIM1 to illustrate how the fitted functions behave given the data points. This is shown in the left panels of figure 1. The right panels show the light curve functions for different types of nonIa SNe. Panels from top to bottom run over the DES filters . In order to facilitate visualization, all curves were normalized as explained in section 4.
For the SIM1 data set, the crossvalidation procedure returns PCs 1 and 5 along with as the most appropriate parameters values. The final geometrical distribution of the training sample in such PCs parameter space, along with the classification results are shown in figure 2. In order to facilitate visualization, we show the Ia and nonIa SNe in the test sample in two different plots.
We can see that, in a best case scenario, KPCA+1NN algorithm is efficient enough to separate the two populations in feature space with a minimum loss in the number of SNe Ia (up to 94% eff) and almost no contamination from nonIa’s in the final sample (up to 99% pur).
5.2.2 PostSNPCC data
The analysis of the postSNPCC data was performed in different steps. We first separate a subsample which can be consider the analogous of SIM1 inside postSNPCC, with SNR5 (hereafter +SNR5). This data set results from imposing in postSNPCC data the same selection cuts applied to SIM1.
Using +SNR5, we obtained (80%) pur and SC of (94%) in the training (test) sample. The graphical representation of results from +SNR5 are shown in the upper and middle panels of Figure 3 and the redshift distribution of the diagnostic parameters are displayed in figures 4 and 22.
Analysing the geometrical distribution of training sample data points (blue circles and purple squares), the numerical results mentioned above become more clear. There is an obvious distinction between the preferential locus occupied by Ia and nonIa in this parameter space. However, besides the overlapping area where both species exist, and which was already present in SIM1, we can also spot some contamination of nonIa points inside the area occupied by Ia. Such “misplaced” nonIa probably gave rise to an important share of the wrong crossvalidation classification. In what follows, we described 2 different approaches aimed at suppressing the influence of these “problematic” data points.
The Untyped supernova
Let us focus in +SNR5 for a moment. Each data point in the training sample is characterized by the SN identification number, its coordinates in PC1 PC4 space, the true label and the label from crossvalidation. We identified all points who received a wrong label in the crossvalidation process and gathered them in a set . We considered these troubled points, in the sense that, although they are spectroscopically confirmed SNe, their light curve characteristics are not enough to fully distinguish them within the training sample.
Our first attempt was to remove all points from the training set before classifying the test sample. In doing so, we defined that a new unlabelled test point would be classified according to the region in the parameter space it occupies, since removing the troubled points defines a clear geometrical boundary between Ia and nonIa regions in PCs parameter space. This slightly increased our ratings, leading to 87% pur, 93% eff, and 96% SC rates.
Trying to get rid of the remaining contamination as much as possible, we consider the complete training sample with 3 different SNe types: Ia, nonIa and untyped SNe (). This allows us to take advantage of the information in the troubled points and identify light curves similar to them. An expected consequence of this choice is a decrease in efficiency, since some of the Ia in the test sample will be classified as . On the other hand, as the lost of SNe for the class happens to nonIa as well, the pur in our final Ia sample will increase, for +SNR to 91%.
The training set divided in 3 subsamples has its graphical representation shown in the bottom panel of figure 3. For all the cases described here (complete training, excluding from the training set and including as a classification type) the distribution of test points will not change, since only the training sample is affected.
We performed the classification for all samples described in table 1 imposing 3 different SNR cuts (namely SNR5, SNR3 and SNR0). A summary of our finding is detailed in table 5.
Figure 5 shows results for samples listed in the above mentioned table for the case where the class was included in the training sample as a third SNe type^{13}^{13}13This plot was constructed with the goal of maximizing SC, however, we also applied the crossvalidation process of section 3.2, aiming at maximum FoM and the results are pretty similar.. It is clear from this plot that pur and FoM results become more dependent on time sampling choices as SNR goes higher. The extreme cases being samples / (before maximum, worst results) and / (wider time sampling, better results).
Finally, we should emphasize that our analysis was based on the idea that information should be stored somewhere in the light curve function. If this is true, KPCA could easily be able to provide a direction of information clustering in some untouched feature space, which could be accessed through the data points projections in the PCs. That was the main reason why we started our analysis based on SNR requirements. Errors in flux measurements are direct correlated to the SNR of each observation, and higher errors lead to more oscillations in the light curve functions. In the extreme case were we used random number as components of an input data vector (which contains no information), its projections in PCs will always be located very close to the origin.
Results shown in Table 5 reflects this main idea. Requiring a SNR5 in to , we obtained pur, eff, and SC rates higher than 80% in all 4 cases. These samples contain approximately 5 times more nonIa than Ia SNe (see Table 6), which is close to what we expect in a real survey. Beyond that, we did not demand representativeness in redshift or SNe types between the test and training samples. The training sample inside the postSNPCC data have all the biases the organizers were able to predict and which come along with spectroscopic observational conditions (Kessler et al., 2010). The selection cuts we applied to SNR, in this context, can be seen as a simple procedure to extract the full potential of a given data set^{14}^{14}14We remind the reader that the SNR selection cuts are applied to both, training and test sample..
The results presented here are in agreement to those found by R2012, who applied a diffusion map and random forest algorithm to the same data set. Using the spectroscopic sample as given in the postSNPCC as a training set, they found 56%/48% for pur/eff values. Our analysis for +SNR0, which imposes no SNR selection cuts, returns 43% pur and 35% eff. In their scenario achieving higher purity, they report 90% pur and 8% eff from a redshift limited training sample (R2012, Table 6). For +SNR5, our method achieved 98% pur and 7% eff. However, we emphasize that while R2012 uses a different prescription for constructing the training sample, our results were reached using a subset of the spectroscopic sample as it is presented in the SNPCC.
Focusing in sample of R2012 and cad1+SNR5 of our method, the first feature to call attention is the exponential decay in our results for eff. It will be clear in what follows that this is a consequence of SNR cuts (figure 27). In this particular case, we imposed each filter should observe at least 3 epochs with SNR5 and, with higher redshift, SNe fulfilling this requirement become rare. Also, in the present analysis, we keep only SNe with observations in all available filters, which prevent us from classifying any object with 0.8 (see upper redshift end of our results in figure 4). Obviously these are not intrinsic characteristics of the method, or the data, but choices we made in order to keep results in a conservative perspective. Nevertheless, our values for eff are comparable to those of R2012 up to (see figure 10 of R2012).
As a consequence, despite the loss in efficiency for the reasons cited above, the local maximum in FoM achieved by both groups, us and R2012, are FoM, with our method providing higher results up to .
It was not our purpose to construct a different observation strategy, but instead, to show that if a photometric survey was able to provide a sample similar to postSNPCC today, it is possible to extract a photometric classified set containing approximately 15% of the entire sample (more than 2000), with SC90% . Beyond that, such results can be achieved with minimum astrophysical input and no a priori hypothesis about light curve shape, colour, SNe host environment or redshift.
Results from Linear PCA
Given the wide spread use of linear PCA in astronomy, we also verified how the standard linear version of PCA performs in the SNe photometric classification problem. The method described in section 2.1 was applied to the postSNPCC data. Once the PCs and projections were calculated, we used a crossvalidation algorithm similar to that presented in section 3.2. The main difference being that, in the linear case, there is no parameter to determine.
We present results for in appendix B. As expected, when no SNR cut is applied, linear and KPCA achieved similar rates of pur and eff (table 3). However, when data quality increases, linear PCA is not able to take advantage of the small details introduced in the lightcurve function. Results from linear PCA applied to +SNR5 and including U in training achieved maximum values of 73% pur, 56% eff, and 79% SC. Comparing tables 3 and 5, we find that using KPCA for such a case improves results of pur, eff, and SC by 25%, 50% and 15% respectively, over the linear PCA outcomes. The dependence of these results with redshift are displayed in figures 4 (for KPCA applied to SNR5) and 20 (for the linear case).
5.3 A tougher scenario
In order to make a harder test in the classification power of KPCA+1NN, we used MLCS2k2 light curve fitter within SNANA to exclude easily recognizable nonIa light curves from the test sample. Once the “obviously” nonIa are eliminated from the test sample, we were left with a data set containing light curves more similar between each other. If we are able to improve the MLCS2k2 successful classification rates within this subsample, we can be sure that the algorithm is doing more than just identifying very strange light curves. We shall see this is the case^{15}^{15}15We emphasized that the reader should not consider the classification results using our method and those based on MLCS2k2 in the same grounds. The procedure used to obtain FitProb values uses information about spectroscopic redshift, and as a consequence, it cannot be considered a photometric classification method. .
We begin by choosing a selection cut. For each light curve surviving this cut we calculated the fit probability of being a SNe Ia (FitProb) as implemented in SNANA. Those with FitProb0.1 were tagged as Ia and the remaining ones were classified as nonIa. Figure 6 shows the number of SN according to the calculated FitProb for 4 different selection cuts. Beyond the 3 SNR cuts mentioned previously, we also analysed the outcomes of those used by the SNANA cuts entry submitted to the SNPCC (Kessler et al., 2010). These are defined as: at least 1 observation epoch before maximum brightness, at least 1 epoch after +10 days, at least 1 epoch with SNR10 and filters {,} should have maximum SNR5. Panels also show results for pur, ppur, eff, eff, FoM and FoM obtained from classifying the entire samples according to FitProb. In this plot, it is evident that, no matter which selection cut we choose, there is a high concentration of SNe with FitProb0.1. This reflects the fact that such group of high quality nonIa light curves are most obviously different from standard SNe Ia, and was responsible for a significant part of our SC rates in the previous subsection. Analysing the efficiency values, we see that only of type Ia SNe are wrongly classified as nonIa according to the FitProb criteria.
We now separate only the SNe classified as Ia according to the FitProb criteria for each selection cut and consider these our entire test sample. After that, we recalculated the FitProb results and ran the KPCA+1NN classifier. For the SNANA cuts entry, no extra SNR cuts were applied. Results for different time windows are shown in figure 7. From this plot, it is evident that, when no SNR cuts are applied, both methods return very similar results for pur. The FoM obtained from the FitProb criteria is higher than those obtained with our method, due the their maximum efficiency in this context (all SNe tagged as Ia). The main difference appears when results for higher SNR are compared. For SNR, our method is able to increase pur results from pur to pur without using any kind of astrophysical information.
In order to have a better idea of how demanding the time sampling is on the SNe Ia sample which already passed the selection cuts, we show in the bottom panel of figure 7 the fraction of SNe Ia that fulfils such requirements. These results are quite similar and almost independent of selection cuts. For to around of SNe Ia were classifiable and for and around .
5.4 Premaximum observations
We also explored the ability of KPCA+1NN to classify light curves given only observation epochs before maximum brightness. A proposal that was submitted to the participants of the SNPCC but did not received any reply. Although such kind of analysis do not produce a SN sample useful for cosmology, it is extremely important in pointing candidates for spectroscopic followup.
In a first approach, the light curves were treated as described in section 4. Once the spline fitted functions and time of maximum were obtained, we constructed the data matrix, , with time sampling between 10 e 0 days since maximum brightness ( and in table 1). We emphasize that this scenario uses points after maximum in order to determine , but not in the construction of matrix . The more realistic situation, where the points after maximum are not used in any step of the process is also analysed bellow.
For +SNR5 and +SNR0, results are shown in figure 11 and 12, respectively. Figure 11 is similar to figure 3, in the sense that both present a clear separation between Ia and nonIa points in the training sample and the Ia in the test sample seem to obey that boundary (upper panels). On the other hand, when nonIa points from the test sample are superimposed, they occupy almost the entire populated region of the parameter space.
In figure 12 the situation changes completely. The effect mention previously, describing data vectors corresponding to low information content localized close to the origin in PC space, is translated into an overdensity of points in this area. Beyond that, we also see that the difference between the Ia and nonIa distributions are not that clear any more. There is a slightly tendency of the nonIa points agglomerate along the vertical axis, but this entire area is also occupied by Ia. The plot also states that the amount of relevant information contained in Ia input vectors is larger than that in nonIa, since the spread in the first is much larger than the second. Classification results for +SNR5 (+SNR0) achieved 61% (38%) pur, 73% (44%) eff and 83% (58%) SC^{16}^{16}16It is important to emphasize that, given the training sample contains much more nonIa than Ia, a 50% SC does not correspond to the outcomes of a random decision making process., which leads to a FoM of 0.25 (0.07).
We now turn to a more restrict situation. Although very promising, results for and were not obtained using strictly only premaximum data, since the entire light curve was used to determine (section 4). In order to analyse a more realistic scenario, we also studied the classification outcomes when points after maximum are removed from the process of determining .
For each light curve in the postSNPCC we took just epochs observed before the simulated time of maximum brightness^{17}^{17}17SNANA variable: SIM_PEAKMJD.. The spline fit was then applied to these data points and the time of maximum is defined by the rband as before. If in any other filter the last observed data point correspond to an earlier epoch them in band, we extrapolated the light curve function until it reaches . We performed classification for and in both cases was kept as . After the curves were obtained, we followed the construction of the data matrix and the KPCA1NN algorithm as explained before. In what follows, these data sample is tagged as .
Differences between the time of maximum brightness determined using the entire light curve and using only premaximum data are shown in figure 13. Classification results for +SNR5 are shown in figure 11 () and 17 () and numerical results for other cases are displayed in table 5. Comparison with results from +SNR5 (figure 11) shows that, although pur and efficiency remain almost unchanged, there is a larger number of nonIa classified as . The type SNe, in this case, acts like a barrier between Ia and nonIa regions, such that expanding nonIa cover area (adding data a little more noisy) makes them being classified as before pur levels are diminished. However, this barrier only works up to a certain point.
Classification results for (figure 17) and with (figure 17), both satisfying SNR5, reflect this point. The determination of the time of maximum brightness is the only difference between these two data sets, and yet, it is already enough to lower the classification results significantly. A feature that was not verified among the samples (figure 5). This demonstrates the importance of a correct determination of the time of maximum brightness. The redshift dependent results for these 2 instances of the data are displayed in figures 17 (+SNR5) and 17 (+SNR5, with ).
These results are very encouraging. It means that, in the context of future DES data, the algorithm can correctly classify approximately of the initial data sample using only premaximum data, if the entire data set was given at once. But in a real situation this can be improved. Suppose that initially, our training sample is composed by the spectroscopic SNe sample available today. As time goes by and premaximum light curves are observed, they are automatically classified. An example strategy would be to target with spectroscopic observations the light curves whose projections in PC feature space lay in the boundaries of the SNe Ia/nonIa regions. Once the SNe type is confirmed, it can be added to the training sample, improving future classification results.
6 SNPCC sample
In order to allow a direct comparison of our results with those reported in the SNPCC, we also applied the KPCA+1NN algorithm to the data set used in the competition. This consists of 20216 simulated light curves of which 1105 represent the spectroscopic sample. This data can be consider less likely to represent the future DES data, given that all bugs listed as fixed “after SNPhotoCC” in table 4 of Kessler et al. (2010) are still part of this sample. However, the application is instructive to have an idea of how our method performs when faced to other algorithms.
Results for FoM, eff, ppur (W=3) and pur (W=1) are shown in figure 27 for different SNPCC subsamples and SNR cuts. This should be compared to figure 5 of Kessler et al. (2010), which reports results from different classifiers without using host galaxy photometric redshift. A detailed analysis of the multiple panels in figure 27 is presented in appendix E.
Our findings from this sample can be summarized through the items bellow:

There is a weak dependence of the overall classification results with particular time sampling choices. The only eyecatching difference comes from time window including the second maximum in the infrared (D).

Results are highly dependent on SNR cuts, specially efficiency and consequently, FoM.

D+SNR5 achieved FoM 0.25 for 0.2 z 0.4. A result only achieved by 3 of the entries participating on the SNPCC (namely Sako, JEDIKDE and SNANA).

Our method achieved outstanding pur and ppur results for z0.2. In this redshift range, all samples with SNR5 reported pur values larger than 75%: a result that was not obtained by none of the SNPCC entries. Particularly, in 0.2z0.4, D+SNR5 obtained 94% pur 97%, while keeping a moderate FoM. The redshift dependence of these results are displayed in figure 18.
7 Conclusion
Current SNe surveys already have at hand much more SNe light curves than it is possible to spectroscopically confirm. This situation will increase tremendously in the next decade, which makes SNe Ia photometric identification a crucial issue. In this work, we propose the use of KPCA combined with k = 1 nearest neighbour algorithm (KPCA+1NN) as a framework for SNe photometric classification.
Lately, a large effort has been applied to the SNe photometric classification problem. An up to date compilation of those efforts is reported in Kessler et al. (2010), known as the SuperNova Photometric Classification Challenge (SNPCC). It consisted of a blind simulated light curve sample as expected for the Dark Energy Survey (DES) to be used as a test ground for different classifiers. Although there were some fundamental differences between the algorithms submitted, none of the entries performed obviously better than all the others. After the results were reported, the organizers made public an updated version of the simulated data (postSNPCC). Both samples, SNPCC and postSNPCC were analysed in this work.
Our method fit in the class of statistical inference algorithms, according to the SNPCC nomenclature. All calculations are done in the observer frame. There is no corrections due to reddening, local environment, redshift or observation conditions and all available spectroscopically confirmed data surviving quality selection cuts should be used to shape the PCs feature space. The dimensionality reduction is performed using only spectroscopically confirmed SNe (training sample) and each new unlabelled light curve (test sample) is classified one at a time. This allow us to avoid introducing noisy information from nonconfirmed SNe in the classifier training. The algorithm is built so that once a new spectroscopic light curve is available or we have total confidence in a photometric one, it can easily be included in the training process, but it is not necessary to redefine the PC feature space every time a new point is to be classified.
In designing our method, we prioritize purity in the final SNe Ia sample, once it is the most important characteristic of a data set to be use for cosmology. We also decided to take a conservative approach towards the unknown features of the data. As a consequence, no extrapolation on time or wavelength domain was used and we demanded that each SNe was observed in all available filters. As expected, these choices have a great impact in our efficiency results. However, we believe that the high purity levels achieved justifies our choices (figure 27), specially in a context where there are already observed light curves not being used for cosmology due to lack of classification (Sako et al., 2011).
We highlight that we chose not to include high complexity in the different steps along the process in order to keep focus in the KPCA performance. Although, as remarked before, there is plenty of room for improvement. For example, in choosing the kernel function, the nearest neighbour algorithm degree and studying more flexible selection cuts. Such developments are worth pursuing, but one should also be aware not to fine tuning the procedure too much, so the results will apply only to one specific data set. Quantifying the dependence of our results with such change of choices is out of the scope of this work.
Results presented in this work show that KPCA+1NN algorithm provide excellent purity in the final SNe Ia sample. Although a time window since maximum brightness needs to be defined, its width does not have a large impact in final classification results. On the other hand, SNR of each observation epoch plays a crucial role. As a consequence, our best results are mainly concentrated in the intermediate range, 0.2z. From the SNPCC sample analysis in these redshifts, our method returned FoM 0.25, using D+SNR5 (figure 18). A result only achieved by 3 of the entries participating on the SNPCC (namely Sako, JEDIKDE and SNANA).
We also found outstanding purity and pseudopurity results. All samples with SNR5 reported purity values larger than 75% for z0.2: a result that was not obtained by none of the SNPCC entries. Particularly, for 0.2 z 0.4, D+SNR5 obtained 94% pur 97%, while keeping a moderate FoM (figure 18).
Among the entries submitted to the SNPCC, only the InCA group used a similar approach, although by means of completely different techniques. The results they reported to the competition provide purity rates similar the ones we get for SNR.
We stress that, although the comparison with the SNPCC results is important, it cannot be considered exactly in the same grounds as our results. First because since they were built with different purposes (the SNPCC aimed at maximum FoM and our goal was to achieve the highest possible purity while maintaining a reasonable FoM), second because we were not time constrained as the groups taking the challenge and finally, we had access to the answer key before hand. Something the competitors did not have. However, a strictly direct comparison with other results in the literature is possible through the postSNPCC sample.
Recently, the InCA group made public a detailed analysis of the results achieved by their method when applied to the postSNPCC data set (Richards et al., 2012) (R2012). The two algorithms provided similar classification results. Both achieving local maximum of FoM around 0.5, with our method giving better results at lower and theirs at higher redshifts. Averaging over the entire redshift range, we achieve FoM of 0.06 and R2012 reported 0.35. R2012 also provides results with different spectroscopic samples, constructed by redistributing DES available followup time. In their result with highest purity, they reported 90% purity, 8% eff and 0.08 FoM using a redshift limited spectroscopic sample. Our method provides 96% purity, 6% eff and 0.06 FoM for +SNR5.
Karpenka et al. (2012) also present results from postSNPCC data. In their analysis, results from a parametric fit to the spectroscopic light curves are used to train a neural network which subsequently returns the probability of a new object being a Ia. Using 50% of the initial sample as a training set (10000 objects considered spectroscopically confirmed), they found 80% purity, 85% eff and 0.51 FoM.
It is important to emphasize that the results we report above were achieved using a subset of the spectroscopic sample as it is given within the postSNPCC data. This means that it is not necessary to tailor the spectroscopic sample a priori in order to get high purity results, making our method ideal as a first approach to a large photometric data set.
In order to test the algorithm in a more restrictive scenario, we present results obtained from the postSNPCC subsample with MultiColor Lightcurve Shape (MLCS2k2) fit probability, FitProb. This sample contains light curves very similar between each other, and represents a more difficult classification challenge than the complete SNPCC data. We show that our method is not able to do more than identifying the obviously nonIa light curves when no SNR cuts are applied. However, when we compare results from data samples with SNR3, KPCA+1NN can boost purity levels to independently of time window sampling.
Finally, we report the first attempt in classifying the postSNPCC data using only premaximum epochs. This study is very important in selecting candidates for spectroscopic follow up. Using only data between 10 and 0 days since maximum brightness, we obtained 63% purity, 71% eff, 77% SC and FoM of 0.26. This is a very enthusiastic result and reflects the vast room for improvement this kind of analysis may provide in different stages of the pipeline.
We stress that the application proposed here is merely an example of how the KPCAkNN algorithm might be applied in astronomy. Beyond the specific problem of SNe Ia photometric classification, the same procedure can be used to identify other expected transient sources and even to spot still nonobserved objects among a large and heterogeneous data set. The projection of such objects in PCs feature space would occupy a previously nonpopulated locus, what would give us a hint to further investigate that particular object. In the more ideal scenario, when synthetic light curves from a nonobserver object is available, a synthetic target can be included in the training sample, leading to a detection tailored according to our expectations. This provides still another advantage over template fitting techniques, which deserve further investigation.
From what was presented here, we conclude that the decision of choosing one method over the other is not a straightforward one, but must be balanced by the characteristics of the data available and our goal in classifying it. Given that SNe without spectroscopic confirmation is not a future issue of large surveys, but a problem that is already present in the SDSS data (Sako et al., 2011), KPCA+1NN algorithm proved to be the ideal choice to quickly increase the number of SNe Ia available for cosmology with minimum contamination. Alternatively, it can also be used as a complement to other techniques in helping to increase the number of SNe Ia in the training sample. Either way, we have enough evidence to trust the competitiveness of our algorithm within the current status of the SNe photometric classification field.
Acknowledgements
We thank Masaomi Tanaka, Naoki Yoshida, Takashi Moriya, Laerte Sodré Jr, Andrea Ferrara, Andrei Mesinger and Rick Kessler for fruitful discussions and suggestions. We also thank the anonymous referee for comments that highly improved the quality of the paper. The authors are happy to thank the Institute for the Physics and Mathematics of the Universe (IPMU), Kashiwa, Japan, Scuola Normale Superiore (SNS), Pisa, Italy, Centro Brasileiro de Pesquisas Fisicas (CBPF), Rio de Janeiro, Brazil and the Asia Pacific Center for Theoretical Physics (APCTP), Pohang, South Korea, for hosting during the development of this work. RSS thanks the Excellence Cluster Universe Institute, Garching, Germany, for hosting while this work was developed. The authors acknowledge financial support from the Brazilian financial agency FAPESP through grants number 2011/095253 (EEOI) and 2009/051764 (RSS). EEOI thanks the Brazilian agency CAPES for financial support (1313100). RSS thanks the Brazilian agency CNPq for financial support (200297/20104).
Appendix A Basic proofs
This appendix contain basic proofs for the statements used throughout the text. These are common to machine learning theory field, but may not be as such for the astronomy community. They follow closely Max Welling’s notes A first encounter with Machine Learning and Schölkopf et al. (1996), which the reader is advised to check for a comprehensible introduction to the basic concepts used here.

All the vectors in the eigenvector space lie in the space spanned by the data vectors contained in
Consider ,
(21) In other words, any eigenvector can be written as a linear combination of the vectors in and, as a consequence, must lie in the space spanned by them.

Determining equation (7)

Determination of
The norm of parameters is a consequence of the normalization of the eigenvectors in .Using equation (5),(25)

Obtaining and
We begin with the definition of the covariance matrix in feature space
(26) we have to find the eigenvalues, , and eigenvectors, , which satisfy
(27) Using item (ii) above, we have that all can be written as a linear combination of the ’s. This means that we are allowed to consider the equivalent equations
(28) with the prescription that
(29) Using equations (28) and (29),
(30) Calling
(31) leads to
(32) where is a column vector. As is symmetric,
(33) with . In order to obtain , we only need to diagonalize .

Centralization in feature space
Considered the centred vectors in feature space
(36) our goal now is to define the dot product matrix
(37) In a procedure similar to (v) above, we arrive at the eigenvalue equation
(38) which has eigenvectors and
(39) In this case, we do not have the centered data points represented by equation (36), so we need to write in terms of . In what follows, consider .
Appendix B Linear PCA
We present here the results we achieved from applying linear PCA to the postSNPCC data. The procedure for deriving the PCs are described in subsection 2.1. The 2 PCs that best separate Ia and nonIa data points were identified by using a crossvalidation algorithm similar to the one described in subsection 3.2. The only difference is that, in the linear case, there is no parameter to adjust. The outcomes for sample using different SNR cuts are displayed in table 3. The graphical representation of data points projections for the SNR5 case is shown in figure 19 and the redshift dependence of the classification results are displayed in figure 20.
Comparing results for +SNR5 when U class is included in the training, presented in Tables 3 and 5, the reader can verify that the using KPCA raises the efficiency levels from 56% to 84% and the purity levels from 73% to 91%. This corresponds to approximately 50% increase in efficiency and 25% increase in purity.
Training sample  Test sample  

crossvalidated  including U  
PC pair  eff  pur  SC  eff  pur  SC  FoM  
SNR5  13  77  80  86  56  73  79  0.27 
SNR3  13  85  83  87  64  63  78  0.23 
SNR0  13  84  84  84  48  32  50  0.07 
Appendix C Results for as a function of redshift and SNR cuts
Figure 21 shows how the classification results for (test sample) behave as a function of redshift and SNR selection cuts. Figure 22 shows SC, efficiency and FoM results normalized after election cuts.
Examining the topmiddle panel of figure 23, we see that eff also suffers in high redshift due to SNe classified as (thin lines). This was another choice we made in order to preserve purity. Although a few SNe Ia are lost to the class (which is bad for efficiency), so are nonIa that would easily be mistaken with SNe Ia (which is good for purity). This effect becomes clear if we compare figures 4 and 22 to figure 23. From these we see that eff gets from 89% (without type) to 84% (with type) but at the same time purity increased from 80% to 91%, staying above 75% for the entire redshift range.
Appendix D +SNR0 classifications
We present in figures 24, 26 and 26 the classification results for +SNR0. This is shown in order to facilitate comparison with other methods from the literature which do not apply SNR cuts. However, we emphasize that, for a given time sampling, this is the worst case scenario for our method. As shown in figure 21, the classification potential of the method is highly increased with better quality data (higher SNR).
Appendix E SNPCC complete results
Figure 27 shows detailed results obtained from the SNPCC sample for different time window samplings. It is composed by 4 big panels, each one containing plots for a diagnostic parameter, organized in 3 rows and 4 columns. The rows run through SNR5, SNR3 and SNR, from top to bottom. The leftmost column in each panel show results for SNR cuts only. Meaning that all SNe surviving the corresponding SNR cut were classified as Ia. Other columns represent , and , from left to right. Outcomes from , and are similar to the ones presented in the plot, so we decided not to show them.
The first thing to notice from this figure is that the time window sampling leads to small differences in the overall classification results. Obviously higher purity results comes from , the only subsample which includes the second maximum in the infrared, for SNe Ia in . However, discrepancies between results from different SNR cuts are much larger. This shows that, despite the need to define a time window, the specific choice is not crucial in the determination of final results.
The same argument does not hold for SNR selection cuts. We see the crucial role played by the quality of each observation, no matter which diagnostic we analyse. Although this effect is noticeable in all of them, it is more evident in outcomes from eff and FoM, due to reasons already discussed in section 5. Nevertheless, our method achieved FoM for . In this redshift range, only SNPCC entries Sako, JEDIKDE and SNANA cuts reported comparable results. The behaviour of our eff plots is almost opposite to what is reported from the SNPCC. In those, the efficiency is almost always very high, what frequently comes accompanied by a low purity result.
On the other hand, our results for purity and pseudopurity are very good, specially for redshifts within . For all subsamples with SNR5, we achieved purity values larger than 75% in this redshift range, a result that is not present in none of the entries in the SNPCC. Beyond that, +SNR5 gives good results for purity and pseudopurity for , confirming the importance of observing the second maximum in the infrared.
Appendix F Summary tables
We present bellow complete tables describing our results for different light curve time samplings and SNR cuts.
Training sample  
SIM1  SNR0  SNR3  SNR5  SNR0  SNR3  SNR5  SNR0  SNR3  SNR5  SNR0  SNR3  SNR5  
Ia  559  374  213  142  409  225  145  418  232  148  297  173  119  
nonIa  544  355  315  273  397  347  303  412  350  303  282  257  222  
total  1103  729  528  415  806  572  448  830  582  415  579  430  341  
Test sample  
SIM1  SNR0  SNR3  SNR5  SNR0  SNR3  SNR5  SNR0  SNR3  SNR5  SNR0  SNR3  SNR5  
Ia  559  3181  633  431  3480  666  453  3575  673  448  2525  520  354  
nonIa  544  11346  3716  1993  12255  3926  2100  12413  3900  2096  9340  3241  1776  
total  1103  14527  4349  2424  15735  4592  2553  15988  4573  2544  11865  3761  2130  
Test sample  Training Sample  
SNR0  SNR3  SNR5  SNR0  SNR3  SNR5  
Ia  444  238  153  3555  661  437  
nonIa  440  361  312  12544  3926  2125  
total  884  599  465  16099  4587  2562 
Test sample  
Training sample  complete  exclude U  include U  
cross validated  
data set  SNR  PCs  eff  pur  SC  eff  pur  SC  eff  pur  SC  eff  pur  SC  FoM  eff  FoM  
0.9  89  89  92  89  80  94  93  87  96  84  91  91  0.64  8  0.06  
0.9  87  86  89  83  56  88  89  64  91  76  67  83  0.32  11  0.04  
0.7  88  87  87  73  30  57  81  24  41  63  34  33  0.09  44  0.06  
0.4  90  90  92  90  77  94  93  80  95  86  83  91  0.54  8  0.05  
0.7  88  88  90  77  62  90  78  70  92  64  75  83  0.32  9  0.04  
0.4  86  88  87  65  32  63  87  27  46  32  30  32  0.06  46  0.04  
1.0  84  86  90  89  71  92  91  75  93  82  83  87  0.51  8  0.05  
1.0  85  87  89  84  52  86  88  58  89  77  67  82  0.31  11  0.05  
0.6  85  84  84  69  30  58  79  26  45  52  29  34  0.06  40  0.05  
0.3  85  90  92  87  78  93  88  86  95  82  88  91  0.58  8  0.06  
0.6  83  86  88  85  53  87  88  57  89  77  67  81  0.31  11  0.05  
1.7  85  85  85  60  26  53  62  25  51  50  25  44  0.05  38  0.04  
1.9  85  86  91  82  59  87  88  60 