Optimizing Automated Classification of Periodic Variable Stars in New Synoptic Surveys
Efficient and automated classification of periodic variable stars is becoming increasingly important as the scale of astronomical surveys grows. Several recent papers have used methods from machine learning and statistics to construct classifiers on databases of labeled, multi–epoch sources with the intention of using these classifiers to automatically infer the classes of unlabeled sources from new surveys. However, the same source observed with two different synoptic surveys will generally yield different derived metrics (features) from the light curve. Since such features are used in classifiers, this survey-dependent mismatch in feature space will typically lead to degraded classifier performance. In this paper we show how and why feature distributions change using OGLE and Hipparcos light curves. To overcome survey systematics, we apply a method, noisification, which attempts to empirically match distributions of features between the labeled sources used to construct the classifier and the unlabeled sources we wish to classify. Results from simulated and real–world light curves show that noisification can significantly improve classifier performance. In a three–class problem using light curves from Hipparcos and OGLE, noisification reduces the classifier error rate from 27.0% to 7.0%. We recommend that noisification be used for upcoming surveys such as Gaia and LSST and describe some of the promises and challenges of applying noisification to these surveys.
Classification of periodic variables is crucial for scientific knowledge discovery and efficient use of telescopic resources for source follow up (Eyer & Mowlavi, 2008; Walkowicz et al., 2009). As the size of synoptic surveys has grown, a greater and greater share of the classification process must become automated (Bloom & Richards, 2011). With Hipparcos, it was possible for astronomers to individually analyze and classify each of the 2712 periodic variables observed in the survey. Starting in 2013, Gaia is expected to discover 5 million classical periodic variables over the course of its 4–5-year mission (Eyer & Cuypers, 2000). LSST, for that matter, may collect on the order of a billion (Borne et al., 2007). Individual analysis and classification by hand of all periodic variables is no longer feasible.
The need for efficient and accurate source classification has motivated much recent work on applying statistical and machine learning methods to variable star data sets (e.g., Eyer & Blake 2005; Debosscher et al. 2007; Richards et al. 2011; Dubath et al. 2011). In these papers, classifiers were constructed using light curves from a variety of surveys, such as the Optical Gravitational Lensing Experiment (OGLE, Soszyński et al. 2011), Hipparcos (Perryman et al., 1997), The All-Sky Automated Survey (ASAS, Pojmanski et al. 2005), the COnvection, ROtation & planetary Transits survey (CoRoT, Auvergne et al. 2009), and the Geneva Extrasolar Planet Search. Often the intention of these studies is to develop classifiers with high accuracy in classifying sources from surveys other than those used to construct the classifier. For example, Blomme et al. (2011) trained a classifier on a mixture of Hipparcos, OGLE, and CoRoT sources and used it to classify sources from the Trans-atlantic Exoplanet Survey (TrES, O’Donovan et al. 2009) Lyr1 field. Dubath et al. (2011) and Eyer et al. (2008) view their work on classification of Hipparcos sources as a precursor to classification of yet–to–be collected Gaia light curves. Debosscher and collaborators trained a classifier on a mixture of OGLE and Hipparcos sources in attempts to classify CoRoT sources (Debosscher et al., 2007; Sarro & Debosscher J., 2008; Debosscher et al., 2009).
It is well known that systematic differences in cadence, observing region, flux noise, detection limits, and number of observed epochs per light curve exist among surveys. Even within surveys there is heterogeneity in these characteristics. Most statistical classifiers assume that the light curves of a known class used to construct the classifier, termed training data, and the light curves of unknown class which we wish to classify, termed unlabeled data, share the same characteristics. This is unlikely to be the case when training and unlabeled light curves come from different surveys, or when the best-quality light curves of sources from each class are used to classify poorly sampled light curves of unknown class from the same survey.
To illustrate how seriously survey mismatches can deteriorate classification performance, consider the three-class problem of separating Mira variables, Classical Cepheids, and Fundamental Mode RR Lyrae from the Hipparcos and OGLE surveys. From OGLE, we use V-band data. Note that OGLE is far better sampled in I-band than V-band. We use V-band to create a setting where one set of data is well sampled while the other set is poorly sampled. See Section 5.3 and Table 2 for more information on these sources.
For each light curve we compute dozens of metrics, termed features, that contain important information related to source class (e.g., frequency and amplitude; see Section 2 for details on feature selection and extraction). Using the Hipparcos light curves we construct a classifier using CART.111CART (Classification And Regression Trees) is a popular classifier that forms a sequence of nested binary partitions of feature space. See Breiman et al. (1984) for more on CART. The resulting classifier uses only two features for separating classes: the amplitude of a best fit sinusoidal model and the 90 percentile of the slope between phase adjacent flux measurements after the light curve has been folded on twice the estimated period.
Figure 1a displays these two features for each Hipparcos source with grey lines denoting the class boundaries chosen by CART. Based on the Hipparcos light curves, this looks like an excellent classifier as each of the three regions of feature space selected by CART contains sources of only one class. However, examining a subset of the OGLE sources, Figure 1b, shows large class overlap on these two features. Here these two features do not separate OGLE sources well. The error rate measured by cross–validation on the Hipparcos sources was only 0.6%222See 2.4 for a definition of cross–validation. However, the misclassification rate on the OGLE sources is 30.0%.
Despite what the 30.0% error rate seems to imply, the problem of separating classes in OGLE is not inherently difficult. A CART classifier trained on the OGLE light curves has a cross–validated error rate of 1.3%. While there are many systematic differences between the Hipparcos and OGLE surveys, their radically different cadences and number of flux measurements per light curve appear to be driving the increase in misclassification rate. For example, both features in Figure 1 depend on the estimate of each source’s period; yet, over 25% of the RR Lyrae in OGLE have incorrectly estimated periods due to poor sampling in the V-band.
A natural question to ask is: If we had observed the Hipparcos sources at an OGLE cadence, what classifier would CART have constructed, and how would this have changed the error rate? In this paper we use noisification, a method which matches the cadence of training data and unlabeled data by inferring a continuous periodic function for each training light curve and then extracting flux measurements at the cadence and photometric error level present in the unlabeled light curves. The purpose of noisification is to automatically shift the distribution of features in the training data closer to the distribution of features in the unlabeled data so that a classifier can determine class boundaries as they exist in the unlabeled data. Versions of noisification were introduced in Starr et al. (2010) and Long et al. (2011). In this paper, we demonstrate that noisification improves classification accuracy on several simulated and real–world data sets. For instance, on the OGLE – Hipparcos three class problem we reduce misclassification rate by 20.0%. Performance increases are greatest when the training data is well sampled at a particular cadence while unlabeled light curves are either poorly time sampled or observed at a different cadence.
This paper is organized as follows. In Section 2 we briefly outline the statistical classification framework and show how it is applied in the context of periodic variables. In Section 3 we illustrate the problems that occur when training and unlabeled data come from different surveys. We present noisification, a method for overcoming differences related to number of flux measurements, cadence, and photometric error in Section 4. In Section 5 we apply noisification to several data sets. Finally in Section 6 we discuss possible uses of noisification for upcoming surveys.
2 Overview of Classification of Periodic Variables
Here we review a methodology for constructing, implementing, and evaluating statistical classifiers for periodic variables. This approach has been used in many recent works. For a more detailed review of the methodology see Debosscher et al. (2007) or Richards et al. (2011).
2.1 Constructing a Classifier
We start with a set of light curves of known class, termed training data and a set of light curves of unknown class, termed unlabeled data. Our goal is to determine the classes for the unlabeled light curves using information present in the training data. Each light curve consists of a set of time, flux, and photometric error measurements. We compute functions of the time, flux, and photometric error, termed features. Features are chosen to contain information relevant for differentiating classes. The same set of features is computed for each light curve. A statistical classification method uses the training data to learn a relationship between features and class and produces a classifier . Given the features, , for a light curve in the unlabeled set, is a prediction of its class.
2.2 Feature Set
We use a total of 62 features to describe each light curve. 50 of these features are described in Tables 4 and 5 of Richards et al. (2011).333We do not use pair_slope_trend, max_slope, or linear_trend. We use 12 other features, described in Appendix A of this article. Many of the features that we use are obvious choices e.g., frequency and amplitude. Most of our features, or features very similar to the ones here, have been used in recent work on classification of periodic variables (Kim et al., 2011; Dubath et al., 2011).
2.3 Choosing a Classifier
There are many statistical classification methods for constructing the function . Some of the most popular include linear discriminant analysis (LDA), neural networks, support vector machines (SVMs), and Random Forests. In an earlier example we used CART. Each classification method has its own strengths and weaknesses. See Hastie et al. (2009) for an extensive discussion of classification methods. In this work we use the Random Forests classifier developed by Breiman (2001), Amit & Geman (1997), and Dietterich (2000). Random Forests has been used, with high levels of success, in recent studies of automated variable star classification (Richards et al., 2011; Dubath et al., 2011). Richards et al. (2011), in a side–by–side comparison of 10 different classifiers using OGLE and Hipparcos data, found that Random Forest had the lowest error rate.
2.4 Estimating Classifier Accuracy
Usually, researchers want an estimate of how accurate the classifier, , will be when presented with new, unlabeled data. Simply calculating the proportion of times correctly classifies light curves in the training data is a poor estimate of classifier success, as this typically overestimates classifier performance on unlabeled data. Better assessment of classifier performance on unlabeled data is attained by using training–test set splits or cross–validation. With training–test set splits a fraction of the data, usually between 10% and 30%, is “held out” while the rest of the data is used to train the classifier. Subsequently, the held out observations are classified and the accuracy recorded. This number provides an estimate of how well the classifier will perform on unlabeled observations. In cross–validation, the training–test split is repeated many times, holding out a different set of observations at each iteration. The accuracy of the classifier is recorded at each iteration and then averaged. See Chapter 7 of Hastie et al. (2009) for more information on assessing classifier performance. Cross–validation has been the method of choice for evaluating classifier performance in many of the recent articles on classification of periodic variables.
3 Feature Distributions and Survey Systematics
The classification framework described above comes with assumptions and limitations. Of critical importance, statistical classification methods are only designed to produce accurate classifiers when the relationship between features and classes is the same in training and unlabeled data. This is formalized as follows. Let represent the class for a source with features . Let be the probability of class given features in the training set and be the probability of class given features for unlabeled data. Statistical classifiers are designed to have high accuracy when In the three class example in the introduction, we saw that this was not the case due, in part, to incorrect estimation of periods in the unlabeled (OGLE) light curves. Violating this assumption will also cause cross–validation to make incorrect predictions of classifier accuracy.
In this section we illustrate the complex connection between survey systematics and feature distributions. We show how this connection causes the assumption to break, potentially leading to poor classifier performance on the unlabeled data.
3.1 Periodic Features
Nearly every study of classification of periodic variables has used period (or frequency) as a feature. Often in the training set, the period is correct for a large majority of sources due to the investigators selecting the highest quality light curves of each source class of interest. However, if periods are estimated incorrectly for the unlabeled data, then a classifier constructed on the training data may not capture the period–class relationship as it exists for the unlabeled data.
For example, it has been suggested that light curves from early Gaia data releases be labeled using classifiers trained on Hipparcos light curves (Eyer et al., 2008; Eyer & et al., 2010). Figure 2a shows a density plot of the estimated frequency for three source classes in Hipparcos444Sources used in Dubath et al. (2011) using light curves from the entire 3.5-year survey. The median number of flux measurements per light curve is 91. However, one year into Hipparcos the densities of the estimated frequency for these source classes look significantly different (Figure 2b). The median number of flux measurements per light curve is now 29. Thus, even if we assume that Gaia and Hipparcos have similar survey characteristics, a classifier built on the 3.5-year baseline Hipparcos training set will not accurately capture the frequency–class relationship as it exists in 1-year Gaia data. This is due to incorrect estimates of frequency for the 1-year length light curves. Since it is often the case that many features depend on frequency (e.g. Table 4 of Richards et al. (2011) and Section 4.5 of Dubath et al. (2011)), systematic differences in estimates of frequency can alter the distributions of many features.
3.2 Time-Ordered Flux Measurements
Several recent studies of classification of periodic variables have used features that depend on the time ordering of flux measurements. For example, Dubath et al. (2011) used point–to–point scatter (P2PS), the median of absolute differences between adjacent flux measurements divided by the median absolute difference of flux measurements around the median. Specifically, given some light curve x with time ordered flux measurements ,
where denotes the median. While potentially useful for classification, the behavior of this feature is heavily dependent on the cadence of time sampling. To see this, consider a two class problem where class 1 is sine waves of amplitude 1 with period drawn uniformly at random between 0.25 days and 0.75 days and class 2 is sine waves of amplitude 1 where period is drawn uniformly at random between 2 days and 8 days. Say we observe 20 flux measurements for each source. Figure 3 shows the density of P2PS for 200 sources of each class with (a) 30 minutes, (b) 2 days, and (c) 10 days between successive flux measurements. At 30 minutes and 2 days the feature is useful for distinguishing classes, but in opposite directions. At 10 days the feature is no longer useful.
The process of how cadence and period produce the P2PS feature density is complex. For class 2 (2 day to 8 day periods) at 30 minute cadence, the flux measurements for each source are often monotonically increasing or decreasing, producing a small numerator relative to denominator in equation (1). When the cadence is large relative to the distribution of periods for the source class, the functional shape of the light curve determines the P2PS density. In Figure 3c where the cadence is longer than any possible period for either class, the two classes have the same density because they have the same functional shape (sine waves).
Note that this extreme sensitivity to cadence is not based on having 20 flux measurements per light curve. Running these simulations with 100 flux measurements per light curve produces densities of roughly the same shape. Rather, this example suggests how useful P2PS may be for distinguishing between classes in a setting where it may be difficult to determine a correct period (20 flux measurements per light curve), and how sensitive it is to systematic differences in cadence between training and unlabeled data.
3.3 Time-Independent Features
Finally, some of the most useful features for periodic variable classification are simple functions of flux measurements such as estimated amplitude, standard deviation, and skew. Figure 4 shows how estimated amplitude of Miras differs in distribution between the Hipparcos and OGLE surveys.555The Hipparcos Miras were used in (Debosscher et al., 2007). The OGLE sources are V-band data from OGLE III Catalog of Variable Stars: http://ogledb.astrouw.edu.pl/~ogle/CVS/ In Hipparcos there are no Miras with amplitude greater than 3 mag while roughly 12% of Miras in OGLE have amplitude greater than 3 mag. The mode of the densities is different as well.
There are several possible causes for the difference in shape of these densities. The median difference between last observation time and first observation time for OGLE sources is 1902 days and 1142 days for Hipparcos. Since Miras vary in amplitude through each period, it is possible that OGLE is simply observing more periods and picking up on lower troughs and higher peaks than Hipparcos. Additionally, many OGLE sources have large mean photometric error (not shown), which may be driving up estimates of amplitude. Also, OGLE and Hipparcos sources were observed with different filters, possibly leading to biases in estimated amplitude.
It is also worth noting that the Hipparcos catalog light curves are themselves a composite of Selected sources chosen for their scientific interest before the mission and a set of Survey sources which represent a nearly complete sample to well defined magnitude limits (which depend spectral type and galactic latitude). Figure 5 shows boxplots of amplitudes in Hipparcos for classes with over 50 sources, blocked into Survey and Selected. The Selected sources appear to have larger amplitudes on average than the Survey sources. A statistical classifier trained on this data will discover class boundaries for this mixture of Selected and Survey sources. However if the unlabeled data resemble the Survey sources, these boundaries may not separate classes well.
We have shown how differences in survey systematics can alter feature distributions and deteriorate classifier performance. These survey systematics exist between and within surveys. In this section we describe noisification, our solution to addressing training–unlabeled set differences. We use noisification to overcome differences in training–unlabeled feature distributions caused by differences in the number of flux measurements, cadence, and level of photometric error of light curves. Before introducing noisification we discuss a few recent works in the periodic variable classification literature that account for differences in training and unlabeled data and the extent to which they address distribution shifts discussed in Section 3.
4.1 Related Work
Two recent works, Richards et al. (2012) and Debosscher et al. (2009), have adapted classifiers to address training–unlabeled data set differences by adding unlabeled data to the training set. Richards et al. (2012) applied an active learning methodology to successfully improve classifier performance on ASAS unlabeled data using OGLE and Hipparcos training data. Debosscher et al. (2009) used a method similar to self-training (Nigam & Ghani, 2000) where after applying a classifier trained on Hipparcos and OGLE sources to CoRoT data, the most confidently labeled CoRoT sources were added to the training data. From this new training set, they constructed a classifier and used it to classify the remaining CoRoT sources.
Both active learning and self-training are designed to work when the feature densities in training and unlabeled data are different, but the feature–class relationship is the same. More formally, if and are the feature densities in training and unlabeled data, then Active Learning and self-training are designed to address the setting where , not . However with our problem, differences in number of flux measurements, cadence, and photometric error induce different relationships between class and features. For instance, consider the P2PS cadence example in §3.2, Figure 3. If the left plot, (a), is the training data P2PS class densities and the center plot, (b), is the unlabeled P2PS class densities, then moving data from (b) to (a) (as is done with Active Learning and self-training) would produce class densities that are a mixture of (a) and (b). Training a classifier on a mixture of (a) and (b) densities is unlikely to produce a classifier that has high accuracy on data with the classes densities in (b).
A method that comes closer to addressing class–feature distribution differences was used in Debosscher et al. (2009) to overcome aliasing in period estimation. There the authors found that the day orbital frequency of the CoRoT mission caused spurious spectral peaks and induced incorrect period estimation for sources. Their solution was to disregard spectral peaks at the orbital frequency.
Effectively, Debosscher et al. (2009) asked the question “What would the value of this light curve’s period feature have been if it had been observed at a cadence matching the training data.” In their case, the answer is fairly staightfoward. However it is much less clear how to correct other features in a similar manner. If the unlabeled sources are observed for 10 days, then it is likely that estimates of amplitude are biased. But by how much? If the source is a Mira, then likely by a lot, but if the source is an RR Lyrae possibly not at all. So in order to correct amplitude estimates we need to know, or have some idea, of the class of the unlabeled source. But this returns to the goal of classification in the first place.
In Long et al. (2011) this approach was termed denoisification. For each unlabeled source the authors estimated a distribution across features representing uncertainty on what the feature values would have been if the source had been observed at a cadence, noise–level, and number of flux measurements in the training data. This distribution was combined with a classifier constructed on training data in order to classify unlabeled sources. While denoisification was superior to not adjusting for training–unlabeled distribution differences, the method did not achieve as large performance increases as noisification.
Noisification overcomes training–unlabeled set differences by altering the training set so that the number of flux measurements, cadence, and photometric error match that of the unlabeled data. A classifier can then use this “noisified” training data to determine class boundaries as they exist for the unlabeled data. Noisification was introduced in Starr et al. (2010). Long et al. (2011) described a specific version of noisification appropriate for when training and unlabeled data have different numbers of flux measurements but are otherwise identical. Here we describe a far more general version of noisification which can be used across surveys when unlabeled sources have a systematically different number of flux measurements, cadence, and photometric error than the training data. Code written in Python and R is available for implementing noisification of light curves.666Code available here: http://stat.berkeley.edu/~jlong/noisification
4.2 Implementation of Noisification
Given a set of training light curves, we first estimate a period for each.777Noisification assumes we have training sources that are of high enough quality that we can estimate periods accurately. Next, we smooth the period folded light curves, turning each set of flux measurements into a continuous periodic function. Select a light curve from the training set, and then at random choose a light curve, from the unlabeled set. Let be the smooth periodic function associated with . Let and represent the time, flux and photometric error for epoch of light curve . Say there are flux measurements for light curve . We now extract flux measurements from the periodic function matching the cadence and photometric error present in . Specifically, if we let and be the time, flux, and photometric error of light curve noisified to light curve , then we have,
is a phase offset drawn uniformly at random between 0 and the period of g, . This represents that fact that we are equally likely to start observing a source at any point in its phase. is the the photometric error added to each flux measurement.
The cadence and level of photometric error in this new, noisified version of light curve now match that of the unlabeled data. Repeat this process for every training light curve. Then derive features for the noisified training data, train a classifier on these observations, and classify the unlabeled light curves using this classifier. We call this process noisification because if our training data consists only of well-sampled light curves and our unlabeled data consists mainly of poorly sampled light curves, then the technique effectively adds noise to features in the training data to more closely match the characteristics of the unlabeled features. See Figure 6 for a concise description of the algorithm.
4.3 Remarks on Noisification
There are a few important points to note about this procedure. First, if the training and unlabeled data have the same cadence and photometric error, then smoothing the training light curves is not necessary. This would be the case, for example, if we had a set of training light curves of known class with many flux measurements ( 100) from one survey and we wanted to classify an unlabeled set of poorly sampled light curves ( flux measurements) of similar cadence and photometric error level from the same survey as the training data. Then we could simply take the training light curves, truncate them at 30 flux measurements, train a classifier on the truncated curves, and apply this classifier to the unlabeled light curves. This setting has the added benefit that no error will be introduced by smoothing the light curves. In this case the training sources do not need to be periodic.
Secondly, the procedure as described is most appropriate if all of the unlabeled data have similar numbers of flux measurements, cadence and photometric error. If this is not the case, then we can repeat the procedure several times using different subsets of the unlabeled data which share similar properties. For example, if unlabeled light curves have either around 20 or around 70 flux measurements, then we could break the unlabeled data into two sets and classify each set using a separate run of the noisification procedure. The more subsets of the unlabeled data one uses, the closer the noisified training data gets to the unlabeled data. The tradeoff is computational burden. With training light curves and unlabeled light curves, noisifying to precisely match the properties of each unlabeled light curve requires deriving features for light curves. In Section 5 we explore how much one can gain from dividing the unlabeled data into subsets.
With noisification, the unlabeled light curve, , at which to noisify training light curve , and are all random. Thus, repeating the noisification process several times and obtaining several classifiers offers potential for improvement in classifier performance over running the process once. We study this in Section 5. While building several classifiers may be a good idea, it is important not to train a classifier using several noisified versions of the same light curve as the training data would no longer be independent. This can cause classifiers to overfit the data, hurting classifier performance.
Note that noisification is classifier independent. We use Random Forests in this work, but noisification can be used in conjunction with essentially any statistical classification method. Here we use Super Smoother for transforming training light curves into continuous periodic functions (Friedman, 1984)888Fortran code here: http://www-stat.stanford.edu/~jhf/ftp/progs/supsmu.f. We used automatic span selection (span) and a high frequency penalty of . These choices were based on visual inspection of smoothing fits to light curves.. The method used for inferring continuous training curves is separate from the the rest of the noisification process. Splines and Nadaraya-Watson methods are other possibilities. Splines are described in 5.4 of Hastie et al. (2009). See Hall (2008) for using Nadaraya-Watson with periodic variables.
Finally we stress that this implementation of noisification is limited to addressing differences between training and unlabeled sets caused by number of flux measurements, cadence, and photometric error. We do not correct for differences in feature distributions due to observing regions, detection limits, or filters.
5.1 Noisification within a Survey
|Survey||Source Classes||F / LC||# Train||# Unlabeled|
|Simulated||RR Lyrae, Cepheid, Persei,||200-200||500||500|
|OGLE||RR Lyrae DM, MM Cepheid,||261-474||358||165|
|Persei, Lyrae, WU Majoris|
In the case of the simulated data, the light curves were made to resemble these classes.
F / LC is the first and third quartiles of flux measurements per light curve for training.
We use every light curves of these classes analyzed in Richards et al. (2011).
To get a sense how noisification performs in a controlled setting, we first test the method using training and unlabeled data from the same survey, but with systematically differing number of flux measurements. This resembles the real–life situation where well sampled light curves of known class are used as training data to classify poorly sampled curves of unknown class from the same survey. The cadence and levels of photometric error are assumed to match in the training and unlabeled data. We are also free from worrying about survey characteristics that noisification does not address. We perform two experiments, one using a simulated light curve data set and one using an OGLE light curve data set.999Here the OGLE curves are in I-band. See Table 1 for data set information.
After splitting each data set into training and unlabeled sets, we downsample the light curves in the unlabeled data set to 10 through 100 flux measurements in multiples of 10. Now the unlabeled data sets resemble the training in every way except for the number of flux measurements per light curve. To each of the ten unlabeled data sets we apply four classifiers and compute classification accuracy on the unlabeled data sets. Figure 7 provides error rates for the four classifiers applied to the 10 unlabeled sets from (a) simulated and (b) OGLE. The four classifiers are:
naive (black circles): Random Forest constructed on the unaltered training data
unordered (red triangles): noisify every training light curve by matching the number of flux measurements in the training set and unlabeled set, but we choose a random, non-contiguous set of epochs (cadence information is lost)
1x noisification (green plus): noisification without smoothing as described in Section 4
5x noisification (blue x) “1x noisification” repeated five times as discussed in Section 4
The results in Figure 7 suggest that noisification can significantly increase classification performance when the unlabeled data is poorly sampled. With OGLE, “naive” misclassifies around 32% of light curves with 30 flux measurements while “5x noisification” misclassifies around 21%. Based on the difference between the “unordered” and “1x / 5x noisification” procedures, it appears that having a training cadence that matches the cadence of the unlabeled data can improve classification performance. We explore this in more detail later when training and unlabeled data come from surveys with different cadences. The “5x noisification” advantage over “1x noisification” is fairly modest. Repeatedly noisifying the training data and averaging the resulting classifiers reduces variance and leaves bias unchanged, so we see no way that using “5x noisification” instead of “1x noisification” could hurt classifier performance. For the remainder of the paper, noisification refers to“5x noisification.”
To investigate how noisified classifiers differ, we plot feature importances for the “1x noisification” classifier for 10 and 100 flux measurements for the OGLE data (see Figure 8). Random Forest feature importance measures were introduced by Breiman (2001) and have been used in recent studies of periodic variables to gain an understanding of which features Random Forests considers most highly when assigning a class to a light curve. See Dubath et al. (2011) Section 4.1 for a complete description of feature importance. Figure 8 shows that skew is very important for both classifiers. Notice that the 100 flux measurement classifier ranks several period based features as being important – scatter_res_raw, freq_signif, and freq1_harmonics_freq_0 – while the 10 flux measurement classifier does not. The interpretation is clear: when classifying light curves with 10 flux measurements, features that require a correct period will not be very useful. The process of noisifying light curves causes the classifier to recognize this and make use of class information present in other features.
In these two examples, light curves in the unlabeled data set always had one of 10 possible number of flux measurements (). The noisified light curves had exactly the same number of flux measurements as the unlabeled data. In practice, we will need to classify light curves with any number of flux measurements. It may be computationally challenging to construct noisified classifiers for every possible number of flux measurements. To test how sensitive error rates are to how light curves are noisified, we took the noisified classifiers for 10, 50, and 100 flux measurements and applied them across all 10 of the unlabeled data sets. Figure 9 shows the results for the (a) simulated and (b) OGLE data. We plot the error rates of these three classifiers along with the error rate of the classifier noisified to the number of flux measurements actually in the unlabeled data set (the “5x noisified” classifiers from Figure 7). The results show that for these examples the error rates are fairly insensitive to exactly how many flux measurements we use in the noisified classifier. For the OGLE data, the classifier noisified to 10 flux measurements performs well until unlabeled light curves have around 70 flux measurements. Additionally the 50–flux and 100–flux noisified classifiers perform well for unlabeled data sets with between 30 and 100 flux measurements.
5.2 Noisification with Smoothing
We now address the challenge of training a classifier on a survey with one cadence to classify light curves of a different cadence. In order to ensure that all differences between training and unlabeled data are due to issues addressed by noisification (number of flux measurements, cadence, photometric error) we use the simulated light curve prototypes from Section 5.1 for both training and unlabeled data sets. We sample the light curves at actual Hipparcos and OGLE light curve cadences used in previous studies (Richards et al., 2011; Debosscher et al., 2007).
Systematic differences exist between the OGLE and Hipparcos survey cadences. OGLE is a ground based survey with flux measurements taken at multiples of one day plus or minus a few hours. The sampling for these curves is quite regular with few large gaps. In contrast, Hipparcos light curves tend to be sampled in bursts, with several measurements over the course of less than a day followed by long gaps.
In practice, one data set (say, Hipparcos) would be used to train a classifier in order to classify sources in the other (say, OGLE). However since these light curves are simulated, and we have labels for both sets, we create training and unlabeled data sets at Hipparcos and OGLE cadences so we can study the challenge of constructing a classifier on Hipparcos for use on OGLE sources and vice versa. We begin by generating 1000 simulated light curves using the class templates from Section 5.1. For 500 of these curves we randomly select an OGLE cadence and sample flux measurements and photometric errors from this cadence. We then take these 500 curves and downsample them to have flux measurements in multiples of 10. The original 500 curves cadenced to OGLE is the OGLE training set, and the downsampled curves are the 10 OGLE unlabeled data sets. We repeat this process for the other 500 simulated curves at Hipparcos cadences.
In order to test the efficacy and necessity of various aspects of the noisification process, we apply several classifiers to each of the unlabeled data sets. Figure 10 shows the accuracy of these methods treating (a) OGLE and (b) Hipparcos as the unlabeled data. For the left plot with OGLE unlabeled light curves the classifiers are trained on:
ogle cadence naive (black circle): unaltered OGLE light curves
hipparcos cadence noisified (red triangle): Hipparcos light curves truncated to match length of unlabeled set, but not smoothed (cadence is different between training and unlabeled)
hipparcos smoothed to ogle – noisified (green plus): Hipparcos light curves after they have been smoothed, cadenced at OGLE, and truncated to match length of unlabeled curves
ogle cadence noisified (dark blue x): noisified OGLE light curves (cadence already matches unlabeled set so smoothing unnecessary)
hipparcos naive (light blue diamonds): unaltered Hipparcos light curves
Not addressing cadence, flux measurement, and photometric error mismatches by training on full length Hipparcos light curves leads to poor performance (light blue diamond). Noisifying these Hipparcos sources by truncation improves performance (red diamonds). However we gain significantly by correcting for cadence differences by smoothing (green plus). It is encouraging to see that by smoothing the Hipparcos training set and noisifying we can do as well as if we had started with OGLE cadence curves (dark blue x and green plus).
The right plot of Figure 10 displays the same information with Hipparcos as the unlabeled cadence. Note that the line markings have been changed to preserve relationship of training set to unlabeled set. The overall picture is similar to the OGLE data, except that convergence of error rates happens much more quickly. At 60 flux measurements there is little difference among any of the classifiers.
The difference in error rates between classifiers trained on data noisified to the cadence of the unlabeled data and those that are not suggests that at low number of flux measurements feature distributions are different for the OGLE and Hipparcos cadences. To investigate this in Figure 11 we plot densities of amplitude for simulated light curves with 10 flux measurements at the OGLE and Hipparcos cadences. To keep things simple we show two class densities – Miras and not Miras. It is clear here that for the OGLE cadence amplitude is not a particularly useful feature for separating Miras from other sources whereas for the Hipparcos cadence it is. Due to the regular sampling at one to two day intervals, 10 flux measurement OGLE curves have only captured part of a Mira period. Hence the amplitude of the source looks much smaller than it actually is. In contrast the large gaps between flux measurements in Hipparcos cadences result in us observing a much larger piece of phase space and thus obtaining a better estimate of amplitude.
5.3 Using Hipparcos to Classify OGLE
Now that we have studied noisification in some controlled settings, we test the method on the original problem proposed in Section 1. Recall that we are classifying Miras, RR Lyrae AB, and Classical Cepheids Fundamental Mode using light curves from Hipparcos as the training data and V-band OGLE as the unlabeled data. In Section 1 we saw that training a classifier on the Hipparcos curves and applying it directly to OGLE resulted in poor performance due, in part, to differences in number of flux measurements, cadence, and photometric error between the two data sets.
Table 2 highlights some important differences between the Hipparcos and V-band OGLE sources. See Udalski et al. (2008); Soszynski et al. (2008, 2009a, 2009b) for descriptions of OGLE III photometry and these three source classes.101010These OGLE III sources are available here: http://ogledb.astrouw.edu.pl/~ogle/CVS/. We use all OGLE III sources from the LMC belonging to the three classes of interest.
There are systematically fewer flux measurements in OGLE than in Hipparcos. Unlike the previous example with I-band OGLE, the V-band OGLE curves here are fairly sparse. 25% percent of the flux measurements are spaced 16 or more days apart. Perhaps the most striking difference between surveys is in the class proportions. RR Lyrae AB make up 26.6% of light curves in Hipparcos and 84.1% of light curves in OGLE. This is most likely due to Hipparcos magnitude limits which result in undersampling the intrinsically faint RR Lyrae AB relative to Mira and Classical Cepheids AB.
|Survey||# Sources||Class Probs.||F / LC||Time Diff||Error|
Class probs. is the class proportion of (Classical Cepheids F, RR Lyrae AB, Mira).
F / LC is the first and third quartiles of flux measurements per light curve for training.
Time Diff is the first and third quartiles of time differences in days between successive flux measurements.
Error is the first and third quartiles of estimated photometric error in magnitude for all flux measurements.
Light curves and classifications from Richards et al. (2011).
To classify the OGLE sources, we noisify all the Hipparcos light curves to OGLE cadence at 10 through 100 flux measurements in multiples of 10. We then construct classifiers on each of these sets, resulting in 10 noisified classifiers. Each OGLE light curve is classified using the classifier with the closest number of flux measurements. So for an unlabeled OGLE light curve with 27 flux measurements, we classify it using the noisified classifier constructed on the 30-flux measurement training set.
Table 3 displays a confusion matrix for the classifier constructed on the unmodified Hipparcos light curves when it is applied to the OGLE light curves. Table 4 shows the error rate using the noisification procedure. The overall error rate drops from 27% to 7% as a result of using noisification. This is driven by the drop in error rate for RR Lyrae AB (31% error using unmodified classifier, 7% after noisification) and the prevalence of RR Lyrae AB in OGLE. The error rate for Classical Cepheids F actually increases from 2% to 10% while for Miras it is roughly the same.
Part of the reason why noisification increases the error rate for Classical Cepheids appears due to differences in distribution of frequency caused by Hipparcos magnitude limits. Figure 12 displays frequency density in Hipparcos, 35-45 flux length OGLE, and Hipparcos noisified to 40 flux for Cepheids (12a), RR Lyrae (12b), and Miras (12c). Noisification has not changed the density at all for the Cepheid sources (the blue and orange density almost exactly overlap) for the Cepheids. Visual inspection of OGLE periods revealed that they were correct. This suggests that the frequency distribution for Cepheids is fundamentally different in Hipparcos and OGLE. This is likely due to magnitude limits in Hipparcos and OGLE.
Lower frequency Cepheids are intrinsically brighter, so we can see them from further away. These low frequency Cepheids are over-represented in Hipparcos. In contrast OGLE is closer to a random sample of Cepheids in the Large Magellanic Cloud (LMC). If it is there, we see it. Since this survey difference is not caused by number of flux measurements, cadence, or photometric error, the current implementation of noisification does not correct for it. Notice that in Figure 12 right plot, the noisification procedure has shifted the distribution of RR Lyrae frequencies in Hipparcos to more closely match that in OGLE. Here much of the density mismatch was due to error in estimation of frequency due to having few flux measurements. Noisification helps us overcome this survey difference.
Noisification is successful at matching other feature distributions. Figure 13 displays the densities of P2PS for each sources class in 13a Hipparcos, 13b OGLE, and 13c Hipparcos noisified. There is a great deal of difference between Hipparcos and OGLE densities. However the noisified Hipparcos source densities appear to closely match the densities of OGLE.
We have highlighted how differences between training and unlabeled light curves induce different feature distributions. We then showed how these shifts in distribution can cause high error rates, even on problems where the unlabeled data is well separated in feature space. Common methods to evaluate classifier performance, such as cross–validation, do not detect these shifts in distribution and may give a false impression of classifier quality as they only reveal how well a classifier performs on data that is similar to the training set.
We developed a methodology, noisification, for overcoming differences between training and unlabeled data sets. As implemented in this study, noisification addresses differences due to the number of flux measurements, cadence, and photometric error. On several simulated and real–world examples, noisification greatly improved classifier performance. In the Hipparcos training–OGLE unlabeled example, noisification reduced the misclassification rate by 20%.
We hope these findings motivate practitioners to carefully consider differences between training and unlabeled data sets. In general, we recommend using training sets that match as closely as possible the unlabeled set of interest rather than training sets that are high signal–to–noise. As demonstrated in many examples, high signal–to–noise light curves often work poorly as training sets when the unlabeled light curves are of low quality. This is due to the classifier discovering class boundaries in feature space as they exist in the training set, not as they exist in the unlabeled set.
This study has made us skeptical of attempts to identify a single set of features that is generically sufficient for separating a set of classes of periodic variables. Useful features change depending on how sources are observed. The Random Forest importance plots (Figure 8) and the P2PS simulation (Subsection 3.2) illustrate this. When implementing noisification, we recommend starting with large feature sets, even including features that are not useful for separating classes in the training data. These features may become useful for separating classes once the light curves have been noisified.
While we have studied noisification in the context of classification, it could also be applied to other problems. For example, novelty detection and unsupervised learning (clustering) methods are likely to work poorly when training and unlabeled data sets have systematic differences. Noisifying light curves offers a way to overcome these differences.
Noisification may also be extended from what is implemented here to account for differences not related to number of flux measurements, cadence, and level of photometric error. For example, known censoring thresholds in the unlabeled data could be incorporated into the training data by removing, or marking as censored, flux measurements which would not have been observed in the unlabeled data set due to magnitude limits.
In the future, we will apply noisification to light curves from more surveys using larger, highly multi-class training sets. In parallel, we are developing a theoretical understanding of how noisification works and the problems for which it is most suitable. Of particular interest is how noisification performs when there are survey differences not addressed by noisification. This was the case with the Cepheid frequencies in the three–class Hipparcos–OGLE problem.
Upcoming surveys pose a challenge based in their size and their novelty. Not only will Gaia and LSST detect orders of magnitude more periodic variables than previous surveys, the sources they collect will have different properties than any training data we currently have. Noisification offers the potential to bridge some of these differences, enabling us to optimize scientific discovery.
The authors would like to thank Laurent Eyer and Dan Starr for helpful comments and criticisms. The authors would like to acknowledge the generous support of a Cyber-Enabled Discovery and Innovation (CDI) grant (No. 0941742) from the National Science Foundation. This work was performed in the CDI-sponsored Center for Time Domain Informatics (http://cftd.info).
Appendix A Description of Features
We used 62 features in this work. Fifty of these features came from Tables 4 and 5 in Richards et al. (2011). We did not use the features pair_slope_trend, max_slope, or linear_trend from these tables. We used 12 additional features. Five are from Dubath et al. (2011).111111scatter_res_raw, medperc90_2p_p, p2p_scatter_2praw, P2PS (named P2p_scatter in Dubath et al. (2011)), and p2p_scatter_pfold_over_mad The remaining seven are:
fold2P_slope_10percentile 10th percentile of slopes between adjacent flux measurements after the light curve has been folded on twice the estimated period
fold2P_slope_90percentile 90th percentile of slopes between adjacent flux measurements after the light curve has been folded on twice the estimated period
freq_frequency_ratio_21 ratio of the second to first frequency determined by lomb-scargle ( from Table 4 in Richards et al. (2011))
freq_frequency_ratio_31 ratio of the third to first frequency determined by lomb-scargle ( from Table 4 in Richards et al. (2011))
freq_amplitude_ratio_21 ratio of amplitude for frequency 2 to amplitude for frequency 1 ( from Table 4 in Richards et al. (2011))
freq_amplitude_ratio_31 ratio of amplitude for frequency 3 to amplitude for frequency 1 ( from Table 4 in Richards et al. (2011))
p2p_ssqr_diff_over_var121212From Kim et al. (2011) the sum of squared differences in successive flux measurements divided by the variance of the flux measurements
- Amit & Geman (1997) Amit, Y., & Geman, D. 1997, Neural computation, 9, 1545
- Auvergne et al. (2009) Auvergne, M., et al. 2009, A&A, 506, 411
- Blomme et al. (2011) Blomme, J., et al. 2011, arXiv:1101.5038v1
- Bloom & Richards (2011) Bloom, J., & Richards, J. 2011, Arxiv preprint arXiv:1104.3142
- Borne et al. (2007) Borne, K., Strauss, M., & Tyson, J. 2007, BULLETIN-AMERICAN ASTRONOMICAL SOCIETY, 39, 137
- Breiman (2001) Breiman, L. 2001, Machine Learning, 45, 5
- Breiman et al. (1984) Breiman, L., Freidman, J., Olshen, R., & Stone, C. 1984, Classification and regression trees (Wadsworth)
- Debosscher et al. (2007) Debosscher, J., Sarro, L., Aerts, C., Cuypers, J., Vandenbussche, B., Garrido, R., & Solano, E. 2007, Astronomy and Astrophysics, 475, 1159
- Debosscher et al. (2009) Debosscher, J., et al. 2009, Astronomy and Astrophysics, 506, 519
- Dietterich (2000) Dietterich, T. 2000, Machine learning, 40, 139
- Dubath et al. (2011) Dubath, P., et al. 2011, Monthly Notices of the Royal Astronomical Society, 414, 2602
- Eyer & Blake (2005) Eyer, L., & Blake, C. 2005, Monthly Notices of the Royal Astronomical Society, 358, 30
- Eyer & Cuypers (2000) Eyer, L., & Cuypers, J. 2000, in IAU Colloq. 176: The Impact of Large-Scale Surveys on Pulsating Star Research, Vol. 203, 71–72
- Eyer & et al. (2010) Eyer, L., & et al. 2010, arXiv:1011.4527v1
- Eyer & Mowlavi (2008) Eyer, L., & Mowlavi, N. 2008, in Journal of Physics: Conference Series, Vol. 118, IOP Publishing, 012010
- Eyer et al. (2008) Eyer, L., et al. 2008, in American Institute of Physics Conference Series, Vol. 1082, American Institute of Physics Conference Series, ed. C. A. L. Bailer-Jones, 257–262
- Friedman (1984) Friedman, J. 1984, A variable span smoother., Tech. rep., Technical report, Stanford University, Stanford, CA
- Hall (2008) Hall, P. 2008, COMPSTAT 2008, 3
- Hastie et al. (2009) Hastie, T., Tibshirani, R., & Friedman, J. 2009, The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer Verlag)
- Kim et al. (2011) Kim, D., Protopapas, P., Byun, Y., Alcock, C., & Khardon, R. 2011, Arxiv preprint arXiv:1101.3316
- Long et al. (2011) Long, J., Bloom, J., El Karoui, N., Rice, J., & Richards, J. 2011, GREAT Conference Proceedings
- Nigam & Ghani (2000) Nigam, K., & Ghani, R. 2000, in Proceedings of the ninth international conference on Information and knowledge management, CIKM ’00 (New York, NY, USA: ACM), 86–93
- O’Donovan et al. (2009) O’Donovan, F. T., et al. 2009, in NASA/IPAC/NExScI Star and Exoplanet Database, TrES Lyr1 Catalog, 6
- Perryman et al. (1997) Perryman, M., et al. 1997, Astronomy and Astrophysics, 323, L49
- Pojmanski et al. (2005) Pojmanski, G., Pilecki, B., & Szczygiel, D. 2005, Acta Astronomica, 55, 275
- Richards et al. (2011) Richards, J., et al. 2011, The Astrophysical Journal, 733, 10
- Richards et al. (2012) Richards, J. W., et al. 2012, The Astrophysical Journal, 744, 192
- Sarro & Debosscher J. (2008) Sarro, L., & Debosscher J., A. C. 2008, arXiv:0806.3386v1
- Soszynski et al. (2008) Soszynski, I., et al. 2008, Acta Astronomica, 58, 163
- Soszynski et al. (2009a) —. 2009a, Acta Astronomica, 59, 1
- Soszynski et al. (2009b) —. 2009b, Acta Astronomica, 59, 239
- Soszyński et al. (2011) Soszyński, I., et al. 2011, Acta Astron., 61, 1
- Starr et al. (2010) Starr, D., Bloom, J., Brewer, J., Butler, N., & Klein, C. 2010, in Astronomical Data Analysis Software and Systems XIX, Vol. 434, 406
- Udalski et al. (2008) Udalski, A., Szymanski, M., Soszynski, I., & Poleski, R. 2008, Acta Astronomica, 58, 69
- Walkowicz et al. (2009) Walkowicz, L., et al. 2009, Arxiv preprint arXiv:0902.3981