Deep-Learnt Classification of Light Curves
Astronomy light curves are sparse, gappy, and heteroscedastic. As a result standard time series methods regularly used for financial and similar datasets are of little help and astronomers are usually left to their own instruments and techniques to classify light curves. A common approach is to derive statistical features from the time series and to use machine learning methods, generally supervised, to separate objects into a few of the standard classes. In this work, we transform the time series to two-dimensional light curve representations in order to classify them using modern deep learning techniques. In particular, we show that convolutional neural networks based classifiers work well for broad characterization and classification. We use labeled datasets of periodic variables from CRTS survey and show how this opens doors for a quick classification of diverse classes with several possible exciting extensions.
Astronomy has always boasted of big datasets. The data holdings are getting even larger due to surveys that observe hundreds of millions of sources hundreds of time. The observations are a time series of flux measurements called light curves. The staple for discovery has been the flux variations of individual astronomical objects as noted through such light curves - that is where the science is. The large irregular gaps in observing cadence makes classification challenging. Traditionally statistical features have been derived from the light curves in order to do follow-up classification (see, e.g., [1, 2, 3]). The features include standard statistical measures like median, skew, kurtosis as well as specialized domain knowledge based ones such as ’fading profile of a single peaked fast transient’. The standard features do not carry special powers for classifying a varied set of objects. The designer features are better for specific classes, but carry with them a bias that does not necessarily translate to the classification of a wider set.
In  we introduced a two-dimensional mapping of the light curves based on the changes in magnitude () over the available time-differences (). In this work, we mold the mapping into an image format that is suitable as input for convolutional neural networks (CNNs or ConvNets) . By bringing to bear the machinery of CNNs we are able to conjure a large number of features unimagined so far. We use labeled sets to train the CNN as a classifier and following validation we classify light curves from the Catalina Real-Time Transient Survey (CRTS; [6, 7, 4, 8, 9]).
The three CRTS surveys span 33,000 sq. degrees encompassing light curves of close to half a billion sources. Of these, the 0.7m CSS telescope yields million light curves. The light curves span well over ten years, and are homogeneous in that all are collected using white-light without a filter, and with an asteroid-searching cadence of four images in 30 minutes. As is typical of the astronomical objects in wide-area surveys a vast majority of these sources () are non-variable during the survey life-time and within the typical mag error-bars for CSS. The remaining sources - variables - can be broadly classified as periodic and stochastic. The irregularly spaced sparse light curves mean that often even the periodically variable sources do not seem obviously so. There is a third category, that of transients like supernovae and flaring stars which exhibit enhanced activity over a short period and otherwise a quiescent and relatively flat (within error-bars) light curve.
Getting a large, uniform, well-labelled dataset is a challenge in itself. In order to keep the problem simple during early experiments, we start with a sample of periodic variables from the CRTS North (CRTS-N) survey . There are 17 classes represented in the sample. Ten of these have fewer than 500 members. We exclude them from our experiments for now and will include them in future studies. The numbers for the remaining seven classes are given in Table I. These classifications have been carried out by humans mostly based on calibrated light curves, their phased versions after periods were determined, and some auxiliary information on the objects. A little over of these have spectroscopic confirmation of the exact classification. As a result some misclassifications, especially in a nearby class can not be ruled out, especially for objects that are fainter and/or have fewer observations.
Iii DMDT Mappings
A light curve consists of brightness variations as a function of time. Besides the time (expressed here in days - MJD), and brightness (expressed here in the traditional inverse logarithmic scale - mags), we also have information about the error in magnitudes.
For each pair of points in a light curve we determine the difference in magnitude () and the difference in time (). This gives us points for a light curve of length (see Fig. 1). These points are then binned for a range of and values. The resulting binned representation is our mapping from the light curve. The bin boundaries we have used are: mags and days. The bins are in approximately a semi-logarithmic fashion so that frequent small magnitude changes are distributed over many bins, and the infrequent large magnitude changes can be combined together (see Fig. 2). Similarly the more important rapid changes are well represented, and the slower changes are clubbed together. This is akin to histogram equalization, but not forced to be exact. In case of these take into consideration the CSS cadence of four images in 30 minutes. The image intensity, , is normalized by to account for varying lengths of original light curves, and stretched to fit between 0 and 255: . Thus, a bin that does not include a single point, now has an intensity of zero, and a bin that had at least one point has an intensity of at least 1, and the bin that had the maximum number of points has an intensity of at most 255. 255 is reached only when all points are in one bin - the typical max values we have seen in different classes are around 50. Unnormalized values go to several thousand based on the length of the light curve. The normalization also ensures that we can use our data to fine-tune other pre-learned networks. For inclusion in the training set, we require that the light curves contain at least 20 points.
The representations - called -images hereafter - reflect the underlying structure from variability of the source. The -images are translation independent as they consider only the differences in time. A light curve reflected about the x-axis will provide a -image reflected about the y-axis. Thus the structure above and below the line is discriminating for sources. In a sense the -images are like a structure function without consideration to the error-bars. The error-bars tend to be heteroskedastic, but are broadly a function of magnitude, and unless an individual source varies a lot, tend to be similar. In particular, the way the -technique works, the error-bars for neighboring points in the mapped version are similar. Taking into consideration the error-bars would be equivalent to smoothing along the -axis (). For now we ignore the error-bars (but see Section V).
Since the -image is a straightforward mapping of a light curve and the CNN can bring out features hidden therein, the proposed method opens up at least two important avenues. (1) Implication for real-time classification of variables and transients: A sparse light curve of recently discovered object will correspond to a -image that is just a sparse version of the -image that would be formed from the corresponding non-sparse light curve. Since some of the unique features are accentuated at discovery, the discriminative power would already be encapsulated in the -image. (2) Transfer learning: Training based on -images from one survey can be used to classify -images from another survey. We demonstrate this on a set of variables from CRTS-S a Southern set corresponding to the main training set used here , and corresponding PTF data .
There are three ways in which we experiment with the setup to improve performance: (1) Change the bins for optimality - these can be done based on the survey cadence, or based on the classes being considered, (2) Change the layers of the CNN depending on number of classes, size of training sample, possible ways in which unbalancedness between the classes is remedied (or not), and (3) modifying the light curve to -image mapping to bring out features in the classes being separated. We look at these in the next two sections.
Iv Convolutional Networks
Recently, so-called deep learning techniques have become very popular in machine learning and various application domains. This is, in particular, the case for convolutional neural networks (CNNs), which are a special type of artificial neural networks (ANN) [13, 14]. In astronomy, CNNs have been used for a few problems where structure in images is obvious (e.g.  uses them for classifying galaxies based on morphology and  for detecting supernovae). We provide brief description of ANNs and CNNs here. More details can be found in [13, 14, 5] etc.
An ANN is based on several layers. The overall input data (e.g., images) are provided to the input layer. The output of one layer serves as input for the next. The last layer forms the output of the network. For instance, in a multiclass setting, the output layer would contain one node per class. The layers between the input and the output layer are the hidden layers. For a standard ANN, the layers are fully connected, with each node of one layer connected to all nodes of the next.
A CNN is an extension of classical ANNs and consists of different types of layers. As before, we have an input and an output layer. Further, the last layers before the output layer are often standard fully connected layers, called dense layers. The layers before these dense layers, however, are usually very different from those of a standard ANN.
The most prominent types of layers are (a) convolutional layers, (b) pooling layers, and (c) dropout layers: A convolutional layer usually consists of a small set of filters (e.g., 3x3, 5x5, …), called kernels, and convolves every input image with each of those kernels. Typically several such kernels are used as filters in a given convolutional layer to match desired shapes in the input images. This gives rise, for each kernel, to a new representation of the input image. These representations are called feature maps. The convolutional layers are used with rectifiers to introduce non-linearity. A pooling layer decreases the number of parameters of the network by aggregating spatially-adjacent pixel values. One prominent type of pooling layers is max-pooling that replaces patches of an input feature map by the maximum value within each patch. While this reduces the sizes of the feature maps, it also makes the network more robust to small changes in the input data. Finally, a dropout layer randomly omits hidden units by setting their values to zero. Hence, the network cannot fully rely on them. This helps prevent overfitting. One usually resorts to one or more final dense layers based on a large number of nodes connected to the previous layer. Such layers are unlike the convolutional, pooling, and dropout layers, but the same as for traditional neural networks. It is the depth provided by these multiple layers, and the extensive mapping afforded by them that has given rise to the name deep learning.
For approaching the classification task at hand, we considered a multi-layer CNN instance, called deep network. Selecting good layers and parameters is, as yet, more an art than science. An interesting research direction is the use of Bayesian optimization for choosing the best network hyperparameters. As a first step towards this, we considered a far simpler shallow network and were pleasantly surprised by its equally good performance for the base all-class classifications compared to the performance of the deep network. The code detailing the structure of both networks is provided in Listing 1 and Listing 2, respectively. The listings include type and size of layers, number and size of kernels, and dropout fractions. We have used the theano framework (http://deeplearning.net/software/theano/) with lasagne (https://lasagne.readthedocs.io/en/latest/) for our runs.
V Experimental Evaluation
Light curves were converted to -images, normalized as described in Sec. II, and used as inputs to the network. We used 500 training epochs for the deep network and 300 for the shallow network with samples reserved for testing for both configurations. We used a learning rate of 0.0002 and the Adaptive Momentum (ADAM) algorithm to train all our models. All our models have been trained on a single NVIDIA GeForce GTX 560 graphics processing unit (GPU). It takes and 42.3 seconds per epoch for the shallow and deep networks respectively when trained on the CRTS-N training set. Training RFs is one to two orders of magnitude faster - after computing the features. Depending on how complex the features are, the computing time can vary a lot. We used Red Hat linux release 6.6, python 2.7.2.
V-a Binary classification
We trained the network with pairs of classes as well as with all seven classes together. When used in binary mode we noticed poor performance when class 1 is involved (see Col. 2 of Table II). It is not unexpected since class 1 contains two-thirds of all objects, and when paired with an individual class, it overwhelms every other class easily. Class 13 reached an accuracy of 89% - the highest - against class 1. These are the Long period variables (LPVs) and the long-term structure is likely getting picked up. In general the separation of all other classes with class 13 was similarly far better than other binary comparisons. Except in a few cases, the binary -classifier did comparable or better than the corresponding feature-based random forests () classifier (see Table II and Sec. VI-A).
|Original binning (as outlined before):|
|Average accuracy: 83%|
|Average accuracy: 84.5%|
|Average accuracy: 84.3%|
|Average accuracy: 82.7%|
V-B Multi-class classification
When used in the 7-class mode, the -images produced an average accuracy of . This is remarkable in itself given the sparse nature of the data and no fine tuning of the parameters. The performance is comparable with feature-based -classifier (see Table II last row, and Figs. 6 and 7). Class 1 still dominates to an extent but not as blatantly as in the binary cases. It still leaves a lot to be desired if one wishes to use it in real-time for light curves containing far fewer points, and binary classification may be somewhat preferable.
V-C Varying dmdt binning
Input image size of 23x24 is small for CNNs and the training is relatively quick. As a result one can consider finer binning in both and in . On the other hand, the discriminating structure likely resides in a few smaller areas, and one could use more granular binning. While it is desirable to determine the binning for a given survey and classes under consideration, such a systematic approach will need more extensive work. Here we report some quick experiments. We give below the variations we tried and the corresponding results (see Table III).
We notice that finer bins improves performance a little, and fewer bins do not seem to adversely affect the performance. Each experiment is time consuming and we continue our efforts to fine tune the parameters to get better results. As stated earlier the current results are already usable. Exploring these more is an obvious area of further advance.
V-D Background Subtraction
How can we improve the classification further? Individuals look at differences as well as similarities in order to classify objects. Consider spectra: given other common things, one asks if a particular line is too broad, or narrow, or missing, or extra-intense. We do the same when looking at light curves. With -images we have squared the number of points and distributed them over a rectangle. If we could remove an underlying, common background, the class-membership may become more obvious. It is a non-trivial task at best. We consider a few possibilities for determining such a background.
A -image can be said to be made of three components (1) a static background, , that results primarily from the cadence of the survey. No matter what kind of astronomical object there is, one will always find pairs of observations with large and small , and peaks at specific depending on the survey cadence, (2) a more specific background related to the class-membership, , for the class. This is the kind of one would get from a densely sampled prototype, and (3) something like an individual signature, , for each object.
We formulate the foreground-background separation problem by drawing parallels to video surveillance tasks. We interpret each -image as a video frame, vectorize it and then stack up all the -images columnwise to create a matrix . We then decompose this matrix into a sum of a low rank matrix, , and a sparse matrix, . Each column of the low rank matrix corresponds to the background in the corresponding -image and each column of the sparse matrix corresponds to the foreground of the corresponding -image. and can be obtained by solving the following expression where L is forced to be low-rank and S is forced to be sparse:
We use the Robust PCA algorithm  for decomposing the matrix in this way. We use this particular method as it is one of the earliest method which is a provable non-convex algorithm in contrast to other works that rely upon convex relaxations of the actual objective.
We determine backgrounds to subtract before training in a few different ways: (1) for individual class backgrounds we consider -images of just that class to form the matrix . Fig. 4 shows the class backgrounds. The class backgrounds are not like the median images of stacked class images (see Fig. 5). The difference partly springs from non-uniform lengths of time-spans for individual light curves as well as differences in maximum brightness variations over different time-scales. We use the respective maximum background for each class. (2) For a pseudo-cadence background we consider all our objects together. We call this the pseudo-cadence background because all our objects in the current set are periodic variables. In order to not overwhelm this pseudo-background with a single class, we take 500 samples of each type (or all, if the training sample has fewer than 500). (3) As another possibility we ignored the class imbalance, took all training samples, and used the from the background for subtraction from all training and testing samples. (4) For a true background we will need to consider -images for light curves of randomly selected objects (or perhaps a large number of standard stars - stars known not to vary). Since a vast majority of these objects will be constant within error-bars over the entire time-span, the corresponding -images will consist of a thin line along the midrib. After the backgrounds are subtracted, the CNN is trained on the foreground images.
Subtracting the class background provides far better results as expected (see Table IV) since we use class information to subtract a specific background even during testing, and in the real world we are not be privy to this information. However it does show that removing appropriate background accentuates the different features between different classes. However. the removal of the pseudo-cadence background is somewhat worse than not removing any background. We mimicked cadence background removal by blanking the middle 2,4,6 rows of the -images, but they did not provide better results than not removing the background either. Clearly, better modeling of the background is required, a more time-consuming job that we will be taking up in the near-future.
V-E Transfer Learning
One of the real power of the -technique is its cross-survey applicability. We used models trained with the CRTS-N -images and tested them on CRTS-S -images with same classes , but no overlapping objects, and with PTF -images with a subset of the same objects as in the CRTS-N sample. CRTS-S uses the same asteroid-finding cadence as CRTS-N and also has an open filter. PTF used a more mixed cadence with a greater emphasis on looking for explosive events including a repeat cadence of 1, 3, 5 nights. We used PTF data taken with the filter. The results using both the shallow and deep CNNs are given in Table V. The numbers are not as good as with CRTS-N, but that is not unexpected. In fact, for many classes, especially for CRTS-S, they are better than one would naively expect. In case of PTF the survey cadence is very different in addition to the aperture and wavelength range and the results are somewhat worse. But the very fact that they are still usable, and definitely a good starting point indicates the merit of using such a technique. With proper survey-based background subtraction the results should improve further. The implications for domain adaptation are obvious, especially with applicability to forthcoming surveys like ZTF and LSST.
We have shown how to transform light curves to simple -images for use with canned as well as simpler CNNs for out-of-the-box classification of objects with performance comparable to random forests, and without having to resort to designing or extracting features, or other necessary evils like dimensionality reduction. The internal features the CNN uses need to be explored further using tools like deconvolutional networks. That will make the results interpretable, and provide insights. We have also shown various paths to take in order to improve the results further e.g. background subtraction and varying the bins. We have further demonstrated the application of the technique to transfer learning and thereby classifying objects from a completely different survey.
Vi-a Comparison with RF
Random forests (RF) tend to be very versatile and difficult to beat in performance. Hence we compare the performance of our technique with random forests. We use the features given in Table VI for our RF setup. The output is shown in Table II. The features we have used are generic, and designer features would provide better recall and precision for select classes.
We find the performance of the shallow CNN comparable to that of unweighted RF. In a way this is remarkable since it is akin to providing the light curve almost in its raw format and getting a classification. For some classes the recall is low, possibly due to the sparseness of the light curves. RFs have provided better results when used with features based on non-sparse light curves [1, 18]. Combining the features with the CNN to form a deep-and-wide network will likely provide better performance than either.
Vi-B Misclassified Sources
The confusion matrices (Figs. 6 and 7) during our various experiments showed that for some classes large fractions of objects were misclassified (see Fig. 8). We investigated the light curves for some of these sources in order to identify the source of errors. In some cases it was a genuine error (wrong label) indicating that the network was working well. In some other cases the misclassification was owing to a sparse light curve indicating that in a handful of cases a smaller number of features may be tilting the classifications one way or another. In still other cases, the subclasses were just too close for the technique to discern them apart just from the -images based upon the light curves (e.g. RR Lyrae of different types). In the case of EW and EA classes we suspect that the technique may be teasing apart subclasses (e.g. based on separation) or geometric dependence. For example, we noticed two distinct kinds of foregrounds (see Fig. 9), and this needs to be investigated further.
Vi-C Future Work
We will explore various possibilities related to varying CNN hyperparameters, improving background subtraction for more reliable classification, expanding to more classes and surveys, as well as identifying the misclassified sources. We will also experiment to make the technique more useful in the real-time cases with far fewer data points. We did a couple of tests using error-bars to augment smaller classes, but that did not work well. That needs to be explored further for reducing the unbalancedness of the different classes. Also there is the possibility of using Generative Networks to create large simulated examples for different classes for understanding the features that really separate different classes. The number of possibilities is large – we invite others to explore them as well.
This work, and CRTS survey, was supported in part by the NSF grants AST-0909182, AST-1313422, AST-1413600, and AST-1518308, and by the Ajax Foundation. KS thanks IIT-Gandhinagar and the Caltech SURF program.
- J. W. Richards, D. L. Starr, N. R. Butler et al., “On Machine-learned Classification of Variable Stars with Sparse and Noisy Time-series Data,” ApJ, vol. 733, p. 10, May 2011.
- C. Donalek, A. Arun Kumar, S. G. Djorgovski et al., “Feature Selection Strategies for Classifying High Dimensional Astronomical Data Sets,” ArXiv e-prints, Oct. 2013.
- M. J. Graham, S. G. Djorgovski, A. J. Drake et al., “A novel variability-based method for quasar selection: evidence for a rest-frame 54 d characteristic time-scale,” MNRAS, vol. 439, pp. 703–718, Mar. 2014.
- A. A. Mahabal, S. G. Djorgovski, A. J. Drake et al., “Discovery, classification, and scientific exploration of transient events from the Catalina Real-time Transient Survey,” Bulletin of the Astronomical Society of India, vol. 39, pp. 387–408, Sep. 2011.
- Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 5 2015.
- S. G. Djorgovski, C. Donalek, A. Mahabal et al., “Towards an Automated Classification of Transient Events in Synoptic Sky Surveys,” ArXiv e-prints, Oct. 2011.
- A. J. Drake, S. G. Djorgovski, A. Mahabal et al., “First Results from the Catalina Real-Time Transient Survey,” ApJ, vol. 696, pp. 870–884, May 2009.
- A. A. Mahabal, C. Donalek, S. G. Djorgovski et al., “Real-Time Classification of Transient Events in Synoptic Sky Surveys,” in New Horizons in Time Domain Astronomy, ser. IAU Symposium, E. Griffin, R. Hanisch, and R. Seaman, Eds., vol. 285, Apr. 2012, pp. 355–357.
- S. G. Djorgovski, M. J. Graham, C. Donalek et al., “Real-Time Data Mining of Massive Data Streams from Synoptic Sky Surveys,” ArXiv e-prints, Jan. 2016.
- A. J. Drake, S. G. Djorgovski, M. Catelan et al., “The Catalina Surveys Southern periodic variable star catalogue,” MNRAS, vol. 469, pp. 3688–3712, Aug. 2017.
- N. M. Law, S. R. Kulkarni, R. G. Dekany et al., “The Palomar Transient Factory: System Overview, Performance, and First Results,” PASP, vol. 121, p. 1395, Dec. 2009.
- A. J. Drake, M. J. Graham, S. G. Djorgovski et al., “The Catalina Surveys Periodic Variable Star Catalog,” ApJS, vol. 213, p. 9, Jul. 2014.
- T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, ser. Springer Series in Statistics. New York, NY, USA: Springer New York Inc., 2001.
- K. P. Murphy, Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.
- S. Dieleman, K. W. Willett, and J. Dambre, “Rotation-invariant convolutional neural networks for galaxy morphology prediction,” MNRAS, vol. 450, pp. 1441–1459, Jun. 2015.
- G. Cabrera-Vives, I. Reyes, F. Förster, P. A. Estévez, and J. Maureira, “Supernovae detection by using convolutional neural networks,” in 2016 International Joint Conference on Neural Networks, IJCNN 2016, Vancouver, BC, Canada, July 24-29, 2016, 2016, pp. 251–258.
- P. Netrapalli, U. N. Niranjan, S. Sanghavi, A. Anandkumar, and P. Jain, “Non-convex robust PCA,” CoRR, vol. abs/1410.7660, 2014.
- P. Dubath, L. Rimoldini, M. Süveges et al., “Random forest automated supervised classification of Hipparcos periodic variable stars,” MNRAS, vol. 414, pp. 2602–2617, Jul. 2011.