M-type stars recognized through hash learning

Recognition of M-type stars in the unclassified spectra of LAMOST DR5 using a hash learning method

Y.-X. Guo, A.-L. Luo, S. Zhang, B. Du, Y.-F. Wang, J.-J. Chen, F. Zuo, X. Kong, Y.-H. Hou
Key Laboratory of Optical Astronomy, National Astronomical Observatories, Chinese Academy of Sciences, Beijing 100012, China
University of Chinese Academy of Sciences, Beijing 100049, China
Department of Physics and Astronomy, University of Delaware, Newark, DE 19716, USA
Nanjing Institute of Astronomical Optics & Technology, National Astronomical Observatories, Chinese Academy of Sciences,
Nanjing 210042, China
E-mail: lal@nao.cas.cn
Accepted 2019 February 05. Received 2019 February 03; in original form 2018 July 08

Our study aims to recognize M-type stars which are classified as “UNKNOWN” due to bad quality in Large sky Area Multi-Object fibre Spectroscopic Telescope (LAMOST) DR5 V1. A binary nonlinear hashing algorithm based on Multi-Layer Pseudo Inverse Learning (ML-PIL) is proposed to effectively learn spectral features for the M-type star detection, which can overcome the bad fitting problem of template matching, particularly for low S/N spectra. The key steps and the performance of the search scheme are presented. A positive dataset is obtained by clustering the existing M-type spectra to train the ML-PIL networks. By employing this new method, we find 11,410 M-type spectra out of 642,178 “UNKNOWN” spectra, and provide a supplemental catalogue. Both the supplemental objects and released M-type stars in DR5 V1 are composed a whole M type sample, which will be released in the official DR5 to the public in June 2019, All the M-type stars in the dataset are classified to giants and dwarfs by two suggested separators: 1) color diagram of versus from 2MASS; 2) line indices CaOH versus CaH1, and the separation is validated with HRD derived from DR2. The magnetic activities and kinematics of M dwarfs are also provided with the EW of H emission line and the astrometric data from DR2 respectively.

stars: late-type – stars: low-mass – techniques: spectroscopic
pubyear: 2018pagerange: Recognition of M-type stars in the unclassified spectra of LAMOST DR5 using a hash learning methodRecognition of M-type stars in the unclassified spectra of LAMOST DR5 using a hash learning method

1 Introduction

M-type stars are becoming dominant targets for research on the structural evolution and kinematics of the local Milky Way. M giants and M dwarfs have similar spectral features, both with strong molecular characteristics. M giants are red-giant-branch (RGB) stars with low surface temperature and high luminosity in the late-phase of stellar evolution. Their luminous nature allows us to use these stars as good tracers to study the outer Galactic halo and distant substructures (Zhong et al., 2015). M dwarfs, main-sequence stars with (Chabrier et al., 2000), are the dominant stellar constituent in the solar neighborhood and probably the Galaxy (Henry et al., 1994, 2006; Bochanski et al., 2010; Salpeter & Hoffman, 1995; Chabrier, 2003). They are very useful sources for studying and probing the lower end of the Hertzsprung-Russell diagram (HRD), even down to the hydrogen-burning limit. More and more M dwarf samples enable us to deep the understanding of their fundamental properties just like what we knew about the massive stars. For example, some of the key astrophysical interrelated topics have been exploring, including the precise relationship between mass and radius (Feiden & Chaboyer, 2014; Jackson & Jeffries, 2014; Han et al., 2017), the mass-luminosity relation (Henry & McCarthy, 1993; Delfosse et al., 2000; Torres et al., 2010; Benedict et al., 2016), rotation and angular momentum (Stassun et al., 2011; Houdebine et al., 2017), magnetic activity (Reiners, 2012; Feiden & Chaboyer, 2014; Yang et al., 2017), complex atmospheric parameters and dust settling in their atmospheres, and age dispersion within populations (Veyette et al., 2017; Bayo et al., 2017).

The largest spectroscopic data bases of M-type stars were from multi-object spectroscopic surveys such as the Sloan Extension for Galactic Understanding and Evolution (SEGUE) (Yanny et al., 2009) and the LAMOST Experiment for Galactic Understanding and Exploration (LEGUE) (Newberg et al., 2012). Besides the formal data releases of the surveys, specific M dwarf catalogs were also presented by astronomers. An M dwarf catalogue of SDSS including more than 70,000 stars (West et al., 2011), and two M dwarf catalogues of LAMOST were published (Yi et al., 2014; Guo et al., 2015). Considering the intrinsic low brightness of M dwarfs and the large distance of M giants, however, many low S/N M-type spectra has not been recognized in these surveys.

LAMOST DR5 V1 have released more than 9 million spectra including 640,000 “UNKNOWN” spectra (not classified by the LAMOST pipeline (Luo et al., 2015)). Some of these “UNKNOWN” spectra, mostly with low S/N, are valuable for astronomical research. For example, Huo et al. (2017) identified eight quasars from the LAMOST DR3 “UNKNOWN” spectra in the area of the Galactic anti-center of and . By applying a machine learning method, Li et al. (2018) recognized a total of 149 carbon stars that were misclassified as “UNKNOWN” in LAMOST DR4. Ren et al. (2018) published a catalog of White Dwarf Main Sequence binaries based on DR5 V1 dataset, several of which were classified as “UNKNOWN” by the LAMOST pipeline.

The classification method of LAMOST pipeline is based on template matching, in which each observed spectrum is cross-matched with a set of templates to calculate chi-square values. The template which corresponds the smallest value suggests the class that the object belongs to. Sometimes, the chi-square value of the best-fitted template for a low S/N spectrum has too low confidence, which makes the pipeline refuse to judge and labels its class as “Unknown". Other than template matching, the Query based machine learning methods are specifically ‘similarity search’ algorithm which can retrieve objects in a database with a specific pattern ignoring irrelevant noise. The Approximate Nearest Neighbor Search (ANNS) is a commonly used Query method, and the hash learning technique is one of the most widely used ANNS algorithm (Wang et al., 2016a). The basic idea of the hashing-based search techniques is to learn the relationships which map the high-dimensional raw data to the compact binary codes (series of digits consisting of 0 and 1), and then to retrieve the nearest neighbors of the query pattern using the Hamming distance (frequently used for representing the distance between two binary code) in the binary code space. Consequently, searching in the hash code space is extremely efficient both in time and memory consuming.

A schematic diagram of a hash learning search is shown in Fig.1. Many hash methods, including Locality Sensitive Hashing (LSH) (Andoni & Indyk, 2006), Spectral Hashing (specH) (Weiss et al., 2008), Iterative Quantization (ITQ) (Gong & Lazebnik, 2011), Spherical Hashing (SpH) (Heo et al., 2012) etc., have been intensively studied and widely used in many different fields, and their advantages and weaknesses have also been deeply investigated (Bondugula, 2013; Wang et al., 2016b). In this paper, we employ Semantic Hashing (SH) (Salakhutdinov & Hinton, 2009) to construct a deep hash learning model to search for M-type spectral pattern through learning hidden binary features and reconstructing the input data. However, to train such a deep generative model often requires multiple iterations, which suggests that it is not only extremely time-consuming while dealing with large amount of data but also needs to set parameters repeatedly depends on experience rather than theoretical basis.

Figure 1: Principle schematic diagram of the hashing learning search. Members of two different classes in original space might be similar in the hashing code space through a hash mapping which can be a coding network obtained via deep learning.

The appearance of pseudo inverse learning (PIL) (Guo et al., 2017) shed a new light on the deep learning technique because PIL is actually an supervised learning algorithm for training a single hidden layer feedforward neural network which do not need to tune the hidden layer parameters once the number of hidden layer nodes is determined. The weight and bias vectors between the input layer and the hidden layer are randomly generated, and these are independent of the training samples and the specific applications (Pal et al., 2015). In this study we build a multilayer PIL (ML-PIL) to fulfill the hash learning process, so as to search M-type stars in the “UNKNOWN” spectra of LAMOST DR5 V1.

There are methods to separate giants from dwarfs for M-type stars based on colors, spectral indices and proper motions etc. Bessell & Brett (1988) proposed a color discrimination method. In their study, M giants and M dwarfs are distributed around different loci in the [, ] color-color diagram, which are mainly caused by the differences in the opacity of molecular bands of HO (Bessell et al., 1998). Because M giants have relatively larger distances and smaller proper motions, a reduced proper motion method was used to separate M giants from M dwarfs (Lépine & Gaidos, 2011). By comparing the spectra of M dwarfs and M giants, several gravity-sensitive molecular and atomic spectral indices were selected to determine the luminosity class (Mann et al., 2012). Recently, a new photometric method combining 2MASS and WISE photometry was used to recognize M giant spectra in LAMOST dataset (Zhong et al., 2015). The strength ratio of TiO band to CaH band varies with surface gravity. Reid et al. (1995) defined the TiO5, CaH2, and CaH3 spectral indices, and Zhong et al. (2015) used the aforementioned indices to distinguish M giants from M dwarfs. In addition, other methods, such as Mg2 versus was used for the separation (Covey et al., 2008). We compare different giant/dwarf separation schemes and suggest two additional separation indicators with more correctness.

The paper is organized as follows: section 2 briefly introduce the spectral data used in the paper along with the spectra preprocessing; Section 3 presents the ML-PIL-based hash learning scheme, the construction of positive and negative samples, the model training and the performance evaluation of the method on real spectral data, and then the application of ML-PIL in searching for M-type stars in LAMOST DR5 V1 “UNKNOWN” dataset; Section 4 compares different giant/dwarf separation schemes for M-type stars and suggests two useful indicators following by investigation of the activity and kinematics of the whole M dwarf sample in DR5; The final section summarizes the work of this paper and envisions potential future work.

2 data and preprocessing

LAMOST is a 4-m reflecting Schmidt telescope with a large field of view (FoV) of 5 degrees in diameter. It has 4,000 fibers mounted on its focal plane and 16 spectrographs with 32 CCD cameras, so that it can simultaneously observe up to 4,000 objects (Cui et al., 2012). The raw CCD data are reduced and analyzed by the LAMOST data pipelines, which consists of a 2D pipeline and a 1D pipeline. The primary functions of the 2D pipeline include bias calibration, flat field correction, spectral extraction, sky subtraction, wavelength calibration, flux calibration, sub-exposures combination, etc. The calibrated spectra from the 2D pipeline are then fed to the 1D pipeline which performs spectral classification and parameter determination based on template matching and chi-square criteria (Luo et al., 2015).

Until July 2017, LAMOST has completed its five-year regular survey. The LAMOST DR5 V1 includes 9,017,844 spectra of stars, galaxies, quasars, and unrecognized objects. These spectra cover the wavelength range from 3690 to 9100Å with a resolution of at the wavelength of 5500Å.

2.1 “UNKNOWN” data from LAMOST DR5

Among the 9 million spectra in LAMOST DR5 V1, 642,178 unrecognized spectra were labeled as “UNKNOWN”. During the classification process of 1D pipeline, a spectrum is classified as “UNKNOWN” if the confidence of the classification result is lower than a given threshold value, e.g., the chi-square value of the best-match result is greater than a certain value, or the target spectrum has almost equal similarities to multiple dissimilar templates. These problems occur in multi-template matching process, which we will refer to as the multi-template matching problems, mostly owing to the low spectral S/N (see top panel of Fig. 2).

The lower panel of Fig. 2 gives part of “UNKNOWN” objects in color-color diagram. The M-type star candidates should be located in the upper right region of this panel. Due to the intrinsic low luminosity of late-type M dwarfs, most of them have low S/N spectra, which are expected to be classified as “UNKNOWN” objects. To efficiently recognize M-type spectra from the 642,178 “UNKNOWN” spectra by using a more noise-insensitive approach than the multi-template matching problem of 1D pipeline, we choose an approximate proximity search method based on deep learning model, which can combine the low-level features layer by layer to obtain more abstract high-level feature expression, and then discover the inherent and essential feature representation of complex data.

Figure 2: S/N (top panel) and color g-r vs r-i (bottom panel) distribution of LAMOST “UNKNOWN” data. Most of late type M dwarfs are expected to be existed, which locate in the upper right region of the color-color diagram.

2.2 Data preprocessing

LAMOST spectra cover the wavelength range from 3690 to 9100Å with a resolution at the wavelength of 5500Å. First, each spectrum is rebinned onto the same-wavelength space with a fixed step length. Then, we normalize the spectra by re-scaling the fluxes to eliminate scale differences among the raw spectra. For a given spectral set denoted by , the vector represents a spectrum, in which is the flux at a given wavelength. The normalization is performed as


where MAX and MIN indicate the maximum and minimum values after the normalization, use 1 and 0 for simplicity. The and return the maximum and minimum element in a given vector, respectively. For each spectrum, we obtain the normalized one denoted as .

3 ML-PIL based hashing scheme and application

ML-PIL based hashing scheme can be divided into two stages: the deep hashing learning model training stage and the ANNS query stage.

In the model training phase, we construct a deep hash learning model to project all the target data into a feature space, then we encode the final feature representations of the last hidden layer’s outputs into “fingerprints”. For a well-trained ML-PIL-based hash network, we can get the corresponding “fingerprints” using the query sample of Section 3.3 as input data. Similarly, we can get the “fingerprints” of the “UNKNOWN” spectra in DR5. We organize the description of this model construction into several subsections including framework of ML-PIL, hashing encoding scheme, positive sample through clustering, negative sample selection, model training and performance evaluation etc., from Section 3.1 to 3.5.

In the second ANNS query stage, for any given “query”, we search for the similar spectra from the “UNKNOWN” data by calculating their similarities. The similarity between the query sample and each “UNKNOWN” spectrum is calculated by measuring the Hamming distance in the feature code space. The less distance of a sample to a coded query spectrum in the hash space suggests it is more similar to that query spectrum. The Section 3.6 illustrates the aforementioned query stage.

3.1 ML-PIL framework

ML-PIL is a hierarchical network structure based on pseudo inverse learning, and it is stacked with several single hidden layer neural networks. For a given single hidden layer neural network in Fig. 3, we can get a single layer auto-encoder by training such that the output is approximately equal to input. By stacking several aforementioned single layer autoencoders, we can get the multilayer autoencoder. Once a multilayer autoencoder is trained, the binary hash code of any sample is obtained from the deepest hidden layer. However, these complex models require iterative parameter adjustments and hence are computationally expensive.

To overcome the computational complexity of multilayer autoencoder, a PIL algorithm is introduced exploiting the advantages of its random orthogonal feature mapping to speed up learning. PIL is actually a supervised learning algorithm for training a single hidden layer feed-forward neural network(SLFN). The basic idea is to find a set of orthogonal vector bases using the nonlinear activation function to make the output vectors of the hidden layer neurons orthogonal. Then the weights of the output connection of the network are approximately solved by calculating the pseudo inverse. The PIL algorithm uses only basic matrix operations to calculate the analytical solution of the optimization objective (Wang et al., 2016a, 2017). It does not need iterative optimization and parameter adjustment. Therefore, its efficiency is much higher than that of the gradient descent based algorithms. Here, we give a detailed introduction for the PIL algorithm.

In Fig. 3, suppose that denotes the sample set, where is a spectrum with n dimensions and is the target label corresponding to . The input is mapped to L-dimensional PIL random feature space, and the network output is

Figure 3: The structure of single hidden layer feed-forward neural network (SLFN) .

where is the activation function, is the input weight vector, is the output weight matrix between the hidden node and the output node, is the bias of the input matrix. The aim is to find the optimal weight matrix to minimize the loss function


This problem can be expressed as


where is the hidden layer node output. This nonlinear mapping is defined by


The objective of optimization can be converted to minimize the loss function


In the PIL algorithm, once the bias and the input weight of hidden layer is determined, the output matrix of hidden layer is uniquely determined. The training of the single hidden layer neural network can be transformed into solving a linear system. We can get the output weight from


where is the Moore-Penrose generalized inverse of matrix .

PIL is modified as follows to get PIL AutoEncoder (PIL AE) so as to perform unsupervised feature representation: input data are used as output data . ML-PIL is derived from multiple stacks of PIL AEs. Each PIL AE is trained separately. The output of the hidden layer of the previous PIL-AE is connected to the input of the latter PIL-AE. The layer by layer trained PIL-AEs are then stacked into a ML-PIL (see Fig. 4). The output of the last hidden layer is used to do hash mapping.

Figure 4: Framework of ML-PIL. Each hidden layer is trained separately. The last layer of ML-PIL can conduct a baseline PIL classification or regression.

3.2 Proposed hashing scheme

As described in the previous subsection, the feature expression can be learned from the last hidden layer of ML-PIL. These features can be projected into the hash code space through hash mapping to obtain the “fingerprint”. The “fingerprint” is a binary number consisting of a series of 0 or 1. A perfect hash mapping should have the following properties simultaneously: (1) Similar samples should be mapped to similar hash codes (usually called similarity-preserving or coding consistency). (2) The hash codes should be “balanced” (usually called coding balance), which means that, for each bit in the code, half of the samples are mapped to 1 and the other half are mapped to 0 (or ). (3) All bits should be independent of each other.

Fig. 5 illustrates the procedure of learning and hash coding for features. We define a threshold with which the features H are made binary. To be specific, we choose the median value of each feature dimension as the threshold. Then the feature values that are greater than the threshold are mapped to 1; otherwise, they are mapped to 0. By doing so, the learned binary codes are guaranteed to be “balanced”.

Figure 5: The procedure of learning and hash coding for features for a single hidden layer PIL-AE. denotes the random weight matrix; is the reconstruction of the input data. H is the learned feature through PIL-AE.

3.3 Positive samples

The size of training set for any Machine Learning (ML) algorithm depends on the complexity of the algorithm, while for PIL-ML based hashing scheme, thousands of positive samples are demanded to represent M-type spectra which embrace all kinds of subtypes, luminosity classes and various S/Ns especially low S/N ones since the “UNKNOWN” data have universally low S/N as shown in the top panel of Fig. 2. Therefore, we cluster the released M-type stars in LAMOST DR5 V1 to select various positive samples from each cluster.

Before clustering, all the M-type spectra are shifted to rest frames, then two machine learning methods are adopted which are Balanced Iterative Reducing and Clustering Using Hierarchies (BIRCH) (Zhang, 1999) and K-means (Arthur & Vassilvitskii, 2007). The BIRCH algorithm builds a tree called the Characteristic Feature Tree (CFT) for the given data. It incrementally clusters the data points, uses a fraction of the dataset memory, and updates the clustering decisions when new data comes in. The K-Means algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. It has been widely used in many different fields (Almeida & Prieto, 2013; Wei et al., 2014).

First, the BIRCH algorithm is adopted to cluster the 529,629 M dwarf spectra in LAMOST DR5 V1 into 50 groups. The Principal Component Analysis (Jolliffe, 2002) is applied to reduce the dimensions of the spectra. Second, for each group, 20 sub-clusters are obtained using K-Means. While for M giants, 80 clusters are obtained only using K-Means algorithm. Thus, we initially obtain 1,080 average spectra for all cluster centers. After manually inspection, 23 defective spectra with flux gaps (see Fig. 6) are abandoned, and then 6,699 spectra are randomly selected in reserved 1057 clusters. Supplementing 38 template spectra used in the LAMOST pipeline, 6,737 M-type positive spectra are ultimately assembled.

Figure 6: Examples of defective spectra eliminated from positive samples.

As shown in Table  1, the positive samples of the 6,737 M-type spectra include 10 M-type subclasses, luminosity classes (dwarf or giant), and wide range of S/Ns. We present four typical positive samples shown in Fig. 7, three high S/N spectra and one low S/N spectrum including an early giant, a late giant, and two early dwarfs.

Spectral Luminosity Numbers of different S/Nb
type classa 5 510 1030 30
M0 g 107 59 61 337
d 329 460 732 987
M1 g 37 10 16 104
d 132 171 235 269
M2 g 69 12 23 83
d 321 202 259 198
M3 g 87 8 6 10
d 517 146 146 113
M4 g 28 10 29 73
d 76 9 0 0
M5 g 5 7 14 31
d 6 3 0 0
M6 g 14 13 17 47
d 4 2 4 4
M7 g 8 4 5 34
d 1 0 0 0
M8 g 11 1 3 4
d 3 0 0 0
M9 g 13 0 1 0
d 4 3 0 0
  • g denotes giant; d denotes dwarf.

  • S/N here refers to the r-band S/N value.

Table 1: Main information of the finally combined query library.
Figure 7: Example spectra in the positive samples with different subtypes, luminosity classes and S/Ns.

3.4 Negative samples

The 10,000 negative samples are randomly selected and visually confirmed as non-M-type from “UNKNOWN” spectra in LAMOST DR5 V1. Another 5000 non-M-type spectra are randomly selected from the data release with known class and shifted to rest frames. Then, we totally obtain 15,000 negative samples.

3.5 Model training and performance evaluation

The aforementioned total 21, 737 positive and negative samples are used to train and validate the designed ML-PIL model. The ML-PIL model comprises of three hidden layers, since Wang et al. (2017) demonstrated that more hidden layers do not help much for improving performance. The Sigmoid activation function is selected for each hidden layer. The length of hash code (“fingerprint”) which is derived from the feature learned through ML-PIL would affect the performance of the ANNS, so that an appropriate code length should be decided via the performance evaluation.

We use “Accuracy” and “Recall” to evaluate the performance of ML-PIL hashing searching. The “Accuracy” is defined as


where TP denotes the number of the true positive samples in the result of query. While the FP is the number of the false positive samples.

The “Recall” is defined as


where FN denotes the number of the false negative samples.

We plot the Accuracy-Recall curves (Fig. 8) to evaluate the performance of the model setting the code length to be 32, 64, 128, and 256 bits. Each value of “Accuracy” and “Recall” in the Fig. 8 is the average value of ten thousands ANNS results. We perform ten thousand times of ANNSs to guarantee each of 21,737 samples can be selected. Therefore, a unbiased statistical result is obtained. The training set and validation set of each ANNS is randomly selected. In Fig. 8, the larger area under the curve suggests the better performance intuitively, that is, both the “Accuracy” and “Recall” achieve a higher level. It can be observed that as the code length increases, the performance of the model is improved. But to a certain extent, the variation of performance is less sensitive to the code length. Therefore, we ultimately choose 256 as the code length.

Figure 8: Accuracy-recall curves of different code lengths.

Finally, we obtain an effective ML-PIL hash learning model which had both high “Accuracy” and “Recall”. Besides, the time consumed for training the ML-PIL framework is 11.76s, which is much less than that of the traditional gradient descent based deep learning networks.

3.6 Application of ML-PIL based hash learning to recognize M-type spectra

We apply the ML-PIL hash learning method to search for M-type spectra from “UNKNOWN” data in LAMOST DR5 V1 with the 6,737 query (positive) sample spectra which are described in subsection 3.3. We firstly derive the hash codes for the query samples and all “UNKNOWN" spectra through the ML-PIL hash model. Then, for each “query” we calculate similarity between the query sample and each “UNKNOWN” spectrum using the Hamming distance between their hash codes. The smaller distance the better similarity. Fig. 9 shows one example of the top 10 search results for a late-type M spectrum. On the other hand, Fig. 10 shows another example of the increasing dissimilarity with the Hamming distance for an early-type M spectrum. Those similarities ranks top 10% are kept for each of 6,737 searches.

Figure 9: Example of the top 10 search results for a late-type M spectrum (red). The top 10 spectra (black) are sorted by decreasing similarities.
Figure 10: Retrieved spectra (blue) in different positions of the ranked list for an early-type M query (red), where the number after @ denotes the ranking.

We manually inspect the union of these 6,737 subset and recognized 11,410 M-type spectra (11,156 objects) including 10,242 dwarf and 1,168 spectra from the 642,178 “UNKNOWN” spectra in LAMOST DR5 V1. We make a supplemental catalog and re-archive all these 11,410 spectra from “UNKNOWN" category into M-type star in LAMOST DR5 V2, which will be officially released in June 2019. Like former LAMOST data releases, we measure same parameters for these spectra in the catalog including indices of nine molecular bands, equivalent width of H, magnetic activity, and metal-sensitive parameter (Yi et al., 2014; Guo et al., 2015) etc. In addition, the catalog also provides spectral subtype for these spectra determined using an improved Hammer package. The improvement to the original Hammer (Covey et al., 2007) was made by Yi et al. (2014) who incorporated three new indices to increase the classification correctness. In the catalog, each object also has radial velocity which is measured through cross-matching with dwarf templates, and the giant/dwarf separation which is determined using the suggested methods described in Section 4.1, respectively. This supplemental catalog can be downloaded from the web site http://paperdata.china-vo.org/Guoyx/2018/M_etable.txt. Table  2 shows the first five rows of the catalog. Fig. 11 and Fig. 12 show the distributions of the spectral subtypes and the S/Ns of the 11,410 spectra, respectively.

Figure 11: Subtypes distribution of the 11,410 M-type spectra.
Figure 12: S/Ns distribution of the 11,410 M-type spectra.
Figure 13: Comparison of magnitude (in r band) distribution of M-type spectra between released in LAMOST DR5 V1and recognized through ML-PIL hash learning. The magnitude distribution of M-type spectra in DR5 V1 are shown in blue with the left vertical axis while that from ML-PIL hash learning are in red with the right vertical axis.

The number distributions of the M-type spectra in LAMOST DR5 V1 and the supplemental spectra in mag_r space are compared and shown in Fig. 13. These supplemental M-type spectra not only have fainter luminosity, but also have higher proportion of the late-type than the M-type spectra in LAMOST DR5 V1, and the comparison are shown in Fig. 14. The total number of late M-type spectra (later than M5) recognized through ML-PIL based hash learning is 569.

Adding 11,410 M-type spectra from “UNKNOWN” data in LAMOST DR5 V1 to the M dataset of LAMOST DR5 V1 which originally has 58,3728 M-type spectra, we now posses a larger M star catalog for DR5 (defined as “ALL M” hereafter) to study the giant/dwarf separation and the magnetic activity in the discussion section.

Figure 14: Comparison of subtype distribution of M-type spectra between released in LAMOST DR5 V1 and recognized through ML-PIL hash learning. The subtype distribution of M-type spectra in DR5 V1 are shown in blue with the left vertical axis while that from ML-PIL hash learning are in red with the right vertical axis.

4 Discussion

4.1 Luminosity class indicators

We use “ALL M" objects to check both the spectroscopic and the photometric criteria for separation of M giant and dwarf proposed by Zhong et al. (2015) (Zhong2015 for short), which are the CaH2+CaH3 versus TiO5 line index diagram and the versus color diagram respectively, and suggest better spectroscopic and photometric separator for M giant/dwarf, which are the CaOH versus CaH1 line index diagram (middle panel of Fig. 15) and the versus color diagram (top panel of Fig. 16). We use HRD of DR2 to verify the suggested separation approach.

Accurate parallaxes and proper motions for the vast majority of “ALL M” are obtained through cross-matching within 5 arcsec to Gaia DR2 (Gaia Collaboration et al., 2018a) which have come available in April 2018. We build the HRDs by simply estimating the absolute magnitude in the G band for individual star using , with the parallax in miliarcseconds (plus the extinction) (Gaia Collaboration et al., 2018b). This is valid when the relative uncertainty on the parallax is 20% (Luri et al., 2018).

First, we choose the early M-type spectra to analyze the validation of the luminosity discrimination in spectral features. As shown in the HRD, the top panel of Fig. 15, the M giants are in red color and locate in the upper branch while the M dwarfs are clearly separated in black color and locate in the lower branch. The middle panel of Fig. 15, the CaOH versus CaH1 diagram, shows that the same giants population with the upper panel in red color lay in the upper branch in this diagram. However, Zhong2015 was weaker to discriminate M giants and dwarfs for early M type spectra. The bottom panel of Fig. 15, the CaH2+CaH3 versus TiO5 diagram, shows some giants overlap with dwarfs in the lower branch where is the location of dwarfs. This overlap means that the criterion in Zhong2015 will lead to a small portion of M giants misclassified as dwarfs.

Figure 15: The distribution of early M-type stars in the HRD (Top panel), CaH1 versus CaOH (middle panel) and CaH2+CaH3 versus TiO5 (bottom panel) diagrams. Black and red dots denote the dwarfs and giants respectively. For the same subsample of upper branch in the HRD, a more clear separation can be seen in the CaH1 versus CaOH diagram, whereas a small portion of giants are mixed into the dwarf branch in the CaH2+CaH3 versus TiO5 diagram.

Then, we examine the effectiveness of the versus criterion using the total “ALL M”, and we can see different loci of dwarfs and giants in the bottom of HRD. As shown in top and middle panel of Fig. 16, both the suggested criteria in this paper and Zhong2015 can separate giants and dwarfs. In these two panel, dwarfs are shown in black color, while giants are in red or blue represent classified by the criteria in this paper or by Zhong2015 respectively. Comparing this two groups of giants from different separator in the the HRD shown in the bottom panel of Fig. 16, part of giant candidates (12%) from the criterion in Zhong2015, versus , should actually be dwarfs lying in the main-sequence strip. It is clear that versus can easier eliminate possible dwarf contaminations from giants than the method given in Zhong2015.

Using both the spectral features and the 2MASS photometry, we determine each M-type spectra as giant or dwarf in the supplemental catalog. From the result, we conclude that even lacking of 2MASS infrared data we still can efficient to separate M giants from M dwarfs based on spectral feature.

Figure 16: The distribution of “ALL M” stars in the versus (Top panel), versus (middle panel) and HRD (bottom panel) diagrams.Giants(red dots) determined in the versus diagram lie in the upper branch of HRD. While giants(blue dots) determined in the versus diagram, lie mainly in the upper strip of the HRD, with a small portion lying in the main-sequence strip.
Figure 17: H EWs (upper panel) and magnetic activity fraction (lower panel) as a function of spectral type. The red (blue) dots in the lower panel is the ratio of active (inactive) stars.

4.2 Magnetic activity and kinematics

Magnetic fields affect the chromospheric activity of M dwarfs, and H emission can be an indicator of chromospheric activity. We investigate the magnetic activity of M dwarfs by measuring the equivalent widths (EWs) of H. Once the S/N of the continuum around H of a M dwarf is greater than 3, the M dwarf spectra is then to be checked the value of EW of the H greater or less than 1 to determine it is active or inactive.(Guo et al., 2015). If the S/N around H of a M dwarf less than 3, the activity of the M dwarf will not be measured. The upper panel of Fig.17 shows that the EWs of H increase with the subtype becoming later, while the lower panel shows that mean fraction of active stars increases or inactive stars decreases with spectral subtype becoming later. This implies that later M dwarfs show stronger and higher fraction of magnetic activity.

We also investigate the velocities and velocity dispersions for both active and inactive M dwarfs. Combining radial velocities, distances and proper motions from , the heliocentric space motions () are computed according to the method of (Johnson et al., 1987). The 3D velocities are computed in a right-handed coordinate system, with positive velocity toward the Galactic center, positive velocity in the direction of Galactic rotation and positive velocity toward the north Galactic pole. The velocities are corrected for solar motion (10, 5, 7 ) (Dehnen & Binney, 1998) with respect to the local standard of rest. These kinematical parameters are also provided in the supplemental catalog.

The M dwarfs are binned in 100 pc increments of absolute vertical distance from the Galactic plane. The velocity mean values and velocity dispersions as a function of absolute vertical distance for active and inactive populations are shown in Fig.18. From the figure, we can see that the active M dwarfs are systematically low in velocity dispersion in the direction. While the the velocity mean values of the active M dwarfs are high in and directions. The two populations separated apparently, suggesting that the active M dwarfs should be born in an older kinematical population, which is consistent with Hawley et al. (2011). The velocity mean values decline with increasing absolute vertical distance, whereas the velocity dispersions rise, for both the active and inactive populations. This result agrees well with the trend for thin disks shown in Bochanski et al. (2007).

Figure 18: velocity mean values (left) and velocity dispersions (right) as a function of absolute vertical distance from the Galactic plane in 100 pc bins for active (red asterisks) and inactive (blue triangles) M dwarfs.

Furthermore, although we find that the the strength of H emission line varies in multiple observations for some M dwarfs, we don’t have enough data to draw any conclusion, which needs analysis of other physical characteristics, such as flare, rotation, and their intrinsic relationships by using time domain photometric and spectroscopic observations.

5 Summary

A binary nonlinear hashing algorithm based on ML-PIL is proposed to effectively learn spectral features of M-type stars, in order to search for missing M type stars due to failures of multi-template matching particularly for low signal-to-noise ratio spectra. We construct a specific ML-PIL model for the learning and searching, and build a positive sample through clustering both high and low S/N known M-type spectra. Evaluating the performance of the model and effectively applying to 642,178 “UNKNOWN" spectra in LAMOST DR5 V1, we finally recognize 11,410 M-type spectra and make a catalog to supplement to the M-type star catalog of LAMOST DR5 V1. For the recognized spectra, some useful values are calculated including indices of molecular bands, magnetic activities and metal-sensitive parameters . Adding the M-type spectra recognized through ML-PIL to the original released M-type stars in DR5 V1, we obtain a complete catalog of M-type stars in LAMOST DR5 which will be officially released in June 2019. Through cross-matching, the common objects with DR2 are used to study the giant/dwarf separators based on the 2MASS color indices and and LAMOST spectral line indices. We then propose two giant/dwarf separators, and verify them with the HRD from DR2, by which we label the objects as dwarfs or giants and calculate kinematics for the M dwarfs. According to the good performance of ML-PIL based hash learning algorithm and their successful application in M-type spectra search, we believe it is able to effectively search for specific spectra, especially low S/N data such as LAMOST “UNKNOWN" dataset in which there still potentially exist early type stars besides M-type stars.


This research is supported by the Major State Basic Research Development Program of China (973 Program, No. 2014CB845700), China Scholarship Council and the National Natural Science Foundation of China (Grant Nos. 11703053 and 11703051). This research has made use of LAMOST data. The Guo Shou Jing Telescope (the Large Sky Area Multi-Object Fiber Spectroscopic Telescope, LAMOST) is a National Major Scientific Project built by the Chinese Academy of Sciences. Funding for the project has been provided by the National Development and Reform Commission. This research also makes use of data products from the Two Micron All Sky Survey, which is a joint project of the University of Massachusetts and the Infrared Processing and Analysis Center/California Institute of Technology, funded by the National Aeronautics and Space Administration and the National Science Foundation.


  • Almeida & Prieto (2013) Almeida J. S., Prieto C. A., 2013, The Astrophysical Journal, 763, 50
  • Andoni & Indyk (2006) Andoni A., Indyk P., 2006, in IEEE Symposium on Foundations of Computer Science. pp 459–468
  • Arthur & Vassilvitskii (2007) Arthur D., Vassilvitskii S., 2007, in Eighteenth Acm-Siam Symposium on Discrete Algorithms, New Orleans, Louisiana. pp 1027–1035
  • Bayo et al. (2017) Bayo A., et al., 2017, MNRAS, 465, 760
  • Benedict et al. (2016) Benedict G. F., et al., 2016, AJ, 152, 141
  • Bessell & Brett (1988) Bessell M. S., Brett J. M., 1988, PASP, 100, 1134
  • Bessell et al. (1998) Bessell M. S., Castelli F., Plez B., 1998, A&A, 337, 321
  • Bochanski et al. (2007) Bochanski J. J., Munn J. A., Hawley S. L., West A. A., Covey K. R., Schneider D. P., 2007, AJ, 134, 2418
  • Bochanski et al. (2010) Bochanski J. J., Hawley S. L., Covey K. R., West A. A., Reid I. N., Golimowski D. A., Ivezić Ž., 2010, AJ, 139, 2679
  • Bondugula (2013) Bondugula S., 2013, Survey of Hashing Techniques for Compact Bit Representations of Images
  • Chabrier (2003) Chabrier G., 2003, PASP, 115, 763
  • Chabrier et al. (2000) Chabrier G., Baraffe I., Allard F., Hauschildt P., 2000, ApJ, 542, 464
  • Covey et al. (2007) Covey K. R., et al., 2007, AJ, 134, 2398
  • Covey et al. (2008) Covey K. R., et al., 2008, AJ, 136, 1778
  • Cui et al. (2012) Cui X.-Q., et al., 2012, Research in Astronomy and Astrophysics, 12, 1197
  • Dehnen & Binney (1998) Dehnen W., Binney J. J., 1998, MNRAS, 298, 387
  • Delfosse et al. (2000) Delfosse X., Forveille T., Ségransan D., Beuzit J.-L., Udry S., Perrier C., Mayor M., 2000, A&A, 364, 217
  • Feiden & Chaboyer (2014) Feiden G. A., Chaboyer B., 2014, ApJ, 789, 53
  • Gaia Collaboration et al. (2018a) Gaia Collaboration et al., 2018a, preprint, (arXiv:1804.09382)
  • Gaia Collaboration et al. (2018b) Gaia Collaboration et al., 2018b, A&A, 616, A10
  • Gong & Lazebnik (2011) Gong Y., Lazebnik S., 2011, in Computer Vision and Pattern Recognition. pp 817–824
  • Guo et al. (2015) Guo Y.-X., et al., 2015, Research in Astronomy and Astrophysics, 15, 1182
  • Guo et al. (2017) Guo P., Wang K., Xin X., 2017, in Smc.
  • Han et al. (2017) Han E., et al., 2017, AJ, 154, 100
  • Hawley et al. (2011) Hawley S. L., Bochanski J. J., West A. A., 2011, in Johns-Krull C., Browning M. K., West A. A., eds, Astronomical Society of the Pacific Conference Series Vol. 448, 16th Cambridge Workshop on Cool Stars, Stellar Systems, and the Sun. p. 1359 (arXiv:1012.3505)
  • Henry & McCarthy (1993) Henry T. J., McCarthy Jr. D. W., 1993, AJ, 106, 773
  • Henry et al. (1994) Henry T. J., Kirkpatrick J. D., Simons D. A., 1994, AJ, 108, 1437
  • Henry et al. (2006) Henry T. J., Jao W.-C., Subasavage J. P., Beaulieu T. D., Ianna P. A., Costa E., Méndez R. A., 2006, AJ, 132, 2360
  • Heo et al. (2012) Heo J. P., Lee Y., He J., Chang S. F., Yoon S. E., 2012, IEEE, 157, 2957
  • Houdebine et al. (2017) Houdebine E. R., Mullan D. J., Bercu B., Paletou F., Gebran M., 2017, ApJ, 837, 96
  • Huo et al. (2017) Huo Z.-Y., et al., 2017, Research in Astronomy and Astrophysics, 17, 032
  • Jackson & Jeffries (2014) Jackson R. J., Jeffries R. D., 2014, MNRAS, 441, 2111
  • Johnson et al. (1987) Johnson C. H., Horen D. J., Mahaux C., 1987, Phys. Rev. C, 36, 2252
  • Jolliffe (2002) Jolliffe I. T., 2002, Weather, 98
  • Lépine & Gaidos (2011) Lépine S., Gaidos E., 2011, AJ, 142, 138
  • Li et al. (2018) Li Y.-B., et al., 2018, ApJS, 234, 31
  • Luo et al. (2015) Luo A.-L., et al., 2015, Research in Astronomy and Astrophysics, 15, 1095
  • Luri et al. (2018) Luri X., et al., 2018, A&A, 616, A9
  • Mann et al. (2012) Mann A. W., Gaidos E., Lépine S., Hilton E. J., 2012, ApJ, 753, 90
  • Newberg et al. (2012) Newberg H. J., et al., 2012, in Aoki W., Ishigaki M., Suda T., Tsujimoto T., Arimoto N., eds, Astronomical Society of the Pacific Conference Series Vol. 458, Galactic Archaeology: Near-Field Cosmology and the Formation of the Milky Way. p. 405
  • Pal et al. (2015) Pal C., Hagiwara I., Kayaba N., Morishita S., 2015, Shock & Vibration, 3, 201
  • Reid et al. (1995) Reid I. N., Hawley S. L., Gizis J. E., 1995, AJ, 110, 1838
  • Reiners (2012) Reiners A., 2012, Living Reviews in Solar Physics, 9, 1
  • Ren et al. (2018) Ren J.-J., Rebassa-Mansergas A., Parsons S. G., Liu X.-W., Luo A.-L., Kong X., Zhang H.-T., 2018, MNRAS, 477, 4641
  • Salakhutdinov & Hinton (2009) Salakhutdinov R., Hinton G., 2009, International Journal of Approximate Reasoning, 50, 969
  • Salpeter & Hoffman (1995) Salpeter E. E., Hoffman G. L., 1995, ApJ, 441, 51
  • Stassun et al. (2011) Stassun K. G., et al., 2011, in Johns-Krull C., Browning M. K., West A. A., eds, Astronomical Society of the Pacific Conference Series Vol. 448, 16th Cambridge Workshop on Cool Stars, Stellar Systems, and the Sun. p. 505 (arXiv:1012.2580)
  • Torres et al. (2010) Torres G., Andersen J., Giménez A., 2010, A&ARv, 18, 67
  • Veyette et al. (2017) Veyette M. J., Muirhead P. S., Mann A. W., Brewer J. M., Allard F., Homeier D., 2017, ApJ, 851, 26
  • Wang et al. (2016a) Wang K., Guo P., Yin Q., Luo A. L., Xin X., 2016a, in International Joint Conference on Neural Networks. pp 3453–3460
  • Wang et al. (2016b) Wang J., Liu W., Kumar S., Chang S., 2016b, Proceedings of the IEEE, 104, 34
  • Wang et al. (2017) Wang K., Guo P., Luo A. L., Xin X., Duan F., 2017, in IEEE International Conference on Systems, Man, and Cybernetics. pp 002687–002692
  • Wei et al. (2014) Wei P., et al., 2014, AJ, 147, 101
  • Weiss et al. (2008) Weiss Y., Torralba A., Fergus R., 2008, in International Conference on Neural Information Processing Systems. pp 1753–1760
  • West et al. (2011) West A. A., et al., 2011, in Johns-Krull C., Browning M. K., West A. A., eds, Astronomical Society of the Pacific Conference Series Vol. 448, 16th Cambridge Workshop on Cool Stars, Stellar Systems, and the Sun. p. 1407 (arXiv:1012.3766)
  • Yang et al. (2017) Yang H., et al., 2017, ApJ, 849, 36
  • Yanny et al. (2009) Yanny B., et al., 2009, AJ, 137, 4377
  • Yi et al. (2014) Yi Z., et al., 2014, AJ, 147, 33
  • Zhang (1999) Zhang T., 1999, Acm Sigmod Record, 25, 103
  • Zhong et al. (2015) Zhong J., et al., 2015, Research in Astronomy and Astrophysics, 15, 1154
designation obsDate mjd planID spID fiberID ra dec snrr subClass rv ewHa ewHaErr TiO5 CaH2 CaH3
J005327.82+391733.4 2011-10-24 55858 M5901 7 186 13.365946 39.292626 4.03 M0 -48.56 1.01 0.217 0.88 0.87 1.05
J005251.32+384930.7 2011-10-24 55858 M5901 7 216 13.213834 38.825212 3.48 M0 -33.14 0.38 0.227 0.89 0.77 0.98
J005009.21+382629.5 2011-10-24 55858 M5901 7 244 12.538397 38.44155 3.15 M0 -51.63 1.44 0.333 0.76 0.74 0.91
J005223.75+405459.0 2011-10-24 55858 M5901 9 132 13.09896 40.916401 2.73 M2 441.6 -1.68 0.198 0.9 0.78 0.83
J003718.31+394912.3 2011-10-24 55858 M5901 10 172 9.3263115 39.820105 2.08 M0 35.45 0.49 0.191 0.81 0.53 0.76
Table 2: Several examples of the online catalog.
TiO1 TiO2 TiO3 TiO4 CaH1 CaOH TiO5Err CaH2Err CaH3Err TiO1Err TiO2Err TiO3Err TiO4Err CaH1Err CaOHErr na zeta
0.87 0.9 0.99 0.89 0.89 0.79 0.044 0.037 0.044 0.051 0.06 0.056 0.051 0.03 0.044 1.03 2.56
1 0.88 0.92 0.88 1 0.76 0.051 0.038 0.047 0.067 0.069 0.059 0.057 0.038 0.047 0.98 1.12
1 1.03 0.95 0.79 0.91 0.78 0.056 0.048 0.058 0.086 0.106 0.082 0.067 0.047 0.058 1.14 1.71
0.97 0.85 1.07 1.14 1.16 1.35 0.041 0.03 0.032 0.054 0.052 0.054 0.057 0.042 0.032 0.96 0.58
0.72 0.77 0.9 0.94 0.82 0.52 0.035 0.02 0.028 0.038 0.044 0.043 0.046 0.031 0.028 0.92 0.45
zetaerr giant parallax parallax_error pmra pmra_error pmdec pmdec_error distance U V W h_disc l b
0.93 0 -9999 -9999 -9999 -9999 -9999 -9999 -10 -9999 -9999 -9999 -12.29 132.040235 23.486443
0.62 0 0.6844 0.2827 5.783 0.553 2.885 0.319 1461.13 21.33 -40.31 31.97 -660.1 109.020028 -59.220041
0.658 0 2.0459 0.4649 -0.554 0.775 -1.613 0.712 488.78 -27.3 -40.17 17.96 480.95 93.931123 -61.426298
0.268 0 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
0.089 0 1.2663 0.1689 9.599 0.241 -8.694 0.299 789.7 38.95 -0.39 -45.72 -644.74 112.827489 -71.309437
  • The fields of the catalog are too many to be shown in one line, so they are split into several sub-tables.

  • This table is just an example of the first four lines chosen from the complete catalog, more records can be found in the online catalog.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description