Fractal dimension analysis for automatic morphological galaxy classification
In this report we present experimental results using Haussdorf-Besicovich fractal dimension for performing morphological galaxy classification. The fractal dimension is a topological, structural and spatial property that give us information about the space were an object lives. We have calculated the fractal dimension value of the main types of galaxies: ellipticals, spirals and irregulars; and we use it as a feature for classifying them. Also, we have performed an image analysis process in order to standardize the galaxy images, and we have used principal component analysis to obtain the main attributes in the images. Galaxy classification was performed using machine learning algorithms: C4.5, k-nearest neighbors, random forest and support vector machines. Preliminary experimental results using 10-fold cross-validation show that fractal dimension helps to improve classification, with over 88 per cent accuracy for elliptical galaxies, 100 per cent accuracy for spiral galaxies and over 40 per cent for irregular galaxies.
pacs:galaxies, fractal dimension, machine learning
Astronomy has a long history of acquiring and analyzing enormous quantities of data. As many other fields, this science has become very data-rich due to advances in telescope, detector, and computer technology. Recently, numerous digital sky surveys across a wide range of wavelengths are producing very large image databases of astronomical objects. For example, the Large Synoptic Sky Survey will produce billions of galaxy images Abell et al. (2009). Therefore there is a need to create robust and automated tools for processing astronomical data, particularly for the analysis of the morphology of celestial objects such as galaxies.
Edwin Hubble in 1926 devised a formal galaxy classification approach, known as the Hubble tuning-fork. In his classification scheme galaxies are classified based on their shape, and there are three basic types: elliptical galaxies, spiral galaxies, and irregular galaxies. The morphology of galaxies is generally an important issue in the large scale study of the Universe. Galaxy classification is the first step towards a greater understanding of the origin and formation process of galaxies, and also the evolution process of the Universe. Visual inspection for classifying galaxies has been done by experts, however, it is not easy, because it requires skill and experience, and it is also time-consuming. On the other hand, automatic classification can analyze thousands of images in seconds, and also they are more impartial than humans, i.e. they are not subject to the conscious and unconscious prejudices which affect humans in looking at galaxy images Ball et al. (2004).
Several approaches have been carried out for automatic image analysis and galaxy classification using machine learning and computer vision techniques. First approaches have used artificial neural networks Ball et al. (2004); De La Calleja and Fuentes (2004); Goderya and Lolling (2002); Sodré et al. (1992), decision trees Marin et al. (2013); Owens et al. (1996), instance-based methods Shamir (2009), kernel methods Freed and Lee (2015), among others. Recently, some significant works have been presented using new approaches. For example, the Sparse Representation technique and dictionary learning Diaz-Hernandez et al. (2016), and deep neural networks Dieleman et al. (2015).
Classification of galaxies can help to identify with a little more exactitude the localization of some kind of galaxies and to justify their study in depth. To add elements to describe the universe objects respond to different interests; i.e., the description of the existence of geometrical nature of space-time singularities and discover the nature of the physics which takes place there Penrose (2002), the study of astrophysical jets associated with outflows originating from accretion processes in star-forming regions or galaxy clusters Beall (May 2015) or discover new shine objects.
Generally, automated galaxy classification is performed as follows: extracting relevant information in a galaxy image, encode it as efficiently as possible, and compare one galaxy encoding with a database of similarly encoded images. This research propose the use of the fractal dimension analysis to obtain additional information in order to determine the type of galaxy. Fractal dimension is present in many objects in nature, structures generated by mathematical algorithms, spatial interactions among populations, the distributions of particles in amorphous solids, and in particle configurations created by computer simulations De la Calleja et al. (2016). Thus, we can measure the fractal characteristics in a wide variety of two-dimensional digital images such as galaxy images. We have also performed an image analysis stage, which we introduced in early work De La Calleja and Fuentes (2004). In this stage we standardized galaxy images, that is, we rotated, centered and cropped the images. In addition, we use principal component analysis (PCA) to reduce data and find features that characterize them. Finally, we have used machine learning algorithms to classify the three main types of galaxies; particularly we used C4.5, k-nearest neighbors, random forest and support vector machines.
The remainder of this paper is organized as follows. The next section provides a theoretical background on the fractal dimension. Section III describes the adopted Hubble galaxy classification scheme. Section IV introduces the proposed galaxy classification methodology. Section V reports experimental results on real galaxy images and Section VI presents a discussion of results. Finally, Section VII outlines conclusions and future work directions.
Ii Fractal dimension
The fractal dimension is a topological, structural and spatial property that give us information about the space were an object lives. The image of a galaxy is composed by many objects that can be apparently close to each other, however this is not true. The distance between them can be of billion kilometers, nevertheless, its possible to classify the galaxies by taking into account the objects projected in a two dimensional image.
The fractality is present in many physical systems, to measure the distribution of particles at mesoscopic scales Weitz and Oliveria (1984) or on macroscopic scales such as the famous fractality of the Britain island Mandelbrot (1967). However, one of the basic discussions is the connectivity factor. According to the basic definition of a fractal, this property appears in connected systems. How we can extrapolate the connectivity in galaxies? The superposition of stars give us this connectivity that we required to obtain our structural parameter. The fractality property appears emerging in galaxies, taking in consideration the large scales and were apparently the connectivity its not inherent.
The local fractal dimension Halsey et al. (1986); Hentschel and Procaccia (1983); Ott (1993); Feigenbaum et al. (1986) is calculated following the standard procedure De la Calleja et al. (2016) using the generalized box counting dimension defined as:
where is the size of the box which acquired successively smaller values of length until the minimum value of and and is a parameter which gives the width of the spectrum and when the generalized fractal dimension represents the classic fractal dimension. We calculate the fractal dimension of each image by the multi-fractal spectrum obtained by the plugin FracLac for ImageJ Ferreira and Rasband (2013); Chhabra and Jensen (1989); Chhabra et al. (1989), that gives us the generalized fractal dimension using a gray scale differential option and the ode default sampling sizes.
Iii The Hubble tunning fork scheme
Galaxies are large systems of stars and clouds of gas and dust, all held together by gravity. Galaxies have many different characteristics, but the easiest way to classify them is by their shape; Edwin Hubble devised a basic method for classifying them in this way Ball (2002). In his classification scheme, there are three main types of galaxies: Ellipticals, Spirals and Irregulars (Figure 1). The different shapes of galaxies tell us something about their properties such as luminous mass, diameter, interstellar material, among others. Elliptical galaxies have the shape of an ellipsoid. Spiral galaxies are divided in ordinary and barred: ordinary spirals have an approximately spherical nucleus, while barred spirals have a elongated nucleus that looks like a bar. Finally, irregular galaxies do not have an obvious elliptical or spiral shape. In our study we have classified the main tree types, using a data set of 131 images as shown in Table 1.
|Type||Number of galaxies|
Iv The classification process
The classification process consists of three main stages: image analysis, feature extraction and automated classification. In the image analysis stage the galaxy images are rotated, centered and cropped. After that, we use principal component analysis to reduce the dimensionality of the data and to find a set of features; in addition calculating the fractal dimension of the galaxy images. Finally, these features are the input parameters for machine learning algorithms in order to classify galaxies. The next subsections describe each stage in detail.
iv.1 Image analysis
Galaxy images generally are of different sizes and color formats, and most of the time the galaxy contained in the image is not at the center. Therefore, the aim of this stage is to create images invariant to color, position, orientation and size. We have introduced this image analysis process in early work De La Calleja and Fuentes (2004), thus, we only present a brief description of this stage.
First, the galaxy is found in the image by applying a threshold, that is, from the original image , it is generated a binary image , such that we will obtain the pixels that conform the galaxy. Then we obtain and , the center row and column of the galaxy in the image, respectively. Next we obtain the covariance matrix of the points in the galaxy image, given by
The galaxy’s main axis is given by the first eigenvector of . Then the image is rotated so that the main axis is horizontal. After that the image is cropped, eliminating the columns that contain only background (black) pixels. Finally, we stretch and standardize the images to a size of 128x128 pixels. Figure 2 shows examples of original and standardized images for an elliptical galaxy, a spiral galaxy, and an irregular galaxy.
iv.2 Feature extraction
A general idea for galaxy classification is to extract relevant information (features) in a galaxy image, encode it as efficiently as possible, and compare one galaxy encoding with a database of similarly encoded images. However, one of the difficulties when performing this task is to find a set of characteristics that describe them as best as possible. Generally, the galaxy image is transformed into a vector with feature values. In this study, we have used principal component analysis (PCA) to find this set of relevant features, in addition we have calculated the fractal dimension of the galaxy images.
iv.2.1 Principal component analysis
The basic idea in PCA is to find the components (the eigenvectors) of the covariance matrix of the set objects, so that they explain the maximum amount of variance possible by linearly transformed components. These eigenvectors can be thought of as a set of features which together characterize the variation among the objects Turk and Pentland (1991), in this case the galaxies.
The formulation of standard PCA is as follows. Consider a set of objects , where the mean object of the set is defined by
Each object differs from the mean by the vector
Therefore, principal component analysis seeks a set of orthogonal vectors and their associated eigenvalues which best describes the distribution of the data. The vectors and scalars are the eigenvectors and eigenvalues, respectively, of the covariance matrix
where the matrix .
The associated eigenvalues allow us to rank the eigenvectors (features) according their usefulness in characterizing the variation among the objects. In our study, we have used 8 and 12 principal components, which represent about of the information in original and standardized images, respectively; and 21 and 29 principal components, which represent about of the information in the same way (Figure 3).
iv.2.2 Fractal dimension
We obtain the generalized fractal dimension of different samples of the three main types of galaxies. In Fig. 4(a) is presented the fractal dimension of elliptical galaxies. Most of the values of fractal dimension are around . The behavior of spiral galaxies looks a bit different from the previous ones which are presented in Fig.4(b), because most of the values fall between . Finally in Fig. 4(c) the behavior of irregular galaxies obtained values around . The ranks of fractal dimension of these three types of galaxies are evidently different. Also we can observe that the vast majority of spiral galaxies are in a very similar range; irregular galaxies also exhibit this behavior; while elliptical galaxies vary a little more.
In order to classify the galaxy images we used the following machine learning algorithms: C4.5, k-nearest neighbors, random forest and support vector machines. Each algorithm takes as input the projection of the images onto the principal components, and also the value of the fractal dimension. Then the algorithms classify the images according to the main three types of the Hubble sequence of galaxies. Next we give a brief description of the machine learning algorithms used in this study.
C4.5 is an extension of Quinlan’s earlier ID3 algorithm used to generate a decision tree. This algorithm operates by recursively splitting a training set based on feature values to produce a tree such that each example can end up in only one leaf. An initial feature is chosen as the root of the tree, and the examples are split among branches based on the feature value for each example. If the values are continuous, then each branch takes a certain range of values. Then a new feature is chosen, and the process is repeated for the remaining examples. Then the tree is converted to an equivalent rule set, which is pruned. For a deeper introduction of this algorithm we refer the reader to Quinlan (1986).
iv.3.2 k-nearest neighbors
K-nearest neighbors (k-nn) belongs to the family of instance-based learning methods. These algorithms simply store all the available training data, and when a new query instance is encountered, they find the training examples similar to the query, and use them to classify the new query instances. K-nearest neighbors are defined in terms of standard Euclidean distance. For details of this method we refer the reader to Mitchell (1997).
iv.3.3 Random forest
A random forest (RF) is a classifier consisting of a collection of individual tree classifiers. Each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. Basically, random forest is an ensemble of unpruned trees, induced from bootstrap samples of the training data, using random feature selection in the tree induction process. Prediction is done by majority votes from predictions from the ensemble of tree. Details about this method can be found in Breiman (2001).
iv.3.4 Support vector machines
Support Vector Machines (SVMs) Vapnik (1995) are based on the Structural Risk Minimization principle from computational learning theory. This principle provides a formal mechanism to select a hypothesis from a hypothesis space for learning from finite training data sets. The aim of SVMs is to compute the hyperplane that best separates a set of training examples. Two cases are analyzed: the linear separable case and the non-linear separable case. In the first case we are looking for the optimal hyperplane in the set of hyper-planes separating the given training examples. The optimal hyperplane maximizes the sum of the distances to the closest positive and negative training examples (considering only two classes). The second case is solved by mapping training examples to a high-dimensional feature space using kernel functions. In this space the decision boundary is linear and we can apply the first case. There are several kernels such as polynomial, radial basis functions, neural networks, Fourier series, and splines, among others; that are chosen depending on the application.
V Experimental Results
The data set consisted of 131 images of galaxies, see Table 1. Most of them were taken from the NGC catalog on the web page of the Astronomical Society of the Pacific, and their classification was taken from the interactive NGC catalog on line at www.seds.org.
The experiments were carried out using Weka, a software package that implements machine learning algorithms for data mining tasks Frank et al. (2016). In order to measure the overall accuracy of the machine learning algorithms, We used 10-fold cross-validation for all the experiments; that is, the original data set is randomly divided into ten equally sized subsets and performed 10 experiments, using in each experiment one of the subsets for testing and the other nine for training. As previously mentioned, we experiment with the following machine learning algorithms: decision trees, k-nearest neighbors, random forest and support vector machines. For the case of decision trees, and random forest we use default parameters. For the case of k-nn we use three neighbors with weighted distance, and we use a two-degree polynomial kernel for support vector machines.
Tables 2 and 3 show the accuracy for each machine learning algorithm using the original images and the standardized ones, respectively. These results were obtained by averaging the results of five runs of 10-fold cross-validation for each algorithm. As we can observe from Table 2, the best results were obtained by 3-nearest neighbors, with 81.82 per cent accuracy using only PCs, and 81.67 per cent accuracy using PCs plus the fractal dimension value. On the other hand, we can see from Table 3, that random forest obtained the best results with 85.95 and 86.71 per cent accuracy, using PCs and PCs plus the fractal dimension value, respectively.
|PCs||PCs + FDV|
|PCs||PCs + FDV|
In Tables 4 and 5 we present the accuracy for the algorithms using only one feature, that is, 1 principal component or the fractal dimension value. Also we show the results using 1 PC plus the fractal dimension value. From these Tables we can observe that random forest obtained five of the best results, while C4.5 obtained the other one. In addition we can see that, on average, classification using the fractal dimension value is better than using 1 principal component, considering standardized images.
|Algorithm||1 PC||FDV||1 PC + FDV|
|Algorithm||1 PC||FDV||1 PC + FDV|
The difference of the fractal dimension values between the three types of galaxies analyzed is presented in Fig. 5. The characterization by fractal dimension helps to distinguish with mathematical arguments the classification process of complicated images of galaxies. Results presented in Tables 2 and 3 show that the best results are obtained when standardized images and fractal dimension are used, particularly using random forest with 29 PCs plus the fractal dimension value. In addition we can observe that support vector machines was the algorithm with the greatest increase of accuracy when using PCs plus the fractal dimension value. That is from 73.27 to 85.29 per cent accuracy using 29 and 30 features, respectively. In fact, in average across different classifiers, the performance improved my more than 4 per cent when including the fractal dimension as feature.
In Tables 6, 7 and 8 we present the confusion matrix for the best algorithm to classify elliptical, spiral and irregular galaxies, respectively. From this results we can see that 3-nearest neighbors was the best algorithm to classify elliptical galaxies with 88.9 per cent accuracy; random forest was able to classify 100 per cent accuracy of the spiral galaxies; while C4.5 was the best algorithm to classify irregular galaxies with 40 per cent accuracy. We can also observe that none of the best results for elliptical and spiral galaxies have classified irregular galaxies correctly. On the other hand, when irregular galaxies are classified correctly, accuracy of elliptical decreases significantly; while the accuracy for spiral galaxies remains about 87 percent accuracy.
|Galaxy type||Elliptical||Spiral||Irregular||Accuracy per type|
|Galaxy type||Elliptical||Spiral||Irregular||Accuracy per type|
|Galaxy type||Elliptical||Spiral||Irregular||Accuracy per type|
In this paper we have applied, for the first time, fractal dimension analysis to perform automatic morphological galaxy classification. The fractal dimension analysis contributes to distinguish the three main types of galaxies: elliptical, spiral and irregular. We found evidently differences between the rank of fractal dimension among the three groups of galaxies. By using the fractal dimension value as an additional attribute, it is possible to improve the classification accuracy, despite using a small set of images. The best results were obtained by 3-nearest neighbors and random forest using standardized images with PCs and the fractal dimension value. Future work includes: repeating the experiments using a larger data set of galaxy images; and creating ensemble of classifiers to identify each type of galaxy separately.
EMCM thanks the financial support of CONACyT.
- Abell et al. (2009) P. A. Abell, J. Allison and S. F. Anderson, 2009.
- Ball et al. (2004) N. M. Ball, M. Loveday, J.and Fukugita, O. Nakamura, S. Okamura, J. Brinkmann and R. J. Brunner, Monthly Notices of the Royal Astronomical Society, 2004, 348, 1038â1046.
- De La Calleja and Fuentes (2004) J. De La Calleja and O. Fuentes, Monthly Notices of the Royal Astronomical Society, 2004, 349, 87â93.
- Goderya and Lolling (2002) S. N. Goderya and S. M. Lolling, Astrophysics and Space Science, 2002, 279, 377â387.
- Sodré et al. (1992) L. Sodré, L. O. Storrie-Lombardi, M. C. and L. J. Storrie-Lombardi, Monthly Notices of the Royal Astronomical Society, 1992, 259,.
- Marin et al. (2013) M. A. Marin, L. E. Sucar, J. A. Gonzalez and R. Diaz, In: FLAIRS Conference, 2013.
- Owens et al. (1996) E. Owens, R. Griffiths and K. Ratnatunga, Monthly Notices of the Royal Astronomical Society, 1996, 281, 153â157.
- Shamir (2009) L. Shamir, Monthly Notices of the Royal Astronomical Society, 2009, 399, 1367–1372.
- Freed and Lee (2015) M. Freed and J. Lee, Journal of Data Analysis and Information Processing, 2015, 3,.
- Diaz-Hernandez et al. (2016) R. Diaz-Hernandez, A. Ortiz-Esquivel, H. Peregrina-Barreto, L. Altamirano-Robles and J. Gonzalez-Bernal, Experimental Astronomy, 2016, 41, 409–426.
- Dieleman et al. (2015) S. Dieleman, K. Willett and J. Dambre, Monthly Notices of the Royal Astronomical Society, 2015, 450, 1441–1459.
- Penrose (2002) R. Penrose, Gen. Relat. and Gravit., 2002, 34,.
- Beall (May 2015) J. H. Beall, Proceedings of Science. Bibcode:2015mbhe.confE..58B. Retrieved 19 February 2017, May 2015.
- De la Calleja et al. (2016) E. M. De la Calleja, F. Cervantes and J. De la Calleja, Annals of Physics, 2016, 371, 313–322.
- Weitz and Oliveria (1984) D. A. Weitz and M. Oliveria, Phys. Rev. Lett., 1984, 52, 1433.
- Mandelbrot (1967) B. B. Mandelbrot, Science, 1967, 156, 636–638.
- Halsey et al. (1986) T. C. Halsey, M. H. Jensen, L. P. Kadanoff, I. Procaccia and N. I. Shraiman, Phys. Rev. A., 1986, 33,.
- Hentschel and Procaccia (1983) G. E. Hentschel and I. Procaccia, Physica D, 1983, 8, 435–444.
- Ott (1993) E. Ott, 1993.
- Feigenbaum et al. (1986) M. J. Feigenbaum, M. H. Jensen and I. Procaccia, Phys. Rev. Lett., 1986, 57, 1503.
- Ferreira and Rasband (2013) T. Ferreira and W. Rasband, 2013.
- Chhabra and Jensen (1989) A. Chhabra and R. V. Jensen, Phys. Rev. lett., 1989, 62, 1327.
- Chhabra et al. (1989) A. B. Chhabra, C. Meneveau, R. V. Jensen and K. R. Sreenivasan, Phys. Rev. A., 1989, 40,.
- Ball (2002) N. Ball, Masterâs thesis, University of Sussex, 2002.
- Turk and Pentland (1991) M. A. Turk and A. P. Pentland, Proc. IEEE Conference on Computer Vision and Pattern Recognition, 1991, 586â591.
- Quinlan (1986) J. R. Quinlan, Machine Learning, 1986, 1, 81–106.
- Mitchell (1997) T. Mitchell, Machine learning, McGrawHill, 14th edn., 1997.
- Breiman (2001) L. Breiman, Machine Learning, 2001, 45, 5–32.
- Vapnik (1995) V. Vapnik, The nature of statistical learning theory, Springer, New York, 1st edn., 1995.
- Frank et al. (2016) E. Frank, M. Hall and I. Witten, The WEKA Workbench. Online Appendix for ”Data Mining: Practical Machine Learning Tools and Techniques”, Morgan Kaufmann, Fourth Edition edn., 2016.