Bayesian Classification of Astronomical Objects — and what is behind it
We present a Bayesian method for the identification and classification of objects from sets of astronomical catalogs, given a predefined classification scheme. Identification refers here to the association of entries in different catalogs to a single object, and classification refers to the matching of the associated data set to a model selected from a set of parametrized models of different complexity. By the virtue of Bayes’ theorem, we can combine both tasks in an efficient way, which allows a largely automated and still reliable way to generate classified astronomical catalogs. A problem to the Bayesian approach is hereby the handling of exceptions, for which no likelihoods can be specified. We present and discuss a simple and practical solution to this problem, emphasizing the role of the “evidence” term in Bayes’ theorem for the identification of exceptions. Comparing the practice and logic of Bayesian classification to Bayesian inference, we finally note some interesting links to concepts of the philosophy of science.
:95.80.+p, 95.75.Pq, 02.50.Tt, 01.70.+w
address=Astrophysics Dept./IMAPP, Radboud University, P.O. Box 9010, 6500 GL Nijmegen, the Netherlands, altaddress=Max-Planck-Institute for Astrophysics, Karl-Schwarzschild-Str. 1, 85748 Garching, Germany
The identification and classification of objects from a set of independent catalogs is a key task for making astronomical data usable for scientific analysis. The standard approach here is to solve this problem step by step using to use hierarchical “best-match” algorithms, as exemplified in the cross-identification of radio sources [Vollmer et al.(2005)] from the VizieR database of astronomical catalogs [Ochsenbein et al.(2000)]. Although such algorithms are fast and efficient in low-level applications, they have limitations in dealing with ambiguities and considering object classes with different levels of complexity. This is illustrated in the recent production of the band-merged version [Chen et al.(2013a)] of the Planck Early Release Compact Source Catalog (ERCSC) [Planck Collaboration(2011)], and the variability classification of ERCSC objects using WMAP data [Chen et al.(2013b)].
Motivated by this, we present here a Bayesian approach to object identification and classification, based on data from a set of astronomical catalogs taken, e.g., at different frequencies or by different observatories. In this method, we consider not only positional coincidence between catalog entries, but also the properties of known object classes, and use both as criteria for the identification and simultaneous classification of objects. The current paper focuses on the mathematical basics of this method, with some typical choices for priors and likelihoods needed for catalog generation. Applications to data, and a more detailed comparison with standard approaches will follow in future work.
2 Bayesian Association and Classification
2.1 Terms and Definitions
We understand probability in the Bayesian sense as an operator which assigns a value of plausibility, , to a statement , and we introduce the information Hamiltonian [Enßlin et al.(2009)] for a condition on as . Data (factual information) shall be denoted by blackboard-bold symbols (e.g., ), models (abstract beliefs) by calligraphic symbols (e.g., ). We denote a set of logically independent statements as and define for any condition
A set of mutually exclusive statements shall be denoted with with the definition
If a set on mutually exclusive statements is exhaustive, we call it a complete set of alternatives , with . The operator gives the number of elements of a set with , a set containing a zero-indexed element is denoted .
Structure of Data and Associations
From a set of positions taken from a highly reliable seed catalog we select within a radius potentially associated data from independent target catalogs, where the seed catalog may be included as the zero-indexed element, . The entries of each target catalog form a complete set of alternatives, , where stands for the non-observation in the catalog with noise level and signal-to-noise limit , together with data entries . denote the positional distance of a data entry to the nominal seed coordinates and its error, while contain physical parameters and their errors. Finally, we define an association as a mapping determining one entry of each catalog , and denote . Obviously, associations form a complete set of alternatives, .
Models, Parameters and the Classification Scheme
Classification is based on a set of mutually exclusive models , each providing a physical description of a known object class as a set of functions that can be compared to the data values . is a physical quantity mapping a model prediction on a particular catalog (e.g., nominal frequency), and is a vector in the model parameter space of dimension . The (prior) probability assigned to a model is understood as a marginalization over the model parameter space, i.e., , where is called the parameter p.d.f. of .
A priori, we cannot assume that is exhaustive. This would mean , which poses a problem for the proper normalization of Bayesian posterior probabilities. We therefore introduce the classification scheme as a set of conditions that allows us to treat as an exhaustive set, and write and . In a more general sense, can be understood as the framework of factual information (data), beliefs (theories and ancillary hypothesis) and decisions (e.g., how to classify objects), which enables us to define and delimit our set of models .
2.2 Application of Bayes’ Theorem
Separating Association and Classification
The posterior probability for a candidate object can be written under application of the product rule as
The posterior probability of an association depends only on the set of coordinates which we denote by , and we can omit in the condition of this term. By application of Bayes’ theorem, both terms can be separately transformed as
As both and form complete sets of alternatives we obtain the evidence terms
where we have written the model likelihoods explicitly as integrals over their constrained parameter p.d.f.s , fulfilling .
Priors and Likelihoods of Association
Associations by itself are just abstract combination of numbers. Without referring to data or physical models, we have to assign the same value for all , except for some which can be excluded with certainty (e.g., if ). As a constant prior cancels in Eqs. 5 and 7, we can set in all terms with .
The likelihood of an association is determined by two contributions: (a) the probability of an associated data point to be observed at a effective distance within an effective accuracy , and (b) the confusion probability to have a given number of unrelated data points in a catalog of mean source density within a radius , i.e., the Poisson probability
The effective distance and accuracy consider the seed position error by defining and , thus . It is then straightforward to see111Likelihoods containing contributions from both data point associations (Gaussian p.d.f.s) and non-observations (probabilities) require to define conditions to normalize the relative contribution of both kind of terms. In this is done by requiring for matching coordinates measured with arbitrary precision, .33footnotemark: 3 that
The seed catalog does not contribute to this term as its association is logically implied.
Priors and Likelihoods of Classification
Priors in classification are given by the model parameter p.d.f.s, the functional shape of which is part of the model generation and will not be discussed here. For the normalization of the prior p.d.f.s, relative abundances of known object classes from previous classifications can be used.
The classification likelihood is the probability of the data points and non-observations in to match with the model prediction, and we can write
The second term considers the contribution of assumed non-observations, and
is the usual “goodness-of-fit” measure for the data points in .222Here we require that for the fiducial case , the probability of a data point to be consistent with the model prediction is equal to the probability of a non-observation.333We use Gaussian p.d.f.s as we assumed in our data structure that only one error parameter is given for each position or quantity. If more detailed error information is available, the definition of the corresponding likelihoods has to be adapted.
2.3 Classifying Objects
Definition and Properties of Confidence
Following Jaynes [E.T. Jaynes(2003), § 4] we use the logarithm of the odds ratio to compare our classifications and define the confidence of a candidate object as
The object of choice would then be the candidate object with maximum confidence, , and we denote the corresponding indices as and . Analogously, we can define the confidence of a data association as
and denote value and index of the maximum as and , respectively.
Eq. 3 implies that for all . Because of , we can have for only one combination . As this implies , it can be taken as a condition for a unique and consistent object choice, preferring one over all others. In the contrary, we cannot conclude from that , neither we can conclude that .
Based on the discussion above, we can define for each object a quality rating, defining potential actions to be taken for catalog validation and verification. The most basic scheme would contain four ratings as follows.
Rating A selects clear cases, and is given for nat, corresponding to a rejection probability of the best alternative of (db). No or little human inspection is necessary in these cases, and results of rejected object associations can be deleted. Rating B would be applied for , and indicates likely cases, while rating C would be applied to potentially ambiguous cases with while . Both require human inspection at different levels, and all results with should be kept for validation. Finally, a rating D () identifies objects which would normally be rejected in catalog generation, but which may still be interesting to look at for research purposes. Of course, this scheme may be adapted to the needs of reliability, and it may make sense to split up rating A using a sequence of increasing .
2.4 Odd Objects
We are left with a problem: Assume there is an object which does not fit into any of our model classes. How would it appear in our classification?
Obviously, for such objects all integrals in the sum of Eq. 6 would become very small, so would become very small, even if the data association has a high confidence. We therefore introduce the counter-evidence for an associated object to fit into the classification scheme as
and . Thus, can be seen as the information Hamiltonian of the classification scheme, taken for the association for which it becomes minimal. Large values for are an indication of classification exceptions. Following the US performance artist Laurie Anderson444Laurie Anderson, United States Live, Warner Bros. (1983) we call such cases odd objects: While exceptions are usually expected to be results of instrumental errors or defects in the target catalogs (“just odd”), they could also indicate the discovery or a new, unexpected object type (“useful”).
Introducing an exception class
That our method mingles candidates for rejection with candidates for discovery is a defect obtained by forcing the condition onto an, in principle arbitrary, classification scheme . To overcome this problem, we introduce an odd-object class defined by a single parameter
for all . As is the logical complement of the set , we have without conditions, and we obtain the total evidence
This implies , and we define the confidence for an object to be odd as
Moreover, objects of classes need to fulfill for some in order to receive a rating B or above, while there was no such limit in . To prevent that odd objects are accidentally considered “clear cases” if for all , we introduce a sub-rating Ao A for , which requires human inspection.
3 Discussion and Philosophical Epilogue
3.1 Benefits of Bayesian Classification
Models and Priors: Experience vs. Bias
In Bayesian classification we use models and priors, which are usually suspected in catalog generation to introduce bias. Shouldn’t we use only the information contained in the given data set in order to be objective? Our Bayesian answer is: No, we shouldn’t, and in fact, we never do. In general, it is the advantage of Bayesian methods to clearly state our priors, while orthodox methods often hide the prior assumptions used. For the special case of catalog generation, this means that we always have additional data available, usually in a complex and incoherent form, and also widely accepted models describing the nature of our potential objects, and these data and models are used by “experienced astronomers” in the process called catalog validation. All we do by introducing models and priors is to automate part of this experience, i.e., provide a condensed description of our prior knowledge and beliefs to the classification procedure. Our quality rating ensures that this affects only the trivial, routine tasks of validation, and prevents that potentially interesting alternatives to the best assignments are prematurely dropped (e.g., cases with ).
Beyond Best Fits: Robustness and Model Complexity
Our method exhibits a fundamental aspect of Bayesian classification: Model parameters are not optimized as in “best-fit” approaches, but marginalized in Eq. 4 and 6. We emphasize that this is implied by plausibility logic: It is not our question which model can produce an optimal fit to the data for some parameter choice, but which model explains the data in the most natural way, given prior expectations for its parameters.
To discuss this in more detail, let us consider one parameter dimension of the model parameter space , and assume that , with for and otherwise. Moreover, we assume that for we achieve integrated over all parameter dimensions except . Varying may decrease to a value (i.e., increase the likelihood) in some regime , while it decreases the likelihood () in some other regime . Everywhere else we assume . Defining
we immediately obtain for the change in confidence caused by parameter
A significant increase of the model confidence is only obtained if , i.e., if a significant net improvement of the fit quality averaged over the “prior mass” of the parameter is achieved. We shall call parameters with this property robust, while parameters with shall be called fragile.
The factors are equivalent to the Ockham factors defined by Jaynes [E.T. Jaynes(2003), § 20], referring to the principle of simplicity known as Ockhams razor. However, Eq. 19 shows that Bayesian logic does not lead to a flat penalization of model complexity; rather, a parameter which does not affect the fit quality () does not affect the model confidence. It therefore seems more appropriate to say that Bayesian logic penalizes fine tuning, i.e., the introduction of fragile parameters with little prior constraints for the mere purpose to improve the “best fit” for some particular choice of parameter values.555In his discussion of this topic on p. 605-607 of his book [E.T. Jaynes(2003)], Jaynes implicitly assumes that the likelihood is significantly different from zero only within . If a moderately good match has been achieved without the parameter , this is equivalent to setting and in Eq. 19, yielding . Now the Ockham factor indeed penalizes the model complexity as it requires for significant improvement of confidence (note that ).
Bayesian Learning: Updating the Classification Scheme
Classification is naturally applied to a large number of objects , which allows us to use posterior number distributions to iteratively update all prior assumptions which we have entered. In particular total model priors can be updated as
where  denotes the set of all A rated objects [in model class ]. In the same way, updates can be applied to the shape of prior p.d.f.s of the models, if these are determined by empirical parameters.
The most important parameter for posterior updates is hereby the odd object threshold . If we consider determined by Eq. 20 as a function of and call it , we note that and for . If classification exceptions hide a class of undiscovered objects with particular properties, we would expect that they are grouped around some large value of a , while all objects fitting into the classification scheme have small values of . In between, we expect a range where remains approximately constant, and a good choice of for separating the two populations is then found by maximizing
within the range of where . Once is found, we can update all model priors by Eq. 20.
In principle, every update is a redefinition of the classification scheme , and the goal of our iterative process is to find a converging chain of updates , until a self consistent result is obtained. If this does not succeed, our conclusion might be that the classification task is ill-defined, and we may exchange our classification scheme by an entirely different , containing other models to define object classes.
3.2 Classification and Inference
Interpretation Schemes and Anomalies
With these considerations we make the link from Bayesian classification to Bayesian inference. There, we confront a set of models or theories — we call it the interpretation scheme — with a series of data sets , which we now call tests of the interpretation scheme, expecting that subsequent tests will lead to a more and more reliable estimation of the free parameters in our model space. Occasionally, however, results of experiments will not fit at all into the picture (), and we then call them anomalies. Normally, we will cope with anomalies by successively extending the parameter spaces of models (), but if anomalies become rampant, we will have to doubt the validity of our interpretation scheme as a whole. This may lead us to replace it with a new scheme involving entirely new theories (), involving a reinterpretation of all data sets observed so far.
The Course of Science in a Bayesian View
The gentle reader may have noticed that our interpretation scheme is what Thomas Kuhn has called a paradigm [Thomas S. Kuhn(1962)]. In a Bayesian language, it is that part of our “web of beliefs” which is kept unchanged in technical applications, slowly modified in the normal course of science, but questioned and eventually been overthrown when confronted with overwhelming anomalies. We have identified the counter-evidence as a measure to monitor such developments.
We may write for an interpretation scheme continuously modified over time, and define as its average counter-evidence. can then be identified with Imre Lakatos’ concept of a research programme [Imre Lakatos(1978)], and the sign of would indicate whether it is “progressing” () or “degenerating” (). Degeneration of a research programme — or the decline of a paradigm — is hereby not only caused by experimental anomalies, but also by fragile parameters introduced to cope with them. At the end of the road, we may enter into that what Kuhn calls a scientific revolution, the incommensurable paradigm shift , by which all known data obtain a new meaning [Thomas S. Kuhn(1962)]. A further exploration of these topics would be beyond the scope of this paper, but it is intriguing to note how Bayesian methods allow a quantitative understanding of concepts in the philosophy of science which are otherwise considered irrational.
Acknowledgements.The author thanks Tim Pearson, Torsten Enßlin and the anonymous referees for comments and discussions.
- [Vollmer et al.(2005)] B. Vollmer, E. Davoust, P. Dubois, F. Genova, F. Ochsenbein, and W. van Driel, Astronomy & Astrophysics 431, 1177–1187 (2005).
- [Ochsenbein et al.(2000)] F. Ochsenbein, P. Bauer, and J. Marcout, Astronomy & Astrophysics Supplement 143, 23–32 (2000).
- [Chen et al.(2013a)] X. Chen, R. Chary, T. J. Pearson, P. McGehee, and G. Helou, The Bandmerged Planck Eearly Release Compact Source Catalog, to be submitted (2013a).
- [Planck Collaboration(2011)] Planck Collaboration, Astronomy & Astrophysics 536, A7 (2011).
- [Chen et al.(2013b)] X. Chen, J. P. Rachen, M. López-Caniego, C. Dickinson, T. J. Pearson, L. Fuhrmann, T. P. Krichbaum, and B. Partridge, Astronomy & Astrophysics (2013b), in press.
- [Enßlin et al.(2009)] T. A. Enßlin, M. Frommert, and F. S. Kitaura, Phys. Rev. D 80, 105005 (2009), also these proceedings.
- [E.T. Jaynes(2003)] E.T. Jaynes, Probability Theory: The Logic of Science, Cambridge University Press, 2003.
- [Thomas S. Kuhn(1962)] Thomas S. Kuhn, The Structure of Scientific Revolutions, The University of Chicago Press, 1962.
- [Imre Lakatos(1978)] Imre Lakatos, The Methodology of Scientific Research Programmes, Cambridge Univ. Press, 1978.