Automatic classification of sources in large astronomical catalogs

Abstract In this paper we address two questions related to data analysis in large astronomical datasets, and we demonstrate how they can be answered making use of machine learning techniques. The first question is: how to efficiently find previously unknown or rare objects which can be expected to exist in big data samples? Using the largest existing extragalactic all-sky survey, provided by the WISE satellite, we demonstrate that, surprisingly, supervised classification methods can come to aid. The second question is: having a sufficiently large data sample, how can we look for new optimal classification schemes, possibly finding new and previously unknown classes and subclasses of sources? Based on the VIPERS cutting-edge galaxy catalog at redshift z > 0.5, we demonstrate that unsupervised classification methods can give unexpected but physically well-motivated results.


Introduction
Currently, astronomy is dealing with increasingly larger data samples, and even bigger data, like those from the Large Synoptic Survey Telescope (LSST), are coming soon. Such data open new possibilities: both for discoveries of previously unknown rare classes of objects and for understanding the global properties of the known sources, as they provide samples allowing for much more refined statistical treatment. This new situation on the market of astronomical data, coupled with increasing capabilities of modern computers, resulted in a boost of different machine learning and data mining methods applied in the field of astronomy over the last few years.
Very broadly speaking, machine learning based classification methods can be divided into two main families: supervised methods, where we know a priori what sources we expect to find and we use some known datasets to train algorithms to look for them, and unsupervised methods which look for separate clusters or groups in the data based on similarity of their properties in a given feature space. In this paper, we present two examples of applications of semi-supervised (a hybrid between supervised and unsupervised) and  Left panel : infrared color-color W 1 − W 2 vs W 2 − W 3 distribution of AllWISE sources: in addition to objects following the pattern of known galaxies, stars and AGN, a large number of anomalous sources was found. Right panel : histogram of the visible-infrared r − W 2 color of ∼7, 000 anomalous objects identified as genuine astrophysical sources with counterparts in the photometric part of the SDSS catalog; the preliminary analysis indicates that they form two groups of dusty (redder) and non-dusty (bluer) AGN.
unsupervised machine learning techniques to search for new unknown classes of sources and new classification schemes for known (and unknown) sources.

Supervised methods in search for novel sources
The largest presently existing all-sky extragalactic catalog was created based on the near-to mid-infrared data gathered by the Wide-field Infrared Survey Explorer (WISE, Wright et al. 2010) in four passbands (W 1, W 2, W 3, W 4) centered at 3.4, 4.6, 12 and 23 µm, respectively. It contains over 747 million sources in its AllWISE data release. We can reasonably expect that among such a wealth of data new, rare and previously unclassified categories of objects should be contained. The question is -how to find them?
Solarz et al. (2017) applied a semi-supervised machine learning method, called oneclass support vector machines (OCSVM, Schölkopf et al. 2000), in order to separate all known classes of objects from those which do not fit to any known pattern in the considered parameter space. A model of expected data was trained based on 2.6 million sources present in the spectroscopic SDSS DR13 database, and found also in AllWISE. As shown in the left panel of Fig. 1, among the AllWISE data with no spectroscopic SDSS counterparts, the OCSVM algorithm found a large number of sources fitting well the pattern corresponding to known stars, galaxies and quasars. However, it also identified two big groups of anomalous sources. The largest among these two groups was found to contain mainly artifacts, such as objects with spurious photometry due to blending. It should be stressed that these artifacts were not detected previously by any other cleaning algorithms. Thus, the OCVSM method was proven to be an efficient tool for cleaning large catalogs. However, even more importantly, a second, smaller group of anomalies appeared to contain ∼40, 000 real sources of genuine astrophysical interest, among them a sample of heavily reddened Active Galactic Nuclei (AGN) / quasar candidates distributed uniformly over the sky and in a large part absent from other WISE-based AGN catalogs. The r − W 2 color distribution of 7, 000 among these sources having counterparts in the photometric part of the SDSS survey is presented in the right panel of Fig. 1; two groups of "red" and "blue" sources presumably correspond to obscured and unobscured AGN, respectively. The follow-up observations of these sources are now on-going.

Unsupervised methods in search for new classification schemes
The VIMOS Public Extragalactic Redshift Survey (VIPERS; Scodeggio et al. 2018) contains ∼90,000 spectroscopically measured galaxies at z 0.5, covering a large volume of ∼5x10 7 h −3 Mpc 3 with an effective spectroscopic sampling (∼46%), which makes it a state-of-the-art equivalent of local surveys but at z ∼ 1.
Siudek et al. (2018a) applied a Fisher Expectation-Maximization (FEM) unsupervised algorithm (Bouveyron & Brunet 2018) to classify VIPERS galaxies in a parameter space of 12 rest-frame magnitudes and spectroscopic redshift. The FEM algorithm has automatically distinguished 12 classes: 11 classes of galaxies at 0.5 z 1.2 and an additional class of outliers consisting mostly of broad-line AGNs. The first broad division was into red (passive), green (intermediate), and blue (star-forming) galaxy populations. A further sub-division yielded three red, three green, and five blue galaxy classes. A joint analysis based on standard statistical criteria (BIC, AIC, ICL), a flow chart of the subsequent divisions performed by the algorithm in the consecutive steps, and a posteriori checks of physical properties of galaxies in resultant FEM classes, allowed for the conclusion that the 11 galaxy classes obtained reflect indeed different galaxy subpopulations, and that the transition of galaxy properties between subsequent classes is not continuous. In other words, the FEM classes can be treated as physically different subcategories of galaxies. The differences between classes are seen not only in galaxy rest-frame magnitudes or colors, but they are also reflected by the properties which were not used for classification neither directly nor indirectly, like spectral lines, shapes or sizes. For example, one among the classes of passive galaxies contains galaxies significantly more compact than two other red classes. As shown in Fig. 2, FEM classes follow the sequence from the earliest to the latest types, which is reflected both by their colors (left panel), and physical as well as spectroscopic properties (right panel).
Can such a fine classification be reproduced based on photometric data only? The question is timely, having in mind huge photometric sky surveys which are expected in the near future, like the LSST. corresponding rest-frame magnitudes based on these photometric redshifts and apparent magnitudes. Under these conditions, the FEM algorithm effectively distinguished three main red, intermediate, and blue galaxy classes, and the agreement with the corresponding classes based on the spectroscopic data was very high: 92%, 84% and 96%, respectively. Moreover, most of the subclasses were also successfully recovered. The only exception was one of the star-forming classes, containing dusty star-forming galaxies, which in the photometric classification merged with other groups of star-forming galaxies. It was thus proven to be the most sensitive to the accuracy of the recovery of the rest-frame color properties of galaxies.

Summary
As we have shown, both supervised and unsupervised machine learning methods give great promise for providing proper understanding of source properties in the future sky surveys. Combining these methods will become a powerful tool for future data analysis. It can allow first for cleaning the data and then for finding novel sources which do not correspond to any known pattern -a semi-supervised algorithm such as OCSVM can add flexibility and reliability to automated source separation procedures. In the next step, unsupervised methods like FEM can be used for efficient and robust classification of already known and, even more importantly, novel samples of sources.

Discussion
Fumi Egusa: How are different classes different? Are they really discrete groups or part of a continuously/smoothly changing properties?
Agnieszka Pollo: Different classes selected by the FEM algorithm seem to be indeed well separated in the multidimensional feature space. In particular, most galaxies have very clear class assignment with high probability of belonging to a given class and low second best class membership probability; different classes gather galaxies of different physical properties: for instance, one of the red classes gathers galaxies with UV upturn; one of the intermediate classes contains very dusty objects. Thus, the change in properties between different classes is not smooth.
Fumi Egusa: Comparison with z = 0 would be interesting.
Agnieszka Pollo: Yes, indeed, and we are working on such a comparison right now.