In statistics, the situation may arise where we must classify an object as belonging to group A or group B. When we have labeled training data for each class of object, the problem is fairly straightforward - we can utilize binary classification algorithms to predict the class to which a new object belongs. When we have unlabeled training data, we turn to clustering algorithms. So far so good, but how do we solve problems in which our training data only contains labeled objects for one class, and the rest are objects of an unknown class? Suddenly, the problem isn't so simple. To make matters worse, not even the trusty SKLearn Estimator Cheatsheet provides an answer.
I asked myself this very question when attempting to construct a model that would estimate the probability that a star beyond our solar system contains an exoplanet in its orbit. The NASA Exoplanet Archive contains a treasure trove of information detailing different star systems and exoplanets. Within the Kepler Stellar Dataset, astronomers have determined that many stars do in fact have exoplanets in their orbit. For the other stars, however, whether they host any orbiting planets is unknown. In this case, we have a data sample in which one class is labeled (star contains an exoplanet), and everything else is unlabeled (star may or may not contain an exoplanet). We have no labels for the case that a star does not have an exoplanet because it is extremely difficult, if not impossible, to say for certain that a star does not have any planets in its orbit. The objective is to construct and train a model that estimates the probability that a new observed test star contains an exoplanet in its orbit based on that test star's similarity to stars that are known to contain exoplanets. What options do we have?
The scenario I've outlined is what is known as One-class classification. There are numerous interpretations and applications outlined throughout scientific literature, but I will touch on some of the more popular concepts here.
- PU Learning A binary classifier is learned in a semi-supervised way from only positive P and unlabeled data U. Learn more
- Novelty and Outlier Detection Decide whether a new observation belongs to the same distribution as existing observations (it is an inlier), or should be considered as different (it is an outlier). Learn more
- One Class SVM An SVM approach to one-class classification. Learn more
There are various ways to implement one-class classifiers in Python. A this point, I will defer to individuals who are much better equipped to discuss the implementation details of such models.
- Roy Wright provides a detailed breakdown of PU learning and derives his own custom Python models.
- D.M. TAX develops a one-class SVM.
- SKLearn provides a one-class svm method.
I hope at the very least the resources I've provided above leave you better equipped to solve your own one-class classification problems. This is the third in a series of blog posts written for the ChiPy Mentorship Program. As part of my project, I am attempting to train models that will estimate the probability that a star has exoplanets in its orbit or habitable planets in its orbit. Identifying my problem as a one-class classification problem was an important leap in the progress of this project. My next steps are to implement a one-class SVM then analyze time-series data for various stars to try to identify minute dimming events that are indicative of an orbiting planet.