DEV Community: Steven Bruno

An Introduction to One-class Classification

Steven Bruno — Fri, 19 Apr 2019 04:42:49 +0000

The Problem Statement

In statistics, the situation may arise where we must classify an object as belonging to group A or group B. When we have labeled training data for each class of object, the problem is fairly straightforward - we can utilize binary classification algorithms to predict the class to which a new object belongs. When we have unlabeled training data, we turn to clustering algorithms. So far so good, but how do we solve problems in which our training data only contains labeled objects for one class, and the rest are objects of an unknown class? Suddenly, the problem isn't so simple. To make matters worse, not even the trusty SKLearn Estimator Cheatsheet provides an answer.

What about semi-labeled data?

Background

I asked myself this very question when attempting to construct a model that would estimate the probability that a star beyond our solar system contains an exoplanet in its orbit. The NASA Exoplanet Archive contains a treasure trove of information detailing different star systems and exoplanets. Within the Kepler Stellar Dataset, astronomers have determined that many stars do in fact have exoplanets in their orbit. For the other stars, however, whether they host any orbiting planets is unknown. In this case, we have a data sample in which one class is labeled (star contains an exoplanet), and everything else is unlabeled (star may or may not contain an exoplanet). We have no labels for the case that a star does not have an exoplanet because it is extremely difficult, if not impossible, to say for certain that a star does not have any planets in its orbit. The objective is to construct and train a model that estimates the probability that a new observed test star contains an exoplanet in its orbit based on that test star's similarity to stars that are known to contain exoplanets. What options do we have?

The Solution (skip to here for tl;dr)

The scenario I've outlined is what is known as One-class classification. There are numerous interpretations and applications outlined throughout scientific literature, but I will touch on some of the more popular concepts here.

PU Learning A binary classifier is learned in a semi-supervised way from only positive P and unlabeled data U. Learn more
Novelty and Outlier Detection Decide whether a new observation belongs to the same distribution as existing observations (it is an inlier), or should be considered as different (it is an outlier). Learn more
One Class SVM An SVM approach to one-class classification. Learn more

Python Resources

There are various ways to implement one-class classifiers in Python. A this point, I will defer to individuals who are much better equipped to discuss the implementation details of such models.

Roy Wright provides a detailed breakdown of PU learning and derives his own custom Python models.
D.M. TAX develops a one-class SVM.
SKLearn provides a one-class svm method.

Conclusion

I hope at the very least the resources I've provided above leave you better equipped to solve your own one-class classification problems. This is the third in a series of blog posts written for the ChiPy Mentorship Program. As part of my project, I am attempting to train models that will estimate the probability that a star has exoplanets in its orbit or habitable planets in its orbit. Identifying my problem as a one-class classification problem was an important leap in the progress of this project. My next steps are to implement a one-class SVM then analyze time-series data for various stars to try to identify minute dimming events that are indicative of an orbiting planet.

How to Explore NASA Exoplanet Archive Data with Python and Pandas

Steven Bruno — Fri, 22 Mar 2019 04:40:04 +0000

Intro

The NASA Exoplanet Archive hosts large data sets that feature exoplanets and their host stars. The site can be a little overwhelming for newcomers, however, as it is large and complex. The purpose of this post is to teach aspiring Exoplanet Archive data explorers how to navigate the site, download datasets, and begin manipulating the data with Pandas in Python. To frame the discussion, I will walk through the steps as I have recently performed them for my data science project in the ChiPy Mentorship Program.

Step 1 | Find your Dataset

The majority of the data sets can be found by navigating to the Data tab in the primary site navigation. Within, you will notice four main sections - Confirmed Planets, Kepler, Transit Surveys, and Other. For this example, lets work with the Composite Planet Data from within the confirmed exoplanets section. Once you open up the data set, you'll be presented with a tabular interface that you can scroll through directly in your browser. To learn more about individual columns or the data as a whole, scan through the provided View Documentation links.

The available data sets

View Documentation options

Step 2 | Filter and Download

While you could certainly do all of your data cleaning with Pandas alone, you can get a head start by using the tools built into the archive webpage. Click on the select columns button, then select the columns you are interested in. You can further filter the data by searching for row entries and selecting and deselecting rows as you please. Each change you make will be reflected if your file download. Once ready, click the download button, and download your data in CSV format. If you tried to read the csv in Pandas at this point, you would encounter some errors, as the download brought some unwanted baggage. Open up your recently downloaded csv in a text editor, and delete everything up to the label for your first column. In my case, I will delete everything up to fpl_hostname. Save the updated file as a csv.

The Select Columns button

Column Controls

Delete the extraneous information

Step 3 | Read File into Pandas Dataframe

Now the fun part. I am assuming you have Python 3 and Pandas installed already, but if you don't, go ahead and do that. I like to work with the data in jupyter notebooks, so install Jupyter Notebook if you haven't done so already. Create a new jupyter notebook in the same directory where you are keeping your filtered data csv. In the notebook, begin by importing Pandas. Then use the Pandas built in read_csv() function to read your file into a new dataframe. Once done, run df.head() to make sure your dataframe looks good.

import Pandas as pd
df = pd.read_csv('name_of_filtered_csv')
df.head()

Import your data

Step 4 | Clean your Data

So far so good! At this point, you have a fully functioning exoplanet dataset in a Pandas dataframe. You may notice that the column headers aren't exactly intuitive. To remedy this, we will rename them using the df.rename() function. As a parameter, it expects a dictionary where the keys are old column names and the values are the new column names. We can set up that dictionary and use it in the rename function like so:

column_dict = { 'fpl_hostname':'Host Name', 'fpl_letter':'Planet Letter', 'fpl_name':'Planet Name', 'fpl_controvflag':'Controversial Flag' }
df.rename(columns=column_dict, inplace=True)
df.head()

Renaming columns

If done correctly, you'll see the names of your columns modified to the new values. But what if we have more than just a few columns to rename. What if there are dozens, or hundreds? One way to address this is by extracting the html table for planet parameters accessible from the View Documentation > View Data Column Definitions link in the web interface for your data set. There, you'll see a table of column names, labels, and some other columns. The first two columns would work perfectly for keys (old column names) and values (new column names) within your dataframe. To approach this problem, you could capture the table using pd.read_html(), then create a dictionary of the first two columns using this method and then use the df.rename() function to finish renaming your original dataframe column labels. Good luck!

Final thoughts

This has been an introduction to extracting data sets from the Nasa Exoplanet Archive using Pandas and Python. It is the second post in a blog trilogy I am writing for the ChiPy Mentorship Program. My next steps are to use methods in the sklearn.neighbors module to attempt to classify the likelihood a star has an orbiting exoplanet based on stellar parameters. I have performed all the steps outlined above and they are a great way to get started with creating model-ready dataframes from the exoplanet archive, but certainly not the only way. To learn about more methods for extracting data from the archive, check out the Tools section within the site.

How to Start Your First Data Science Project

Steven Bruno — Fri, 22 Feb 2019 05:46:23 +0000

In recent years, data science has seen an influx of researchers, aspiring professionals, and enthusiasts. While many of the tools and techniques have been around for years, if not decades, the industry has really taken off in the past 5 years. Good publicity, increased technological capabilities, and high pay prospects have combined to form the perfect recipe for a burgeoning field.

The space is so enticing that even I have decided to test the waters. In the coming months, I will be creating my first data science project using python as part of my involvement in the Chicago Python user group mentorship program.

If you're in my position, you may be wondering how to get started. With the help of individuals much smarter than I am, I've crafted a 5 step sequence to get started. Without further ado:

Step 1 | Reconsider your intentions

Its easy to get sucked into the glitz and glamour of the "data scientist" role. The title itself, however, can mean 100 different things to 100 different employers. If you are an aspiring professional data scientist, know that the title is suffering a bit of an identity crisis. Assuming you've done your research, aren't just in it for the alluring pay, and are still interested, then continue to step two.

Step 2 | Gain a basic familiarity with the tools and techniques

Data Science is essentially programming and statistics. The two most popular languages at the moment for the practice are Python and R. It would not be a bad idea to take an online data science course and pick up some fundamental statistics before you begin. This will also help you ascertain whether the type of work is even something you would be interested in. I've scanned countless recommendations from online communities over the past year and these courses seem to receive a lot of good feedback.

I have worked through a decent portion of the Jose Portilla Udemy course in preparation for my project, and I find the instruction to be outstanding.
A word of caution: Do not get sucked into the trap of the infinite tutorial loop. Just get the basics down then try to tackle your own project as soon as you can. While the courses listed above are great resources, you will learn a tremendous amount by solving the problems that arise in your own project. You will also have something cool to show for it.

Step 3 | Choose a topic you're interested in

This one is pretty self explanatory. If you're struggling the think of a project idea, just consider your own interests. From apples to zoos, if you can think of a subject, there's a good chance data exists for it. The best project idea is the one you actually stick to, so choosing something that excites you is in your best interest.

Step 4 | Hone in on the data

Got your topic? Great, now its time to select your dataset(s). Kaggle is an amazing resource, as is Google's dataset search. For me, I like astronomy, so I'm looking at the NASA Exoplanet Archive and beginning to envision the sorts of relationships and models that can potentially be drawn out. Side Note - web developers, please make data scientist's lives easier by allowing Google to find and publicize your data with their search tool. (Unless you are anti Google, which is okay too)

Step 5 | Dive into your data!

At this point, you've got the tools, techniques, and data to get started. Get stuck? There are plenty of online communities willing to help out. Good luck!

Author's Note | This has been the first in a series of blog posts for the ChiPy mentorship program. My next steps are to extract and clean my data, so keep an eye out for my next post if you are interested in my progress.