How to Explore NASA Exoplanet Archive Data with Python and Pandas

#python #pandas #data #space

Intro

The NASA Exoplanet Archive hosts large data sets that feature exoplanets and their host stars. The site can be a little overwhelming for newcomers, however, as it is large and complex. The purpose of this post is to teach aspiring Exoplanet Archive data explorers how to navigate the site, download datasets, and begin manipulating the data with Pandas in Python. To frame the discussion, I will walk through the steps as I have recently performed them for my data science project in the ChiPy Mentorship Program.

Step 1 | Find your Dataset

The majority of the data sets can be found by navigating to the Data tab in the primary site navigation. Within, you will notice four main sections - Confirmed Planets, Kepler, Transit Surveys, and Other. For this example, lets work with the Composite Planet Data from within the confirmed exoplanets section. Once you open up the data set, you'll be presented with a tabular interface that you can scroll through directly in your browser. To learn more about individual columns or the data as a whole, scan through the provided View Documentation links.

Step 2 | Filter and Download

While you could certainly do all of your data cleaning with Pandas alone, you can get a head start by using the tools built into the archive webpage. Click on the select columns button, then select the columns you are interested in. You can further filter the data by searching for row entries and selecting and deselecting rows as you please. Each change you make will be reflected if your file download. Once ready, click the download button, and download your data in CSV format. If you tried to read the csv in Pandas at this point, you would encounter some errors, as the download brought some unwanted baggage. Open up your recently downloaded csv in a text editor, and delete everything up to the label for your first column. In my case, I will delete everything up to fpl_hostname. Save the updated file as a csv.

Delete the baggage — Delete the extraneous information

Step 3 | Read File into Pandas Dataframe

Now the fun part. I am assuming you have Python 3 and Pandas installed already, but if you don't, go ahead and do that. I like to work with the data in jupyter notebooks, so install Jupyter Notebook if you haven't done so already. Create a new jupyter notebook in the same directory where you are keeping your filtered data csv. In the notebook, begin by importing Pandas. Then use the Pandas built in read_csv() function to read your file into a new dataframe. Once done, run df.head() to make sure your dataframe looks good.

import Pandas as pd
df = pd.read_csv('name_of_filtered_csv')
df.head()

Step 4 | Clean your Data

So far so good! At this point, you have a fully functioning exoplanet dataset in a Pandas dataframe. You may notice that the column headers aren't exactly intuitive. To remedy this, we will rename them using the df.rename() function. As a parameter, it expects a dictionary where the keys are old column names and the values are the new column names. We can set up that dictionary and use it in the rename function like so:

column_dict = { 'fpl_hostname':'Host Name', 'fpl_letter':'Planet Letter', 'fpl_name':'Planet Name', 'fpl_controvflag':'Controversial Flag' }
df.rename(columns=column_dict, inplace=True)
df.head()

If done correctly, you'll see the names of your columns modified to the new values. But what if we have more than just a few columns to rename. What if there are dozens, or hundreds? One way to address this is by extracting the html table for planet parameters accessible from the View Documentation > View Data Column Definitions link in the web interface for your data set. There, you'll see a table of column names, labels, and some other columns. The first two columns would work perfectly for keys (old column names) and values (new column names) within your dataframe. To approach this problem, you could capture the table using pd.read_html(), then create a dictionary of the first two columns using this method and then use the df.rename() function to finish renaming your original dataframe column labels. Good luck!

Final thoughts

This has been an introduction to extracting data sets from the Nasa Exoplanet Archive using Pandas and Python. It is the second post in a blog trilogy I am writing for the ChiPy Mentorship Program. My next steps are to use methods in the sklearn.neighbors module to attempt to classify the likelihood a star has an orbiting exoplanet based on stellar parameters. I have performed all the steps outlined above and they are a great way to get started with creating model-ready dataframes from the exoplanet archive, but certainly not the only way. To learn about more methods for extracting data from the archive, check out the Tools section within the site.