Lorna Jane Mitchell

Posted on Feb 18, 2021 • Originally published at aiven.io on Feb 17, 2021

Discover exoplanets with PostgreSQL sample data

#database #data #postgres #science

PostgreSQL is and remains one of our most popular and strongly growing storage platforms. Other storage technologies come and go but modern Postgres is a solid choice for so many applications. In this article we'll take a look at how to get started with Aiven for PostgreSQL, populate it with data and start playing.

When you spin up your first Aiven for PostgreSQL, you'll want to take some time to play with the features ... but there's a problem. Your new shiny database is empty.

Finding and using some open datasets is a great way to fill this gap. One option is go try the Kaggle platform. It's a place to find open data, advice about data science, and some competitions you can participate in to hone your skills. There's quite a selection of datasets to choose from, but today I'll be using the exoplanets data from the Kepler mission. You'll need a (free) account to log in and download the data. Go ahead and extract the zip file, I'm using cumulative.csv for the example in this post.

Get Started with Aiven

If you are not already an Aiven user, you can sign up for an Aiven account to follow the steps in this post - we'll wait right here!

We will also be using the Aiven CLI. This tool requires Python 3.6 or later, and can be installed from PyPI:

pip install aiven-client

You also need to authenticate your Aiven account against the CLI tool. Run the following command, substituting your own details:

avn user login <email@example.com>

You now have everything you need to create an Aiven database in the cloud.

Create PostgreSQL Service

A good first step is to create a project to keep the services in. All it needs is a name:

avn project create exoplanets

Aiven offers many options when creating services but to get us going quickly, I'll use the newest postgres available and the smallest package, called "hobbyist". One of the most fun things though is being able to choose any cloud platform you like! Take a moment to check the list and copy the CLOUD_NAME field of your favorite:

avn cloud list

I chose google-europe-west1 for my example. Here is the command to run to create the Postgres database (remember to swap in the cloud of your choice):

avn service create -t pg -p hobbyist --cloud google-europe-west1 pg-exoplanets

The node takes a few minutes to be ready, but the Aiven CLI has a handy "wait" command that doesn't return until the service is ready to talk to us. This is less critical for running the commands by hand as I have here, but it's super useful when your CI system is spinning up the data platforms by itself!

avn service wait pg-exoplanets

When the command returns, my PostgreSQL cluster is ready to use. Let's create a database to hold the sample data; the command below creates one named "exoplanets":

avn service database-create --dbname exoplanets pg-exoplanets

Now I have a shiny new database... but it's all sad and empty. Let's look at the sample data and get it imported.

Adding CSV Data to PostgreSQL

PostgreSQL has built-in support for importing CSV data into an existing table, but I don't have the table structure, just a CSV. Luckily there's a tool for that - ddlgenerator is another Python commandline tool.

Here's how to install the ddlgenerator tool and then generate the CREATE TABLE statement from cumulative.csv that I downloaded earlier:

pip install ddlgenerator
ddlgenerator postgres cumulative.csv > create.sql

Have a look inside the file and you will see that it has the structure I need to explain to PostgreSQL how to hold the data. The avn service cli command will give us a psql prompt on the new database:

avn service cli pg-exoplanets

From within psql I can connect to the "exoplanets" database, and then run the SQL file to create the table structure:

\c exoplanets
\i create.sql

Adding the final piece to the puzzle, and still from the psql prompt, the next command brings in the CSV data:

\copy cumulative from data/cumulative.csv csv header

Nice work! The cumulative table should now have some data for you to play with!

Dreaming of Exoplanets

Now you have a database full of measurements of exoplanets taken by the Kepler Space Telescope. If you're not already familiar with the project, the NASA mission page is worth a read. The mission went into a second phase when one of the controls failed, which serves to remind us that engineering systems we can see and touch, or at least ssh into, is much easier gig than operating in space!

You can explore the dataset, which describes observations and compares the Kepler assessment of each exoplanet with its official status in the pre-existing literature. For example, try this to see the false-positives identified by Kepler:

select kepler_name, koi_pdisposition from cumulative where koi_disposition = 'CONFIRMED' and koi_pdisposition = 'FALSE POSITIVE';

You can also connect this data to other tools to use the dataset further. Either grab the connection details from the web console, or use jq with avn for a one-liner:

avn service get pg-exoplanets --json | jq ".service_uri"

What's Next?

Good cloud experimentation practice suggests that if you've finished with your exoplanets database, you can delete it:

avn service terminate pg-exoplanets

For even more fun and learning, how about one of these resources:

Kaggle Open Datasets in case you don't fancy exoplanets, there are some excellent alternatives here
In our documentation you can find instructions for migrating your existing PostgreSQL to Aiven
More about the Aiven CLI, avn
Need more PostgreSQL? Check out our introduction to PostgreSQL post

DEV Community