PostgreSQL is and remains one of our most popular and strongly growing storage platforms. Other storage technologies come and go but modern Postgres is a solid choice for so many applications. In this article we'll take a look at how to get started with Aiven for PostgreSQL, populate it with data and start playing.
When you spin up your first Aiven for PostgreSQL, you'll want to take some time to play with the features ... but there's a problem. Your new shiny database is empty.
Finding and using some open datasets is a great way to fill this gap. One option is go try the Kaggle platform. It's a place to find open data, advice about data science, and some competitions you can participate in to hone your skills. There's quite a selection of datasets to choose from, but today I'll be using the exoplanets data from the Kepler mission. You'll need a (free) account to log in and download the data. Go ahead and extract the zip file, I'm using
cumulative.csv for the example in this post.
If you are not already an Aiven user, you can sign up for an Aiven account to follow the steps in this post - we'll wait right here!
We will also be using the Aiven CLI. This tool requires Python 3.6 or later, and can be installed from PyPI:
pip install aiven-client
You also need to authenticate your Aiven account against the CLI tool. Run the following command, substituting your own details:
avn user login <firstname.lastname@example.org>
You now have everything you need to create an Aiven database in the cloud.
A good first step is to create a project to keep the services in. All it needs is a name:
avn project create exoplanets
Aiven offers many options when creating services but to get us going quickly, I'll use the newest postgres available and the smallest package, called "hobbyist". One of the most fun things though is being able to choose any cloud platform you like! Take a moment to check the list and copy the
CLOUD_NAME field of your favorite:
avn cloud list
google-europe-west1 for my example. Here is the command to run to create the Postgres database (remember to swap in the cloud of your choice):
avn service create -t pg -p hobbyist --cloud google-europe-west1 pg-exoplanets
The node takes a few minutes to be ready, but the Aiven CLI has a handy "wait" command that doesn't return until the service is ready to talk to us. This is less critical for running the commands by hand as I have here, but it's super useful when your CI system is spinning up the data platforms by itself!
avn service wait pg-exoplanets
When the command returns, my PostgreSQL cluster is ready to use. Let's create a database to hold the sample data; the command below creates one named "exoplanets":
avn service database-create --dbname exoplanets pg-exoplanets
Now I have a shiny new database... but it's all sad and empty. Let's look at the sample data and get it imported.
PostgreSQL has built-in support for importing CSV data into an existing table, but I don't have the table structure, just a CSV. Luckily there's a tool for that - ddlgenerator is another Python commandline tool.
Here's how to install the
ddlgenerator tool and then generate the
CREATE TABLE statement from
cumulative.csv that I downloaded earlier:
pip install ddlgenerator ddlgenerator postgres cumulative.csv > create.sql
Have a look inside the file and you will see that it has the structure I need to explain to PostgreSQL how to hold the data. The
avn service cli command will give us a
psql prompt on the new database:
avn service cli pg-exoplanets
psql I can connect to the "exoplanets" database, and then run the SQL file to create the table structure:
\c exoplanets \i create.sql
Adding the final piece to the puzzle, and still from the
psql prompt, the next command brings in the CSV data:
\copy cumulative from data/cumulative.csv csv header
Nice work! The
cumulative table should now have some data for you to play with!
Now you have a database full of measurements of exoplanets taken by the Kepler Space Telescope. If you're not already familiar with the project, the NASA mission page is worth a read. The mission went into a second phase when one of the controls failed, which serves to remind us that engineering systems we can see and touch, or at least ssh into, is much easier gig than operating in space!
You can explore the dataset, which describes observations and compares the Kepler assessment of each exoplanet with its official status in the pre-existing literature. For example, try this to see the false-positives identified by Kepler:
select kepler_name, koi_pdisposition from cumulative where koi_disposition = 'CONFIRMED' and koi_pdisposition = 'FALSE POSITIVE';
You can also connect this data to other tools to use the dataset further. Either grab the connection details from the web console, or use jq with
avn for a one-liner:
avn service get pg-exoplanets --json | jq ".service_uri"
Good cloud experimentation practice suggests that if you've finished with your exoplanets database, you can delete it:
avn service terminate pg-exoplanets
For even more fun and learning, how about one of these resources:
- Kaggle Open Datasets in case you don't fancy exoplanets, there are some excellent alternatives here
- In our documentation you can find instructions for migrating your existing PostgreSQL to Aiven
- More about the Aiven CLI,
- Need more PostgreSQL? Check out our introduction to PostgreSQL post