Normalize nested JSON objects with pandas

ernestine-m — Mon, 03 Aug 2020 13:14:00 +0000

Ever since I started my job as a data analyst, I have heard many times from many different people that the most time-consuming task in data science is cleaning the data. And after a little more than a month in this new job, I can totally concur. However, python pandas library is making it smoother than I thought.

A little about pandas

Pandas is a an open source data analysis library that allows for intuitive data manipulation. It's based on two primary data structures:

The series

It's a one-dimensional array capable of holding any type of data or python objects. I like to think of it as a column in Excel.
Series are by default indexed with integers (0 to n) but we can also define our own index.

The dataframe

It's a 2-dimensional labeled data structure with columns of potentially different types. I like to think of it as different series put together (or as a spreadsheet in excel). Dataframes are the most commonly used data types in pandas.

This 10 minutes to pandas article in the documentation explains everything you need to know to start with pandas!

Surprise! It's JSON nested objects...

It was not a good surprise. I had retrieved 178 pages of data from an API (I talk about this here) and I thought I had to write some code for each nested field I was interested in.
Indeed, my data looked like a shelf of russian dolls, some of them containing smaller dolls, and some of them not.

The data
Nested JSON object structure
I was only interested in keys that were at different levels in the JSON. This seemed like a long and tenuous work.

The solution : pandas.json_normalize

Pandas offers a function to easily flatten nested JSON objects and select the keys we care about in 3 simple steps:

Make a python list of the keys we care about. We can accesss nested objects with the dot notation
Put the unserialized JSON Object to our function json_normalize
Filter the dataframe we obtain with the list of keys

And voilà!

Since I had multiple files to clean that way, I wrote a function to automate the process throughout my code:

FIELDS = ['list of keys I care about']
def clean_data(data):
    table = pd.DataFrame()
    for i in range(len(data) - 1):
        df = pd.json_normalize(data[i + 1])
        df = df[FIELDS]
        table = table.append(df)
    return table

This function allowed me to clean the data I had retrieved and prepare clear dataframes for analysis in just a couple lines of code! 🙌

What's an API and how to access one using Python?

ernestine-m — Tue, 21 Jul 2020 14:36:48 +0000

Last month, I was given my very first task at work as a beginner in data science : retrieve data from an API that uses the Oauth2 authorization protocol. With hindsight, that seems like a very basic task, but I had trouble finding a how-to online that is beginner-friendly. This article is a little breakdown of the steps needed to communicate with an API using python 3.

What is an API ?

The textbook definition of an API (or Application Programming Interface) is "a set of functions and procedures allowing the creation of applications that access the features or data of an operating system, application, or other service."

To put it simply, an API is the messenger between a client and a server and allows us to retrieve data. It can be compared to a waiter in a restaurant who takes our order, transmits it to cooks in the kitchen, then delivers our food back to us.

A very helpful 3 minute explanation

We could use different architectural styles to code an API but the standard one is based on the representational state transfer (REST), which allows for interoperability between computer systems on the internet. Indeed, A RESTful API, or REST API, uses existing HTTP methodologies to communicate:

GET to retrieve a resource/data
PUT to change the state of a resource or update it
POST to create a resource
DELETE to remove a resource

What is OAuth2 ?

In order to access an API, you need an authorization. The most common standard is called OAuth and is used by most big tech companies. OAuth allows access tokens to be issued to third-party clients by an authorization server, with the approval of the resource owner. The third party then uses the access token to access the protected resources hosted by the resource server.

So, how does it work ?

The workflow I had to use for this task was client_credentials, which consists of 2 steps:

Step 1: Request an access token with the information given by the resource owner

In order to communicate with APIs, python has a very useful HTTP library called requests that allows us to retrieve data in a very simple way. There’s no need to manually add query strings to URLs, or to form-encode POST data.

The code I wrote

import requests

values = {"grant_type":"client_credentials",
   "client_id": ' given by the resource owner',
   'client_secret' : 'given by the resource owner',
   'scope' : 'specified in the API documentation'
}
headers = {
  'Authorization': 'given by the resource owner'
}
r = requests.post('host/oauth2/token', data=values, headers=headers)

print(r.json())

This code gives us back an access token that allows us to move to step 2.

Step 2 : Retrieve the data by using the access token that's been issued

For this step, I used Postman, a collaborative platform for API developments that also allows us to send requests. This tool is useful for beginners as it auto-generates headers. The only one I had to add was a range header, because the API results were paginated.

paginated ?

Yes, just like books, APIs can be paginated. Since databases can contain millions or billions of data, requesting all of it at once could cause the server to crash. Pagination was invented in order to prevent such an issue to occur by limiting the number of pages of data you get at each request. There are 3 main types of pagination :

Offset-based pagination
Keyset pagination
Seek pagination

This article goes into greater details about each one of these methods!

DEV Community: ernestine-m