Bismarck Cerda

Posted on Feb 10, 2022

Reading csv dataset and explore the data

#python #pandas #tips

First of all, we are going to use Meteorite_Landings.csv dataset from NASA.

Once downloaded, let's jump into the code! 🚀

import pandas as pd

# Loading csv into our dataframe
meterorites_df = pd.read_csv('dir/to/csv_file')

Using pandas allow us to use different methods to understand the dataset. For example we can see the shape of the dataframe ↔️

df_meteorites.shape

(45716, 10)

Using head() will show us the first 5 rows of the dataframe 💆‍♂️

df_meteorites.head()

And also yo can set to head(n) how many rows you want to see

df_meteorites.head(15)

We can see the last rows as well using tail() it works the same way as head()

df_meteorites.tail()

df_meteorites.tail(15)

And now my favorite way to see the data is by using sample(n)

df_meteorites.sample(20)

This method will randomly get a sample of rows from our dataframe. I found this useful to see a mean of the different data types that the dataframe could contain

Let's see to powerful methods to get information and describe the dataframe 👀

df_meteorites.info()

Simple as that we already can see how many NaN values we have for each column by comparing the non null values of the id which is 45716 with non null values of year for example which is 45425, with that we already know the amount of NaN values on Year column, is the subtraction of 45716 (total rows that we get by using shape) and non null values of Year 45425 then we have 291 NaN values on Year field.

That was just an example of how to use the data from info() method. It's a world of possibilities.

Also we can see the data types for each columns and here some tip to parse each column to the right data type 🧐

df_meteorites.convert_dtypes().dtypes

the last method that I'd like to show you is describe() before that I'll give you a tip. Format the float data before use describe() because this will show us data such as ➡️ Standard Deviation, Mean, Max value from each numeric field.

To do that, we'll use the following line

pd.options.display.float_format = '{:,.2f}'.format

Now we can describe the data

df_meteorites.describe(include='all')

Once we understand the data, the dataframe is able to be queried, here some examples of the function query()

We want to see all the meteorites landed on 2013 that were found

df_meteorites.query('year == 2013 and fall == "Found"')

If you want to add some other tips or feedback I would be more than grateful, thanks for reading my first post see you around ✌️

DEV Community

Reading csv dataset and explore the data

Latest comments (0)