DEV Community

Cover image for Reading csv dataset and explore the data
Bismarck Cerda
Bismarck Cerda

Posted on

Reading csv dataset and explore the data

First of all, we are going to use Meteorite_Landings.csv dataset from NASA.

Once downloaded, let's jump into the code! πŸš€

import pandas as pd

# Loading csv into our dataframe
meterorites_df = pd.read_csv('dir/to/csv_file')
Enter fullscreen mode Exit fullscreen mode

Using pandas allow us to use different methods to understand the dataset. For example we can see the shape of the dataframe ↔️

df_meteorites.shape
Enter fullscreen mode Exit fullscreen mode

(45716, 10)

Using head() will show us the first 5 rows of the dataframe πŸ’†β€β™‚οΈ

df_meteorites.head()
Enter fullscreen mode Exit fullscreen mode

df_meteorites.head()

And also yo can set to head(n) how many rows you want to see

df_meteorites.head(15)
Enter fullscreen mode Exit fullscreen mode

df_meteorites.head(15)

We can see the last rows as well using tail() it works the same way as head()

df_meteorites.tail()
Enter fullscreen mode Exit fullscreen mode

df_meteorites.tail()

df_meteorites.tail(15)
Enter fullscreen mode Exit fullscreen mode

df_meteorites.tail(15)

And now my favorite way to see the data is by using sample(n)

df_meteorites.sample(20)
Enter fullscreen mode Exit fullscreen mode

This method will randomly get a sample of rows from our dataframe. I found this useful to see a mean of the different data types that the dataframe could contain

df_meteorites.sample(20)

Let's see to powerful methods to get information and describe the dataframe πŸ‘€

df_meteorites.info()
Enter fullscreen mode Exit fullscreen mode

df_meteorites.info()
Simple as that we already can see how many NaN values we have for each column by comparing the non null values of the id which is 45716 with non null values of year for example which is 45425, with that we already know the amount of NaN values on Year column, is the subtraction of 45716 (total rows that we get by using shape) and non null values of Year 45425 then we have 291 NaN values on Year field.

That was just an example of how to use the data from info() method. It's a world of possibilities.

Also we can see the data types for each columns and here some tip to parse each column to the right data type 🧐

df_meteorites.convert_dtypes().dtypes
Enter fullscreen mode Exit fullscreen mode

df_meteorites.convert_dtypes().dtypes

the last method that I'd like to show you is describe() before that I'll give you a tip. Format the float data before use describe() because this will show us data such as ➑️ Standard Deviation, Mean, Max value from each numeric field.

To do that, we'll use the following line

pd.options.display.float_format = '{:,.2f}'.format
Enter fullscreen mode Exit fullscreen mode

Now we can describe the data

df_meteorites.describe(include='all')
Enter fullscreen mode Exit fullscreen mode

df_meteorites.describe(include='all')

Once we understand the data, the dataframe is able to be queried, here some examples of the function query()

We want to see all the meteorites landed on 2013 that were found

df_meteorites.query('year == 2013 and fall == "Found"')
Enter fullscreen mode Exit fullscreen mode

query output

If you want to add some other tips or feedback I would be more than grateful, thanks for reading my first post see you around ✌️

Latest comments (0)