jayson kibet

Posted on Jun 19

Pandas and Visualizations Using Matplotlib and Seaborn

#python #datascience #tutorial #analytics

Introduction

I still remember the first time someone handed me a CSV with 2,000 rows and said"find some insights in there."I had no idea where to start.Pandas is what eventually made that feel less like a chore and more like a conversation with the data.You ask it questions,it answers.And once you've cleaned things up, matplotlib and seaborn are how you actually show what you found, instead of just describing it in a paragraph nobody wants to read.
I'm going to walk through both of these using a dataset I worked with recently,rental properties around Nairobi.Not because it's the most exciting dataset in the world but because it's exactly the kind of messy,real-world thing you'll actually run into.

Foundation of Pandas

Pandas really only gives you two things to think about:
Series - One column,on its own.
DataFrame(df) - The full table,rows and columns together.This is where you'll live 95% of the time.

Getting your data

It doesn't matter where the data is coming from,pandas reads CSVs,Excel files,JSON and even a live database connection all in roughly the same one-liner
Below is a syntax:

But for my housing data,you go to file explorer where you downloaded and saves the csv file,copy the file path then head back to your python environment.Before placing the link,after opening the first bracket write the letter 'R' to prevent an error since python will read the back slashes and the special characters.The photo below explains more:

The 'df.head()' allows you to view the first 5 rows by default but you can also write the number of rows you want to view inside the bracket as:'df.head(21)' for the first 20 rows.

Examine your dataset

This is the step I see people skip and it always costs them later.Before you clean or filter or do anything to the data,just look at it.Get a feel for what's in there.

df.describe() gives you the numeric summary - count,mean,standard deviation,min,max and the quartiles.Throw a .T(transpose) on the end so it's readable when you've got a lot of columns

The missing values and Duplicates

Every dataset I've ever worked with has had gaps somewhere.It's basically a guarantee.The real question is what to do about it and honestly,that depends entirely on how much is missing.Also checking on the duplicates.My data for example has none
Datasets pick up duplicate rows more often than you'd think - a form gets submitted twice,two sources get merged that kind of thing.

I usually go by something like this:
if nulls are 0–5%,Drop those rows or fill with mean/median/mode
if nulls are 5–40%,ill with mean/median/mode
Above 40%,honestly,think about dropping the whole column
Here's the intuition behind the thresholds,because the numbers alone don't explain much,if a column's only missing 3% of its values,filling those gaps with the mean barely moves the needle on the overall picture.But if 60% of a column is empty and you fill it all with one number,you're not really "filling in gaps" anymore - you're mostly just making the column up.At that point it's usually more honest to drop it.

Picking Rows and Columns

loc and iloc confuse almost everyone at first,here's the short version.loc cares about the row's label - whatever the index actually says,even if that's a number,a name or a date.iloc only cares about position - 1st,2nd,3rd row - no matter what the index looks like underneath.Once that distinction clicks,the two stop being confusing.

Filtering

Just like excel,SQL and Bi,filtering remains the same.In Python,equals to(=)is written using 2 sighs(==).Not is written as != just like SQL.

Tip: always wrap each condition in its own parentheses.& binds tighter than |,so skipping the parentheses can quietly give you the wrong answer.

New Columns & Renaming

Most new columns you'll create are just simple math on columns you already have,like converting currency or units.When the logic is more than basic arithmetic,apply() with a lambda or np.where() will usually get the job done.Renaming columns is even simpler - just a dictionary mapping old names to new ones.

Renaming is just a dictionary of old name to new name:

Sorting

Sorting is either ascending or descending

Grouping(Where Pandas Really Starts Paying Off)

This is the part that,once it clicks it changes how you think about data entirely. groupby answers "what does this look like per category" without you writing a single loop.

If all you want is "how often does each value show up," value_counts() is the shortcut — quicker to type than a full groupby:

Once groupby feels natural,it's worth digging into pivot tables too (pd.pivot_table()).They scratch a similar itch but let you summarize across two dimensions at once - rows and columns - which is great for something like "average rent by estate,broken down further by property type," all in a single table.

Cleaning Up Text

Real-world text data is messy almost by default - inconsistent capitalization,stray spaces the occasional typo.The .str accessor handles most of the cleanup.Below is a simple syntax:

Creating visuals

For me this is the fun pard as a data analyst.
Tables only get you so far.There's a point where a chart shows you something - a trend,a weird outlier,a relationship that you'd never notice scrolling through rows of numbers.That's where matplotlib and seaborn come in.
Matplotlib is the engine underneath everything
seaborn sits on top and just makes the defaults look a lot better with a lot less effort.

You start by importing them first

1.Count Plots - how many of each thing do we have?

2.Bar Plots - comparing an average across groups

You don't have to let seaborn do the aggregating for you, sometimes it's nicer to groupby first and plot the result,especially if you want the bars sorted by value instead of alphabetically:

3.Histograms - what does the spread of a number actually look like?

4.Box Plots - median,spread and outliers in one shape

The box itself covers the middle 50% of your data (25th to 75th percentile),the line inside is the median and the whiskers stretch out to about 1.5× that range.Anything past the whiskers shows up as its own little dot - that's your outlier.

5.Scatter Plots

6.Line Plots - trends usually over time

7.Pie Charts - proportions

Funny enough,seaborn doesn't actually have a pie chart function - for this one,you drop back down to plain matplotlib:

Conclusion

Pandas plus matplotlib and seaborn cover most of what you actually need to explore a dataset: load it in,get a feel for it,fix what's broken,slice it however your question demands and then turn it into something you can actually look at and understand.None of the individual pieces are hard - the real skill,the thing that takes practice is knowing which tool fits the question in front of you.
I hope you found this guide sefull.Ciao.

DEV Community