Joseous Ng'ash

Posted on Jun 20

Pandas and Data Visualization Using Matplotlib and Seaborn

#datascience #dataengineering #analytics #deeplearning

New chapter in Learning data analytics and data science. The focus now is on Pandas as a Python library alongside Matplotlib and Seaborn for data visualization.
Am writing this article to guide beginners who are already or beginning the data analytics and data science profession.

Introduction to Python Data Analysis

In modern world Data has become most valuable asset. Business, healthcare institutions, financial and even social media platforms rely heavily on data to make informed decisions.
Raw data is of no use and that where Data Analysis and visualization becomes important.

One of most used programming language is Python for data analysis because of its simplicity and powerful libraries.
Commonly used libraries are Pandas, Matplotlib and Seaborn.
Pandas: Helps in cleaning, organizing and analyzing data.
Matplotlib and Seaborn: are used to create visual representations of data.

This article introduces Pandas and explains how Matplotlib and Seaborn can be used for effective data visualization in a begginer friendly way.

What is Pandas and Why it Matters

Pandas is open source Python Library used for data manipulation and analysis. It provides simple and efficient tools for working with structured data eg CSV, spreadsheets and databases.

Main data structures in Pandas are:

Series: it is a one-dimensional array like structure.

import pandas as pd
age = {"Age":[18, 20, 23, 40, 50, 24]} 

series = pd.DataFrame(age)
series

#Output: 
Age
0   18
1   20
2   23
3   40
4   50
5   24

DataFrame: It is two dimensional table similar to an Excel spreadsheet or SQL table.

students = {"Name":["Mark", "John", "Nancy"],
            "Grade":["A", "B", "C"],
            "Course":["Data Science", "Data Engineering", "Data Analytics"]}

student_grades = pd.DataFrame(students)
student_grades

#Output: 
    Name    Grade   Course
0   Mark    A   Data Science
1   John    B   Data Engineering
2   Nancy   C   Data Analytics

Pandas matters because it simplifies complex tasks such as:

Performing calculations and statistical analysis
Filtering and sorting information
Cleaning missing or incorrect data
Reading datasets from files

Before using Pandas, it must be installed, run the following command:

pip install pandas

Then imported into Python script:

 import pandas as pd

Reading Data Pandas can easily read files such as CSV and Excel Example:

import pandas as pd

df = pd.read_csv("students.csv")
print(df.head())

The head() function displays the first five rows of the dataset.

Checking Data Information When you want to understand the structure of your data:

print(data.info()) 
print(data.describe())

info() shows column names, data types and missing values
describe() provides statistical summaries such as maximum values, averages

Handling Missing Values Missing values can affect the analysis results. Checking missing data

 print(data.isnull().sum())

Removing missing values:

data = data.dropna()

Filling missing values:

data["Age"] = data["Age"].fillna(data["Age"].mean())

This replaces missing age values with the average age.
Filtering allows users to select specific information
Example:

high_scores = data[data["Score"] > 70] 
print(high_scores)

Sorting data:

sorted_data = data.sort_values(by="Score", ascending=False) 

print(sorted_data)

These operations help organize data for better understanding and reporting

Data Visualization Fundamentals

Data Visualization: It is the process of representing data graphically using charts, graphs and plots. Visualization makes it easier to identify patterns, trends and relationship in data.

For Example:

Scatter plots shows relationships between variables
Bar Charts compares category
Pie Chart display proportions
Line Charts show trends over time

Visualizations helps understand large datasets because humans interpret visuals faster than raw numbers.

Python provides powerful visualization libraries, with Matplotlib and Seaborn being among the most widely used.

Using Matplotlib for Charts

Matplotlib is one of the oldest and most flexible visualization libraries in Python. It provides full control over the chart customization

To install Matplotlib

pip install matplotlib

Import it:

import matplotlib.pyplot as plt

Creating a Line Chart
A line chart is used to show trends.
Example:

import pandas as pd
import matplotlib.pyplot as plt 
plt.figure(figsize=(6, 3))

sns.lineplot(data=housing_df, x="bathrooms", y="bedrooms")
plt.title("Bathrooms vs Bedrooms")
plt.xlabel("Bathrooms")
plt.ylabel("Bedrooms")
plt.show()

The chart shows relationship between Bathrooms and bedrooms.

Creating Bar Chart
Bar chart compare categories.
Example:

# Average satisfaction score by property type
avg_satisfaction_by_prop = housing_df.groupby("property_type")["satisfaction_score"].mean().sort_values(ascending = False).reset_index()

#Plot
plt.figure(figsize = (6,3))

sns.barplot(data = avg_satisfaction_by_prop, x = "property_type", y = "satisfaction_score")
plt.title("Average Satisfaction Score by Property type")
plt.xlabel("Property Type")
plt.ylabel("Average Satisfaction Score")
plt.show()

The chart compares Average satisfaction per property type

Creating Pie Chart
Pie charts represents percentages.
Example:

furnishing_counts = housing_df["furnishing"].value_counts()

explode = (0.05, 0.05, 0.05)

plt.figure(figsize = (6, 6))

plt.pie(furnishing_counts, explode = explode, labels = furnishing_counts.index, autopct="%1.1f%%")
plt.title("Distribution Of the furnishing status")
plt.show()

Pie Chart:

Matplotlib is highly customizable and allows users to change colors, labels, chart sizes, and grid styles.

Using Seaborn for Statistical Visualizations

Seaborn: It is Python library built on top of Matplotlib. It provides more attractive and advanced statistical visualizations with less code.

Install Seaborns

pip install seaborn

Import it:

import seaborn as sns

Seaborn works smoothly with Pandas DataFrames.

Example dataset:

import pandas as pd 
data = { "Student": ["John", "Mary", "Peter", "James"], "Score": [85, 90, 78, 88] }
df = pd.DataFrame(data)

Bar Plot

# Average monthly rent by property type
plt.figure(figsize=(6, 3))

sns.barplot(data=housing_df, x = "property_type", y = "monthly_rent_kes", estimator = "mean", palette = "bright")
plt.title("Average monthly rent by property type")
plt.xlabel("Property Type")
plt.ylabel("Average monthly rent")
plt.xticks(rotation=45)
plt.show()

Seaborn automatically applies better styling than in matplotlib.
Example of more styled Bar Plot:

Histogram
Histograms show data distribution.
Example:

# What is the distribution of monthly rent
plt.figure(figsize = (6, 3))

sns.histplot(data=housing_df, x = "monthly_rent_kes")
plt.title("Distribution of monthly rent")
plt.xlabel("Monthly rent")
plt.ylabel("Number of properties")
plt.show()

This helps determine distribution of Monthly Rent.

Scatter Plot
It reveals relationships between variables.
Example:

# Plotting relationship between bedrooms and bathrooms

plt.figure(figsize=(6, 3))

sns.scatterplot(data=housing_df, x="bathrooms", y="bedrooms")
plt.title("Bathrooms vs Bedrooms")
plt.xlabel("Bathrooms")
plt.ylabel("Bedrooms")
plt.show()

This chart helps to show the relationship between bathrooms and bedrooms.

Heatmap
Heatmaps show relationship between numerical values.
Example:

# Correlation analysis

correlation = housing_df[numerical_columns].corr()

plt.figure(figsize=(6,6))

sns.heatmap(correlation, annot=True, fmt=".2f")
plt.title("Correlation Heatmap for Numerical Variables")
plt.show()

They helps identify strong or weak relationships in datasets.

Heatmap Example

Matplotlib is suitable when detailed customization is required, while Seaborn is ideal for creating visually appealing statistical charts quickly.
In real world, most Analysts use both libraries together because seaborn is built on top of Matplotlib.

Best Practices and Common Mistakes.

Best Practices:

Always clean data before analysis
Use of appropriate charts
Add labels and titles to charts
Visualization should be simple and easy to read
Check for missing or duplicate data

Common Mistakes:

Forgetting labels and or legends
Ignoring missing values
Using wrong charts
Overcrowding charts with too much information

Clear visualization communicates datasets information effectively without confusion.

Conclusion

For analyst to created clear and understandable visualizations, they must use Pandas, Matplotlib, and Seaborn.

Pandas simplifies data cleaning, manipulation, and analysis.
Matplotlib and Seaborn transform raw numbers into meaningful visual insights.

For data analyst or data science beginners, learning these libraries is essential because they are widely used in industries like finance, healthcare, marketing and business intelligence.
Mastering these tools is an important step toward becoming a skilled analyst or data scientist.

DEV Community