Joseph Hinga

Posted on Aug 26

Exploratory Data Analysis (EDA): The First Step to Understanding Data

#datascience #dataanalysis

Introduction

In the world of data science, raw data is rarely ready for analysis. Before building machine learning models or creating dashboards, it’s important to take a step back and understand the data itself. This process is called Exploratory Data Analysis (EDA) — a critical phase where we dive into the dataset, uncover insights, detect anomalies, and prepare it for modeling.

EDA is often described as “letting the data speak”. It blends statistics, visualization, and intuition to answer fundamental questions:

What does my dataset look like?

2 . Are there patterns or trends?

Are there missing values or outliers?

4 . Which features matter most?

Why is EDA Important?

Skipping EDA is like trying to solve a puzzle without first looking at all the pieces. A good EDA will:

Reveal data quality issues (missing values, duplicates, errors).
Provide statistical summaries for better understanding.
Uncover relationships between variables.
Highlight outliers or anomalies that could mislead models.
Guide the feature engineering process.

EDA is not just preparation; it’s the foundation of data-driven decision-making.

The EDA Workflow

Here’s a step-by-step framework you can follow in any project:

1. Load and Inspect the Data

Start with a first look. Check dimensions, column names, data types, and missing values.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

data= pd.read_csv('/content/dataset.csv.csv') `

The first step in any analysis is importing the right Python libraries. These libraries provide specialized functions that make data exploration efficient and powerful.

Pandas is essential for data manipulation and handling structured data in the form of dataframes.

NumPyadds numerical operations and array support, making calculations fast and efficient.

Matplotlib and Seaborn enable data visualization, which is at the heart of EDA. While Matplotlib offers low-level plotting control, Seaborn makes it easier to produce attractive and insightful charts.

Without these libraries, EDA would be tedious and error-prone. They save time, reduce complexity, and ensure reproducibility all crucial traits of professional data analysis.

Once the data is loaded, the first task is to understand its structure. This means checking the number of rows and columns, identifying data types, and scanning for missing values. At this stage, you also get descriptive statistics like means, medians, and standard deviations for numerical variables.

This step is important because it defines the scope of your analysis. For example, knowing whether you have 1,000 rows or 1 million rows will affect which techniques and algorithms you use later. Similarly, identifying categorical versus numerical data ensures you apply the right statistical methods and visualizations.

2. Clean the Data
Real-world datasets are messy. They often contain missing values, duplicates, incorrect data types, or even inconsistencies like spelling errors. Cleaning the data is therefore a non-negotiable step.

By handling missing values, removing duplicates, and correcting formats (such as converting text to dates), you ensure that your analysis is reliable. If this step is skipped, any conclusions drawn later may be misleading. In other words, clean data equals trustworthy insights.

# missing values
data.isnull().sum()

The code above is used in data cleaning to check for missing values.

df = df.drop_duplicates()
df['date'] = pd.to_datetime(df['date'])
df['age'].fillna(df['age'].median(), inplace=True)

3: Univariate Analysis – Looking at One Variable at a Time

# Identifying outliers in price

df = data[data['price'] < 1500].copy()


sns.boxplot(data=df,x='price')

Univariate analysis focuses on a single variable, whether it is numerical (e.g., income, age) or categorical (e.g., gender, city). For numerical data, histograms and boxplots help reveal the distribution, central tendency, and presence of outliers. For categorical data, count plots highlight the frequency of each category.

This step is important because it allows you to spot potential problems early. For example, a skewed income distribution might suggest the need for transformation, while a heavily imbalanced category might call for sampling techniques in later modeling.

4. Bivariate Analysis – Understanding Relationships

sns.scatterplot(x='income', y='spending_score', data=df)
sns.boxplot(x='gender', y='income', data=df)

Study relationships between two variables:

Numerical vs Numerical → scatterplots, correlation heatmaps
Categorical vs Numerical → boxplots, barplots

While univariate analysis looks at single variables, bivariate analysis explores the relationship between two. For numerical variables, scatterplots and correlation heatmaps are useful. For categorical vs. numerical variables, boxplots and barplots reveal differences between groups.

This step matters because real-world insights often emerge from relationships rather than isolated variables. Without this step, hidden connections may remain undiscovered.

5.Multivariate Analysis – The Bigger Picture

Multivariate analysis takes exploration further by analyzing three or more variables together. Techniques such as group-by operations, pairplots, and pivot tables help uncover complex patterns.

This is important because many phenomena in data are multi-dimensional. For example, understanding how gender, age group, and income interact together provides richer insights than studying them individually. In business terms, this can translate to better strategies for specific customer segments.

Look at interactions among three or more variables using grouping, aggregations, or pairplots.

6: Detecting Outliers – Guarding Against Misleading Data

Outliers are extreme values that deviate significantly from the rest of the dataset. They can be genuine anomalies (such as fraud in financial transactions) or errors (like a wrongly entered value). Detecting them through boxplots, Z-scores, or the IQR method ensures that your models are not biased or skewed by these unusual points.

The importance of this step lies in the fact that outliers can distort results. For example, a single billionaire in a dataset of average earners will disproportionately inflate mean income values. Handling outliers ensures more robust and accurate insights.

Outliers can bias models if not handled properly.

# Identifying outliers in price
df = data[data['price'] < 1500].copy()
sns.boxplot(data=df,x='price')

7: Feature Engineering – Creating Value from Data

Feature engineering involves creating new variables from existing ones. For example, transforming age into age groups, deriving ratios, or extracting month and year from a date column.

This step is vital because models are only as good as the features they are trained on. Thoughtful feature engineering can significantly improve predictive power, while also making results easier to interpret for non-technical audiences.

Create new features that add value ,such as categories, ratios, or time-based features

# price per bed
df["price per bed"] = df["price"]/df["beds"]

# checking the new column created
df.head()

# average price per bed
df.groupby(by='neighbourhood_group')['price per bed'].mean()

Deliverables of EDA

At the end of EDA, you should have:

A clean, structured dataset ready for modeling.
Visual insights that explain patterns in the data.
A set of engineered features that can improve predictive performance.
A summary report that communicates findings clearly.

These insights help decide the next step, whether it’s building a predictive model, clustering customers, or creating dashboards.

Conclusion

Exploratory Data Analysis (EDA) is the most crucial step in any data project. It sets the tone for how well your models perform and how accurate your insights will be. By combining statistics, data visualization, and domain knowledge, EDA transforms raw datasets into a foundation for deeper analysis.

**Whenever you start a new project, remember: don’t rush to model, let the data tell its story first.

Below is a link to my EDA project:
https://github.com/JosephHinga/Airbnb-listing-New-York