Akhilesh

Posted on Apr 28

Real Project: Analyze a Dataset End to End

#ai #beginners #productivity #python

Every post so far taught you one skill at a time.

Load data. Clean data. Filter it. Group it. Visualize it. One concept, one post, one isolated practice exercise.

Real work does not look like that.

Real work looks like this: you get a dataset, you have a question, and you spend the next few hours loading, cleaning, questioning, visualizing, re-cleaning, finding something unexpected, investigating it, losing the thread, finding it again, and eventually producing an analysis that actually says something true about the data.

This post is that experience. One dataset. One real question. Every skill from Phase 3 used in service of finding an actual answer.

No isolated exercises. No pre-cleaned data. No hand-holding on what you will find.

The Dataset: Netflix Movies and TV Shows

Download it from Kaggle. Search "Netflix Movies and TV Shows dataset Kaggle" and download the netflix_titles.csv file. It is free, about 8000 rows, and has enough structure to ask interesting questions.

If you want to follow along right now without downloading, we will create a simplified version:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from collections import Counter
import warnings
warnings.filterwarnings("ignore")

sns.set_theme(style="darkgrid", palette="husl")
np.random.seed(42)

The Question

Before touching the data, write down what you want to know.

Primary question: How has Netflix's content strategy changed over time? Are they producing more original movies or shifting toward TV shows? Which countries produce the most content on Netflix? What genres dominate?

This question is specific enough to guide your analysis but open enough that you will discover things you did not expect. That balance is what makes a good EDA question.

Step 1: Load and First Look

df = pd.read_csv("netflix_titles.csv")

print(f"Shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nDtypes:\n{df.dtypes}")
print(f"\nFirst 3 rows:")
print(df.head(3))
print(f"\nMissing values:")
print(df.isnull().sum().sort_values(ascending=False))

Output you should see (approximate):

Shape: (8807, 12)

Columns: ['show_id', 'type', 'title', 'director', 'cast', 'country',
          'date_added', 'release_year', 'rating', 'duration',
          'listed_in', 'description']

Missing values:
director        2634
cast             825
country          831
date_added        10
rating             4
...

2634 missing directors. That is 30% of the dataset. Dropping these rows would remove too much. The director column is not critical for most of our questions so we will fill with "Unknown" or work around it.

Notice listed_in contains genres as a comma-separated string in one column. That will need special handling.

Step 2: Clean the Data

df["date_added"] = pd.to_datetime(df["date_added"].str.strip(), errors="coerce")

df["year_added"] = df["date_added"].dt.year
df["month_added"] = df["date_added"].dt.month

df["director"] = df["director"].fillna("Unknown")
df["cast"]     = df["cast"].fillna("Unknown")
df["country"]  = df["country"].fillna("Unknown")

df["country_primary"] = df["country"].apply(
    lambda x: x.split(",")[0].strip() if x != "Unknown" else "Unknown"
)

df["duration_value"] = df["duration"].str.extract(r"(\d+)").astype(float)
df["duration_type"]  = df["duration"].str.extract(r"([a-zA-Z]+)")

df = df[df["release_year"] >= 1950]

print(f"Clean shape: {df.shape}")
print(f"\nDate range: {df['year_added'].min():.0f} to {df['year_added'].max():.0f}")
print(f"\nContent types:\n{df['type'].value_counts()}")

Three things happened here that are worth understanding.

The date_added column had whitespace. .str.strip() before conversion prevents parse failures on "January 1, 2020 " with a trailing space.

Country is a comma-separated list. "United States, India" means the show was co-produced. Taking only the first country is a simplification but makes analysis possible. Always document simplifications like this.

Duration is a mixed column. "90 min" for movies and "2 Seasons" for TV shows. Splitting into value and type lets you analyze them separately.

Step 3: Movies vs TV Shows Over Time

yearly_type = df.groupby(["year_added", "type"]).size().reset_index(name="count")
yearly_type = yearly_type[yearly_type["year_added"].between(2010, 2021)]

fig, axes = plt.subplots(1, 2, figsize=(15, 5))

for content_type, color in zip(["Movie", "TV Show"], ["steelblue", "coral"]):
    data = yearly_type[yearly_type["type"] == content_type]
    axes[0].plot(data["year_added"], data["count"],
                 marker="o", label=content_type, color=color, linewidth=2.5)

axes[0].set_title("Content Added to Netflix Per Year", fontsize=13)
axes[0].set_xlabel("Year")
axes[0].set_ylabel("Number of Titles")
axes[0].legend()
axes[0].grid(True, alpha=0.3)

ratio = yearly_type.pivot(index="year_added", columns="type", values="count").fillna(0)
ratio["tv_ratio"] = ratio.get("TV Show", 0) / (ratio.get("Movie", 0) + ratio.get("TV Show", 0))
ratio = ratio.reset_index()

axes[1].bar(ratio["year_added"], ratio["tv_ratio"], color="coral", alpha=0.8)
axes[1].axhline(y=0.5, color="black", linestyle="--", alpha=0.5, label="50% line")
axes[1].set_title("TV Show Share of All Content Added", fontsize=13)
axes[1].set_xlabel("Year")
axes[1].set_ylabel("TV Show Proportion")
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis="y")

plt.tight_layout()
plt.savefig("content_over_time.png", dpi=150)
plt.show()

What this chart will tell you: Netflix added content aggressively from 2015 to 2019. Then 2020 shows a dip (COVID production shutdowns). The TV show proportion has been increasing over time. Netflix is shifting from being a movie platform to a TV show platform.

That is a real business insight from clean data and two charts.

Step 4: Geographic Analysis

country_counts = df[df["country_primary"] != "Unknown"]["country_primary"].value_counts().head(15)

fig = px.bar(
    x=country_counts.values,
    y=country_counts.index,
    orientation="h",
    color=country_counts.values,
    color_continuous_scale="Blues",
    title="Top 15 Countries Producing Netflix Content",
    labels={"x": "Number of Titles", "y": "Country"}
)

fig.update_layout(
    template="plotly_dark",
    showlegend=False,
    yaxis=dict(autorange="reversed"),
    height=500
)

fig.show()
fig.write_html("country_content.html")

us_india_uk = df[df["country_primary"].isin(["United States", "India", "United Kingdom"])]
country_type = us_india_uk.groupby(["country_primary", "type"]).size().reset_index(name="count")

fig2, ax = plt.subplots(figsize=(10, 5))
country_type_pivot = country_type.pivot(index="country_primary", columns="type", values="count")
country_type_pivot.plot(kind="bar", ax=ax, color=["steelblue", "coral"])
ax.set_title("Movies vs TV Shows: US, India, UK")
ax.set_xlabel("")
ax.set_ylabel("Number of Titles")
ax.legend(title="Type")
ax.tick_params(axis="x", rotation=0)
plt.tight_layout()
plt.savefig("country_type_comparison.png", dpi=150)
plt.show()

India adds mostly movies. The US adds both movies and shows at scale. UK leans toward shows. Different countries have different content strategies on the platform.

Step 5: Genre Analysis

The genre column is a mess. Each row has multiple genres in one string.

all_genres = []
for genres in df["listed_in"].dropna():
    for genre in genres.split(","):
        all_genres.append(genre.strip())

genre_counts = pd.Series(Counter(all_genres)).sort_values(ascending=False).head(15)

fig, ax = plt.subplots(figsize=(10, 7))
genre_counts.sort_values().plot(kind="barh", ax=ax, color="steelblue", alpha=0.85)
ax.set_title("Top 15 Genres on Netflix", fontsize=14)
ax.set_xlabel("Number of Titles")
ax.grid(True, alpha=0.3, axis="x")

for i, (val, name) in enumerate(zip(genre_counts.sort_values().values,
                                     genre_counts.sort_values().index)):
    ax.text(val + 20, i, str(val), va="center", fontsize=9)

plt.tight_layout()
plt.savefig("genre_analysis.png", dpi=150)
plt.show()

print("\nTop 5 genres for Movies only:")
movies = df[df["type"] == "Movie"]
movie_genres = []
for genres in movies["listed_in"].dropna():
    for genre in genres.split(","):
        movie_genres.append(genre.strip())
print(pd.Series(Counter(movie_genres)).sort_values(ascending=False).head(5))

print("\nTop 5 genres for TV Shows only:")
shows = df[df["type"] == "TV Show"]
show_genres = []
for genres in shows["listed_in"].dropna():
    for genre in genres.split(","):
        show_genres.append(genre.strip())
print(pd.Series(Counter(show_genres)).sort_values(ascending=False).head(5))

International Movies and Dramas dominate. TV shows lean heavily toward international content while movies have more variety. Netflix is a global platform first, not just a Hollywood platform.

Step 6: Duration Analysis

movies_dur = df[(df["type"] == "Movie") & (df["duration_value"].notna())]
shows_dur  = df[(df["type"] == "TV Show") & (df["duration_value"].notna())]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(movies_dur["duration_value"], bins=40,
             color="steelblue", edgecolor="white", alpha=0.8)
axes[0].axvline(movies_dur["duration_value"].mean(), color="red",
                linestyle="--", label=f"Mean: {movies_dur['duration_value'].mean():.0f} min")
axes[0].axvline(movies_dur["duration_value"].median(), color="orange",
                linestyle="--", label=f"Median: {movies_dur['duration_value'].median():.0f} min")
axes[0].set_title("Movie Duration Distribution")
axes[0].set_xlabel("Duration (minutes)")
axes[0].legend()

season_counts = shows_dur["duration_value"].value_counts().sort_index()
axes[1].bar(season_counts.index[:8], season_counts.values[:8],
            color="coral", alpha=0.85, edgecolor="white")
axes[1].set_title("TV Show Season Count Distribution")
axes[1].set_xlabel("Number of Seasons")
axes[1].set_ylabel("Number of Shows")

plt.tight_layout()
plt.savefig("duration_analysis.png", dpi=150)
plt.show()

print(f"Movie duration: mean={movies_dur['duration_value'].mean():.0f} min, "
      f"median={movies_dur['duration_value'].median():.0f} min")
print(f"Most TV shows have: {shows_dur['duration_value'].mode()[0]:.0f} season(s)")

Most movies are between 80 and 120 minutes. Most TV shows have only one season. Netflix cancels aggressively. Shows that survive past season 2 are a minority.

Step 7: Rating Analysis

rating_order = ["G", "TV-Y", "TV-Y7", "TV-Y7-FV", "TV-G",
                "PG", "TV-PG", "PG-13", "TV-14", "R", "TV-MA", "NC-17"]

ratings = df["rating"].value_counts()
ratings = ratings[ratings.index.isin(rating_order)]
ratings = ratings.reindex([r for r in rating_order if r in ratings.index])

fig, ax = plt.subplots(figsize=(12, 5))
colors = ["#2ecc71" if r in ["G","TV-Y","TV-Y7","TV-Y7-FV","TV-G"]
          else "#f39c12" if r in ["PG","TV-PG","PG-13"]
          else "#e74c3c" for r in ratings.index]

bars = ax.bar(ratings.index, ratings.values, color=colors, edgecolor="white")
ax.set_title("Content Distribution by Rating", fontsize=14)
ax.set_ylabel("Number of Titles")

for bar, val in zip(bars, ratings.values):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 10,
            str(val), ha="center", va="bottom", fontsize=8)

from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor="#2ecc71", label="Kids (G to TV-G)"),
    Patch(facecolor="#f39c12", label="Teen (PG to PG-13)"),
    Patch(facecolor="#e74c3c", label="Adult (TV-14 to NC-17)")
]
ax.legend(handles=legend_elements)
ax.grid(True, alpha=0.3, axis="y")

plt.tight_layout()
plt.savefig("ratings_analysis.png", dpi=150)
plt.show()

TV-MA and TV-14 dominate. Netflix is primarily an adult platform, not a family one. This has implications for recommendations and content strategy.

Step 8: The Final Report

total       = len(df)
movies_pct  = (df["type"] == "Movie").mean() * 100
shows_pct   = (df["type"] == "TV Show").mean() * 100
top_country = df[df["country_primary"] != "Unknown"]["country_primary"].value_counts().index[0]
top_genre   = pd.Series(all_genres).value_counts().index[0]
avg_movie   = movies_dur["duration_value"].mean()
peak_year   = df["year_added"].value_counts().index[0]

report = f"""
NETFLIX CONTENT ANALYSIS: FINAL REPORT
========================================
Dataset: {total:,} titles analyzed

CONTENT MIX:
  Movies:   {movies_pct:.1f}%
  TV Shows: {shows_pct:.1f}%

GROWTH:
  Peak content addition year: {peak_year:.0f}
  Netflix has consistently grown its library with a recent
  shift toward TV shows as a percentage of new additions.

GEOGRAPHY:
  Top producing country: {top_country}
  International content represents a significant share,
  confirming Netflix's global-first strategy.

GENRES:
  Most common genre: {top_genre}
  International content dominates, with Dramas and Comedies
  being the most common genre combinations.

DURATION:
  Average movie runtime: {avg_movie:.0f} minutes
  Most TV shows run for only 1 season.
  Netflix cancels most shows before season 2.

AUDIENCE:
  The platform skews heavily toward adult content (TV-MA, TV-14).
  Family and kids content represents a small minority.

KEY INSIGHTS FOR STAKEHOLDERS:
  1. Netflix is transitioning from a movie platform to a TV platform.
  2. International content is not supplementary, it is core to the strategy.
  3. Short-run shows (1 season) dominate. Original production is volatile.
  4. Adult audiences are the primary target demographic.
========================================
"""

print(report)

with open("netflix_analysis_report.txt", "w") as f:
    f.write(report)

print("Report saved.")

What Real EDA Looks Like vs What Tutorials Show

Tutorials show clean datasets, obvious patterns, and predictable conclusions.

Real EDA involves spending twenty minutes wondering why a column has values you do not recognize. It involves realizing your date parsing was wrong and re-running three cells. It involves finding a pattern, getting excited, then realizing it was caused by missing data not filling properly.

The Netflix dataset has all of this. Director is missing for a third of the content. Country sometimes lists four countries separated by commas. Ratings have inconsistencies between movies and shows. Duration is a mixed-type column that requires splitting.

That friction is the job. The clean analysis you produce at the end was built on dozens of small decisions about how to handle messy reality. Every decision is a chance to introduce bias or lose signal. This is why EDA takes longer than people expect, and why it matters more than people admit.

A Blog That Gets This Right

Towards Data Science published a piece by Susan Li called "Exploratory Data Analysis with Python" using real Airbnb data that shows exactly this kind of messy, iterative process. Her approach of documenting every cleaning decision alongside the analysis has influenced how a lot of practitioners work. Search "Susan Li exploratory data analysis Python Towards Data Science."

Phase 3 Complete

Thirteen posts. From NumPy arrays to a full end-to-end analysis on a real dataset.

You can load data from any source. You can clean it systematically. You can filter, group, merge, and aggregate. You can visualize with Matplotlib, Seaborn, and Plotly. You can run a complete EDA and produce a report that says something true.

These skills are what data scientists spend most of their time doing. Models are 20% of the job. Data work is 80%.

DEV Community