DEV Community

Shuvendu Parida
Shuvendu Parida

Posted on

🎬 Building an End-to-End Data Quality & Analytics Pipeline on AWS (IMDB Movie Dataset)

πŸ‘‹ Hi everyone! This is my first post on Dev.to πŸŽ‰
I’m super excited to share my journey of building an ETL pipeline with data quality checks and dashboards using AWS services.

The dataset I used? πŸ‘‰ The famous IMDB Movie Dataset πŸŽ₯

πŸ“Œ Why I Built This

Data pipelines are not just about moving data from point A to B.
They need to handle:
βœ… Cleaning messy data
βœ… Validating data quality
βœ… Automating workflows
βœ… Delivering insights via dashboards

So, I challenged myself to design a cloud-native solution on AWS to cover all of these steps.

*πŸ—οΈ Architecture Overview
*

Here’s the high-level design of my pipeline:
Amazon S3 – Stores the raw IMDB dataset.
AWS Glue Crawler – Automatically detects schema and stores metadata in Glue Data Catalog.
AWS Glue Job – Cleans and transforms data:
Converts Released_Year into integer/date
Extracts runtime in minutes
Cleans Gross revenue field into numeric
Amazon Athena – Runs SQL-based data quality checks.
AWS Lambda – Automates DQ checks, executes Athena queries, and handles logic.
Amazon SNS – Sends email notifications if bad data is found.
Amazon EventBridge – Schedules and orchestrates pipeline runs.
Amazon QuickSight – Creates interactive dashboards for insights.

πŸ› οΈ Tools & Services Used

AWS Glue (ETL + Crawler)
AWS Lambda (Serverless automation)
Amazon Athena (SQL over S3)
Amazon S3 (Data lake storage)
Amazon SNS (Notifications)
Amazon EventBridge (Scheduler)
Amazon QuickSight (BI dashboards)
Python (ETL scripts + Lambda)

βš™οΈ How the Pipeline Works

1.Ingest Data
Raw CSV of IMDB movies is uploaded to S3.

2.Discover Schema
Glue Crawler scans the file β†’ adds table to Glue Data Catalog.

3.ETL with Glue
My Glue script (glue_job_script.py) transforms the data:
# Example transformation for Gross column
df = df.withColumn("Gross_Cleaned",
regexp_replace("Gross", "[$,]", "").cast("long"))

3.Data Quality Checks
Lambda runs queries in Athena, e.g.:
SELECT Series_Title, Released_Year, IMDB_Rating
FROM imdb_movies_rating
WHERE IMDB_Rating IS NULL OR Released_Year = '';
πŸ‘‰ If bad data found β†’ SNS sends me an email alert.

5.Visualization
Final curated dataset is connected to QuickSight dashboards.

πŸ“Š Dashboards

Here’s a glimpse of the dashboards I built with QuickSight:
⭐ Top Rated Movies
🎭 Genre Distribution
πŸ“ˆ Ratings Trend Over Years
🎬 Runtime vs IMDB Rating

πŸ“¬ Notifications
No pipeline is complete without monitoring!
Whenever a data quality issue is detected, I get an email from SNS like:
"⚠️ Data Quality Alert: 12 records found with missing IMDB ratings."

πŸ“‚ Project Repo

I’ve open-sourced my project on GitHub πŸ‘‰ ETL-Movie-Data-Analysis
The repo includes:
Glue ETL script
Lambda function for DQ checks
Infrastructure configs (EventBridge, SNS, Glue Crawler)
QuickSight dashboard notes

🌟 Lessons Learned

AWS Glue is powerful, but schema mismatches can cause headaches πŸ˜…
QuickSight becomes much easier once data types are cleaned properly.
Automating DQ checks with Lambda + Athena is a game changer for reliability.

Documentation & version control (GitHub) are just as important as building the pipeline itself.

πŸš€ Next Steps

Add CI/CD with GitHub Actions
Use AWS Deequ or Great Expectations for richer data quality checks
Explore exporting QuickSight dashboards as templates

πŸ‘¨β€πŸ’» About Me

I’m Shuvendu Parida – passionate about Data Engineering, Cloud, and Analytics.
This was my first Dev.to post πŸ™Œ and I’d love your feedback!

πŸ‘‰ GitHub

πŸ’‘ Thanks for reading!
If you found this useful, drop a ❀️ or comment so I can improve in my next posts.

Top comments (0)