🎬 Building an End-to-End Data Quality & Analytics Pipeline on AWS (IMDB Movie Dataset)

👋 Hi everyone! This is my first post on Dev.to 🎉
I’m super excited to share my journey of building an ETL pipeline with data quality checks and dashboards using AWS services.

The dataset I used? 👉 The famous IMDB Movie Dataset 🎥

📌 Why I Built This

Data pipelines are not just about moving data from point A to B.
They need to handle:
✅ Cleaning messy data
✅ Validating data quality
✅ Automating workflows
✅ Delivering insights via dashboards

So, I challenged myself to design a cloud-native solution on AWS to cover all of these steps.

*🏗️ Architecture Overview
*

Here’s the high-level design of my pipeline:
Amazon S3 – Stores the raw IMDB dataset.
AWS Glue Crawler – Automatically detects schema and stores metadata in Glue Data Catalog.
AWS Glue Job – Cleans and transforms data:
Converts Released_Year into integer/date
Extracts runtime in minutes
Cleans Gross revenue field into numeric
Amazon Athena – Runs SQL-based data quality checks.
AWS Lambda – Automates DQ checks, executes Athena queries, and handles logic.
Amazon SNS – Sends email notifications if bad data is found.
Amazon EventBridge – Schedules and orchestrates pipeline runs.
Amazon QuickSight – Creates interactive dashboards for insights.

🛠️ Tools & Services Used

AWS Glue (ETL + Crawler)
AWS Lambda (Serverless automation)
Amazon Athena (SQL over S3)
Amazon S3 (Data lake storage)
Amazon SNS (Notifications)
Amazon EventBridge (Scheduler)
Amazon QuickSight (BI dashboards)
Python (ETL scripts + Lambda)

⚙️ How the Pipeline Works

1.Ingest Data
Raw CSV of IMDB movies is uploaded to S3.

2.Discover Schema
Glue Crawler scans the file → adds table to Glue Data Catalog.

3.ETL with Glue
My Glue script (glue_job_script.py) transforms the data:
# Example transformation for Gross column
df = df.withColumn("Gross_Cleaned",
regexp_replace("Gross", "[$,]", "").cast("long"))

3.Data Quality Checks
Lambda runs queries in Athena, e.g.:
SELECT Series_Title, Released_Year, IMDB_Rating
FROM imdb_movies_rating
WHERE IMDB_Rating IS NULL OR Released_Year = '';
👉 If bad data found → SNS sends me an email alert.

5.Visualization
Final curated dataset is connected to QuickSight dashboards.

📊 Dashboards

Here’s a glimpse of the dashboards I built with QuickSight:
⭐ Top Rated Movies
🎭 Genre Distribution
📈 Ratings Trend Over Years
🎬 Runtime vs IMDB Rating

📬 Notifications
No pipeline is complete without monitoring!
Whenever a data quality issue is detected, I get an email from SNS like:
"⚠️ Data Quality Alert: 12 records found with missing IMDB ratings."

📂 Project Repo

I’ve open-sourced my project on GitHub 👉 ETL-Movie-Data-Analysis
The repo includes:
Glue ETL script
Lambda function for DQ checks
Infrastructure configs (EventBridge, SNS, Glue Crawler)
QuickSight dashboard notes

🌟 Lessons Learned

AWS Glue is powerful, but schema mismatches can cause headaches 😅
QuickSight becomes much easier once data types are cleaned properly.
Automating DQ checks with Lambda + Athena is a game changer for reliability.

Documentation & version control (GitHub) are just as important as building the pipeline itself.

🚀 Next Steps

Add CI/CD with GitHub Actions
Use AWS Deequ or Great Expectations for richer data quality checks
Explore exporting QuickSight dashboards as templates

👨‍💻 About Me

I’m Shuvendu Parida – passionate about Data Engineering, Cloud, and Analytics.
This was my first Dev.to post 🙌 and I’d love your feedback!

👉 GitHub

💡 Thanks for reading!
If you found this useful, drop a ❤️ or comment so I can improve in my next posts.

DEV Community

🎬 Building an End-to-End Data Quality & Analytics Pipeline on AWS (IMDB Movie Dataset)

Top comments (0)