π Hi everyone! This is my first post on Dev.to π
Iβm super excited to share my journey of building an ETL pipeline with data quality checks and dashboards using AWS services.
The dataset I used? π The famous IMDB Movie Dataset π₯
π Why I Built This
Data pipelines are not just about moving data from point A to B.
They need to handle:
β
Cleaning messy data
β
Validating data quality
β
Automating workflows
β
Delivering insights via dashboards
So, I challenged myself to design a cloud-native solution on AWS to cover all of these steps.
*ποΈ Architecture Overview
*
Hereβs the high-level design of my pipeline:
Amazon S3 β Stores the raw IMDB dataset.
AWS Glue Crawler β Automatically detects schema and stores metadata in Glue Data Catalog.
AWS Glue Job β Cleans and transforms data:
Converts Released_Year into integer/date
Extracts runtime in minutes
Cleans Gross revenue field into numeric
Amazon Athena β Runs SQL-based data quality checks.
AWS Lambda β Automates DQ checks, executes Athena queries, and handles logic.
Amazon SNS β Sends email notifications if bad data is found.
Amazon EventBridge β Schedules and orchestrates pipeline runs.
Amazon QuickSight β Creates interactive dashboards for insights.
π οΈ Tools & Services Used
AWS Glue (ETL + Crawler)
AWS Lambda (Serverless automation)
Amazon Athena (SQL over S3)
Amazon S3 (Data lake storage)
Amazon SNS (Notifications)
Amazon EventBridge (Scheduler)
Amazon QuickSight (BI dashboards)
Python (ETL scripts + Lambda)
βοΈ How the Pipeline Works
1.Ingest Data
Raw CSV of IMDB movies is uploaded to S3.
2.Discover Schema
Glue Crawler scans the file β adds table to Glue Data Catalog.
3.ETL with Glue
My Glue script (glue_job_script.py) transforms the data:
# Example transformation for Gross column
df = df.withColumn("Gross_Cleaned",
regexp_replace("Gross", "[$,]", "").cast("long"))
3.Data Quality Checks
Lambda runs queries in Athena, e.g.:
SELECT Series_Title, Released_Year, IMDB_Rating
FROM imdb_movies_rating
WHERE IMDB_Rating IS NULL OR Released_Year = '';
π If bad data found β SNS sends me an email alert.
5.Visualization
Final curated dataset is connected to QuickSight dashboards.
π Dashboards
Hereβs a glimpse of the dashboards I built with QuickSight:
β Top Rated Movies
π Genre Distribution
π Ratings Trend Over Years
π¬ Runtime vs IMDB Rating
π¬ Notifications
No pipeline is complete without monitoring!
Whenever a data quality issue is detected, I get an email from SNS like:
"β οΈ Data Quality Alert: 12 records found with missing IMDB ratings."
π Project Repo
Iβve open-sourced my project on GitHub π ETL-Movie-Data-Analysis
The repo includes:
Glue ETL script
Lambda function for DQ checks
Infrastructure configs (EventBridge, SNS, Glue Crawler)
QuickSight dashboard notes
π Lessons Learned
AWS Glue is powerful, but schema mismatches can cause headaches π
QuickSight becomes much easier once data types are cleaned properly.
Automating DQ checks with Lambda + Athena is a game changer for reliability.
Documentation & version control (GitHub) are just as important as building the pipeline itself.
π Next Steps
Add CI/CD with GitHub Actions
Use AWS Deequ or Great Expectations for richer data quality checks
Explore exporting QuickSight dashboards as templates
π¨βπ» About Me
Iβm Shuvendu Parida β passionate about Data Engineering, Cloud, and Analytics.
This was my first Dev.to post π and Iβd love your feedback!
π GitHub
π‘ Thanks for reading!
If you found this useful, drop a β€οΈ or comment so I can improve in my next posts.

Top comments (0)