<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shuvendu Parida</title>
    <description>The latest articles on DEV Community by Shuvendu Parida (@shuvendu_parida_4494624b3).</description>
    <link>https://dev.to/shuvendu_parida_4494624b3</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3494381%2F73406fe8-fe62-45ee-b79d-8acd08458b1e.png</url>
      <title>DEV Community: Shuvendu Parida</title>
      <link>https://dev.to/shuvendu_parida_4494624b3</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shuvendu_parida_4494624b3"/>
    <language>en</language>
    <item>
      <title>🎬 Building an End-to-End Data Quality &amp; Analytics Pipeline on AWS (IMDB Movie Dataset)</title>
      <dc:creator>Shuvendu Parida</dc:creator>
      <pubDate>Thu, 11 Sep 2025 08:45:01 +0000</pubDate>
      <link>https://dev.to/shuvendu_parida_4494624b3/building-an-end-to-end-data-quality-analytics-pipeline-on-aws-imdb-movie-dataset-k3</link>
      <guid>https://dev.to/shuvendu_parida_4494624b3/building-an-end-to-end-data-quality-analytics-pipeline-on-aws-imdb-movie-dataset-k3</guid>
      <description>&lt;p&gt;👋 Hi everyone! This is my first post on Dev.to 🎉&lt;br&gt;
I’m super excited to share my journey of building an ETL pipeline with data quality checks and dashboards using AWS services.&lt;/p&gt;

&lt;p&gt;The dataset I used? 👉 The famous IMDB Movie Dataset 🎥&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;📌 Why I Built This&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Data pipelines are not just about moving data from point A to B.&lt;br&gt;
They need to handle:&lt;br&gt;
✅ Cleaning messy data&lt;br&gt;
✅ Validating data quality&lt;br&gt;
✅ Automating workflows&lt;br&gt;
✅ Delivering insights via dashboards&lt;/p&gt;

&lt;p&gt;So, I challenged myself to design a cloud-native solution on AWS to cover all of these steps.&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;🏗️ Architecture Overview&lt;br&gt;
*&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4anxqw3v4w3hintf9f83.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4anxqw3v4w3hintf9f83.png" alt=" " width="587" height="544"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here’s the high-level design of my pipeline:&lt;br&gt;
Amazon S3 – Stores the raw IMDB dataset.&lt;br&gt;
AWS Glue Crawler – Automatically detects schema and stores metadata in Glue Data Catalog.&lt;br&gt;
AWS Glue Job – Cleans and transforms data:&lt;br&gt;
Converts Released_Year into integer/date&lt;br&gt;
Extracts runtime in minutes&lt;br&gt;
Cleans Gross revenue field into numeric&lt;br&gt;
Amazon Athena – Runs SQL-based data quality checks.&lt;br&gt;
AWS Lambda – Automates DQ checks, executes Athena queries, and handles logic.&lt;br&gt;
Amazon SNS – Sends email notifications if bad data is found.&lt;br&gt;
Amazon EventBridge – Schedules and orchestrates pipeline runs.&lt;br&gt;
Amazon QuickSight – Creates interactive dashboards for insights.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🛠️ Tools &amp;amp; Services Used&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AWS Glue (ETL + Crawler)&lt;br&gt;
AWS Lambda (Serverless automation)&lt;br&gt;
Amazon Athena (SQL over S3)&lt;br&gt;
Amazon S3 (Data lake storage)&lt;br&gt;
Amazon SNS (Notifications)&lt;br&gt;
Amazon EventBridge (Scheduler)&lt;br&gt;
Amazon QuickSight (BI dashboards)&lt;br&gt;
Python (ETL scripts + Lambda)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚙️ How the Pipeline Works&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;1.Ingest Data&lt;br&gt;
Raw CSV of IMDB movies is uploaded to S3.&lt;/p&gt;

&lt;p&gt;2.Discover Schema&lt;br&gt;
Glue Crawler scans the file → adds table to Glue Data Catalog.&lt;/p&gt;

&lt;p&gt;3.ETL with Glue&lt;br&gt;
My Glue script (glue_job_script.py) transforms the data:&lt;br&gt;
    # Example transformation for Gross column&lt;br&gt;
    df = df.withColumn("Gross_Cleaned", &lt;br&gt;
    regexp_replace("Gross", "[$,]", "").cast("long"))&lt;/p&gt;

&lt;p&gt;3.Data Quality Checks&lt;br&gt;
Lambda runs queries in Athena, e.g.:&lt;br&gt;
    SELECT Series_Title, Released_Year, IMDB_Rating&lt;br&gt;
    FROM imdb_movies_rating&lt;br&gt;
    WHERE IMDB_Rating IS NULL OR Released_Year = '';&lt;br&gt;
👉 If bad data found → SNS sends me an email alert.&lt;/p&gt;

&lt;p&gt;5.Visualization&lt;br&gt;
Final curated dataset is connected to QuickSight dashboards.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;📊 Dashboards&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here’s a glimpse of the dashboards I built with QuickSight:&lt;br&gt;
⭐ Top Rated Movies&lt;br&gt;
🎭 Genre Distribution&lt;br&gt;
📈 Ratings Trend Over Years&lt;br&gt;
🎬 Runtime vs IMDB Rating&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;📬 Notifications&lt;/strong&gt;&lt;br&gt;
No pipeline is complete without monitoring!&lt;br&gt;
Whenever a data quality issue is detected, I get an email from SNS like:&lt;br&gt;
"⚠️ Data Quality Alert: 12 records found with missing IMDB ratings."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;📂 Project Repo&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I’ve open-sourced my project on GitHub 👉 ETL-Movie-Data-Analysis&lt;br&gt;
The repo includes:&lt;br&gt;
Glue ETL script&lt;br&gt;
Lambda function for DQ checks&lt;br&gt;
Infrastructure configs (EventBridge, SNS, Glue Crawler)&lt;br&gt;
QuickSight dashboard notes&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🌟 Lessons Learned&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AWS Glue is powerful, but schema mismatches can cause headaches 😅&lt;br&gt;
QuickSight becomes much easier once data types are cleaned properly.&lt;br&gt;
Automating DQ checks with Lambda + Athena is a game changer for reliability.&lt;/p&gt;

&lt;p&gt;Documentation &amp;amp; version control (GitHub) are just as important as building the pipeline itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🚀 Next Steps&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Add CI/CD with GitHub Actions&lt;br&gt;
Use AWS Deequ or Great Expectations for richer data quality checks&lt;br&gt;
Explore exporting QuickSight dashboards as templates&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;👨‍💻 About Me&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I’m Shuvendu Parida – passionate about Data Engineering, Cloud, and Analytics.&lt;br&gt;
This was my first Dev.to post 🙌 and I’d love your feedback!&lt;/p&gt;

&lt;p&gt;👉 GitHub&lt;/p&gt;

&lt;p&gt;💡 Thanks for reading!&lt;br&gt;
If you found this useful, drop a ❤️ or comment so I can improve in my next posts.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
