DEV Community

ajithmanmu
ajithmanmu

Posted on

How I Built a Serverless Data Analytics Pipeline for Customer Churn with S3, Glue, Athena, and QuickSight

I wanted to explore how AWS services can be combined into a simple data pipeline that not only processes customer churn data, but also highlights the kind of insights companies rely on to drive retention and revenue growth.

For this project, I used the Telco Customer Churn dataset from Kaggle. The goal was to take raw CSV data, process it into a query-optimized format, and power dashboards that surface churn KPIs.


Architecture

Here’s the high-level design:

architecture

  • Amazon S3 — Stores raw Kaggle CSV input and processed Parquet output.
  • AWS Glue — Crawler to catalog schemas + ETL job to convert CSV into Parquet and partition the data.
  • Amazon Athena — Runs SQL queries and views over the processed data.
  • Amazon QuickSight — Dashboards to visualize churn KPIs like churn %, revenue loss, and segmentation.
  • Amazon EventBridge (optional) — Triggers Glue ETL jobs on a schedule.
  • Terraform — Infrastructure as Code for reproducible setup.

You can find the full implementation here: GitHub Repo.


Walkthrough

  1. Data Ingestion The raw Telco churn dataset was uploaded into an S3 bucket. To keep data organized, I added key prefixes such as ingest_date=YYYY-MM-DD/. This structure makes it easier for Glue Crawlers to detect and register new data.

s3

  1. Schema Discovery & ETL Glue Crawlers scanned the raw bucket and registered the schema in the Glue Data Catalog. A Glue ETL job then converted the CSV files into Parquet and wrote the results to a processed S3 bucket with partitions. This format makes queries faster and more cost-efficient.

tables

gluejob

  1. Partitioning Strategy

    Partitioning turned out to be a critical design choice:

    • Avoid high-cardinality keys that generate too many small files.
    • Place date partitions last so queries can easily filter recent data.
    • Hive (the engine behind Athena) processes partitions from left to right, so ordering matters.
  2. Athena Queries

    With data processed and partitioned, Athena queries became much more efficient. I created views for:

    • Overall churn percentage
    • Churn by contract type (month-to-month vs annual)
    • Revenue lost from churners
    • Tenure vs churn patterns

athena

workflow

  1. Visualization QuickSight connected directly to Athena, enabling dashboards with filters and visuals for churn % by demographics, add-on services, and contract types. This provided clear insights into which customers were most at risk.

Security

Even though this was a demo project, I applied security best practices:

  • IAM roles scoped with least privilege
  • S3 encryption (SSE-S3) for data at rest
  • Dedicated Glue and Athena execution roles
  • Restricted access to QuickSight dashboards

Conclusion

This pipeline shows how AWS services can be combined to build a self-service analytics solution with no servers to manage. Starting from raw CSVs, I was able to generate Parquet data, run queries in Athena, and visualize churn insights in QuickSight.

The next step for me is extending the pipeline with Amazon Bedrock. By creating a Knowledge Base and Bedrock Agent, I’ll enable natural-language questions like “What’s the churn rate for two-year contracts vs month-to-month?” and have the agent execute the Athena queries under the hood.


Learnings

Some of the key lessons from this build:

  • Adding ingest_date prefixes in S3 simplified partitioning and Glue Crawling.
  • Partitioning design is critical: avoid high-cardinality keys, put date last, and understand Hive’s left-to-right partition evaluation.
  • Encountered a HIVE_BAD_DATA error — a good reminder that Hive is running under the hood of Athena (flashback to Big Data classes!).
  • Parquet format greatly improved query speed and reduced cost.
  • Used Amazon Q Developer with the Diagram MCP server to auto-generate the architecture diagram — which made documentation far easier.

Top comments (1)

Collapse
 
aaron_rose_0787cc8b4775a0 profile image
Aaron Rose

nice one, ajithmanmu! 💯