How I Built a Serverless Data Analytics Pipeline for Customer Churn with S3, Glue, Athena, and QuickSight

#aws #programming #cloud #learning

I wanted to explore how AWS services can be combined into a simple data pipeline that not only processes customer churn data, but also highlights the kind of insights companies rely on to drive retention and revenue growth.

For this project, I used the Telco Customer Churn dataset from Kaggle. The goal was to take raw CSV data, process it into a query-optimized format, and power dashboards that surface churn KPIs.

Architecture

Here’s the high-level design:

Amazon S3 — Stores raw Kaggle CSV input and processed Parquet output.
AWS Glue — Crawler to catalog schemas + ETL job to convert CSV into Parquet and partition the data.
Amazon Athena — Runs SQL queries and views over the processed data.
Amazon QuickSight — Dashboards to visualize churn KPIs like churn %, revenue loss, and segmentation.
Amazon EventBridge (optional) — Triggers Glue ETL jobs on a schedule.
Terraform — Infrastructure as Code for reproducible setup.

You can find the full implementation here: GitHub Repo.

Walkthrough

Data Ingestion The raw Telco churn dataset was uploaded into an S3 bucket. To keep data organized, I added key prefixes such as ingest_date=YYYY-MM-DD/. This structure makes it easier for Glue Crawlers to detect and register new data.

Schema Discovery & ETL Glue Crawlers scanned the raw bucket and registered the schema in the Glue Data Catalog. A Glue ETL job then converted the CSV files into Parquet and wrote the results to a processed S3 bucket with partitions. This format makes queries faster and more cost-efficient.

Partitioning Strategy

Partitioning turned out to be a critical design choice:
- Avoid high-cardinality keys that generate too many small files.
- Place date partitions last so queries can easily filter recent data.
- Hive (the engine behind Athena) processes partitions from left to right, so ordering matters.
Athena Queries

With data processed and partitioned, Athena queries became much more efficient. I created views for:
- Overall churn percentage
- Churn by contract type (month-to-month vs annual)
- Revenue lost from churners
- Tenure vs churn patterns

Visualization QuickSight connected directly to Athena, enabling dashboards with filters and visuals for churn % by demographics, add-on services, and contract types. This provided clear insights into which customers were most at risk.

Security

Even though this was a demo project, I applied security best practices:

IAM roles scoped with least privilege
S3 encryption (SSE-S3) for data at rest
Dedicated Glue and Athena execution roles
Restricted access to QuickSight dashboards

Conclusion

This pipeline shows how AWS services can be combined to build a self-service analytics solution with no servers to manage. Starting from raw CSVs, I was able to generate Parquet data, run queries in Athena, and visualize churn insights in QuickSight.

The next step for me is extending the pipeline with Amazon Bedrock. By creating a Knowledge Base and Bedrock Agent, I’ll enable natural-language questions like “What’s the churn rate for two-year contracts vs month-to-month?” and have the agent execute the Athena queries under the hood.

Learnings

Some of the key lessons from this build:

Adding ingest_date prefixes in S3 simplified partitioning and Glue Crawling.
Partitioning design is critical: avoid high-cardinality keys, put date last, and understand Hive’s left-to-right partition evaluation.
Encountered a HIVE_BAD_DATA error — a good reminder that Hive is running under the hood of Athena (flashback to Big Data classes!).
Parquet format greatly improved query speed and reduced cost.
Used Amazon Q Developer with the Diagram MCP server to auto-generate the architecture diagram — which made documentation far easier.

Top comments (1)

Aaron Rose • Sep 20

nice one, ajithmanmu! 💯