I wanted to explore how AWS services can be combined into a simple data pipeline that not only processes customer churn data, but also highlights the kind of insights companies rely on to drive retention and revenue growth.
For this project, I used the Telco Customer Churn dataset from Kaggle. The goal was to take raw CSV data, process it into a query-optimized format, and power dashboards that surface churn KPIs.
Architecture
Here’s the high-level design:
- Amazon S3 — Stores raw Kaggle CSV input and processed Parquet output.
- AWS Glue — Crawler to catalog schemas + ETL job to convert CSV into Parquet and partition the data.
- Amazon Athena — Runs SQL queries and views over the processed data.
- Amazon QuickSight — Dashboards to visualize churn KPIs like churn %, revenue loss, and segmentation.
- Amazon EventBridge (optional) — Triggers Glue ETL jobs on a schedule.
- Terraform — Infrastructure as Code for reproducible setup.
You can find the full implementation here: GitHub Repo.
Walkthrough
-
Data Ingestion
The raw Telco churn dataset was uploaded into an S3 bucket. To keep data organized, I added key prefixes such as
ingest_date=YYYY-MM-DD/
. This structure makes it easier for Glue Crawlers to detect and register new data.
- Schema Discovery & ETL Glue Crawlers scanned the raw bucket and registered the schema in the Glue Data Catalog. A Glue ETL job then converted the CSV files into Parquet and wrote the results to a processed S3 bucket with partitions. This format makes queries faster and more cost-efficient.
-
Partitioning Strategy
Partitioning turned out to be a critical design choice:- Avoid high-cardinality keys that generate too many small files.
- Place date partitions last so queries can easily filter recent data.
- Hive (the engine behind Athena) processes partitions from left to right, so ordering matters.
-
Athena Queries
With data processed and partitioned, Athena queries became much more efficient. I created views for:- Overall churn percentage
- Churn by contract type (month-to-month vs annual)
- Revenue lost from churners
- Tenure vs churn patterns
- Visualization QuickSight connected directly to Athena, enabling dashboards with filters and visuals for churn % by demographics, add-on services, and contract types. This provided clear insights into which customers were most at risk.
Security
Even though this was a demo project, I applied security best practices:
- IAM roles scoped with least privilege
- S3 encryption (SSE-S3) for data at rest
- Dedicated Glue and Athena execution roles
- Restricted access to QuickSight dashboards
Conclusion
This pipeline shows how AWS services can be combined to build a self-service analytics solution with no servers to manage. Starting from raw CSVs, I was able to generate Parquet data, run queries in Athena, and visualize churn insights in QuickSight.
The next step for me is extending the pipeline with Amazon Bedrock. By creating a Knowledge Base and Bedrock Agent, I’ll enable natural-language questions like “What’s the churn rate for two-year contracts vs month-to-month?” and have the agent execute the Athena queries under the hood.
Learnings
Some of the key lessons from this build:
- Adding ingest_date prefixes in S3 simplified partitioning and Glue Crawling.
- Partitioning design is critical: avoid high-cardinality keys, put date last, and understand Hive’s left-to-right partition evaluation.
- Encountered a
HIVE_BAD_DATA
error — a good reminder that Hive is running under the hood of Athena (flashback to Big Data classes!). - Parquet format greatly improved query speed and reduced cost.
- Used Amazon Q Developer with the Diagram MCP server to auto-generate the architecture diagram — which made documentation far easier.
Top comments (1)
nice one, ajithmanmu! 💯