DEV Community

Naveen Ayalla
Naveen Ayalla

Posted on

Building a PySpark and AWS Glue ETL Pipeline for Search Keyword Revenue Analysis

I published a public data engineering project that demonstrates a cloud-based ETL pipeline for analyzing web analytics search keyword revenue.

The project uses PySpark, AWS Glue, Amazon S3, and Terraform to process hit-level web analytics data, extract external search engine domains and keywords, parse revenue, and generate a sorted reporting output.

Key concepts covered:

Batch ETL pipeline design
PySpark transformations
AWS Glue job configuration
S3 input and output workflow
Revenue aggregation logic
Terraform infrastructure examples

This is a generic open-source portfolio project and does not include proprietary or company-provided data.

GitHub: https://github.com/naveenayalla1-CS50/search-keyword-performance-revenue

Feedback from data engineers and cloud data practitioners is welcome.

Top comments (0)