Part 14 - Cloud Deployment and Lessons Learned ☁️
This final part continues from the local deployment story and closes the loop with the cloud architecture in terraform/main.tf and terraform/user_data.sh.tftpl.
What Terraform provisions
The Terraform layer creates the cloud resources needed to run the project in AWS:
- an S3 bucket for data lake storage,
- an EC2 instance for Airflow and Superset,
- an EMR Serverless application for Spark,
- IAM roles and policies,
- and SSM parameters that publish runtime configuration.
That is enough to reproduce the same pipeline outside a local Docker environment.
Why the EC2 bootstrap matters
The user data template clones the repository, writes the environment file, and starts Docker Compose on the instance. That keeps the cloud setup aligned with the local development workflow.
The result is a single codebase that can run in two environments with minimal friction.
What this project teaches
This repository is a good Zoomcamp final project because it demonstrates the main ideas of modern data engineering:
- ingestion from external APIs,
- raw zone storage,
- batch transformation with Spark,
- warehouse loading,
- dbt modeling,
- dashboard automation,
- and cloud infrastructure as code.
A few lessons from the implementation
A few practical lessons stand out:
- keep environment-specific logic in one config layer,
- isolate API clients from orchestration code,
- use partitioned storage for time-based data,
- model analytics tables in dbt instead of ad hoc SQL,
- and automate the dashboard so the final result can be reproduced.
Closing note
This final article closes the tutorial series. If someone reads the 14 parts in order, they should be able to understand the entire project from raw data collection to dashboard delivery.
This series is now ready to publish as a continuous learning path, and the data-engineering-zoomcamp tag appears in every article so the set stays grouped together.
Tag: #dataengineeringzoomcamp
Top comments (0)