Cloud Data Platform Blueprint
A complete, layered architecture for building a modern cloud data platform from ingestion to serving. This blueprint provides Terraform modules, pipeline configurations, and governance policies for each layer of the data stack — covering batch and streaming ingestion, lakehouse storage with medallion architecture, processing with Spark and serverless, serving through APIs and BI tools, and governance with catalog, lineage, and access controls. Designed for data teams building their platform from scratch or modernizing a legacy warehouse.
Key Features
- Medallion Architecture — Bronze, Silver, Gold layers with clear data quality contracts at each stage
- Multi-Source Ingestion — Templates for database CDC, API polling, file drops, and streaming event capture
- Lakehouse Storage — Delta Lake / Iceberg table format configurations with partitioning and compaction strategies
- Processing Patterns — Spark jobs, serverless ETL (Glue/Data Factory), and streaming (Kinesis/Event Hubs) blueprints
- Serving Layer — Pre-built configurations for BI tool connections, REST APIs, and materialized views
- Data Governance — Data catalog setup, column-level access controls, PII tagging, and lineage tracking
- Infrastructure as Code — Full Terraform modules for provisioning storage, compute, and networking
- Cost Controls — Auto-scaling policies, lifecycle rules, and storage tiering to manage platform costs
Quick Start
# Deploy the storage foundation (S3/ADLS + Delta Lake)
cd src/terraform/storage
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your account details
terraform init
terraform plan -out=plan.out
terraform apply plan.out
# Deploy the ingestion layer
cd ../ingestion
terraform init
terraform apply -var-file="../storage/outputs.json"
Architecture
┌──────────────────────────────────────────────────────────────┐
│ Cloud Data Platform │
│ │
│ Sources Ingestion Storage (Lakehouse) │
│ ┌────────┐ ┌─────────┐ ┌──────────────────┐ │
│ │Database│──CDC────►│ │ │ Bronze (Raw) │ │
│ │ APIs │──Poll───►│ Ingest │─────►│ Silver (Clean) │ │
│ │ Files │──Drop───►│ Layer │ │ Gold (Business) │ │
│ │Streams │──Push───►│ │ └────────┬─────────┘ │
│ └────────┘ └─────────┘ │ │
│ │ │
│ Processing Serving │ │
│ ┌──────────────────┐ ┌────────────┐ │ │
│ │ Spark / Glue │◄────────┤ BI Tools │◄──┘ │
│ │ Stream Processing│ │ REST APIs │ │
│ │ dbt Models │────────►│ ML Feature │ │
│ └──────────────────┘ │ Store │ │
│ └────────────┘ │
│ Governance │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Data Catalog │ Lineage │ Access Control │ Quality │ │
│ └──────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
Usage Examples
Bronze Layer — S3 Bucket with Lifecycle Rules
# src/terraform/storage/bronze.tf
resource "aws_s3_bucket" "bronze" {
bucket = "${var.project_prefix}-bronze-${var.environment}"
tags = {
Layer = "bronze"
Environment = var.environment
ManagedBy = "terraform"
}
}
resource "aws_s3_bucket_lifecycle_configuration" "bronze_lifecycle" {
bucket = aws_s3_bucket.bronze.id
rule {
id = "transition-to-ia"
status = "Enabled"
transition {
days = 30
storage_class = "STANDARD_IA"
}
transition {
days = 90
storage_class = "GLACIER"
}
expiration {
days = 365 # Raw data retained for 1 year
}
}
}
Silver Layer — Data Quality Checks
# src/processing/silver_transform.py
from dataclasses import dataclass
from enum import Enum
class QualityLevel(Enum):
PASS = "pass"
WARN = "warn"
FAIL = "fail"
@dataclass
class QualityResult:
check_name: str
level: QualityLevel
metric_value: float
threshold: float
def check_completeness(
records: list[dict], required_fields: list[str]
) -> QualityResult:
"""Verify all required fields are present and non-null."""
total = len(records)
if total == 0:
return QualityResult("completeness", QualityLevel.FAIL, 0.0, 0.95)
complete = sum(
1 for r in records
if all(r.get(f) is not None for f in required_fields)
)
ratio = complete / total
level = (
QualityLevel.PASS if ratio >= 0.95
else QualityLevel.WARN if ratio >= 0.80
else QualityLevel.FAIL
)
return QualityResult("completeness", level, ratio, 0.95)
Ingestion — CDC Configuration
# configs/ingestion/cdc-source.yaml
source:
type: postgresql
host: db.internal.example.com
port: 5432
database: orders_db
tables:
- schema: public
table: orders
primary_key: order_id
mode: cdc # Change Data Capture via logical replication
- schema: public
table: customers
primary_key: customer_id
mode: full_refresh # Small dimension table — full load daily
target:
type: s3
bucket: acme-bronze-production
prefix: cdc/orders_db/
format: parquet
partition_by: [_ingested_date]
Configuration
# configs/platform-config.yaml
project_prefix: acme-data
environment: production
region: us-east-1
storage:
bronze_bucket: acme-bronze-production
silver_bucket: acme-silver-production
gold_bucket: acme-gold-production
format: delta # delta or iceberg
encryption: aws:kms
processing:
engine: spark # spark, glue, or databricks
cluster_size: medium # small=2 workers, medium=5, large=10
autoscale: true
max_workers: 20
governance:
catalog: glue_catalog # glue_catalog or unity_catalog
pii_detection: true # Automated PII column tagging
retention_days:
bronze: 365
silver: 730
gold: 1825
Best Practices
- Treat bronze as immutable — Never modify raw data; reprocess from bronze into silver if logic changes
- Schema-on-read for bronze, schema-on-write for silver — Accept any shape into bronze, enforce contracts at silver
- Partition by ingestion date, not business date — Avoids late-arriving data overwriting partitions
- Separate compute from storage — Use object storage so clusters can scale independently
- Implement data quality gates — Failed quality checks should halt downstream processing, not silently propagate
- Version your transformations — Use dbt or similar tools for version-controlled, testable SQL transformations
Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
| Spark job OOMs on silver transform | Skewed join keys creating large partitions | Add repartition() before joins or enable AQE |
| Delta table query is slow | Too many small files from frequent writes | Run OPTIMIZE (compaction) and VACUUM on affected tables |
| CDC slot growing indefinitely | Consumer not acknowledging WAL positions | Check CDC connector health; drop and recreate slot if abandoned |
| S3 access denied on cross-account | Bucket policy missing cross-account principal | Add consuming account's role ARN to the bucket policy |
This is 1 of 11 resources in the Cloud Architecture Pro toolkit. Get the complete [Cloud Data Platform Blueprint] with all files, templates, and documentation for $49.
Or grab the entire Cloud Architecture Pro bundle (11 products) for $149 — save 30%.
Top comments (0)