Forem

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Cloud Data Platform Blueprint

Cloud Data Platform Blueprint

A complete, layered architecture for building a modern cloud data platform from ingestion to serving. This blueprint provides Terraform modules, pipeline configurations, and governance policies for each layer of the data stack — covering batch and streaming ingestion, lakehouse storage with medallion architecture, processing with Spark and serverless, serving through APIs and BI tools, and governance with catalog, lineage, and access controls. Designed for data teams building their platform from scratch or modernizing a legacy warehouse.

Key Features

  • Medallion Architecture — Bronze, Silver, Gold layers with clear data quality contracts at each stage
  • Multi-Source Ingestion — Templates for database CDC, API polling, file drops, and streaming event capture
  • Lakehouse Storage — Delta Lake / Iceberg table format configurations with partitioning and compaction strategies
  • Processing Patterns — Spark jobs, serverless ETL (Glue/Data Factory), and streaming (Kinesis/Event Hubs) blueprints
  • Serving Layer — Pre-built configurations for BI tool connections, REST APIs, and materialized views
  • Data Governance — Data catalog setup, column-level access controls, PII tagging, and lineage tracking
  • Infrastructure as Code — Full Terraform modules for provisioning storage, compute, and networking
  • Cost Controls — Auto-scaling policies, lifecycle rules, and storage tiering to manage platform costs

Quick Start

# Deploy the storage foundation (S3/ADLS + Delta Lake)
cd src/terraform/storage
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your account details

terraform init
terraform plan -out=plan.out
terraform apply plan.out

# Deploy the ingestion layer
cd ../ingestion
terraform init
terraform apply -var-file="../storage/outputs.json"
Enter fullscreen mode Exit fullscreen mode

Architecture

┌──────────────────────────────────────────────────────────────┐
│                    Cloud Data Platform                        │
│                                                              │
│  Sources              Ingestion        Storage (Lakehouse)   │
│  ┌────────┐          ┌─────────┐      ┌──────────────────┐  │
│  │Database│──CDC────►│         │      │  Bronze (Raw)    │  │
│  │  APIs  │──Poll───►│ Ingest  │─────►│  Silver (Clean)  │  │
│  │ Files  │──Drop───►│ Layer   │      │  Gold (Business) │  │
│  │Streams │──Push───►│         │      └────────┬─────────┘  │
│  └────────┘          └─────────┘               │             │
│                                                │             │
│  Processing                    Serving         │             │
│  ┌──────────────────┐         ┌────────────┐   │             │
│  │ Spark / Glue     │◄────────┤ BI Tools   │◄──┘             │
│  │ Stream Processing│         │ REST APIs  │                 │
│  │ dbt Models       │────────►│ ML Feature │                 │
│  └──────────────────┘         │ Store      │                 │
│                               └────────────┘                 │
│  Governance                                                  │
│  ┌──────────────────────────────────────────────────────┐    │
│  │ Data Catalog │ Lineage │ Access Control │ Quality    │    │
│  └──────────────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Usage Examples

Bronze Layer — S3 Bucket with Lifecycle Rules

# src/terraform/storage/bronze.tf
resource "aws_s3_bucket" "bronze" {
  bucket = "${var.project_prefix}-bronze-${var.environment}"
  tags = {
    Layer       = "bronze"
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

resource "aws_s3_bucket_lifecycle_configuration" "bronze_lifecycle" {
  bucket = aws_s3_bucket.bronze.id
  rule {
    id     = "transition-to-ia"
    status = "Enabled"
    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }
    transition {
      days          = 90
      storage_class = "GLACIER"
    }
    expiration {
      days = 365   # Raw data retained for 1 year
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Silver Layer — Data Quality Checks

# src/processing/silver_transform.py
from dataclasses import dataclass
from enum import Enum

class QualityLevel(Enum):
    PASS = "pass"
    WARN = "warn"
    FAIL = "fail"

@dataclass
class QualityResult:
    check_name: str
    level: QualityLevel
    metric_value: float
    threshold: float

def check_completeness(
    records: list[dict], required_fields: list[str]
) -> QualityResult:
    """Verify all required fields are present and non-null."""
    total = len(records)
    if total == 0:
        return QualityResult("completeness", QualityLevel.FAIL, 0.0, 0.95)
    complete = sum(
        1 for r in records
        if all(r.get(f) is not None for f in required_fields)
    )
    ratio = complete / total
    level = (
        QualityLevel.PASS if ratio >= 0.95
        else QualityLevel.WARN if ratio >= 0.80
        else QualityLevel.FAIL
    )
    return QualityResult("completeness", level, ratio, 0.95)
Enter fullscreen mode Exit fullscreen mode

Ingestion — CDC Configuration

# configs/ingestion/cdc-source.yaml
source:
  type: postgresql
  host: db.internal.example.com
  port: 5432
  database: orders_db
  tables:
    - schema: public
      table: orders
      primary_key: order_id
      mode: cdc           # Change Data Capture via logical replication
    - schema: public
      table: customers
      primary_key: customer_id
      mode: full_refresh   # Small dimension table — full load daily

target:
  type: s3
  bucket: acme-bronze-production
  prefix: cdc/orders_db/
  format: parquet
  partition_by: [_ingested_date]
Enter fullscreen mode Exit fullscreen mode

Configuration

# configs/platform-config.yaml
project_prefix: acme-data
environment: production
region: us-east-1

storage:
  bronze_bucket: acme-bronze-production
  silver_bucket: acme-silver-production
  gold_bucket: acme-gold-production
  format: delta                    # delta or iceberg
  encryption: aws:kms

processing:
  engine: spark                    # spark, glue, or databricks
  cluster_size: medium             # small=2 workers, medium=5, large=10
  autoscale: true
  max_workers: 20

governance:
  catalog: glue_catalog            # glue_catalog or unity_catalog
  pii_detection: true              # Automated PII column tagging
  retention_days:
    bronze: 365
    silver: 730
    gold: 1825
Enter fullscreen mode Exit fullscreen mode

Best Practices

  • Treat bronze as immutable — Never modify raw data; reprocess from bronze into silver if logic changes
  • Schema-on-read for bronze, schema-on-write for silver — Accept any shape into bronze, enforce contracts at silver
  • Partition by ingestion date, not business date — Avoids late-arriving data overwriting partitions
  • Separate compute from storage — Use object storage so clusters can scale independently
  • Implement data quality gates — Failed quality checks should halt downstream processing, not silently propagate
  • Version your transformations — Use dbt or similar tools for version-controlled, testable SQL transformations

Troubleshooting

Issue Cause Fix
Spark job OOMs on silver transform Skewed join keys creating large partitions Add repartition() before joins or enable AQE
Delta table query is slow Too many small files from frequent writes Run OPTIMIZE (compaction) and VACUUM on affected tables
CDC slot growing indefinitely Consumer not acknowledging WAL positions Check CDC connector health; drop and recreate slot if abandoned
S3 access denied on cross-account Bucket policy missing cross-account principal Add consuming account's role ARN to the bucket policy

This is 1 of 11 resources in the Cloud Architecture Pro toolkit. Get the complete [Cloud Data Platform Blueprint] with all files, templates, and documentation for $49.

Get the Full Kit →

Or grab the entire Cloud Architecture Pro bundle (11 products) for $149 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)