DatanestDigital

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

Data Lakehouse Architecture on Azure: Module 5: Integration Patterns

#databricks #azure #deltalake #terraform

Module 5: Integration Patterns

Learning Objectives

By the end of this module, you will be able to:

Select the right integration pattern (batch, streaming, event-driven) based on latency, throughput, and cost requirements
Design Change Data Capture (CDC) pipelines using Debezium, Azure CDC, and Delta Lake CDF
Architect API-first data products that expose curated datasets to consumers
Implement Delta Sharing for cross-organization data exchange without data duplication
Design reverse ETL pipelines to push lakehouse data back into operational systems
Plan a data marketplace that enables self-service data discovery and consumption

1. Integration Pattern Selection

1.1 The Integration Spectrum

Data integration is not a binary batch-or-streaming decision. Modern architectures
place workloads along a latency continuum.

  INTEGRATION LATENCY SPECTRUM

  ◄─────────────────────────────────────────────────────────────────►
  BATCH                 MICRO-BATCH           STREAMING        EVENT-DRIVEN
  (hours-days)          (seconds-minutes)     (milliseconds)   (sub-ms)

  ┌──────────┐         ┌──────────┐          ┌──────────┐     ┌──────────┐
  │  ADF      │         │ Spark    │          │ Kafka    │     │ Event    │
  │  Pipelines│         │ Structured│          │ Streams  │     │ Grid     │
  │  COPY INTO│         │ Streaming │          │ Flink    │     │ Functions│
  │  Sqoop    │         │ Auto      │          │ Event    │     │ Logic    │
  │           │         │ Loader    │          │ Hubs     │     │ Apps     │
  └──────────┘         └──────────┘          └──────────┘     └──────────┘

  Cost: $                Cost: $$              Cost: $$$        Cost: $$$$
  Complexity: Low        Complexity: Medium    Complexity: High Complexity: High
  Freshness: Hours       Freshness: Minutes    Freshness: Secs  Freshness: Real-time

1.2 Decision Framework

Use the following matrix to select the right integration pattern:

Criterion	Batch	Micro-Batch	Streaming	Event-Driven
Latency requirement	> 1 hour acceptable	1-15 minutes	< 1 minute	Sub-second
Data volume	Very large (TB+)	Medium-large	Medium	Small-medium per event
Source type	Files, DB snapshots	DB logs, files	Message queues	Webhooks, triggers
Schema stability	High	Medium	Medium-low	Low
Cost sensitivity	High (optimize cost)	Medium	Low (optimize latency)	Low
Ordering guarantees	Not critical	Partition-level	Partition-level	Per-entity
Exactly-once need	Easy to achieve	Checkpointed	Requires design	Idempotency required
Typical Azure services	ADF, COPY INTO	Structured Streaming	Event Hubs + Spark	Event Grid + Functions

1.3 Hybrid Pattern: Lambda vs Kappa

Most real architectures use hybrid patterns. Here are two foundational approaches:

  LAMBDA ARCHITECTURE

  ┌─────────────┐
  │ Data Sources │
  └──────┬──────┘
         │
    ┌────┴────┐
    │         │
    ▼         ▼
  ┌─────┐  ┌─────────┐
  │Batch│  │Speed    │
  │Layer│  │Layer    │
  │(ADF)│  │(Strm.)│
  └──┬──┘  └───┬─────┘
     │         │
     ▼         ▼
  ┌────────────────┐
  │  Serving Layer  │
  │  (Delta Tables) │
  └────────────────┘

  ✓ Accuracy via batch reprocessing
  ✗ Two codepaths = double maintenance

  KAPPA ARCHITECTURE

  ┌─────────────┐
  │ Data Sources │
  └──────┬──────┘
         │
         ▼
  ┌─────────────────┐
  │  Unified Stream  │
  │  Processing      │
  │  (Structured     │
  │   Streaming)     │
  └────────┬────────┘
           │
           ▼
  ┌─────────────────┐
  │  Delta Tables    │
  │  (Serving)       │
  └─────────────────┘

  ✓ Single codepath
  ✓ Simpler operations
  ✗ Reprocessing is expensive
  ✗ Complex aggregations harder

Recommendation for Databricks: Prefer Kappa with Delta Lake. Structured Streaming
with Delta as both sink and source eliminates the need for separate batch and speed
layers. Use Delta time travel for reprocessing when needed.

2. Change Data Capture (CDC) Patterns

2.1 CDC Architecture Overview

CDC captures row-level changes (INSERT, UPDATE, DELETE) from source databases and
propagates them to the lakehouse in near real-time.

  CDC PIPELINE ARCHITECTURE

  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
  │ Source DB │───►│ CDC Tool │───►│ Message  │───►│ Lakehouse│
  │ (SQL     │    │(Debezium │    │ Queue    │    │ (Delta   │
  │  Server, │    │ Azure CDC│    │(Kafka /  │    │  Tables) │
  │  Postgres│    │ Oracle   │    │ Event    │    │          │
  │  MySQL)  │    │ GoldenGt)│    │ Hubs)    │    │          │
  └──────────┘    └──────────┘    └──────────┘    └──────────┘
       │                │               │               │
   Transaction      Read log /      JSON/Avro       MERGE INTO
   log (WAL)        Query-based     CDC events      with Delta

2.2 CDC Tool Comparison

Feature	Debezium	Azure SQL CDC	Oracle GoldenGate	Fivetran / Airbyte
Source support	Broad (Postgres, MySQL, SQL Server, Oracle, MongoDB)	SQL Server, Azure SQL	Oracle, SQL Server	300+ connectors
Capture method	Log-based (WAL)	Polling + CT tables	Log-based (redo log)	Mixed (log + query)
Latency	Seconds	Minutes	Seconds	Minutes-hours
Schema evolution	Automatic detection	Manual	Manual	Automatic
Managed service	Self-hosted or Confluent Cloud	Built into Azure SQL	Oracle Cloud	SaaS
Cost model	Open source + infra	Free (SQL feature)	License per CPU	Per row synced
Delete detection	Native	Via tombstone	Native	Depends on connector

2.3 Implementing CDC with Delta Lake

Delta Lake's Change Data Feed (CDF) complements source CDC by tracking changes
within the lakehouse itself.

# Enable CDF on a Delta table
spark.sql("""
    ALTER TABLE silver.customers
    SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
""")

# Read changes since a specific version
changes_df = (
    spark.read
    .format("delta")
    .option("readChangeFeed", "true")
    .option("startingVersion", 5)
    .option("endingVersion", 10)
    .table("silver.customers")
)

# CDF adds these metadata columns:
# _change_type: insert, update_preimage, update_postimage, delete
# _commit_version: Delta version number
# _commit_timestamp: When the change was committed

# Apply changes to gold table using MERGE
changes_df.createOrReplaceTempView("customer_changes")

spark.sql("""
    MERGE INTO gold.dim_customer t
    USING (
        SELECT * FROM customer_changes
        WHERE _change_type IN ('insert', 'update_postimage')
    ) s
    ON t.customer_id = s.customer_id
    WHEN MATCHED THEN UPDATE SET *
    WHEN NOT MATCHED THEN INSERT *
""")

# Handle deletes separately
spark.sql("""
    DELETE FROM gold.dim_customer
    WHERE customer_id IN (
        SELECT customer_id FROM customer_changes
        WHERE _change_type = 'delete'
    )
""")

2.4 CDC Anti-Patterns

Anti-Pattern	Problem	Solution
Polling-based CDC on large tables	Misses deletes, high source load	Use log-based CDC (Debezium)
No ordering guarantees	Out-of-order events corrupt state	Partition by entity key, process in order
Ignoring schema changes	Pipeline breaks on ALTER TABLE	Debezium schema registry + schema evolution
Full snapshots as "CDC"	Wasteful, slow, misses interim changes	True log-based CDC
No dead letter queue	Poison messages block pipeline	Route failures to DLQ for investigation

3. API-First Data Products

3.1 Data Product API Architecture

Data products exposed via APIs enable real-time access to curated lakehouse data
for operational systems.

  API-FIRST DATA PRODUCT ARCHITECTURE

  ┌─────────────────────────────────────────────────────┐
  │                  DATA PRODUCT                        │
  │  ┌──────────┐  ┌──────────┐  ┌──────────────────┐  │
  │  │ Gold     │  │ Caching  │  │ API Layer        │  │
  │  │ Delta    │──│ Layer    │──│ (REST / GraphQL) │  │
  │  │ Tables   │  │(Redis /  │  │                  │  │
  │  │          │  │ Cosmos)  │  │ - Rate limiting  │  │
  │  └──────────┘  └──────────┘  │ - Auth (OAuth2)  │  │
  │                              │ - Versioning     │  │
  │                              │ - Schema docs    │  │
  │  ┌──────────────────────┐   │ - SLA monitoring  │  │
  │  │ Contract / Schema    │   └──────────────────┘  │
  │  │ (OpenAPI / AsyncAPI) │                          │
  │  └──────────────────────┘                          │
  └─────────────────────────────────────────────────────┘
           │              │              │
           ▼              ▼              ▼
      ┌────────┐    ┌────────┐    ┌────────┐
      │Web Apps│    │Mobile  │    │ML      │
      │        │    │Apps    │    │Services│
      └────────┘    └────────┘    └────────┘

3.2 API vs Delta Sharing vs Direct Query

Access Pattern	API (REST/GraphQL)	Delta Sharing	Direct SQL (JDBC/ODBC)
Latency	Milliseconds (cached)	Seconds-minutes	Seconds
Data volume	Small (rows/entities)	Large (tables/partitions)	Medium
Consumer type	Applications	Data platforms	BI tools, notebooks
Auth model	OAuth2 / API keys	Bearer tokens	Service principal
Schema contract	OpenAPI / GraphQL schema	Delta schema	SQL schema
Rate limiting	Yes (API gateway)	Provider-controlled	Connection pool
Best for	Operational reads	Cross-org analytics sharing	Ad-hoc analysis

3.3 API Implementation Pattern on Azure

  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
  │ Gold     │───►│ Azure    │───►│ API      │───►│ Consumer │
  │ Delta    │    │ Cosmos DB│    │ Mgmt     │    │ App      │
  │ Table    │    │ (cache)  │    │ (APIM)   │    │          │
  └──────────┘    └──────────┘    └──────────┘    └──────────┘
       │                │               │               │
   Structured      Cosmos Change     OAuth2 +       OpenAPI
   Streaming       Feed for          Rate limit     contract
   to Cosmos       invalidation      + logging      + SDK

Key design decisions:

Caching layer: Use Cosmos DB or Redis for sub-10ms reads
Sync strategy: Structured Streaming from Delta → Cosmos for near real-time sync
Cache invalidation: Cosmos Change Feed or Delta CDF triggers cache refresh
API gateway: Azure API Management for auth, throttling, versioning, analytics

4. Delta Sharing

4.1 Delta Sharing Architecture

Delta Sharing is an open protocol for secure data sharing across organizations
without copying data.

  DELTA SHARING — OPEN PROTOCOL

  PROVIDER SIDE                           RECIPIENT SIDE
  ┌─────────────────────┐               ┌─────────────────────┐
  │ Unity Catalog        │               │ Any Platform         │
  │                      │               │                      │
  │ ┌─────────────┐     │   HTTPS/REST  │ ┌─────────────┐     │
  │ │ Delta Tables│     │◄─────────────►│ │ Delta       │     │
  │ │ (ADLS Gen2) │     │               │ │ Sharing     │     │
  │ └─────────────┘     │   Pre-signed  │ │ Client      │     │
  │                      │   URLs        │ └─────────────┘     │
  │ ┌─────────────┐     │               │                      │
  │ │ Sharing     │     │               │ Supported:           │
  │ │ Server      │     │               │ - Databricks         │
  │ │ (built-in)  │     │               │ - Spark (OSS)        │
  │ └─────────────┘     │               │ - pandas             │
  │                      │               │ - Power BI           │
  │ Access control:      │               │ - Tableau            │
  │ - Recipient mgmt    │               │ - dbt                │
  │ - Token rotation    │               │ - Any REST client    │
  │ - Audit logging     │               │                      │
  └─────────────────────┘               └─────────────────────┘

4.2 Setting Up Delta Sharing

-- PROVIDER: Create a share
CREATE SHARE IF NOT EXISTS customer_analytics;

-- Add tables to the share
ALTER SHARE customer_analytics
ADD TABLE gold.dim_customer
  COMMENT 'Customer dimension - refreshed daily'
  PARTITION (region = 'EU');

ALTER SHARE customer_analytics
ADD TABLE gold.monthly_revenue
  COMMENT 'Monthly revenue aggregates';

-- Create recipient
CREATE RECIPIENT acme_corp
  USING ID 'acme-databricks-workspace-id';

-- Or for non-Databricks recipients, generate activation link
CREATE RECIPIENT external_partner;
-- Download credentials file for the recipient

-- Grant access
GRANT SELECT ON SHARE customer_analytics TO RECIPIENT acme_corp;

-- Audit: who accessed what
SELECT * FROM system.access.audit
WHERE action_name = 'deltaSharingQuery'
ORDER BY event_time DESC;

4.3 Delta Sharing vs Alternatives

Feature	Delta Sharing	Snowflake Data Sharing	S3/ADLS Direct Access	API
Protocol	Open standard	Proprietary	Cloud IAM	HTTP
Data copy	None (pre-signed URLs)	None (account refs)	None	Serialized
Cross-platform	Yes (any client)	Snowflake only	Same cloud	Any
Cross-cloud	Yes	Yes (with replication)	No	Yes
Governance	Unity Catalog	Snowflake governance	Manual IAM	Custom
Cost to recipient	Egress only	Compute only	Egress only	Compute + egress
Live tables	Yes (CDF support)	Yes	No (snapshots)	Yes

5. Reverse ETL

5.1 The Reverse ETL Pattern

Reverse ETL pushes curated lakehouse data back into operational systems (CRM, marketing
platforms, support tools) to activate analytics insights.

  TRADITIONAL ETL vs REVERSE ETL

  TRADITIONAL ETL:
  ┌──────────┐    ┌──────────┐    ┌──────────┐
  │Salesforce│───►│ Lakehouse│───►│ Dashboard│
  │HubSpot   │    │ (Gold)   │    │ (BI)     │
  │Zendesk   │    │          │    │          │
  └──────────┘    └──────────┘    └──────────┘

  REVERSE ETL:
  ┌──────────┐    ┌──────────┐    ┌──────────┐
  │Salesforce│◄───│ Lakehouse│◄───│Raw Sources│
  │HubSpot   │    │ (Gold)   │    │          │
  │Zendesk   │    │          │    │          │
  └──────────┘    └──────────┘    └──────────┘
        ▲               │
        │               │
     Enriched        ML scores,
     profiles,       segments,
     lead scores     predictions

5.2 Implementation Approaches

... [content trimmed for length — full version in the complete kit]

This is 1 of 5 resources in the Datanest Academy toolkit. Get the complete [Data Lakehouse Architecture on Azure] with all files, templates, and documentation for $199.

Get the Full Kit →

Or grab the entire Datanest Academy bundle (5 products) for $XXX — save XX%.

Get the Complete Bundle →

DEV Community