DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Data Lakehouse Architecture on Azure: Module 5: Integration Patterns

Module 5: Integration Patterns

Learning Objectives

By the end of this module, you will be able to:

  1. Select the right integration pattern (batch, streaming, event-driven) based on latency, throughput, and cost requirements
  2. Design Change Data Capture (CDC) pipelines using Debezium, Azure CDC, and Delta Lake CDF
  3. Architect API-first data products that expose curated datasets to consumers
  4. Implement Delta Sharing for cross-organization data exchange without data duplication
  5. Design reverse ETL pipelines to push lakehouse data back into operational systems
  6. Plan a data marketplace that enables self-service data discovery and consumption

1. Integration Pattern Selection

1.1 The Integration Spectrum

Data integration is not a binary batch-or-streaming decision. Modern architectures
place workloads along a latency continuum.

  INTEGRATION LATENCY SPECTRUM

  ◄─────────────────────────────────────────────────────────────────►
  BATCH                 MICRO-BATCH           STREAMING        EVENT-DRIVEN
  (hours-days)          (seconds-minutes)     (milliseconds)   (sub-ms)

  ┌──────────┐         ┌──────────┐          ┌──────────┐     ┌──────────┐
  │  ADF      │         │ Spark    │          │ Kafka    │     │ Event    │
  │  Pipelines│         │ Structured│          │ Streams  │     │ Grid     │
  │  COPY INTO│         │ Streaming │          │ Flink    │     │ Functions│
  │  Sqoop    │         │ Auto      │          │ Event    │     │ Logic    │
  │           │         │ Loader    │          │ Hubs     │     │ Apps     │
  └──────────┘         └──────────┘          └──────────┘     └──────────┘

  Cost: $                Cost: $$              Cost: $$$        Cost: $$$$
  Complexity: Low        Complexity: Medium    Complexity: High Complexity: High
  Freshness: Hours       Freshness: Minutes    Freshness: Secs  Freshness: Real-time
Enter fullscreen mode Exit fullscreen mode

1.2 Decision Framework

Use the following matrix to select the right integration pattern:

Criterion Batch Micro-Batch Streaming Event-Driven
Latency requirement > 1 hour acceptable 1-15 minutes < 1 minute Sub-second
Data volume Very large (TB+) Medium-large Medium Small-medium per event
Source type Files, DB snapshots DB logs, files Message queues Webhooks, triggers
Schema stability High Medium Medium-low Low
Cost sensitivity High (optimize cost) Medium Low (optimize latency) Low
Ordering guarantees Not critical Partition-level Partition-level Per-entity
Exactly-once need Easy to achieve Checkpointed Requires design Idempotency required
Typical Azure services ADF, COPY INTO Structured Streaming Event Hubs + Spark Event Grid + Functions

1.3 Hybrid Pattern: Lambda vs Kappa

Most real architectures use hybrid patterns. Here are two foundational approaches:

  LAMBDA ARCHITECTURE

  ┌─────────────┐
  │ Data Sources │
  └──────┬──────┘
         │
    ┌────┴────┐
    │         │
    ▼         ▼
  ┌─────┐  ┌─────────┐
  │Batch│  │Speed    │
  │Layer│  │Layer    │
  │(ADF)│  │(Strm.)│
  └──┬──┘  └───┬─────┘
     │         │
     ▼         ▼
  ┌────────────────┐
  │  Serving Layer  │
  │  (Delta Tables) │
  └────────────────┘

  ✓ Accuracy via batch reprocessing
  ✗ Two codepaths = double maintenance
Enter fullscreen mode Exit fullscreen mode
  KAPPA ARCHITECTURE

  ┌─────────────┐
  │ Data Sources │
  └──────┬──────┘
         │
         ▼
  ┌─────────────────┐
  │  Unified Stream  │
  │  Processing      │
  │  (Structured     │
  │   Streaming)     │
  └────────┬────────┘
           │
           ▼
  ┌─────────────────┐
  │  Delta Tables    │
  │  (Serving)       │
  └─────────────────┘

  ✓ Single codepath
  ✓ Simpler operations
  ✗ Reprocessing is expensive
  ✗ Complex aggregations harder
Enter fullscreen mode Exit fullscreen mode

Recommendation for Databricks: Prefer Kappa with Delta Lake. Structured Streaming
with Delta as both sink and source eliminates the need for separate batch and speed
layers. Use Delta time travel for reprocessing when needed.


2. Change Data Capture (CDC) Patterns

2.1 CDC Architecture Overview

CDC captures row-level changes (INSERT, UPDATE, DELETE) from source databases and
propagates them to the lakehouse in near real-time.

  CDC PIPELINE ARCHITECTURE

  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
  │ Source DB │───►│ CDC Tool │───►│ Message  │───►│ Lakehouse│
  │ (SQL     │    │(Debezium │    │ Queue    │    │ (Delta   │
  │  Server, │    │ Azure CDC│    │(Kafka /  │    │  Tables) │
  │  Postgres│    │ Oracle   │    │ Event    │    │          │
  │  MySQL)  │    │ GoldenGt)│    │ Hubs)    │    │          │
  └──────────┘    └──────────┘    └──────────┘    └──────────┘
       │                │               │               │
   Transaction      Read log /      JSON/Avro       MERGE INTO
   log (WAL)        Query-based     CDC events      with Delta
Enter fullscreen mode Exit fullscreen mode

2.2 CDC Tool Comparison

Feature Debezium Azure SQL CDC Oracle GoldenGate Fivetran / Airbyte
Source support Broad (Postgres, MySQL, SQL Server, Oracle, MongoDB) SQL Server, Azure SQL Oracle, SQL Server 300+ connectors
Capture method Log-based (WAL) Polling + CT tables Log-based (redo log) Mixed (log + query)
Latency Seconds Minutes Seconds Minutes-hours
Schema evolution Automatic detection Manual Manual Automatic
Managed service Self-hosted or Confluent Cloud Built into Azure SQL Oracle Cloud SaaS
Cost model Open source + infra Free (SQL feature) License per CPU Per row synced
Delete detection Native Via tombstone Native Depends on connector

2.3 Implementing CDC with Delta Lake

Delta Lake's Change Data Feed (CDF) complements source CDC by tracking changes
within the lakehouse itself.

# Enable CDF on a Delta table
spark.sql("""
    ALTER TABLE silver.customers
    SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
""")

# Read changes since a specific version
changes_df = (
    spark.read
    .format("delta")
    .option("readChangeFeed", "true")
    .option("startingVersion", 5)
    .option("endingVersion", 10)
    .table("silver.customers")
)

# CDF adds these metadata columns:
# _change_type: insert, update_preimage, update_postimage, delete
# _commit_version: Delta version number
# _commit_timestamp: When the change was committed

# Apply changes to gold table using MERGE
changes_df.createOrReplaceTempView("customer_changes")

spark.sql("""
    MERGE INTO gold.dim_customer t
    USING (
        SELECT * FROM customer_changes
        WHERE _change_type IN ('insert', 'update_postimage')
    ) s
    ON t.customer_id = s.customer_id
    WHEN MATCHED THEN UPDATE SET *
    WHEN NOT MATCHED THEN INSERT *
""")

# Handle deletes separately
spark.sql("""
    DELETE FROM gold.dim_customer
    WHERE customer_id IN (
        SELECT customer_id FROM customer_changes
        WHERE _change_type = 'delete'
    )
""")
Enter fullscreen mode Exit fullscreen mode

2.4 CDC Anti-Patterns

Anti-Pattern Problem Solution
Polling-based CDC on large tables Misses deletes, high source load Use log-based CDC (Debezium)
No ordering guarantees Out-of-order events corrupt state Partition by entity key, process in order
Ignoring schema changes Pipeline breaks on ALTER TABLE Debezium schema registry + schema evolution
Full snapshots as "CDC" Wasteful, slow, misses interim changes True log-based CDC
No dead letter queue Poison messages block pipeline Route failures to DLQ for investigation

3. API-First Data Products

3.1 Data Product API Architecture

Data products exposed via APIs enable real-time access to curated lakehouse data
for operational systems.

  API-FIRST DATA PRODUCT ARCHITECTURE

  ┌─────────────────────────────────────────────────────┐
  │                  DATA PRODUCT                        │
  │  ┌──────────┐  ┌──────────┐  ┌──────────────────┐  │
  │  │ Gold     │  │ Caching  │  │ API Layer        │  │
  │  │ Delta    │──│ Layer    │──│ (REST / GraphQL) │  │
  │  │ Tables   │  │(Redis /  │  │                  │  │
  │  │          │  │ Cosmos)  │  │ - Rate limiting  │  │
  │  └──────────┘  └──────────┘  │ - Auth (OAuth2)  │  │
  │                              │ - Versioning     │  │
  │                              │ - Schema docs    │  │
  │  ┌──────────────────────┐   │ - SLA monitoring  │  │
  │  │ Contract / Schema    │   └──────────────────┘  │
  │  │ (OpenAPI / AsyncAPI) │                          │
  │  └──────────────────────┘                          │
  └─────────────────────────────────────────────────────┘
           │              │              │
           ▼              ▼              ▼
      ┌────────┐    ┌────────┐    ┌────────┐
      │Web Apps│    │Mobile  │    │ML      │
      │        │    │Apps    │    │Services│
      └────────┘    └────────┘    └────────┘
Enter fullscreen mode Exit fullscreen mode

3.2 API vs Delta Sharing vs Direct Query

Access Pattern API (REST/GraphQL) Delta Sharing Direct SQL (JDBC/ODBC)
Latency Milliseconds (cached) Seconds-minutes Seconds
Data volume Small (rows/entities) Large (tables/partitions) Medium
Consumer type Applications Data platforms BI tools, notebooks
Auth model OAuth2 / API keys Bearer tokens Service principal
Schema contract OpenAPI / GraphQL schema Delta schema SQL schema
Rate limiting Yes (API gateway) Provider-controlled Connection pool
Best for Operational reads Cross-org analytics sharing Ad-hoc analysis

3.3 API Implementation Pattern on Azure

  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
  │ Gold     │───►│ Azure    │───►│ API      │───►│ Consumer │
  │ Delta    │    │ Cosmos DB│    │ Mgmt     │    │ App      │
  │ Table    │    │ (cache)  │    │ (APIM)   │    │          │
  └──────────┘    └──────────┘    └──────────┘    └──────────┘
       │                │               │               │
   Structured      Cosmos Change     OAuth2 +       OpenAPI
   Streaming       Feed for          Rate limit     contract
   to Cosmos       invalidation      + logging      + SDK
Enter fullscreen mode Exit fullscreen mode

Key design decisions:

  • Caching layer: Use Cosmos DB or Redis for sub-10ms reads
  • Sync strategy: Structured Streaming from Delta → Cosmos for near real-time sync
  • Cache invalidation: Cosmos Change Feed or Delta CDF triggers cache refresh
  • API gateway: Azure API Management for auth, throttling, versioning, analytics

4. Delta Sharing

4.1 Delta Sharing Architecture

Delta Sharing is an open protocol for secure data sharing across organizations
without copying data.

  DELTA SHARING — OPEN PROTOCOL

  PROVIDER SIDE                           RECIPIENT SIDE
  ┌─────────────────────┐               ┌─────────────────────┐
  │ Unity Catalog        │               │ Any Platform         │
  │                      │               │                      │
  │ ┌─────────────┐     │   HTTPS/REST  │ ┌─────────────┐     │
  │ │ Delta Tables│     │◄─────────────►│ │ Delta       │     │
  │ │ (ADLS Gen2) │     │               │ │ Sharing     │     │
  │ └─────────────┘     │   Pre-signed  │ │ Client      │     │
  │                      │   URLs        │ └─────────────┘     │
  │ ┌─────────────┐     │               │                      │
  │ │ Sharing     │     │               │ Supported:           │
  │ │ Server      │     │               │ - Databricks         │
  │ │ (built-in)  │     │               │ - Spark (OSS)        │
  │ └─────────────┘     │               │ - pandas             │
  │                      │               │ - Power BI           │
  │ Access control:      │               │ - Tableau            │
  │ - Recipient mgmt    │               │ - dbt                │
  │ - Token rotation    │               │ - Any REST client    │
  │ - Audit logging     │               │                      │
  └─────────────────────┘               └─────────────────────┘
Enter fullscreen mode Exit fullscreen mode

4.2 Setting Up Delta Sharing

-- PROVIDER: Create a share
CREATE SHARE IF NOT EXISTS customer_analytics;

-- Add tables to the share
ALTER SHARE customer_analytics
ADD TABLE gold.dim_customer
  COMMENT 'Customer dimension - refreshed daily'
  PARTITION (region = 'EU');

ALTER SHARE customer_analytics
ADD TABLE gold.monthly_revenue
  COMMENT 'Monthly revenue aggregates';

-- Create recipient
CREATE RECIPIENT acme_corp
  USING ID 'acme-databricks-workspace-id';

-- Or for non-Databricks recipients, generate activation link
CREATE RECIPIENT external_partner;
-- Download credentials file for the recipient

-- Grant access
GRANT SELECT ON SHARE customer_analytics TO RECIPIENT acme_corp;

-- Audit: who accessed what
SELECT * FROM system.access.audit
WHERE action_name = 'deltaSharingQuery'
ORDER BY event_time DESC;
Enter fullscreen mode Exit fullscreen mode

4.3 Delta Sharing vs Alternatives

Feature Delta Sharing Snowflake Data Sharing S3/ADLS Direct Access API
Protocol Open standard Proprietary Cloud IAM HTTP
Data copy None (pre-signed URLs) None (account refs) None Serialized
Cross-platform Yes (any client) Snowflake only Same cloud Any
Cross-cloud Yes Yes (with replication) No Yes
Governance Unity Catalog Snowflake governance Manual IAM Custom
Cost to recipient Egress only Compute only Egress only Compute + egress
Live tables Yes (CDF support) Yes No (snapshots) Yes

5. Reverse ETL

5.1 The Reverse ETL Pattern

Reverse ETL pushes curated lakehouse data back into operational systems (CRM, marketing
platforms, support tools) to activate analytics insights.

  TRADITIONAL ETL vs REVERSE ETL

  TRADITIONAL ETL:
  ┌──────────┐    ┌──────────┐    ┌──────────┐
  │Salesforce│───►│ Lakehouse│───►│ Dashboard│
  │HubSpot   │    │ (Gold)   │    │ (BI)     │
  │Zendesk   │    │          │    │          │
  └──────────┘    └──────────┘    └──────────┘

  REVERSE ETL:
  ┌──────────┐    ┌──────────┐    ┌──────────┐
  │Salesforce│◄───│ Lakehouse│◄───│Raw Sources│
  │HubSpot   │    │ (Gold)   │    │          │
  │Zendesk   │    │          │    │          │
  └──────────┘    └──────────┘    └──────────┘
        ▲               │
        │               │
     Enriched        ML scores,
     profiles,       segments,
     lead scores     predictions
Enter fullscreen mode Exit fullscreen mode

5.2 Implementation Approaches

... [content trimmed for length — full version in the complete kit]


This is 1 of 5 resources in the Datanest Academy toolkit. Get the complete [Data Lakehouse Architecture on Azure] with all files, templates, and documentation for $199.

Get the Full Kit →

Or grab the entire Datanest Academy bundle (5 products) for $XXX — save XX%.

Get the Complete Bundle →


Related Articles

Top comments (0)