Module 5: Integration Patterns
Learning Objectives
By the end of this module, you will be able to:
- Select the right integration pattern (batch, streaming, event-driven) based on latency, throughput, and cost requirements
- Design Change Data Capture (CDC) pipelines using Debezium, Azure CDC, and Delta Lake CDF
- Architect API-first data products that expose curated datasets to consumers
- Implement Delta Sharing for cross-organization data exchange without data duplication
- Design reverse ETL pipelines to push lakehouse data back into operational systems
- Plan a data marketplace that enables self-service data discovery and consumption
1. Integration Pattern Selection
1.1 The Integration Spectrum
Data integration is not a binary batch-or-streaming decision. Modern architectures
place workloads along a latency continuum.
INTEGRATION LATENCY SPECTRUM
◄─────────────────────────────────────────────────────────────────►
BATCH MICRO-BATCH STREAMING EVENT-DRIVEN
(hours-days) (seconds-minutes) (milliseconds) (sub-ms)
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ ADF │ │ Spark │ │ Kafka │ │ Event │
│ Pipelines│ │ Structured│ │ Streams │ │ Grid │
│ COPY INTO│ │ Streaming │ │ Flink │ │ Functions│
│ Sqoop │ │ Auto │ │ Event │ │ Logic │
│ │ │ Loader │ │ Hubs │ │ Apps │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
Cost: $ Cost: $$ Cost: $$$ Cost: $$$$
Complexity: Low Complexity: Medium Complexity: High Complexity: High
Freshness: Hours Freshness: Minutes Freshness: Secs Freshness: Real-time
1.2 Decision Framework
Use the following matrix to select the right integration pattern:
| Criterion | Batch | Micro-Batch | Streaming | Event-Driven |
|---|---|---|---|---|
| Latency requirement | > 1 hour acceptable | 1-15 minutes | < 1 minute | Sub-second |
| Data volume | Very large (TB+) | Medium-large | Medium | Small-medium per event |
| Source type | Files, DB snapshots | DB logs, files | Message queues | Webhooks, triggers |
| Schema stability | High | Medium | Medium-low | Low |
| Cost sensitivity | High (optimize cost) | Medium | Low (optimize latency) | Low |
| Ordering guarantees | Not critical | Partition-level | Partition-level | Per-entity |
| Exactly-once need | Easy to achieve | Checkpointed | Requires design | Idempotency required |
| Typical Azure services | ADF, COPY INTO | Structured Streaming | Event Hubs + Spark | Event Grid + Functions |
1.3 Hybrid Pattern: Lambda vs Kappa
Most real architectures use hybrid patterns. Here are two foundational approaches:
LAMBDA ARCHITECTURE
┌─────────────┐
│ Data Sources │
└──────┬──────┘
│
┌────┴────┐
│ │
▼ ▼
┌─────┐ ┌─────────┐
│Batch│ │Speed │
│Layer│ │Layer │
│(ADF)│ │(Strm.)│
└──┬──┘ └───┬─────┘
│ │
▼ ▼
┌────────────────┐
│ Serving Layer │
│ (Delta Tables) │
└────────────────┘
✓ Accuracy via batch reprocessing
✗ Two codepaths = double maintenance
KAPPA ARCHITECTURE
┌─────────────┐
│ Data Sources │
└──────┬──────┘
│
▼
┌─────────────────┐
│ Unified Stream │
│ Processing │
│ (Structured │
│ Streaming) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Delta Tables │
│ (Serving) │
└─────────────────┘
✓ Single codepath
✓ Simpler operations
✗ Reprocessing is expensive
✗ Complex aggregations harder
Recommendation for Databricks: Prefer Kappa with Delta Lake. Structured Streaming
with Delta as both sink and source eliminates the need for separate batch and speed
layers. Use Delta time travel for reprocessing when needed.
2. Change Data Capture (CDC) Patterns
2.1 CDC Architecture Overview
CDC captures row-level changes (INSERT, UPDATE, DELETE) from source databases and
propagates them to the lakehouse in near real-time.
CDC PIPELINE ARCHITECTURE
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Source DB │───►│ CDC Tool │───►│ Message │───►│ Lakehouse│
│ (SQL │ │(Debezium │ │ Queue │ │ (Delta │
│ Server, │ │ Azure CDC│ │(Kafka / │ │ Tables) │
│ Postgres│ │ Oracle │ │ Event │ │ │
│ MySQL) │ │ GoldenGt)│ │ Hubs) │ │ │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
│ │ │ │
Transaction Read log / JSON/Avro MERGE INTO
log (WAL) Query-based CDC events with Delta
2.2 CDC Tool Comparison
| Feature | Debezium | Azure SQL CDC | Oracle GoldenGate | Fivetran / Airbyte |
|---|---|---|---|---|
| Source support | Broad (Postgres, MySQL, SQL Server, Oracle, MongoDB) | SQL Server, Azure SQL | Oracle, SQL Server | 300+ connectors |
| Capture method | Log-based (WAL) | Polling + CT tables | Log-based (redo log) | Mixed (log + query) |
| Latency | Seconds | Minutes | Seconds | Minutes-hours |
| Schema evolution | Automatic detection | Manual | Manual | Automatic |
| Managed service | Self-hosted or Confluent Cloud | Built into Azure SQL | Oracle Cloud | SaaS |
| Cost model | Open source + infra | Free (SQL feature) | License per CPU | Per row synced |
| Delete detection | Native | Via tombstone | Native | Depends on connector |
2.3 Implementing CDC with Delta Lake
Delta Lake's Change Data Feed (CDF) complements source CDC by tracking changes
within the lakehouse itself.
# Enable CDF on a Delta table
spark.sql("""
ALTER TABLE silver.customers
SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
""")
# Read changes since a specific version
changes_df = (
spark.read
.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", 5)
.option("endingVersion", 10)
.table("silver.customers")
)
# CDF adds these metadata columns:
# _change_type: insert, update_preimage, update_postimage, delete
# _commit_version: Delta version number
# _commit_timestamp: When the change was committed
# Apply changes to gold table using MERGE
changes_df.createOrReplaceTempView("customer_changes")
spark.sql("""
MERGE INTO gold.dim_customer t
USING (
SELECT * FROM customer_changes
WHERE _change_type IN ('insert', 'update_postimage')
) s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
# Handle deletes separately
spark.sql("""
DELETE FROM gold.dim_customer
WHERE customer_id IN (
SELECT customer_id FROM customer_changes
WHERE _change_type = 'delete'
)
""")
2.4 CDC Anti-Patterns
| Anti-Pattern | Problem | Solution |
|---|---|---|
| Polling-based CDC on large tables | Misses deletes, high source load | Use log-based CDC (Debezium) |
| No ordering guarantees | Out-of-order events corrupt state | Partition by entity key, process in order |
| Ignoring schema changes | Pipeline breaks on ALTER TABLE | Debezium schema registry + schema evolution |
| Full snapshots as "CDC" | Wasteful, slow, misses interim changes | True log-based CDC |
| No dead letter queue | Poison messages block pipeline | Route failures to DLQ for investigation |
3. API-First Data Products
3.1 Data Product API Architecture
Data products exposed via APIs enable real-time access to curated lakehouse data
for operational systems.
API-FIRST DATA PRODUCT ARCHITECTURE
┌─────────────────────────────────────────────────────┐
│ DATA PRODUCT │
│ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ Gold │ │ Caching │ │ API Layer │ │
│ │ Delta │──│ Layer │──│ (REST / GraphQL) │ │
│ │ Tables │ │(Redis / │ │ │ │
│ │ │ │ Cosmos) │ │ - Rate limiting │ │
│ └──────────┘ └──────────┘ │ - Auth (OAuth2) │ │
│ │ - Versioning │ │
│ │ - Schema docs │ │
│ ┌──────────────────────┐ │ - SLA monitoring │ │
│ │ Contract / Schema │ └──────────────────┘ │
│ │ (OpenAPI / AsyncAPI) │ │
│ └──────────────────────┘ │
└─────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│Web Apps│ │Mobile │ │ML │
│ │ │Apps │ │Services│
└────────┘ └────────┘ └────────┘
3.2 API vs Delta Sharing vs Direct Query
| Access Pattern | API (REST/GraphQL) | Delta Sharing | Direct SQL (JDBC/ODBC) |
|---|---|---|---|
| Latency | Milliseconds (cached) | Seconds-minutes | Seconds |
| Data volume | Small (rows/entities) | Large (tables/partitions) | Medium |
| Consumer type | Applications | Data platforms | BI tools, notebooks |
| Auth model | OAuth2 / API keys | Bearer tokens | Service principal |
| Schema contract | OpenAPI / GraphQL schema | Delta schema | SQL schema |
| Rate limiting | Yes (API gateway) | Provider-controlled | Connection pool |
| Best for | Operational reads | Cross-org analytics sharing | Ad-hoc analysis |
3.3 API Implementation Pattern on Azure
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Gold │───►│ Azure │───►│ API │───►│ Consumer │
│ Delta │ │ Cosmos DB│ │ Mgmt │ │ App │
│ Table │ │ (cache) │ │ (APIM) │ │ │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
│ │ │ │
Structured Cosmos Change OAuth2 + OpenAPI
Streaming Feed for Rate limit contract
to Cosmos invalidation + logging + SDK
Key design decisions:
- Caching layer: Use Cosmos DB or Redis for sub-10ms reads
- Sync strategy: Structured Streaming from Delta → Cosmos for near real-time sync
- Cache invalidation: Cosmos Change Feed or Delta CDF triggers cache refresh
- API gateway: Azure API Management for auth, throttling, versioning, analytics
4. Delta Sharing
4.1 Delta Sharing Architecture
Delta Sharing is an open protocol for secure data sharing across organizations
without copying data.
DELTA SHARING — OPEN PROTOCOL
PROVIDER SIDE RECIPIENT SIDE
┌─────────────────────┐ ┌─────────────────────┐
│ Unity Catalog │ │ Any Platform │
│ │ │ │
│ ┌─────────────┐ │ HTTPS/REST │ ┌─────────────┐ │
│ │ Delta Tables│ │◄─────────────►│ │ Delta │ │
│ │ (ADLS Gen2) │ │ │ │ Sharing │ │
│ └─────────────┘ │ Pre-signed │ │ Client │ │
│ │ URLs │ └─────────────┘ │
│ ┌─────────────┐ │ │ │
│ │ Sharing │ │ │ Supported: │
│ │ Server │ │ │ - Databricks │
│ │ (built-in) │ │ │ - Spark (OSS) │
│ └─────────────┘ │ │ - pandas │
│ │ │ - Power BI │
│ Access control: │ │ - Tableau │
│ - Recipient mgmt │ │ - dbt │
│ - Token rotation │ │ - Any REST client │
│ - Audit logging │ │ │
└─────────────────────┘ └─────────────────────┘
4.2 Setting Up Delta Sharing
-- PROVIDER: Create a share
CREATE SHARE IF NOT EXISTS customer_analytics;
-- Add tables to the share
ALTER SHARE customer_analytics
ADD TABLE gold.dim_customer
COMMENT 'Customer dimension - refreshed daily'
PARTITION (region = 'EU');
ALTER SHARE customer_analytics
ADD TABLE gold.monthly_revenue
COMMENT 'Monthly revenue aggregates';
-- Create recipient
CREATE RECIPIENT acme_corp
USING ID 'acme-databricks-workspace-id';
-- Or for non-Databricks recipients, generate activation link
CREATE RECIPIENT external_partner;
-- Download credentials file for the recipient
-- Grant access
GRANT SELECT ON SHARE customer_analytics TO RECIPIENT acme_corp;
-- Audit: who accessed what
SELECT * FROM system.access.audit
WHERE action_name = 'deltaSharingQuery'
ORDER BY event_time DESC;
4.3 Delta Sharing vs Alternatives
| Feature | Delta Sharing | Snowflake Data Sharing | S3/ADLS Direct Access | API |
|---|---|---|---|---|
| Protocol | Open standard | Proprietary | Cloud IAM | HTTP |
| Data copy | None (pre-signed URLs) | None (account refs) | None | Serialized |
| Cross-platform | Yes (any client) | Snowflake only | Same cloud | Any |
| Cross-cloud | Yes | Yes (with replication) | No | Yes |
| Governance | Unity Catalog | Snowflake governance | Manual IAM | Custom |
| Cost to recipient | Egress only | Compute only | Egress only | Compute + egress |
| Live tables | Yes (CDF support) | Yes | No (snapshots) | Yes |
5. Reverse ETL
5.1 The Reverse ETL Pattern
Reverse ETL pushes curated lakehouse data back into operational systems (CRM, marketing
platforms, support tools) to activate analytics insights.
TRADITIONAL ETL vs REVERSE ETL
TRADITIONAL ETL:
┌──────────┐ ┌──────────┐ ┌──────────┐
│Salesforce│───►│ Lakehouse│───►│ Dashboard│
│HubSpot │ │ (Gold) │ │ (BI) │
│Zendesk │ │ │ │ │
└──────────┘ └──────────┘ └──────────┘
REVERSE ETL:
┌──────────┐ ┌──────────┐ ┌──────────┐
│Salesforce│◄───│ Lakehouse│◄───│Raw Sources│
│HubSpot │ │ (Gold) │ │ │
│Zendesk │ │ │ │ │
└──────────┘ └──────────┘ └──────────┘
▲ │
│ │
Enriched ML scores,
profiles, segments,
lead scores predictions
5.2 Implementation Approaches
... [content trimmed for length — full version in the complete kit]
This is 1 of 5 resources in the Datanest Academy toolkit. Get the complete [Data Lakehouse Architecture on Azure] with all files, templates, and documentation for $199.
Or grab the entire Datanest Academy bundle (5 products) for $XXX — save XX%.
Top comments (0)