A senior engineer's guide to designing scalable, resilient data platforms
🎯 Introduction: The Modern Data Platform Challenge
In today's data-driven world, building systems that can handle real-time transactions, process massive data volumes, and deliver actionable insights isn't just a nice-to-have—it's table stakes. Whether you're architecting a fintech platform processing millions of transactions daily, an e-commerce system tracking user behavior, or a logistics network coordinating shipments, you need an architecture that can scale horizontally, recover from failures gracefully, and evolve without requiring a complete rewrite.
This article walks through the architecture of a modern data platform, explaining how Event-Driven Architecture (EDA), messaging queues, batch processing, ETL, and ELT work together to create a system that's both operationally robust and analytically powerful. We'll use a fintech transaction processing platform as our reference implementation—a domain where data integrity, real-time processing, and regulatory compliance aren't optional.
By the end, you'll understand not just what these components do individually, but how they integrate to solve real engineering problems at scale.
🏗️ The Foundation: Event-Driven Architecture (EDA)
What Is EDA and Why Does It Matter?
Event-Driven Architecture is a design paradigm where system components communicate by producing and consuming events—immutable facts about state changes. Unlike request-response patterns where services wait for answers, EDA enables asynchronous, decoupled communication.
Core Principles:
-
Events are facts, not commands: An event like
TransactionCompleted
describes what happened, not what should happen next - Temporal decoupling: Producers and consumers don't need to be online simultaneously
- Spatial decoupling: Services don't need to know about each other's existence
- Immutability: Events represent historical facts that cannot be changed
Real-World Benefits in Fintech
In a transaction processing system, EDA solves several critical problems:
Scalability: When a payment service processes a transaction, it emits an event and moves on. Downstream services (fraud detection, accounting, notifications) consume at their own pace.
Resilience: If the notification service is down, events queue up and get processed when it recovers. No transactions are lost.
Auditability: Every state change is captured as an immutable event, creating a complete audit trail—essential for regulatory compliance.
Flexibility: Adding new consumers (e.g., a new ML-based fraud detection model) doesn't require changing existing services.
Sample Event Structure
{
"eventId": "evt_7x9k2m4n",
"eventType": "transaction.completed",
"timestamp": "2025-10-07T14:32:15.234Z",
"version": "1.0",
"data": {
"transactionId": "txn_abc123",
"accountId": "acc_xyz789",
"amount": 1250.00,
"currency": "USD",
"merchantId": "mch_def456",
"paymentMethod": "credit_card",
"status": "completed",
"metadata": {
"ipAddress": "203.0.113.42",
"deviceId": "dev_mobile_001",
"location": "US-CA-SF"
}
}
}
This structure includes everything downstream consumers need while remaining self-contained and versioned for schema evolution.
📮 The Nervous System: Messaging Queues
Kafka: The Event Backbone
While there are many messaging systems (RabbitMQ, AWS SQS, Google Pub/Sub), Apache Kafka has become the de facto standard for event-driven architectures at scale. Here's why:
Key Characteristics:
- Distributed commit log: Events are persisted to disk, not just held in memory
- High throughput: Handles millions of messages per second with horizontal scaling
- Replay capability: Consumers can rewind and reprocess events from any point
- Ordering guarantees: Within a partition, events maintain strict ordering
- Durability: Configurable replication ensures no data loss
Topic Design in Practice
In our fintech platform, we organize events into topics based on bounded contexts:
transactions.raw → All transaction events (unvalidated)
transactions.validated → Validated, enriched transactions
transactions.failed → Failed transactions for retry/investigation
payments.completed → Successful payment confirmations
fraud.alerts → Suspicious activity detected
accounting.journal → Double-entry bookkeeping events
Partitioning Strategy:
Partition Key: accountId
Purpose: Ensures all events for an account go to same partition
Benefit: Maintains ordering per account, enables parallel processing
Producer Pattern
@Service
public class TransactionEventProducer {
private final KafkaTemplate<String, TransactionEvent> kafkaTemplate;
public void publishTransaction(Transaction txn) {
TransactionEvent event = TransactionEvent.builder()
.eventId(UUID.randomUUID().toString())
.eventType("transaction.completed")
.timestamp(Instant.now())
.data(txn)
.build();
// Partition by accountId for ordering
kafkaTemplate.send(
"transactions.raw",
txn.getAccountId(),
event
).addCallback(
success -> log.info("Event published: {}", event.getEventId()),
failure -> handlePublishFailure(event, failure)
);
}
}
Consumer Pattern
@Service
public class FraudDetectionConsumer {
@KafkaListener(
topics = "transactions.validated",
groupId = "fraud-detection-service"
)
public void processTransaction(
@Payload TransactionEvent event,
@Header(KafkaHeaders.RECEIVED_PARTITION_ID) int partition,
@Header(KafkaHeaders.OFFSET) long offset
) {
try {
FraudScore score = fraudEngine.analyze(event.getData());
if (score.isHighRisk()) {
publishFraudAlert(event, score);
}
// Commit offset only after successful processing
} catch (Exception e) {
// Dead letter queue for failed processing
dlqPublisher.send(event, e);
}
}
}
🔄 ETL: Extract, Transform, Load
The Traditional Data Integration Pattern
ETL represents the classic approach to data integration: extract data from sources, transform it in-flight according to business rules, and load clean, validated data into the target system.
When ETL Shines
1. Data Quality Requirements
In fintech, you can't afford dirty data in your data warehouse. ETL validates and cleanses before loading:
def transform_transaction(raw_event):
"""
ETL transformation logic - validate, enrich, standardize
"""
# Extract
txn = extract_from_kafka(raw_event)
# Transform
validated = {
'transaction_id': txn['transactionId'],
'amount': validate_amount(txn['amount']),
'currency': standardize_currency(txn['currency']),
'merchant_id': validate_merchant(txn['merchantId']),
'merchant_category': enrich_merchant_category(txn['merchantId']),
'risk_score': calculate_risk_score(txn),
'is_international': detect_international(txn),
'processed_at': datetime.utcnow(),
'data_quality_flags': run_quality_checks(txn)
}
# Only load if validation passes
if validated['data_quality_flags']['is_valid']:
# Load
load_to_warehouse(validated)
else:
send_to_quarantine(txn, validated['data_quality_flags'])
return validated
2. Complex Business Logic
When transformation requires multiple data sources:
-- ETL example: Enrich transactions with customer tier and spending limits
INSERT INTO warehouse.enriched_transactions
SELECT
t.transaction_id,
t.amount,
t.currency,
c.customer_tier,
c.monthly_limit,
CASE
WHEN t.amount > c.monthly_limit * 0.8 THEN 'high_usage'
WHEN t.amount > c.monthly_limit * 0.5 THEN 'medium_usage'
ELSE 'normal'
END as usage_category,
m.merchant_name,
m.mcc_code,
mcc.category_description
FROM staging.raw_transactions t
JOIN dim.customers c ON t.account_id = c.account_id
JOIN dim.merchants m ON t.merchant_id = m.merchant_id
JOIN dim.mcc_codes mcc ON m.mcc_code = mcc.mcc_code
WHERE t.status = 'completed'
AND t.processed = false;
3. Schema Enforcement
ETL enforces schema at write time, preventing invalid data from polluting your warehouse:
# Schema validation before load
TRANSACTION_SCHEMA = {
"transaction_id": {"type": "string", "pattern": "^txn_[a-z0-9]+$"},
"amount": {"type": "number", "minimum": 0.01, "maximum": 1000000},
"currency": {"type": "string", "enum": ["USD", "EUR", "GBP"]},
"timestamp": {"type": "datetime", "format": "iso8601"}
}
def validate_and_load(event):
if not validate_schema(event, TRANSACTION_SCHEMA):
raise ValidationError("Schema validation failed")
load_to_warehouse(event)
ETL Pipeline Architecture
Event Source (Kafka)
↓
Extractor
(Kafka Consumer)
↓
Transformation Layer
- Validation
- Enrichment
- Business Rules
- Data Quality Checks
↓
Loader
(Batch Insert)
↓
Data Warehouse
(Snowflake/Redshift)
↓
BI Dashboards
🔀 ELT: Extract, Load, Transform
The Modern Cloud Data Warehouse Approach
ELT flips the script: load raw data first, transform later. This approach has gained popularity with cloud data warehouses that offer massive compute power and storage.
When ELT Excels
1. Schema-on-Read Flexibility
Load everything now, decide what you need later:
-- ELT Stage 1: Load raw JSON directly
CREATE TABLE raw.transaction_events (
event_id STRING,
raw_payload VARIANT, -- JSON blob in Snowflake
ingested_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
);
-- ELT Stage 2: Transform when querying
CREATE VIEW analytics.transaction_summary AS
SELECT
raw_payload:data:transactionId::STRING as transaction_id,
raw_payload:data:amount::NUMBER as amount,
raw_payload:data:currency::STRING as currency,
raw_payload:data:metadata:location::STRING as location,
raw_payload:timestamp::TIMESTAMP as event_time
FROM raw.transaction_events
WHERE raw_payload:eventType::STRING = 'transaction.completed';
2. Audit Trail and Reprocessing
Keep raw data immutable, transform multiple times:
-- Original transformation (deployed week 1)
CREATE TABLE analytics.transactions_v1 AS
SELECT
transaction_id,
amount,
currency
FROM raw.transaction_events;
-- Updated business logic (deployed week 4)
CREATE TABLE analytics.transactions_v2 AS
SELECT
transaction_id,
amount,
currency,
CASE
WHEN amount > 10000 THEN 'large'
WHEN amount > 1000 THEN 'medium'
ELSE 'small'
END as transaction_size -- New logic
FROM raw.transaction_events; -- Reprocess from raw
3. Exploratory Analytics
Data scientists can explore raw data without waiting for ETL pipelines:
# Data scientist exploring raw events
df = spark.read.json("s3://data-lake/transactions/raw/")
# Ad-hoc transformation for ML feature engineering
features = df.select(
col("data.transactionId").alias("txn_id"),
col("data.amount").alias("amount"),
hour(col("timestamp")).alias("hour_of_day"),
dayofweek(col("timestamp")).alias("day_of_week"),
col("data.metadata.deviceId").alias("device")
).filter(
col("eventType") == "transaction.completed"
)
# No ETL pipeline needed for experimentation
ELT Pipeline Architecture
Event Source (Kafka)
↓
Extractor
(Kafka Connect)
↓
Data Lake (S3)
[Raw JSON/Parquet]
↓
Cloud Warehouse
(Snowflake/BigQuery)
[Raw Schema]
↓
dbt Transformations
(SQL-based)
↓
Analytics Tables
(Curated Views)
↓
BI & ML Consumers
⚖️ ETL vs ELT: Making the Right Choice
Comparison Matrix
Dimension | ETL | ELT |
---|---|---|
Transformation Location | Before loading (in pipeline) | After loading (in warehouse) |
Data Quality | Enforced at ingestion | Enforced at read/query time |
Schema Management | Schema-on-write | Schema-on-read |
Storage Cost | Lower (only valid data stored) | Higher (all raw data stored) |
Compute Cost | Higher (dedicated pipeline infra) | Lower (leverage warehouse compute) |
Flexibility | Lower (changes require pipeline updates) | Higher (transform without reingestion) |
Time to Insight | Slower (validation delays) | Faster (load immediately) |
Audit Trail | Transformed data only | Complete raw history |
Best For | Regulated industries, strict quality needs | Exploratory analytics, rapid iteration |
Decision Framework
Choose ETL when:
- Data quality is non-negotiable (finance, healthcare)
- Transformation logic is stable and well-defined
- Reducing storage costs is critical
- Compliance requires validated data only
- Target system has limited compute (legacy warehouse)
Choose ELT when:
- Using modern cloud data warehouses
- Requirements change frequently
- Need to support exploratory analytics
- Want complete audit trail of raw data
- Data scientists need flexible access
- Time-to-insight matters more than storage cost
Hybrid Approach: The Best of Both Worlds
Most mature platforms use both:
Real-time Events
↓
Kafka
↓
┌─────┴─────┐
↓ ↓
ETL ELT
(Critical) (Exploratory)
↓ ↓
Operational Data Lake
Warehouse ↓
↓ Analytics
BI Dashboards Warehouse
↓
ML/Data Science
Example allocation:
- ETL: Financial reports, compliance dashboards, real-time fraud detection
- ELT: Customer behavior analysis, A/B testing, ML model training
🔄 Batch Processing: The Scheduled Workhorse
The Role of Batch Jobs
While event streams handle real-time data, batch processing handles periodic aggregations, reconciliations, and complex analytics that don't need to be immediate.
Common Batch Processing Patterns
1. Daily Aggregations
# Airflow DAG for daily transaction summary
@dag(
schedule_interval="0 2 * * *", # 2 AM daily
start_date=datetime(2025, 1, 1),
catchup=False
)
def daily_transaction_summary():
@task
def extract_daily_transactions():
"""Extract yesterday's transactions"""
return spark.read.parquet(
f"s3://data-lake/transactions/date={yesterday}"
)
@task
def transform_and_aggregate(df):
"""Calculate daily metrics"""
return df.groupBy("account_id", "currency").agg(
count("transaction_id").alias("txn_count"),
sum("amount").alias("total_amount"),
avg("amount").alias("avg_amount"),
max("amount").alias("max_amount"),
countDistinct("merchant_id").alias("unique_merchants")
)
@task
def load_to_warehouse(summary_df):
"""Load to analytics warehouse"""
summary_df.write.mode("overwrite").insertInto(
"analytics.daily_transaction_summary"
)
# Pipeline
raw = extract_daily_transactions()
summary = transform_and_aggregate(raw)
load_to_warehouse(summary)
dag = daily_transaction_summary()
2. End-of-Month Reconciliation
def monthly_reconciliation(month, year):
"""
Reconcile transactions against bank statements
Critical for financial accuracy
"""
# Extract all transactions for the month
transactions = get_transactions(month, year)
# Extract bank statements
bank_statements = get_bank_statements(month, year)
# Match and identify discrepancies
matched, unmatched = reconcile(transactions, bank_statements)
# Generate report
report = {
'total_transactions': len(transactions),
'matched': len(matched),
'unmatched': len(unmatched),
'discrepancy_amount': sum(u['amount'] for u in unmatched),
'reconciliation_rate': len(matched) / len(transactions) * 100
}
# Alert if reconciliation rate < 99.9%
if report['reconciliation_rate'] < 99.9:
send_alert(report)
return report
3. ML Model Training
@task
def train_fraud_detection_model():
"""
Weekly batch job to retrain fraud model
"""
# Extract last 90 days of labeled transactions
df = spark.sql("""
SELECT
amount,
merchant_category,
hour_of_day,
is_international,
velocity_24h,
is_fraud -- labeled by fraud analysts
FROM analytics.labeled_transactions
WHERE processed_date >= current_date - 90
""")
# Feature engineering
features = prepare_features(df)
# Train model
model = XGBClassifier()
model.fit(features, df['is_fraud'])
# Evaluate and deploy if better than current
if model.score > current_model.score:
deploy_model(model, version=f"v{datetime.now().strftime('%Y%m%d')}")
🏛️ Complete System Architecture
End-to-End Data Flow
┌─────────────────────────────────────────────────────────────────┐
│ PRODUCTION SERVICES │
│ (Payment API, Account Service, Merchant Service) │
└────────────────────────┬────────────────────────────────────────┘
│ Emit Events
↓
┌─────────────────────────────────────────────────────────────────┐
│ KAFKA CLUSTER │
│ Topics: transactions.raw, transactions.validated, │
│ fraud.alerts, accounting.journal │
└──────┬──────────────────────────┬───────────────────────────────┘
│ │
│ Real-time │ Batch
↓ ↓
┌─────────────────┐ ┌─────────────────┐
│ STREAM │ │ KAFKA │
│ PROCESSORS │ │ CONNECT │
│ │ │ (CDC/S3 Sink) │
│ - Validation │ └────────┬────────┘
│ - Enrichment │ │
│ - Fraud Check │ ↓
│ - Routing │ ┌─────────────────┐
└────────┬────────┘ │ DATA LAKE │
│ │ (S3/Parquet) │
│ └────────┬────────┘
│ │
↓ ↓
┌─────────────────┐ ┌─────────────────┐
│ OPERATIONAL │ │ ANALYTICS │
│ WAREHOUSE │ │ WAREHOUSE │
│ (PostgreSQL) │ │ (Snowflake) │
│ │ │ │
│ ETL Pipeline │ │ ELT Pipeline │
│ - Validated │ │ - Raw Load │
│ - Enriched │ │ - dbt Transform │
│ - Aggregated │ │ - Star Schema │
└────────┬────────┘ └────────┬────────┘
│ │
↓ ↓
┌─────────────────────────────────────────┐
│ CONSUMPTION LAYER │
│ │
│ ┌──────────┐ ┌──────────┐ ┌────────┐│
│ │ BI Tools │ │ ML/AI │ │ APIs ││
│ │ (Tableau)│ │ Platform │ │ ││
│ └──────────┘ └──────────┘ └────────┘│
└─────────────────────────────────────────┘
┌──────────────────────┐
│ BATCH PROCESSING │
│ (Airflow/Spark) │
│ │
│ - Daily Aggregations │
│ - Reconciliation │
│ - Model Training │
└──────────────────────┘
Data Flow Example: Transaction Journey
Let's trace a single transaction through the entire system:
T+0ms: Transaction Created
POST /api/v1/transactions
{
"accountId": "acc_xyz789",
"amount": 1250.00,
"currency": "USD",
"merchantId": "mch_def456"
}
T+5ms: Event Published to Kafka
{
"eventType": "transaction.completed",
"data": { /* transaction details */ }
}
// Published to: transactions.raw
T+10ms: Stream Processor Validates
# Kafka Streams processor
validated_txn = validate_transaction(raw_event)
if validated_txn.is_valid:
publish("transactions.validated", validated_txn)
else:
publish("transactions.failed", raw_event)
T+50ms: Multiple Consumers React
- Fraud Detection: Analyzes and scores transaction
- Accounting: Creates double-entry journal entries
- Notification: Sends confirmation to customer
- Analytics: Writes to operational warehouse (ETL)
T+2min: Loaded to Data Lake
Kafka Connect → S3 Sink
s3://data-lake/transactions/date=2025-10-07/hour=14/
part-00001.parquet (raw ELT)
T+1hr: Batch Aggregation
-- Hourly rollup job
INSERT INTO analytics.hourly_transaction_summary
SELECT
date_trunc('hour', event_time) as hour,
currency,
count(*) as txn_count,
sum(amount) as total_volume
FROM raw.transaction_events
WHERE event_time >= current_hour
GROUP BY 1, 2;
T+24hrs: Overnight Processing
1. Daily reconciliation batch job
2. Fraud model retraining
3. Customer spending report generation
4. Data quality checks
5. Compliance report export
🎯 Engineering Best Practices
1. Idempotency: The Golden Rule
Every event consumer must be idempotent—processing the same event multiple times produces the same result:
@Transactional
public void processPayment(TransactionEvent event) {
// Check if already processed
if (paymentRepository.existsByEventId(event.getEventId())) {
log.info("Event {} already processed, skipping", event.getEventId());
return;
}
// Process and record event ID
Payment payment = createPayment(event);
paymentRepository.save(payment);
// Store event ID to prevent reprocessing
processedEventRepository.save(
new ProcessedEvent(event.getEventId(), Instant.now())
);
}
2. Schema Registry: Contract-First Development
Use a schema registry (like Confluent Schema Registry) to enforce contracts:
{
"type": "record",
"name": "TransactionEvent",
"namespace": "com.fintech.events",
"version": "1.0",
"fields": [
{"name": "eventId", "type": "string"},
{"name": "eventType", "type": "string"},
{"name": "timestamp", "type": "long", "logicalType": "timestamp-millis"},
{"name": "data", "type": {
"type": "record",
"name": "TransactionData",
"fields": [
{"name": "transactionId", "type": "string"},
{"name": "amount", "type": "double"},
{"name": "currency", "type": "string"}
]
}}
]
}
3. Dead Letter Queues: Graceful Failure Handling
@KafkaListener(topics = "transactions.validated")
public void processTransaction(TransactionEvent event) {
try {
businessLogic.process(event);
} catch (RetryableException e) {
// Temporary failure - will retry
throw e;
} catch (NonRetryableException e) {
// Permanent failure - send to DLQ
dlqPublisher.send("transactions.dlq", event, e);
log.error("Event {} moved to DLQ: {}", event.getEventId(), e.getMessage());
}
}
4. Observability: Know Your System
Key Metrics to Track:
- Event lag: Time between production and consumption
- Processing rate: Events per second per consumer
- Error rate: Failed events / total events
- DLQ size: Events requiring manual intervention
- Pipeline latency: End-to-end time for data flows
- Data quality: Percentage of records passing validation
Implementation:
@Timed(value = "transaction.processing.time")
@Counted(value = "transaction.processed")
public void processTransaction(TransactionEvent event) {
Instant start = Instant.now();
try {
doProcessing(event);
meterRegistry.counter("transaction.success").increment();
} catch (Exception e) {
meterRegistry.counter("transaction.failure").increment();
throw e;
} finally {
long duration = Duration.between(start, Instant.now()).toMillis();
meterRegistry.timer("transaction.latency").record(duration, TimeUnit.MILLISECONDS);
}
}
5. Data Quality Gates
def validate_data_quality(df):
"""
Data quality checks before promoting to production tables
"""
checks = {
'null_check': df.filter(col("transaction_id").isNull()).count() == 0,
'duplicate_check': df.count() == df.select("transaction_id").distinct().count(),
'range_check': df.filter(col("amount") <= 0).count() == 0,
'referential_integrity': check_foreign_keys(df),
'freshness_check': max(df.select("timestamp")) >= yesterday
}
if not all(checks.values()):
raise DataQualityException(f"Quality checks failed: {checks}")
return df
6. Backward Compatibility
When evolving schemas:
✅ DO: Add optional fields
✅ DO: Deprecate fields gradually
✅ DO: Maintain version numbers
❌ DON'T: Remove required fields
❌ DON'T: Change field types
❌ DON'T: Break existing consumers
7. Cost Optimization
# Partition pruning in data lake
df = spark.read.parquet("s3://data-lake/transactions/") \
.filter("date >= '2025-10-01'") # Only scan October data
# Compression for storage efficiency
df.write \
.option("compression", "snappy") \
.partitionBy("date", "currency") \
.parquet("s3://data-lake/transactions/")
# Materialized views for frequent queries
CREATE MATERIALIZED VIEW daily_summary AS
SELECT date, currency, sum(amount)
FROM transactions
GROUP BY 1, 2;
📊 Benefits of This Architecture
1. Scalability
Each component scales independently. High transaction volume? Add Kafka partitions and consumers. Complex analytics? Scale warehouse compute.
2. Resilience
Failed services don't lose data—events queue up. Failed pipelines can replay from any point. Multiple replicas ensure availability.
3. Flexibility
New consumers plug in without changing producers. New transformations don't require data reingestion. Business logic changes quickly.
4. Auditability
Immutable event log provides complete history. Every state change is traceable. Compliance requirements are built-in.
5. Cost Efficiency
Pay only for what you use. Cloud warehouses scale compute independently of storage. Batch jobs run on spot instances.
6. Time to Insight
Real-time streams for operational decisions. Batch jobs for deep analytics. Self-service analytics without engineering bottlenecks.
🚀 Conclusion: Building for the Future
Modern data systems aren't about choosing between real-time and batch, or ETL and ELT—they're about orchestrating these patterns to solve different problems optimally. Event-driven architecture provides the foundation for decoupled, scalable systems. Messaging queues enable reliable, asynchronous communication. ETL enforces quality where it matters. ELT provides flexibility where it's needed. Batch processing handles complex aggregations efficiently.
The fintech platform we've explored demonstrates how these pieces integrate into a cohesive whole: transactions flow through Kafka, get validated in real-time, land in both operational and analytical stores, get enriched through batch processing, and ultimately power everything from fraud detection to customer insights.
As you design your own data platforms, remember: architecture is about trade-offs, not absolutes. Understand your constraints—latency requirements, data quality needs, team capabilities, cost budgets—and choose the patterns that best fit your context. Start simple, measure everything, and evolve as you learn.
The best data platform is one that serves your business needs today while remaining flexible enough to adapt tomorrow.
Key Takeaways:
✅ Event-driven architecture enables loose coupling and independent scaling
✅ Message queues provide durability, ordering, and replay capabilities
✅ ETL enforces quality at write time; ELT optimizes for flexibility
✅ Batch processing complements real-time streams for complex analytics
✅ Idempotency, observability, and data quality are non-negotiable
✅ Hybrid approaches leverage the strengths of multiple patterns
✅ Architecture decisions should be driven by business requirements, not trends
Happy building, and may your data pipelines never have backpressure at 3 AM. 🌙
Top comments (0)