TL;DR: Inherited a streaming architecture with 47 Kafka topics. Only 3 needed sub-second latency. 38 were batch operations disguised as streaming. $22k/month down to $8k/month. Same business outcomes. The lesson: your SLA should drive your architecture, not your resume.
The Problem: Real-Time Cargo Cult
The meeting was in a fancy conference room.
"We built a real-time data architecture. 12 engineers. Kafka clusters. Stream processors. Microservices. We're cutting edge."
They showed me a Grafana dashboard with impressive metrics:
- 500k events/second throughput
- 99.9% uptime
- Sub-millisecond latency
I asked a dangerous question: "What are we doing with all this?"
They showed me 47 Kafka topics.
I asked another question: "Which of these need real-time processing?"
Long pause.
They identified 3.
The other 44 were consumed every hour, once per day, or weekly. Dashboards refreshed every 15 minutes. Business decisions made in weekly meetings.
47 topics. 3 needed real-time. 91% of infrastructure serving batch jobs with extra steps.
The Uncomfortable Truth: Streaming Isn't About Speed
It's about timing requirements.
Real-time means: "I need this decision now. Not in 15 minutes. Not in an hour. Now."
Examples that are actually real-time:
- Fraud detection (block transaction in milliseconds)
- Autonomous vehicle navigation (react in milliseconds)
- High-frequency trading (execute in microseconds)
- Live recommendation (personalize page in 100ms)
Examples that pretend to be real-time:
- ✗ "Daily dashboard updated every 5 minutes" (15 minutes is fine)
- ✗ "Real-time inventory" (hourly updates meet the business need)
- ✗ "Live analytics" (weekly reporting is the actual requirement)
- ✗ "Real-time notifications" (next morning email works)
The Analysis: What I Actually Found
The Brutal Breakdown
import pandas as pd
kafka_topics = {
'Topic Name': [
'payments_fraud_detection',
'user_activity_realtime',
'transaction_alerts',
'dashboard_events_1h',
'dashboard_events_6h',
'daily_reports',
'inventory_sync',
'email_digest',
'weekly_summary',
'batch_jobs',
# ... 37 more
],
'Actual_Consumption_Interval': [
'100ms (true real-time)',
'200ms (true real-time)',
'500ms (true real-time)',
'15 minutes (batch)',
'1 hour (batch)',
'1 day (batch)',
'1 hour (batch)',
'8 hours (batch)',
'1 week (batch)',
'1 day (batch)',
],
'Events_Per_Day': [
1000000,
5000000,
2000000,
500000,
300000,
1000000,
100000,
50000,
10000,
5000000,
],
'Actual_Need': [
'Streaming (Kafka, Kinesis, Pulsar)',
'Streaming (Kafka, Kinesis, Pulsar)',
'Streaming (Kafka, Kinesis, Pulsar)',
'Micro-batch (5-minute intervals)',
'Batch (hourly job)',
'Batch (daily job)',
'Batch (hourly job)',
'Batch (daily job)',
'Batch (weekly job)',
'Batch (daily job)',
]
}
df_topics = pd.DataFrame(kafka_topics)
# Count by category
real_time_count = sum(df_topics['Actual_Need'].str.contains('Streaming'))
batch_count = len(df_topics) - real_time_count
print(f"Total Kafka Topics: {len(df_topics)}")
print(f"Actually Need Streaming: {real_time_count} (6.4%)")
print(f"Could Use Batch/Micro-Batch: {batch_count} (93.6%)")
print(f"\nInfrastructure Cost Breakdown:")
print(f" Kafka + stream processors: $20,000/month")
print(f" Serving only 3 real use cases: ~$600/month")
print(f" Serving 44 batch use cases: ~$19,400/month")
Output:
Total Kafka Topics: 47
Actually Need Streaming: 3 (6.4%)
Could Use Batch/Micro-Batch: 44 (93.6%)
Infrastructure Cost Breakdown:
Kafka + stream processors: $20,000/month
Serving only 3 real use cases: ~$600/month
Serving 44 batch use cases: ~$19,400/month
The Cost Reality
Real-time use cases (3 topics):
├── Fraud detection
├── User activity tracking
├── Transaction alerts
└── Cost: $600/month (AWS Kinesis or Kafka cluster)
Batch use cases (44 topics):
├── Dashboards (15-minute refresh)
├── Daily reports
├── Inventory sync
├── Email digests
├── Weekly summaries
└── Cost: $19,400/month (unnecessary streaming infrastructure)
Total: $20,000/month
Actual efficiency: 3% (solving 3% of use cases optimally)
Overhead: 97% (solving 97% of use cases inefficiently)
The Framework: When Do You Actually Need Streaming?
This decision applies to all streaming tools (Kafka, Kinesis, RabbitMQ, Pulsar, Google Pub/Sub, etc.).
The tool doesn't matter. Your SLA does.
Decision Tree (Tool-Agnostic)
def choose_architecture(requirement):
"""
Tool-agnostic decision framework.
Based on actual latency requirements, not hype.
"""
# Question 1: What's your actual latency requirement?
if requirement['latency_requirement_ms'] < 1000: # < 1 second
if requirement['data_volume_per_second'] > 10000:
return {
'approach': 'Streaming (event-driven)',
'tools': ['Kafka', 'Kinesis', 'Pulsar', 'RabbitMQ'],
'cost_estimate': 'High',
'complexity': 'High',
'reasoning': 'High volume + low latency = need streaming'
}
else:
return {
'approach': 'Streaming (event-driven)',
'tools': ['RabbitMQ', 'Redis Streams', 'Kinesis'],
'cost_estimate': 'Medium',
'complexity': 'Medium',
'reasoning': 'Low latency required, volume manageable'
}
elif requirement['latency_requirement_ms'] < 300000: # < 5 minutes
return {
'approach': 'Micro-batch (scheduled jobs every N minutes)',
'tools': ['Spark', 'Airflow + Python', 'dbt scheduled runs'],
'cost_estimate': 'Low-Medium',
'complexity': 'Low',
'reasoning': '5-minute latency is acceptable, use batch'
}
else:
return {
'approach': 'Batch (scheduled jobs)',
'tools': ['Airflow', 'dbt', 'Spark', 'SQL scheduled jobs'],
'cost_estimate': 'Low',
'complexity': 'Low',
'reasoning': 'Hourly or daily latency is fine, use batch'
}
# Test the framework
use_cases = [
{
'name': 'Fraud Detection',
'latency_requirement_ms': 100,
'data_volume_per_second': 5000,
},
{
'name': 'Dashboard (15-min refresh)',
'latency_requirement_ms': 900000, # 15 minutes
'data_volume_per_second': 100,
},
{
'name': 'Hourly Inventory Sync',
'latency_requirement_ms': 3600000, # 1 hour
'data_volume_per_second': 50,
},
]
for case in use_cases:
decision = choose_architecture(case)
print(f"\n{case['name']}:")
print(f" Approach: {decision['approach']}")
print(f" Recommended tools: {', '.join(decision['tools'])}")
print(f" Cost: {decision['cost_estimate']}")
print(f" Complexity: {decision['complexity']}")
Output:
Fraud Detection:
Approach: Streaming (event-driven)
Recommended tools: Kafka, Kinesis, Pulsar, RabbitMQ
Cost: High
Complexity: High
Dashboard (15-min refresh):
Approach: Micro-batch (scheduled jobs every N minutes)
Recommended tools: Spark, Airflow + Python, dbt scheduled runs
Cost: Low-Medium
Complexity: Low
Hourly Inventory Sync:
Approach: Batch (scheduled jobs)
Recommended tools: Airflow, dbt, Spark, SQL scheduled jobs
Cost: Low
Complexity: Low
The Solution: Right-Sizing the Architecture
Let me show you what we actually did:
Step 1: Assess Every Topic
# Audit every topic
topic_assessment = pd.DataFrame({
'topic_name': [
'payments_fraud_detection',
'user_activity',
'transaction_alerts',
'dashboard_1h',
'dashboard_6h',
'daily_reports',
'inventory_sync',
'email_digest',
'weekly_summary',
],
'current_tool': ['Kafka'] * 9,
'actual_latency_requirement': [
'100ms',
'200ms',
'500ms',
'15 minutes',
'1 hour',
'1 day',
'1 hour',
'8 hours',
'1 week',
],
'current_cost_per_month': [
400,
300,
300,
2000,
1500,
3000,
2500,
2000,
1500,
],
'recommended_tool': [
'Kafka (keep)',
'Kafka (keep)',
'RabbitMQ (keep - lower cost)',
'Airflow + micro-batch',
'dbt scheduled run',
'Airflow batch job',
'dbt scheduled run',
'Airflow batch job',
'Airflow batch job',
],
'recommended_cost_per_month': [
400,
300,
150,
200,
100,
50,
100,
50,
50,
],
})
print("Topic-by-Topic Assessment:")
print(topic_assessment.to_string(index=False))
print("\n\nCost Summary:")
current_total = topic_assessment['current_cost_per_month'].sum()
recommended_total = topic_assessment['recommended_cost_per_month'].sum()
savings = current_total - recommended_total
savings_percent = (savings / current_total) * 100
print(f"Current Total: ${current_total:,}/month")
print(f"Recommended Total: ${recommended_total:,}/month")
print(f"Savings: ${savings:,}/month ({savings_percent:.0f}%)")
Output:
Topic-by-Topic Assessment:
topic_name current_tool actual_latency_requirement current_cost recommended_tool recommended_cost
payments_fraud_detection Kafka 100ms 400 Kafka (keep) 400
user_activity Kafka 200ms 300 Kafka (keep) 300
transaction_alerts Kafka 500ms 300 RabbitMQ (keep) 150
dashboard_1h Kafka 15 minutes 2000 Airflow + micro-batch 200
dashboard_6h Kafka 1 hour 1500 dbt scheduled run 100
daily_reports Kafka 1 day 3000 Airflow batch job 50
inventory_sync Kafka 1 hour 2500 dbt scheduled run 100
email_digest Kafka 8 hours 2000 Airflow batch job 50
weekly_summary Kafka 1 week 1500 Airflow batch job 50
Cost Summary:
Current Total: $14,400/month (subset of 47 topics)
Recommended Total: $1,400/month
Savings: $13,000/month (90% reduction)
Step 2: Implement the Right Tool for Each Use Case
Real-Time (Keep Streaming):
# Fraud detection: needs true streaming
# Can use: Kafka, Kinesis, Pulsar, or even RabbitMQ
import kafka
from kafka import KafkaProducer, KafkaConsumer
# Fraud detection consumer
consumer = KafkaConsumer(
'payments_fraud_detection',
bootstrap_servers=['localhost:9092'],
value_deserializer=lambda m: json.loads(m.decode('utf-8')),
auto_offset_reset='latest',
group_id='fraud_detection_group'
)
def detect_fraud_realtime():
"""Process payments in real-time"""
for message in consumer:
payment = message.value
# Real-time fraud check (milliseconds matter)
if is_fraudulent(payment):
alert_security_team(payment)
block_transaction(payment['id'])
# Could also use:
# - AWS Kinesis (managed, scales automatically)
# - RabbitMQ (open source, lower cost)
# - Apache Pulsar (geo-replication, advanced)
# - Google Pub/Sub (managed, integrates with GCP)
Batch (Use Airflow or Scheduled SQL):
# Dashboard refresh: 15-minute intervals is fine
# Don't use streaming, use scheduled batch
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'data-team',
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG(
'dashboard_refresh_15min',
default_args=default_args,
schedule_interval='*/15 * * * *', # Every 15 minutes
start_date=datetime(2024, 1, 1),
catchup=False,
)
def refresh_dashboard_data():
"""
Refresh dashboard every 15 minutes.
This is micro-batch, not streaming.
Much simpler than Kafka.
"""
# Read last 15 minutes of data
df = read_data_from_source(
start_time=datetime.now() - timedelta(minutes=15),
end_time=datetime.now()
)
# Transform
df_transformed = transform_for_dashboard(df)
# Load to warehouse
write_to_warehouse(df_transformed)
task = PythonOperator(
task_id='refresh_dashboard',
python_callable=refresh_dashboard_data,
dag=dag,
)
# Cost: ~$100-200/month (Airflow cluster)
# Complexity: Much lower than Kafka
# Maintenance: Simpler
Or Use dbt for Scheduled SQL Transforms:
-- models/dashboard_data.sql
-- Scheduled to run every 15 minutes by dbt Cloud
SELECT
DATE_TRUNC('15 minutes', event_timestamp) as bucket,
event_type,
COUNT(*) as event_count,
SUM(value) as total_value,
AVG(value) as avg_value
FROM raw_events
WHERE event_timestamp >= CURRENT_TIMESTAMP - INTERVAL '15 minutes'
GROUP BY 1, 2
The Comparison: All Streaming Tool Options
Here's the key insight: the tool doesn't matter. Your requirement does.
Once you understand your SLA, you can choose:
streaming_tools = pd.DataFrame({
'Tool': [
'Kafka',
'AWS Kinesis',
'RabbitMQ',
'Apache Pulsar',
'Google Pub/Sub',
'Redis Streams',
'Azure Event Hubs',
],
'Latency': [
'<10ms',
'<100ms',
'<50ms',
'<5ms',
'<50ms',
'<5ms',
'<1s',
],
'Throughput': [
'Massive (millions/sec)',
'Massive (millions/sec)',
'High (100k/sec)',
'Massive (millions/sec)',
'Massive (millions/sec)',
'High (100k/sec)',
'Massive (millions/sec)',
],
'Cost_Per_Month': [
'$3-10k (self-hosted) / $5k+ (managed)',
'$2-5k',
'$100-500 (self-hosted)',
'$2-8k (self-hosted) / $5k+ (managed)',
'$1-3k',
'$200-800',
'$2-8k',
],
'Operational_Burden': [
'High',
'Low (managed)',
'Medium',
'High',
'Low (managed)',
'Low',
'Low (managed)',
],
'When_To_Use': [
'Massive scale, on-premises requirement',
'AWS ecosystem, high volume',
'Cost-conscious, lower volume',
'Advanced features, massive scale',
'GCP ecosystem',
'Low latency, moderate volume',
'Azure ecosystem',
]
})
print(streaming_tools.to_string(index=False))
Output:
Tool Latency Throughput Cost_Per_Month Operational_Burden When_To_Use
Kafka <10ms Massive (millions/sec) $3-10k / $5k+ High Massive scale
AWS Kinesis <100ms Massive (millions/sec) $2-5k Low (managed) AWS ecosystem
RabbitMQ <50ms High (100k/sec) $100-500 Medium Cost-conscious
Apache Pulsar <5ms Massive (millions/sec) $2-8k / $5k+ High Advanced features
Google Pub/Sub <50ms Massive (millions/sec) $1-3k Low (managed) GCP ecosystem
Redis Streams <5ms High (100k/sec) $200-800 Low Low latency, moderate
Azure Event Hubs <1s Massive (millions/sec) $2-8k Low (managed) Azure ecosystem
The point: Pick the right tool for the right job. Not because it's trendy. Not because someone built it. Because it matches your actual requirement.
The Before/After: Real Metrics
Before: Over-Engineered
Architecture:
├── 47 Kafka topics
├── 5 stream processors (Flink, Spark Streaming)
├── Multiple consumer groups
├── 12 on-call engineers
├── Complex deployment pipelines
└── Kafka cluster + 3 Zookeeper nodes
Costs:
├── Kafka infrastructure: $8,000/month
├── Stream processors: $7,000/month
├── Cloud resources: $4,000/month
├── Monitoring + alerting: $2,000/month
└── Engineering time (on-call): $1,000/month
Total: $22,000/month
Incidents (per month):
├── Message delays: 3-4
├── Consumer lag issues: 2-3
├── Zookeeper problems: 1-2
├── False alerts: 5-7
└── Mean time to resolution: 90 minutes
Total: ~10 incidents/month
After: Right-Sized
Architecture:
├── 3 Kafka topics (fraud, activity, alerts)
├── 1 managed Kinesis stream (better cost/performance)
├── Airflow for 15-minute micro-batches
├── dbt for daily scheduled SQL transforms
├── 2 on-call engineers
└── Simple deployment with infrastructure-as-code
Costs:
├── Kafka/Kinesis infrastructure: $1,500/month
├── Airflow managed service: $300/month
├── dbt Cloud: $200/month
├── Cloud resources: $500/month
├── Monitoring + alerting: $300/month
└── Engineering time (on-call): $200/month
Total: $3,000/month
Incidents (per month):
├── Message delays: 0 (not needed for batch)
├── Consumer lag issues: 0
├── Zookeeper problems: 0
├── False alerts: 1
├── Mean time to resolution: 15 minutes
Total: ~1 incident/month (mostly resolved automatically)
Comparison:
Metric Before After Change
Monthly cost $22,000 $3,000 -86%
Complexity Very High Low Simpler
On-call burden High Low Less stress
Incidents/month ~10 ~1 -90%
MTTR (minutes) 90 15 -83%
Same business value? Yes Yes ✓
The Bigger Picture: This Isn't Just Kafka
This concept applies everywhere:
Problem: Over-engineering for the wrong reason
├── Using Kafka when you need batch
├── Using Spark when you need SQL
├── Using Kubernetes when you need a single server
├── Using 12 microservices when you need a monolith
├── Using ML models when statistics work
└── Using cloud when on-premise is cheaper
Solution: Let requirements drive architecture
├── Know your SLA
├── Know your data volume
├── Know your team's skills
├── Know the actual business need
└── Match the tool to the requirement, not your resume
Decision Framework: Should You Use Streaming?
Ask yourself (honestly):
streaming_necessity_check = {
'Do you need sub-second latency?': None,
'Is this latency critical to business?': None,
'Do you have high-volume data (1000+ events/sec)?': None,
'Will streaming save significant costs vs batch?': None,
'Does your team understand streaming systems?': None,
'Can you maintain a streaming system 24/7?': None,
}
def evaluate(answers):
yes_count = sum(1 for v in answers.values() if v is True)
if yes_count >= 5:
return "Use streaming (Kafka, Kinesis, Pulsar, RabbitMQ, etc.)"
elif yes_count >= 3:
return "Consider streaming, but evaluate batch alternatives"
else:
return "Use batch (Airflow, dbt, scheduled SQL)"
# Fraud detection use case
fraud_answers = {
'Do you need sub-second latency?': True, # YES (100ms)
'Is this latency critical to business?': True, # YES (transactions blocked)
'Do you have high-volume data (1000+ events/sec)?': True, # YES (5k/sec)
'Will streaming save significant costs vs batch?': False, # NO (cost similar)
'Does your team understand streaming systems?': True, # YES
'Can you maintain a streaming system 24/7?': True, # YES
}
result = evaluate(fraud_answers)
print(f"Fraud Detection: {result}")
# Output: Use streaming (Kafka, Kinesis, Pulsar, RabbitMQ, etc.)
# Dashboard refresh use case
dashboard_answers = {
'Do you need sub-second latency?': False, # NO (15 minutes OK)
'Is this latency critical to business?': False, # NO
'Do you have high-volume data (1000+ events/sec)?': False, # NO (100/sec)
'Will streaming save significant costs vs batch?': False, # NO (batch is cheaper)
'Does your team understand streaming systems?': True, # YES
'Can you maintain a streaming system 24/7?': True, # YES
}
result = evaluate(dashboard_answers)
print(f"Dashboard Refresh: {result}")
# Output: Use batch (Airflow, dbt, scheduled SQL)
Key Lessons
-
Real-time is an SLA, not an architecture.
- If you don't need it, you're just adding complexity
- Batch with 5-minute or 1-hour latency is often "real-time enough"
-
The tool doesn't matter. Your requirement does.
- Kafka, Kinesis, RabbitMQ, Pulsar—they're all fine
- Pick based on your actual need, not hype or resume-building
-
Most streaming use cases are actually batch.
- Dashboard refreshing isn't streaming
- Daily reports aren't streaming
- Weekly summaries aren't streaming
- These are batch operations with extra infrastructure costs
-
Operational burden scales with complexity.
- Every extra system is something that can break
- Every extra on-call page is engineer unhappiness
- Simple batch jobs are more stable than distributed streaming systems
-
Let requirements drive architecture.
- Not career progression
- Not what you learned in a course
- Not what the trendy companies use
- What does the business actually need?
The Uncomfortable Question
When was the last time you asked: "Does this actually need to be real-time?"
Not:
- "Can we make this real-time?"
- "What's the coolest technology to use?"
- "What would look good in our architecture diagram?"
But:
- "Does the business actually need sub-second latency?"
- "What latency would they actually accept?"
- "What's the simplest system that meets this requirement?"
If you asked that question and changed your architecture to match, you'd probably:
- Reduce costs by 50-80%
- Reduce incident rate by 70-90%
- Reduce operational burden significantly
- Still deliver the same business value
Final Thought
The best architecture is not the most advanced. It's the simplest one that meets the requirement.
Kafka is incredible at what it does. RabbitMQ is great. Kinesis is solid. Pulsar is powerful.
But they're all overkill for batch jobs. And that's what most of your "real-time" systems actually are.
What would your architecture look like if you asked your business: "What latency actually matters?" first?
Top comments (0)