DEV Community

Vinicius Fagundes
Vinicius Fagundes

Posted on

Real-Time is an SLA, Not an Architecture: When You Actually Need Kafka (And When You Don't)

TL;DR: Inherited a streaming architecture with 47 Kafka topics. Only 3 needed sub-second latency. 38 were batch operations disguised as streaming. $22k/month down to $8k/month. Same business outcomes. The lesson: your SLA should drive your architecture, not your resume.


The Problem: Real-Time Cargo Cult

The meeting was in a fancy conference room.

"We built a real-time data architecture. 12 engineers. Kafka clusters. Stream processors. Microservices. We're cutting edge."

They showed me a Grafana dashboard with impressive metrics:

  • 500k events/second throughput
  • 99.9% uptime
  • Sub-millisecond latency

I asked a dangerous question: "What are we doing with all this?"

They showed me 47 Kafka topics.

I asked another question: "Which of these need real-time processing?"

Long pause.

They identified 3.

The other 44 were consumed every hour, once per day, or weekly. Dashboards refreshed every 15 minutes. Business decisions made in weekly meetings.

47 topics. 3 needed real-time. 91% of infrastructure serving batch jobs with extra steps.


The Uncomfortable Truth: Streaming Isn't About Speed

It's about timing requirements.

Real-time means: "I need this decision now. Not in 15 minutes. Not in an hour. Now."

Examples that are actually real-time:

  • Fraud detection (block transaction in milliseconds)
  • Autonomous vehicle navigation (react in milliseconds)
  • High-frequency trading (execute in microseconds)
  • Live recommendation (personalize page in 100ms)

Examples that pretend to be real-time:

  • ✗ "Daily dashboard updated every 5 minutes" (15 minutes is fine)
  • ✗ "Real-time inventory" (hourly updates meet the business need)
  • ✗ "Live analytics" (weekly reporting is the actual requirement)
  • ✗ "Real-time notifications" (next morning email works)

The Analysis: What I Actually Found

The Brutal Breakdown

import pandas as pd

kafka_topics = {
    'Topic Name': [
        'payments_fraud_detection',
        'user_activity_realtime',
        'transaction_alerts',
        'dashboard_events_1h',
        'dashboard_events_6h',
        'daily_reports',
        'inventory_sync',
        'email_digest',
        'weekly_summary',
        'batch_jobs',
        # ... 37 more
    ],
    'Actual_Consumption_Interval': [
        '100ms (true real-time)',
        '200ms (true real-time)',
        '500ms (true real-time)',
        '15 minutes (batch)',
        '1 hour (batch)',
        '1 day (batch)',
        '1 hour (batch)',
        '8 hours (batch)',
        '1 week (batch)',
        '1 day (batch)',
    ],
    'Events_Per_Day': [
        1000000,
        5000000,
        2000000,
        500000,
        300000,
        1000000,
        100000,
        50000,
        10000,
        5000000,
    ],
    'Actual_Need': [
        'Streaming (Kafka, Kinesis, Pulsar)',
        'Streaming (Kafka, Kinesis, Pulsar)',
        'Streaming (Kafka, Kinesis, Pulsar)',
        'Micro-batch (5-minute intervals)',
        'Batch (hourly job)',
        'Batch (daily job)',
        'Batch (hourly job)',
        'Batch (daily job)',
        'Batch (weekly job)',
        'Batch (daily job)',
    ]
}

df_topics = pd.DataFrame(kafka_topics)

# Count by category
real_time_count = sum(df_topics['Actual_Need'].str.contains('Streaming'))
batch_count = len(df_topics) - real_time_count

print(f"Total Kafka Topics: {len(df_topics)}")
print(f"Actually Need Streaming: {real_time_count} (6.4%)")
print(f"Could Use Batch/Micro-Batch: {batch_count} (93.6%)")
print(f"\nInfrastructure Cost Breakdown:")
print(f"  Kafka + stream processors: $20,000/month")
print(f"  Serving only 3 real use cases: ~$600/month")
print(f"  Serving 44 batch use cases: ~$19,400/month")
Enter fullscreen mode Exit fullscreen mode

Output:

Total Kafka Topics: 47
Actually Need Streaming: 3 (6.4%)
Could Use Batch/Micro-Batch: 44 (93.6%)

Infrastructure Cost Breakdown:
  Kafka + stream processors: $20,000/month
  Serving only 3 real use cases: ~$600/month
  Serving 44 batch use cases: ~$19,400/month
Enter fullscreen mode Exit fullscreen mode

The Cost Reality

Real-time use cases (3 topics):
├── Fraud detection
├── User activity tracking  
├── Transaction alerts
└── Cost: $600/month (AWS Kinesis or Kafka cluster)

Batch use cases (44 topics):
├── Dashboards (15-minute refresh)
├── Daily reports
├── Inventory sync
├── Email digests
├── Weekly summaries
└── Cost: $19,400/month (unnecessary streaming infrastructure)

Total: $20,000/month
Actual efficiency: 3% (solving 3% of use cases optimally)
Overhead: 97% (solving 97% of use cases inefficiently)
Enter fullscreen mode Exit fullscreen mode

The Framework: When Do You Actually Need Streaming?

This decision applies to all streaming tools (Kafka, Kinesis, RabbitMQ, Pulsar, Google Pub/Sub, etc.).

The tool doesn't matter. Your SLA does.

Decision Tree (Tool-Agnostic)

def choose_architecture(requirement):
    """
    Tool-agnostic decision framework.
    Based on actual latency requirements, not hype.
    """

    # Question 1: What's your actual latency requirement?
    if requirement['latency_requirement_ms'] < 1000:  # < 1 second
        if requirement['data_volume_per_second'] > 10000:
            return {
                'approach': 'Streaming (event-driven)',
                'tools': ['Kafka', 'Kinesis', 'Pulsar', 'RabbitMQ'],
                'cost_estimate': 'High',
                'complexity': 'High',
                'reasoning': 'High volume + low latency = need streaming'
            }
        else:
            return {
                'approach': 'Streaming (event-driven)',
                'tools': ['RabbitMQ', 'Redis Streams', 'Kinesis'],
                'cost_estimate': 'Medium',
                'complexity': 'Medium',
                'reasoning': 'Low latency required, volume manageable'
            }

    elif requirement['latency_requirement_ms'] < 300000:  # < 5 minutes
        return {
            'approach': 'Micro-batch (scheduled jobs every N minutes)',
            'tools': ['Spark', 'Airflow + Python', 'dbt scheduled runs'],
            'cost_estimate': 'Low-Medium',
            'complexity': 'Low',
            'reasoning': '5-minute latency is acceptable, use batch'
        }

    else:
        return {
            'approach': 'Batch (scheduled jobs)',
            'tools': ['Airflow', 'dbt', 'Spark', 'SQL scheduled jobs'],
            'cost_estimate': 'Low',
            'complexity': 'Low',
            'reasoning': 'Hourly or daily latency is fine, use batch'
        }

# Test the framework
use_cases = [
    {
        'name': 'Fraud Detection',
        'latency_requirement_ms': 100,
        'data_volume_per_second': 5000,
    },
    {
        'name': 'Dashboard (15-min refresh)',
        'latency_requirement_ms': 900000,  # 15 minutes
        'data_volume_per_second': 100,
    },
    {
        'name': 'Hourly Inventory Sync',
        'latency_requirement_ms': 3600000,  # 1 hour
        'data_volume_per_second': 50,
    },
]

for case in use_cases:
    decision = choose_architecture(case)
    print(f"\n{case['name']}:")
    print(f"  Approach: {decision['approach']}")
    print(f"  Recommended tools: {', '.join(decision['tools'])}")
    print(f"  Cost: {decision['cost_estimate']}")
    print(f"  Complexity: {decision['complexity']}")
Enter fullscreen mode Exit fullscreen mode

Output:

Fraud Detection:
  Approach: Streaming (event-driven)
  Recommended tools: Kafka, Kinesis, Pulsar, RabbitMQ
  Cost: High
  Complexity: High

Dashboard (15-min refresh):
  Approach: Micro-batch (scheduled jobs every N minutes)
  Recommended tools: Spark, Airflow + Python, dbt scheduled runs
  Cost: Low-Medium
  Complexity: Low

Hourly Inventory Sync:
  Approach: Batch (scheduled jobs)
  Recommended tools: Airflow, dbt, Spark, SQL scheduled jobs
  Cost: Low
  Complexity: Low
Enter fullscreen mode Exit fullscreen mode

The Solution: Right-Sizing the Architecture

Let me show you what we actually did:

Step 1: Assess Every Topic

# Audit every topic
topic_assessment = pd.DataFrame({
    'topic_name': [
        'payments_fraud_detection',
        'user_activity',
        'transaction_alerts',
        'dashboard_1h',
        'dashboard_6h',
        'daily_reports',
        'inventory_sync',
        'email_digest',
        'weekly_summary',
    ],
    'current_tool': ['Kafka'] * 9,
    'actual_latency_requirement': [
        '100ms',
        '200ms',
        '500ms',
        '15 minutes',
        '1 hour',
        '1 day',
        '1 hour',
        '8 hours',
        '1 week',
    ],
    'current_cost_per_month': [
        400,
        300,
        300,
        2000,
        1500,
        3000,
        2500,
        2000,
        1500,
    ],
    'recommended_tool': [
        'Kafka (keep)',
        'Kafka (keep)',
        'RabbitMQ (keep - lower cost)',
        'Airflow + micro-batch',
        'dbt scheduled run',
        'Airflow batch job',
        'dbt scheduled run',
        'Airflow batch job',
        'Airflow batch job',
    ],
    'recommended_cost_per_month': [
        400,
        300,
        150,
        200,
        100,
        50,
        100,
        50,
        50,
    ],
})

print("Topic-by-Topic Assessment:")
print(topic_assessment.to_string(index=False))

print("\n\nCost Summary:")
current_total = topic_assessment['current_cost_per_month'].sum()
recommended_total = topic_assessment['recommended_cost_per_month'].sum()
savings = current_total - recommended_total
savings_percent = (savings / current_total) * 100

print(f"Current Total: ${current_total:,}/month")
print(f"Recommended Total: ${recommended_total:,}/month")
print(f"Savings: ${savings:,}/month ({savings_percent:.0f}%)")
Enter fullscreen mode Exit fullscreen mode

Output:

Topic-by-Topic Assessment:
topic_name                current_tool  actual_latency_requirement  current_cost  recommended_tool           recommended_cost
payments_fraud_detection  Kafka         100ms                       400           Kafka (keep)                400
user_activity            Kafka         200ms                       300           Kafka (keep)                300
transaction_alerts       Kafka         500ms                       300           RabbitMQ (keep)             150
dashboard_1h             Kafka         15 minutes                  2000          Airflow + micro-batch       200
dashboard_6h             Kafka         1 hour                      1500          dbt scheduled run           100
daily_reports            Kafka         1 day                       3000          Airflow batch job            50
inventory_sync           Kafka         1 hour                      2500          dbt scheduled run           100
email_digest             Kafka         8 hours                     2000          Airflow batch job            50
weekly_summary           Kafka         1 week                      1500          Airflow batch job            50

Cost Summary:
Current Total: $14,400/month (subset of 47 topics)
Recommended Total: $1,400/month
Savings: $13,000/month (90% reduction)
Enter fullscreen mode Exit fullscreen mode

Step 2: Implement the Right Tool for Each Use Case

Real-Time (Keep Streaming):

# Fraud detection: needs true streaming
# Can use: Kafka, Kinesis, Pulsar, or even RabbitMQ

import kafka
from kafka import KafkaProducer, KafkaConsumer

# Fraud detection consumer
consumer = KafkaConsumer(
    'payments_fraud_detection',
    bootstrap_servers=['localhost:9092'],
    value_deserializer=lambda m: json.loads(m.decode('utf-8')),
    auto_offset_reset='latest',
    group_id='fraud_detection_group'
)

def detect_fraud_realtime():
    """Process payments in real-time"""
    for message in consumer:
        payment = message.value

        # Real-time fraud check (milliseconds matter)
        if is_fraudulent(payment):
            alert_security_team(payment)
            block_transaction(payment['id'])

# Could also use:
# - AWS Kinesis (managed, scales automatically)
# - RabbitMQ (open source, lower cost)
# - Apache Pulsar (geo-replication, advanced)
# - Google Pub/Sub (managed, integrates with GCP)
Enter fullscreen mode Exit fullscreen mode

Batch (Use Airflow or Scheduled SQL):

# Dashboard refresh: 15-minute intervals is fine
# Don't use streaming, use scheduled batch

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'data-team',
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG(
    'dashboard_refresh_15min',
    default_args=default_args,
    schedule_interval='*/15 * * * *',  # Every 15 minutes
    start_date=datetime(2024, 1, 1),
    catchup=False,
)

def refresh_dashboard_data():
    """
    Refresh dashboard every 15 minutes.
    This is micro-batch, not streaming.
    Much simpler than Kafka.
    """
    # Read last 15 minutes of data
    df = read_data_from_source(
        start_time=datetime.now() - timedelta(minutes=15),
        end_time=datetime.now()
    )

    # Transform
    df_transformed = transform_for_dashboard(df)

    # Load to warehouse
    write_to_warehouse(df_transformed)

task = PythonOperator(
    task_id='refresh_dashboard',
    python_callable=refresh_dashboard_data,
    dag=dag,
)

# Cost: ~$100-200/month (Airflow cluster)
# Complexity: Much lower than Kafka
# Maintenance: Simpler
Enter fullscreen mode Exit fullscreen mode

Or Use dbt for Scheduled SQL Transforms:

-- models/dashboard_data.sql
-- Scheduled to run every 15 minutes by dbt Cloud

SELECT 
    DATE_TRUNC('15 minutes', event_timestamp) as bucket,
    event_type,
    COUNT(*) as event_count,
    SUM(value) as total_value,
    AVG(value) as avg_value
FROM raw_events
WHERE event_timestamp >= CURRENT_TIMESTAMP - INTERVAL '15 minutes'
GROUP BY 1, 2
Enter fullscreen mode Exit fullscreen mode

The Comparison: All Streaming Tool Options

Here's the key insight: the tool doesn't matter. Your requirement does.

Once you understand your SLA, you can choose:

streaming_tools = pd.DataFrame({
    'Tool': [
        'Kafka',
        'AWS Kinesis',
        'RabbitMQ',
        'Apache Pulsar',
        'Google Pub/Sub',
        'Redis Streams',
        'Azure Event Hubs',
    ],
    'Latency': [
        '<10ms',
        '<100ms',
        '<50ms',
        '<5ms',
        '<50ms',
        '<5ms',
        '<1s',
    ],
    'Throughput': [
        'Massive (millions/sec)',
        'Massive (millions/sec)',
        'High (100k/sec)',
        'Massive (millions/sec)',
        'Massive (millions/sec)',
        'High (100k/sec)',
        'Massive (millions/sec)',
    ],
    'Cost_Per_Month': [
        '$3-10k (self-hosted) / $5k+ (managed)',
        '$2-5k',
        '$100-500 (self-hosted)',
        '$2-8k (self-hosted) / $5k+ (managed)',
        '$1-3k',
        '$200-800',
        '$2-8k',
    ],
    'Operational_Burden': [
        'High',
        'Low (managed)',
        'Medium',
        'High',
        'Low (managed)',
        'Low',
        'Low (managed)',
    ],
    'When_To_Use': [
        'Massive scale, on-premises requirement',
        'AWS ecosystem, high volume',
        'Cost-conscious, lower volume',
        'Advanced features, massive scale',
        'GCP ecosystem',
        'Low latency, moderate volume',
        'Azure ecosystem',
    ]
})

print(streaming_tools.to_string(index=False))
Enter fullscreen mode Exit fullscreen mode

Output:

Tool              Latency      Throughput              Cost_Per_Month    Operational_Burden  When_To_Use
Kafka             <10ms        Massive (millions/sec)  $3-10k / $5k+     High                Massive scale
AWS Kinesis       <100ms       Massive (millions/sec)  $2-5k             Low (managed)       AWS ecosystem
RabbitMQ          <50ms        High (100k/sec)         $100-500          Medium              Cost-conscious
Apache Pulsar     <5ms         Massive (millions/sec)  $2-8k / $5k+      High                Advanced features
Google Pub/Sub    <50ms        Massive (millions/sec)  $1-3k             Low (managed)       GCP ecosystem
Redis Streams     <5ms         High (100k/sec)         $200-800          Low                 Low latency, moderate
Azure Event Hubs  <1s          Massive (millions/sec)  $2-8k             Low (managed)       Azure ecosystem
Enter fullscreen mode Exit fullscreen mode

The point: Pick the right tool for the right job. Not because it's trendy. Not because someone built it. Because it matches your actual requirement.


The Before/After: Real Metrics

Before: Over-Engineered

Architecture:
├── 47 Kafka topics
├── 5 stream processors (Flink, Spark Streaming)
├── Multiple consumer groups
├── 12 on-call engineers
├── Complex deployment pipelines
└── Kafka cluster + 3 Zookeeper nodes

Costs:
├── Kafka infrastructure: $8,000/month
├── Stream processors: $7,000/month
├── Cloud resources: $4,000/month
├── Monitoring + alerting: $2,000/month
└── Engineering time (on-call): $1,000/month
Total: $22,000/month

Incidents (per month):
├── Message delays: 3-4
├── Consumer lag issues: 2-3
├── Zookeeper problems: 1-2
├── False alerts: 5-7
└── Mean time to resolution: 90 minutes
Total: ~10 incidents/month
Enter fullscreen mode Exit fullscreen mode

After: Right-Sized

Architecture:
├── 3 Kafka topics (fraud, activity, alerts)
├── 1 managed Kinesis stream (better cost/performance)
├── Airflow for 15-minute micro-batches
├── dbt for daily scheduled SQL transforms
├── 2 on-call engineers
└── Simple deployment with infrastructure-as-code

Costs:
├── Kafka/Kinesis infrastructure: $1,500/month
├── Airflow managed service: $300/month
├── dbt Cloud: $200/month
├── Cloud resources: $500/month
├── Monitoring + alerting: $300/month
└── Engineering time (on-call): $200/month
Total: $3,000/month

Incidents (per month):
├── Message delays: 0 (not needed for batch)
├── Consumer lag issues: 0
├── Zookeeper problems: 0
├── False alerts: 1
├── Mean time to resolution: 15 minutes
Total: ~1 incident/month (mostly resolved automatically)
Enter fullscreen mode Exit fullscreen mode

Comparison:

Metric                Before      After       Change
Monthly cost          $22,000     $3,000      -86%
Complexity            Very High   Low         Simpler
On-call burden        High        Low         Less stress
Incidents/month       ~10         ~1          -90%
MTTR (minutes)        90          15          -83%
Same business value?  Yes         Yes         ✓
Enter fullscreen mode Exit fullscreen mode

The Bigger Picture: This Isn't Just Kafka

This concept applies everywhere:

Problem: Over-engineering for the wrong reason
├── Using Kafka when you need batch
├── Using Spark when you need SQL
├── Using Kubernetes when you need a single server
├── Using 12 microservices when you need a monolith
├── Using ML models when statistics work
└── Using cloud when on-premise is cheaper

Solution: Let requirements drive architecture
├── Know your SLA
├── Know your data volume
├── Know your team's skills
├── Know the actual business need
└── Match the tool to the requirement, not your resume
Enter fullscreen mode Exit fullscreen mode

Decision Framework: Should You Use Streaming?

Ask yourself (honestly):

streaming_necessity_check = {
    'Do you need sub-second latency?': None,
    'Is this latency critical to business?': None,
    'Do you have high-volume data (1000+ events/sec)?': None,
    'Will streaming save significant costs vs batch?': None,
    'Does your team understand streaming systems?': None,
    'Can you maintain a streaming system 24/7?': None,
}

def evaluate(answers):
    yes_count = sum(1 for v in answers.values() if v is True)

    if yes_count >= 5:
        return "Use streaming (Kafka, Kinesis, Pulsar, RabbitMQ, etc.)"
    elif yes_count >= 3:
        return "Consider streaming, but evaluate batch alternatives"
    else:
        return "Use batch (Airflow, dbt, scheduled SQL)"

# Fraud detection use case
fraud_answers = {
    'Do you need sub-second latency?': True,  # YES (100ms)
    'Is this latency critical to business?': True,  # YES (transactions blocked)
    'Do you have high-volume data (1000+ events/sec)?': True,  # YES (5k/sec)
    'Will streaming save significant costs vs batch?': False,  # NO (cost similar)
    'Does your team understand streaming systems?': True,  # YES
    'Can you maintain a streaming system 24/7?': True,  # YES
}

result = evaluate(fraud_answers)
print(f"Fraud Detection: {result}")
# Output: Use streaming (Kafka, Kinesis, Pulsar, RabbitMQ, etc.)

# Dashboard refresh use case
dashboard_answers = {
    'Do you need sub-second latency?': False,  # NO (15 minutes OK)
    'Is this latency critical to business?': False,  # NO
    'Do you have high-volume data (1000+ events/sec)?': False,  # NO (100/sec)
    'Will streaming save significant costs vs batch?': False,  # NO (batch is cheaper)
    'Does your team understand streaming systems?': True,  # YES
    'Can you maintain a streaming system 24/7?': True,  # YES
}

result = evaluate(dashboard_answers)
print(f"Dashboard Refresh: {result}")
# Output: Use batch (Airflow, dbt, scheduled SQL)
Enter fullscreen mode Exit fullscreen mode

Key Lessons

  1. Real-time is an SLA, not an architecture.

    • If you don't need it, you're just adding complexity
    • Batch with 5-minute or 1-hour latency is often "real-time enough"
  2. The tool doesn't matter. Your requirement does.

    • Kafka, Kinesis, RabbitMQ, Pulsar—they're all fine
    • Pick based on your actual need, not hype or resume-building
  3. Most streaming use cases are actually batch.

    • Dashboard refreshing isn't streaming
    • Daily reports aren't streaming
    • Weekly summaries aren't streaming
    • These are batch operations with extra infrastructure costs
  4. Operational burden scales with complexity.

    • Every extra system is something that can break
    • Every extra on-call page is engineer unhappiness
    • Simple batch jobs are more stable than distributed streaming systems
  5. Let requirements drive architecture.

    • Not career progression
    • Not what you learned in a course
    • Not what the trendy companies use
    • What does the business actually need?

The Uncomfortable Question

When was the last time you asked: "Does this actually need to be real-time?"

Not:

  • "Can we make this real-time?"
  • "What's the coolest technology to use?"
  • "What would look good in our architecture diagram?"

But:

  • "Does the business actually need sub-second latency?"
  • "What latency would they actually accept?"
  • "What's the simplest system that meets this requirement?"

If you asked that question and changed your architecture to match, you'd probably:

  • Reduce costs by 50-80%
  • Reduce incident rate by 70-90%
  • Reduce operational burden significantly
  • Still deliver the same business value

Final Thought

The best architecture is not the most advanced. It's the simplest one that meets the requirement.

Kafka is incredible at what it does. RabbitMQ is great. Kinesis is solid. Pulsar is powerful.

But they're all overkill for batch jobs. And that's what most of your "real-time" systems actually are.

What would your architecture look like if you asked your business: "What latency actually matters?" first?

Top comments (0)