DEV Community

Rizwan Saleem
Rizwan Saleem

Posted on

Building a Resilient Event-Driven Architecture with Apache Kafka and Stream Processing

Building a Resilient Event-Driven Architecture with Apache Kafka and Stream Processing

Building a Resilient Event-Driven Architecture with Apache Kafka and Stream Processing

In modern distributed systems, the ability to process data in real time while maintaining system resilience is critical. An event-driven architecture (EDA) decouples services through asynchronous messaging, allowing systems to scale independently and recover gracefully from failures. This guide walks you through designing and implementing a production-ready event-driven system using Apache Kafka for messaging and Apache Flink for stream processing.

Why Event-Driven Architecture?

Traditional request-response architectures tightly couple services, creating brittle systems where failures cascade. Event-driven architecture solves this by:

  • Decoupling: Services communicate through events, not direct calls
  • Scalability: Independent scaling of producers and consumers
  • Resilience: Failures don't cascade; events persist in brokers
  • Replayability: Historical events can be reprocessed for recovery or new features

System Overview

We'll build a real-time order processing system with these components:

Order Service → Kafka → Stream Processor → Kafka → Notification Service + Analytics DB
Enter fullscreen mode Exit fullscreen mode

Architecture Diagram

┌─────────────┐    events    ┌──────────┐   processed   ┌──────────────┐
│  Order      │ ────────────►│  Kafka   │ ─────────────►│  Flink       │
│  Service    │              │  Cluster │               │  Stream      │
└─────────────┘              └──────────┘               │  Processor   │
                                                        └──────┬───────┘
                                                               │
                    ┌──────────────────────────────────────────┘
                    │
         ┌──────────▼──────────┐    ┌──────────────────┐
         │  Notification       │    │  Analytics       │
         │  Service            │    │  Database        │
         └─────────────────────┘    └──────────────────┘
Enter fullscreen mode Exit fullscreen mode

Step 1: Setting Up Apache Kafka

Installing Kafka Locally

Download and extract Kafka:

wget https://archive.apache.org/dist/kafka/3.6.0/kafka_2.13-3.6.0.tgz
tar -xzf kafka_2.13-3.6.0.tgz
cd kafka_2.13-3.6.0
Enter fullscreen mode Exit fullscreen mode

Start Zookeeper (required for Kafka 3.x):

bin/zookeeper-server-start.sh config/zookeeper.properties
Enter fullscreen mode Exit fullscreen mode

Start Kafka broker in a new terminal:

bin/kafka-server-start.sh config/server.properties
Enter fullscreen mode Exit fullscreen mode

Creating Topics

Create three topics for our pipeline:

### Raw order events
bin/kafka-topics.sh create \
  bootstrap-server localhost:9092 \
  topic orders_raw \
  partitions 3 \
  replication-factor 1 \
  config retention.ms=604800000

### Processed orders
bin/kafka-topics.sh create \
  bootstrap-server localhost:9092 \
  topic orders_processed \
  partitions 3 \
  replication-factor 1

### Dead letter queue for failed processing
bin/kafka-topics.sh create \
  bootstrap-server localhost:9092 \
  topic orders_dlq \
  partitions 1 \
  replication-factor 1
Enter fullscreen mode Exit fullscreen mode

Key topic configurations:

  • Partitions: Determine parallelism (3 allows 3 concurrent consumers)
  • Retention: 7 days for raw orders allows replay
  • Replication-factor: 1 for local dev; use 3+ in production

Step 2: Building the Order Producer Service

Create a Node.js service that publishes order events:

// order-service/index.js
const { Kafka } = require('kafkajs');

const kafka = new Kafka({
  clientId: 'order-service',
  brokers: ['localhost:9092']
});

const producer = kafka.producer();

async function publishOrderEvent(order) {
  const event = {
    event_id: crypto.randomUUID(),
    event_type: 'ORDER_CREATED',
    event_timestamp: new Date().toISOString(),
    data: {
      order_id: order.id,
      user_id: order.userId,
      items: order.items,
      total_amount: order.total,
      currency: order.currency
    }
  };

  await producer.send({
    topic: 'orders_raw',
    messages: [{
      key: order.userId,
      value: JSON.stringify(event)
    }]
  });

  console.log(`Order ${order.id} published to Kafka`);
}

async function start() {
  await producer.connect();

  // Simulate order creation
  setInterval(() => {
    const order = {
      id: `order_${Date.now()}`,
      userId: `user_${Math.floor(Math.random() * 100)}`,
      items: [{ product: 'laptop', quantity: 1, price: 999.99 }],
      total: 999.99,
      currency: 'USD'
    };
    publishOrderEvent(order);
  }, 2000);
}

start().catch(console.error);
Enter fullscreen mode Exit fullscreen mode

Install dependencies:

npm init -y
npm install kafkajs crypto
Enter fullscreen mode Exit fullscreen mode

Key Producer Patterns

Partitioning by key: Using userId ensures all orders from one user go to the same partition, maintaining order .

Error handling: Always implement retry logic with exponential backoff:

async function publishWithRetry(order, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      await publishOrderEvent(order);
      return;
    } catch (error) {
      if (i === maxRetries - 1) throw error;
      await new Promise(r => setTimeout(r, Math.pow(2, i) * 1000));
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Step 3: Stream Processing with Apache Flink

Setting Up Flink

Download Flink:

wget https://archive.apache.org/dist/flink/flink-1.17.1/flink-1.17.1-bin-scala_2.12.tgz
tar -xzf flink-1.17.1-bin-scala_2.12.tgz
cd flink-1.17.1
./bin/start-cluster.sh
Enter fullscreen mode Exit fullscreen mode

Access the Flink dashboard at http://localhost:8081.

Creating the Stream Processor (Java)

// OrderProcessor.java
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.connector.kafka.source.KafkaSource;
import org.apache.flink.connector.kafka.source.enumerator.initializer.OffsetsInitializer;

public class OrderProcessor {
    public static class OrderEvent {
        public String event_id;
        public String event_type;
        public String event_timestamp;
        public OrderData data;
    }
    public static class OrderData {
        public String order_id;
        public String user_id;
        public Item[] items;
        public double total_amount;
        public String currency;
    }
    public static class ProcessedOrder {
        public String order_id;
        public String user_id;
        public double total_amount;
        public String status;
        public String processed_at;
        public boolean high_value;
    }
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(3);
        // Kafka source configuration
        KafkaSource<String> source = KafkaSource.<String>builder()
            .setBootstrapServers("localhost:9092")
            .setTopics("orders_raw")
            .setGroupId("flink-order-processor")
            .setStartingOffsets(OffsetsInitializer.earliest())
            .setValueOnlyDeserializer(new SimpleStringSchema())
            .build();
        DataStream<String> rawStream = env.fromSource(source, WatermarkStrategy.noWatermarks(), "Kafka Source");
        // Process orders: validate, enrich, transform
        DataStream<ProcessedOrder> processedStream = rawStream
            .map((MapFunction<String, ProcessedOrder>) json -> {
                // Parse JSON (use Jackson/Gson in production)
                OrderEvent event = parseJson(json, OrderEvent.class);
                ProcessedOrder order = new ProcessedOrder();
                order.order_id = event.data.order_id;
                order.user_id = event.data.user_id;
                order.total_amount = event.data.total_amount;
                order.status = "PROCESSED";
                order.processed_at = java.time.Instant.now().toString();
                order.high_value = event.data.total_amount > 500;
                return order;
            })
            .name("Order Transformation");
        // Sink to processed orders topic
        KafkaSink<String> sink = KafkaSink.<String>builder()
            .setBootstrapServers("localhost:9092")
            .setRecordSerializer((record, value) -> {
                record.value(value);
                record.key("default");
            })
            .build();
        processedStream.sinkTo(sink)
            .name("Kafka Sink")
            .setParallelism(3);
        env.execute("Order Processing Pipeline");
    }
    private static <T> T parseJson(String json, Class<T> clazz) {
        // Implementation using Jackson ObjectMapper
        // Returns parsed object
        return null;
    }
}
Enter fullscreen mode Exit fullscreen mode

Building and Deploying

### Build JAR with Maven
mvn clean package

### Submit to Flink cluster
./bin/flink run -c OrderProcessor target/order-processor-1.0.jar
Enter fullscreen mode Exit fullscreen mode

Stream Processing Patterns Implemented

Stateful processing: Flink maintains state for aggregations and windowing

Exactly-once semantics: Kafka + Flink combination guarantees each event processed once

Error handling with DLQ: Route failed records to dead letter queue:

DataStream<String> errorStream = processedStream
    .filter(order -> order == null)
    .map(error -> "{\"error\": \"processing_failed\", \"original\": " + originalJson + "}")
    .sinkTo(dlqSink);
Enter fullscreen mode Exit fullscreen mode

Step 4: Consumer Services

Notification Service (Python)

### notification-service/main.py
from confluent_kafka import Consumer, KafkaException
import json
import requests

conf = {
    'bootstrap.servers': 'localhost:9092',
    'group.id': 'notification-service',
    'auto.offset.reset': 'earliest'
}

consumer = Consumer(conf)
consumer.subscribe(['orders_processed'])

def send_notification(order):
    """Send email/SMS notification"""
    if order['high_value']:
        # Send priority notification
        requests.post('https://api.notifications.com/priority', json={
            'user_id': order['user_id'],
            'order_id': order['order_id'],
            'amount': order['total_amount'],
            'channel': ['email', 'sms']
        })
    else:
        # Standard notification
        requests.post('https://api.notifications.com/standard', json={
            'user_id': order['user_id'],
            'order_id': order['order_id']
        })

try:
    while True:
        msg = consumer.poll(1.0)
        if msg is None:
            continue
        if msg.error():
            raise KafkaException(msg.error())
        order = json.loads(msg.value().decode('utf-8'))
        send_notification(order)
        print(f"Notification sent for order {order['order_id']}")
except KeyboardInterrupt:
    pass
finally:
    consumer.close()
Enter fullscreen mode Exit fullscreen mode

Install dependencies:

pip install confluent-kafka requests
Enter fullscreen mode Exit fullscreen mode

Consumer Group Best Practices

  • Group ID: Each service type uses unique group ID for independent consumption
  • Offset management: Auto-commit enabled; disable for manual control in critical paths
  • Parallelism: Number of consumers ≤ number of partitions

Step 5: Monitoring and Observability

Kafka Metrics dashboard

Track these critical metrics:

### monitoring/kafka_metrics.py
import requests
import time

KAFKA_JMX_PORT = 9999

def get_consumer_lag(topic, group_id):
    """Get lag for consumer group"""
    # Use Kafka REST proxy or JMX
    lag = calculate_lag(topic, group_id)
    return lag

def monitor_pipeline():
    while True:
        lag = get_consumer_lag('orders_processed', 'notification-service')
        if lag > 1000:
            alert(f"High consumer lag: {lag}")
        time.sleep(30)
Enter fullscreen mode Exit fullscreen mode

Flink Monitoring

Access Flink web UI at http://localhost:8081 to monitor:

  • Checkpoint status and duration
  • Backpressure indicators
  • Task throughput and latency

Distributed Tracing

Integrate OpenTelemetry for end-to-end tracing:

// Add to order-service
const { NodeTracerProvider } = require('@opentelemetry/node');
const { SimpleSpanProcessor } = require('@opentelemetry/tracing');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');

const provider = new NodeTracerProvider();
const exporter = new JaegerExporter({
  endpoint: 'http://localhost:14268/api/traces'
});
provider.addSpanProcessor(new SimpleSpanProcessor(exporter));
provider.register();
Enter fullscreen mode Exit fullscreen mode

Step 6: Production Deployment Checklist

Infrastructure Requirements

| Component | Minimum | Production |
|-|-||
| Kafka brokers | 1 | 3+ (odd number) |
| Zookeeper nodes | 1 | 3+ |
| Flink TaskManagers | 1 | 5+ with HA |
| Replication factor | 1 | 3 |
| Partitions per topic | 1 | 12+ (scale based on throughput) |

Key Production Configurations

Kafka broker configuration (server.properties):

num.partitions=12
default.replication.factor=3
min.insync.replicas=2
log.retention.hours=168
auto.create.topics.enable=false
Enter fullscreen mode Exit fullscreen mode

Flink high-availability:

### flink-conf.yaml
high-availability: zookeeper
high-availability.zookeeper.quorum: zk1:2181,zk2:2181,zk3:2181
high-availability.storageDir: s3://flink-ha/
state.backend: rocksdb
state.checkpoints.dir: s3://flink-checkpoints/
Enter fullscreen mode Exit fullscreen mode

Security Hardening

  1. Enable SSL/TLS for all broker communication
  2. SASL authentication (SCRAM or OAuth2)
  3. Authorization with ACLs per topic/group
  4. Network segmentation - isolate broker network
### Create ACL for consumer group
bin/kafka-acls.sh authorizer-properties zookeeper.connect=localhost:2181 \
  add allow-principal User:notification-service \
  operation Read topic orders_processed \
  group notification-service
Enter fullscreen mode Exit fullscreen mode

Step 7: Testing Strategies

Local Testing with TestContainers

// OrderProcessorTest.java
@Testcontainers
public class OrderProcessorTest {
    @Container
    static KafkaContainer kafka = new KafkaContainer(
        DockerImageName.parse("confluentinc/cp-kafka:7.4.0")
    );
    @Test
    public void testOrderProcessing() throws Exception {
        // Produce test event
        produceTestEvent();
        // Verify processed output
        ProcessedOrder result = consumeProcessedEvent();
        assertThat(result.getStatus()).isEqualTo("PROCESSED");
        assertThat(result.isHighValue()).isTrue();
    }
}
Enter fullscreen mode Exit fullscreen mode

Chaos Testing

Simulate failures to validate resilience:

### Kill a Kafka broker
docker stop kafka-broker-2

### Verify system continues processing
### Checkpoint recovery should occur automatically
Enter fullscreen mode Exit fullscreen mode

Common Pitfalls and Solutions

Pitfall 1: Partition Key Misconfiguration

Problem: All events go to one partition, limiting parallelism

Solution: Choose partition key based on cardinality needs:

// BAD: All events to same partition
{ key: 'constant', value: event }

// GOOD: User-based partitioning maintains order per user
{ key: userId, value: event }

// GOOD: Round-robin for unordered events
{ key: null, value: event }
Enter fullscreen mode Exit fullscreen mode

Pitfall 2: Ignoring Backpressure

Problem: Consumer can't keep up, memory explodes

Solution: Implement backpressure handling:

// Flink automatically handles backpressure
// But monitor and scale consumers accordingly
env.setParallelism(calculateRequiredParallelism());
Enter fullscreen mode Exit fullscreen mode

Pitfall 3: No Dead Letter Queue

Problem: Poison messages block entire partition

Solution: Always implement DLQ pattern:

try:
    process(message)
except Exception as e:
    produce_to_dlq(message, str(e))
    # Don't re-raise; acknowledge message
Enter fullscreen mode Exit fullscreen mode

Performance Optimization Tips

Kafka Tuning

### Producer tuning
batch.size=32768
linger.ms=20
compression.type=lz4

### Consumer tuning
fetch.min.bytes=1048576
fetch.max.wait.ms=500
max.poll.records=1000
Enter fullscreen mode Exit fullscreen mode

Flink Tuning

// Enable incremental checkpoints
env.enableCheckpointing(60000); // 1 minute
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(30000);
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);

// Configure state backend
env.setStateBackend(new RocksDBStateBackend("hdfs://path/to/backups"));
Enter fullscreen mode Exit fullscreen mode

Next Steps and Extensions

Add Real-time Analytics

Extend the pipeline with windowed aggregations:

// 5-minute rolling window of order totals
processedStream
    .keyBy(order -> order.user_id)
    .window(TumblingProcessingTimeWindows.of(Time.minutes(5)))
    .sum("total_amount")
    .sinkTo(analyticsSink);
Enter fullscreen mode Exit fullscreen mode

Implement Event Sourcing

Store all state changes as events for complete audit trail and time-travel debugging.

Add Schema Registry

Use Confluent Schema Registry for schema evolution and compatibility checking:

const { KafkaSchemaRegistry } = require('@ konveyor/schema-registry');

const schemaRegistry = new KafkaSchemaRegistry({
  url: 'http://localhost:8081'
});

await schemaRegistry.register('orders_raw', orderSchema);
Enter fullscreen mode Exit fullscreen mode

Conclusion

Building a resilient event-driven architecture requires careful attention to messaging patterns, state management, and operational excellence. By combining Apache Kafka's durable messaging with Flink's powerful stream processing, you create systems that are:

  • Scalable: Handle millions of events per second
  • Resilient: Survive component failures without data loss
  • Maintainable: Clear separation of concerns through event contracts
  • Observable: Complete visibility into data flow

Start small with a single pipeline, establish monitoring and testing practices, then gradually expand to more complex event-driven patterns. The investment in proper event-driven architecture pays dividends in system reliability and team velocity.

Key takeaways:

  1. Always design for failure-implement DLQs and retry patterns
  2. Monitor consumer lag and backpressure continuously
  3. Use partition keys wisely to balance order and parallelism
  4. Test chaos scenarios regularly to validate resilience
  5. Plan for schema evolution from day one

Ready to implement? Start by setting up local Kafka and Flink, then build your first producer-consumer pipeline following the patterns in this guide.
Need help implementing event-driven architecture for your specific use case? Share your challenges and I'll provide targeted guidance.

-

Rizwan Saleem | https://rizwansaleem.co

Sources

Top comments (0)