Alexandr Bandurchin for Uptrace

Posted on Oct 16 • Originally published at uptrace.dev

Kafka Event-Driven Microservices: Monitoring and Observability

#microservices #architecture #monitoring

Event-driven microservices communicate through messages rather than direct HTTP calls. Instead of Order Service calling Payment Service directly, Order Service publishes an "OrderCreated" event to Kafka. Payment Service consumes that event and processes the payment. This decouples services—Order Service doesn't need to know Payment Service exists or wait for it to respond.

The monitoring challenge appears when something breaks. A customer reports their order didn't ship. Which service failed? Did the event get published? Did all consumers receive it? Did processing fail silently? Traditional HTTP-based tracing doesn't work because there's no synchronous call chain to follow.

Kafka Architecture

Apache Kafka stores events in topics. Producers write events to topics. Consumers read events from topics. Each topic splits into partitions for parallel processing. Kafka replicates partitions across brokers for fault tolerance.

A typical microservices setup has one Kafka cluster with multiple topics organized by business domain. An e-commerce platform might have orders, payments, inventory, and shipping topics. Services produce events to relevant topics and consume from topics they care about.

Consumer groups enable parallel processing. Multiple consumer instances in the same group share partition processing. Kafka guarantees each partition is consumed by only one instance in a group, but different groups can consume the same events independently.

Event Flow Observability

The core challenge with event-driven systems is tracking event flow across async boundaries. When Order Service publishes an event, that event might trigger actions in five different services. If one service fails to process the event, how do you detect it?

Trace Context Propagation

OpenTelemetry solves this by propagating trace context through Kafka messages. The producer injects trace context into message headers. Consumers extract that context and continue the trace.

// Producer side - inject trace context
public void publishOrder(Order order) {
    ProducerRecord<String, Order> record =
        new ProducerRecord<>("orders", order.getId(), order);

    // OpenTelemetry automatically injects trace context into headers
    kafkaProducer.send(record);
}

// Consumer side - extract trace context
@KafkaListener(topics = "orders")
public void processOrder(ConsumerRecord<String, Order> record) {
    // OpenTelemetry automatically extracts trace context from headers
    // and continues the trace

    Order order = record.value();
    paymentService.processPayment(order);
}

With trace context propagation, you see the complete event journey. The trace shows when the producer published the event, when each consumer received it, how long processing took, and any errors that occurred.

Event Correlation

Events often trigger chains of other events. OrderCreated might trigger InventoryReserved, which triggers ShippingScheduled. Trace context links these events into a single distributed trace.

When debugging a failed order, you query for traces containing that order ID. The trace shows every event in the chain, which services processed them, and where the chain broke.

Kafka Metrics Monitoring

Kafka exposes metrics through JMX (Java Management Extensions). These metrics cover broker health, topic performance, and consumer behavior.

Broker metrics track cluster health. Monitor UnderReplicatedPartitions to detect replication issues. Track OfflinePartitionsCount to catch partition failures. Watch RequestHandlerAvgIdlePercent to identify broker overload.

Topic metrics show event throughput. MessagesInPerSec indicates producer activity. BytesInPerSec and BytesOutPerSec track network usage. Sharp drops suggest producer failures. Sharp spikes might indicate retry storms.

Consumer metrics reveal processing health. records-lag-max shows how far behind consumers are. Growing lag means consumers can't keep up with event volume. fetch-rate indicates consumer throughput. Low fetch rates suggest processing bottlenecks.

Consumer Lag

Consumer lag measures how far behind a consumer is from the latest message in a partition. Lag of 1000 means the consumer is 1000 messages behind the producer.

Small lag is normal—consumers process events slightly after production. Growing lag indicates problems. The consumer might be processing too slowly, crashing repeatedly, or stuck on a bad message.

Monitor lag per partition and per consumer group:

@Component
public class ConsumerLagMonitor {
    private final MeterRegistry registry;
    private final AdminClient adminClient;

    @Scheduled(fixedRate = 30000)
    public void recordLag() {
        Map<TopicPartition, OffsetAndMetadata> committed =
            adminClient.listConsumerGroupOffsets("payment-service").all().get();

        Map<TopicPartition, Long> endOffsets =
            adminClient.listOffsets(committed.keySet()).all().get();

        committed.forEach((partition, offsetMetadata) -> {
            long committed = offsetMetadata.offset();
            long latest = endOffsets.get(partition);
            long lag = latest - committed;

            registry.gauge("kafka.consumer.lag",
                Tags.of("topic", partition.topic(),
                       "partition", String.valueOf(partition.partition())),
                lag);
        });
    }
}

Set alerts on lag thresholds. Alert when lag exceeds 10,000 messages or grows by 50% in 5 minutes. This catches issues before they impact users.

Dead Letter Queues

Dead Letter Queues (DLQ) store events that consumers can't process. When a consumer encounters a poison message that causes crashes or exceeds retry attempts, it publishes that message to a DLQ instead of blocking the partition.

@Service
public class OrderConsumer {
    private final KafkaTemplate<String, Order> kafkaTemplate;

    @KafkaListener(topics = "orders")
    public void processOrder(ConsumerRecord<String, Order> record) {
        try {
            Order order = record.value();
            processOrderLogic(order);
        } catch (RetryableException e) {
            // Retry automatically via Spring Kafka
            throw e;
        } catch (Exception e) {
            // Non-retryable error - send to DLQ
            kafkaTemplate.send("orders.dlq", record.key(), record.value());

            // Log with trace context for debugging
            Span span = Span.current();
            span.addEvent("sent_to_dlq",
                Attributes.of(
                    AttributeKey.stringKey("error"), e.getMessage(),
                    AttributeKey.stringKey("order_id"), record.key()
                ));
        }
    }
}

Monitor DLQ growth. A growing DLQ indicates systematic problems—bad data format, downstream service failures, or logic bugs. Set up alerts when DLQ message count increases.

Event Schema Management

Event schemas define the structure of messages. Services that produce and consume events must agree on schemas. Schema evolution lets you change event structure without breaking consumers.

Schema Registry stores event schemas and enforces compatibility rules. When a producer publishes an event, Schema Registry validates it against the registered schema. This prevents producers from sending malformed events.

@Configuration
public class KafkaProducerConfig {
    @Bean
    public ProducerFactory<String, Order> producerFactory() {
        Map<String, Object> config = new HashMap<>();
        config.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        config.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
                  StringSerializer.class);
        config.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
                  KafkaAvroSerializer.class);
        config.put("schema.registry.url", "http://localhost:8081");

        return new DefaultKafkaProducerFactory<>(config);
    }
}

Track schema validation failures in metrics. Failed validations indicate incompatible changes or buggy producers. Monitor schema.registry.requests to ensure registry availability.

Monitoring Event Processing

Each consumer should emit metrics about event processing. Track processing duration, success rate, and error types.

@Service
public class PaymentProcessor {
    private final Timer processingTimer;
    private final Counter successCounter;
    private final Counter errorCounter;

    public PaymentProcessor(MeterRegistry registry) {
        this.processingTimer = Timer.builder("payment.processing.duration")
            .register(registry);
        this.successCounter = Counter.builder("payment.processing.success")
            .register(registry);
        this.errorCounter = Counter.builder("payment.processing.errors")
            .tag("type", "unknown")
            .register(registry);
    }

    public void processPayment(Order order) {
        processingTimer.record(() -> {
            try {
                executePayment(order);
                successCounter.increment();
            } catch (PaymentDeclinedException e) {
                errorCounter.increment();
                throw e;
            }
        });
    }
}

Graph these metrics alongside Kafka metrics. Correlate consumer lag spikes with processing duration increases or error rate jumps. This reveals whether lag comes from slow processing or actual failures.

Debugging Event Flows

When investigating issues, start with traces. Search for traces containing the problematic event ID or customer ID. The trace shows the complete event flow—which services received the event, processing times, and any errors.

If the event never reached a consumer, check producer metrics. Did the event get published? Check topic metrics for the timestamp. If published but not consumed, check consumer lag. Is the consumer running? Is it stuck on an earlier message?

For events that reached consumers but failed processing, examine error logs within the trace context. OpenTelemetry correlates logs with traces, letting you see exact error messages for failed event processing.

Kafka-Specific Instrumentation

OpenTelemetry automatically instruments Kafka clients, capturing producer sends and consumer receives as spans. For custom business logic, add manual instrumentation:

@Service
public class InventoryService {
    private final Tracer tracer;

    @KafkaListener(topics = "orders")
    public void reserveInventory(ConsumerRecord<String, Order> record) {
        Span span = tracer.spanBuilder("reserve-inventory")
            .setAttribute("order.id", record.key())
            .setAttribute("order.items", record.value().getItems().size())
            .startSpan();

        try (Scope scope = span.makeCurrent()) {
            Order order = record.value();
            checkAvailability(order);
            span.addEvent("availability_checked");

            reserveStock(order);
            span.addEvent("stock_reserved");

            publishInventoryReserved(order);
        } catch (OutOfStockException e) {
            span.recordException(e);
            span.setStatus(StatusCode.ERROR, "Out of stock");
            publishInventoryUnavailable(order);
        } finally {
            span.end();
        }
    }
}

This creates detailed traces showing internal service operations, not just Kafka boundaries.

Health Checks for Event Consumers

Consumer health checks verify consumers are actively processing events. Check consumer group membership, partition assignments, and last processing timestamp.

@Component
public class KafkaConsumerHealthIndicator implements HealthIndicator {
    private final KafkaListenerEndpointRegistry registry;
    private volatile Instant lastProcessedTime = Instant.now();

    @Override
    public Health health() {
        if (Duration.between(lastProcessedTime, Instant.now())
                .compareTo(Duration.ofMinutes(5)) > 0) {
            return Health.down()
                .withDetail("lastProcessed", lastProcessedTime)
                .withDetail("reason", "No events processed in 5 minutes")
                .build();
        }

        boolean allRunning = registry.getAllListenerContainers()
            .stream()
            .allMatch(MessageListenerContainer::isRunning);

        if (!allRunning) {
            return Health.down()
                .withDetail("reason", "Some consumers not running")
                .build();
        }

        return Health.up().build();
    }

    @EventListener
    public void onMessageProcessed(MessageProcessedEvent event) {
        lastProcessedTime = Instant.now();
    }
}

Kubernetes readiness probes use these health checks to stop routing traffic to unhealthy consumers.

OpenTelemetry Integration

OpenTelemetry provides unified instrumentation for Kafka-based microservices. Add the OpenTelemetry Kafka instrumentation library:

<dependency>
    <groupId>io.opentelemetry.instrumentation</groupId>
    <artifactId>opentelemetry-kafka-clients-2.6</artifactId>
</dependency>

Configure the exporter to send telemetry data:

otel.service.name=payment-service
otel.traces.exporter=otlp
otel.metrics.exporter=otlp
otel.exporter.otlp.endpoint=http://localhost:4317

Uptrace receives this telemetry and correlates traces across Kafka boundaries. When you search for an order ID, you see the complete event flow through all microservices—producer, Kafka, and all consumers—in a single trace.

For Java microservices using Spring Boot, check out Spring Boot microservices monitoring. For containerized deployments, see Kubernetes microservices monitoring.

Monitoring Best Practices

Track event processing duration at p50, p95, and p99 percentiles. Averages hide outliers that cause real user impact. Alert on p99 latency increases, not just averages.

Monitor consumer lag per partition, not just per topic. One slow partition can block processing while others remain healthy. Identify which partitions accumulate lag to find root causes.

Set up dashboards showing event flow rates through the system. Graph messages produced to each topic, messages consumed from each topic, and processing duration per service. Visualizing these together reveals bottlenecks immediately.

Implement circuit breakers for downstream dependencies. When a service that consumers call fails, the circuit breaker prevents consumers from getting stuck retrying failed calls, which prevents consumer lag from building up.

Getting Started

Start by adding OpenTelemetry instrumentation to your Kafka producers and consumers. This gives you basic trace propagation without code changes beyond configuration.

Export Kafka JMX metrics to your monitoring system. Focus on consumer lag, broker health, and topic throughput. Set initial alerting thresholds conservatively and adjust based on observed patterns.

Implement Dead Letter Queues for non-retryable errors. Monitor DLQ growth to catch systematic issues early.

Add custom spans for business operations within consumers. This shows what your service actually does with events, not just that it received them.

Ready to monitor your event-driven microservices? Start with Uptrace for complete visibility across Kafka, producers, and consumers.

You may also be interested in:

DEV Community