Mohammed Saifulhuq

Posted on Jun 1

Kafka DLQ Messages Stuck After Schema Change? Here's Why and How to Fix It

#kafka #java #opensource #distributedsystems

I asked 4 senior Kafka engineers this question on Reddit. Nobody named a tool. So I built one.

You deployed a hotfix at 9:45 PM.

The fix was correct. The status field had been accepting invalid values — "ok", "done", "finished" — from different teams. Converting it to a strict Enum with four valid values was the right call: PENDING, PROCESSING, COMPLETED, FAILED.

Nobody checked the DLQ first.

There were 23,000 payment events sitting in payments.dlq from a consumer failure that afternoon. Your new V2 consumer tries to process them. Each one fails immediately:

com.fasterxml.jackson.databind.exc.InvalidFormatException: 
Cannot deserialize value of type `PaymentStatus` from String "pending": 
not one of the values accepted for Enum class: 
[PENDING, PROCESSING, COMPLETED, FAILED]

PagerDuty fires at 2:00 AM.

You cannot roll back — the bug fix must stay. You cannot delete the messages — they are real payment transactions. You cannot redrive them as-is — the V2 consumer will reject every one.

You need to transform the messages first. And there is no tool for this.

Why Schema Registry Doesn't Solve This

The first suggestion you'll get is "just use Schema Registry."

Schema Registry is a prevention tool. It sits between your producer and Kafka and enforces compatibility for new messages being produced.

It does nothing for messages already sitting in your DLQ.

Those 23,000 messages are stored as bytes inside Kafka. Schema Registry cannot see them. Cannot transform them. Cannot validate them. Cannot redrive them.

This is a recovery problem, not a prevention problem. They are different.

What Senior Engineers Actually Do (This Is The Problem)

I posted this exact scenario on r/apachekafka and asked how teams handle it. Four experienced engineers replied. Not one named a tool.

BroBroMate: "Something I've done in the past is just throw a quick Kafka Streams app up to do mass transformations... easy to unit test before rolling out."

KTCrisis: "You could spin up a v2 topic and keep a specific v1 consumer around just to drain the DLQ. But it adds a new topic and a dedicated consumer for a one-shot issue."

Every solution is one of three things:

Build a temporary Kafka Streams application, use it once, delete it
Keep zombie v1 consumer infrastructure alive until the queue drains
Write a throwaway Python/Java script with no tests, no validation, no audit trail

Time cost: 3–8 hours per incident. At 2am. Every single time.

The Existing Tools and Their Gaps

Before building anything, I looked at what exists:

Tool	Browse DLQ	Transform Schema	Idempotent Redrive
DLQMan (irori-ab)	✓	✗ No	✓ lifecycle
Confluent Control Center	✓	✗ No	Partial
kafka-rewind-tools	✗	✗ No	✗
Custom Kafka Streams	Manual	Manual	Manual

The transformation step — converting broken message format to valid format before redriving — exists nowhere as a productized tool.

Building DLQ Revive

I spent 5 weeks building DLQ Revive — an open-source Kafka Dead Letter Queue mutation and redrive engine.

Here are the four decisions that matter most.

Decision 1: assign() + seek(), Never subscribe()

This is the most critical Kafka safety decision in the codebase.

// ❌ WRONG - what most tutorials show
consumer.subscribe(List.of("payments.dlq")); 
// This JOINS your consumer group.
// Kafka can trigger rebalancing and assign this partition
// to your read-only viewer tool instead of your production consumer.
// Your debug tool just took down your live pipeline.

// ✅ CORRECT - what DLQ Revive does
TopicPartition tp = new TopicPartition("payments.dlq", partition);
consumer.assign(List.of(tp));
consumer.seek(tp, fromOffset);
// Direct partition access. No group membership.
// No rebalance risk. Production consumer untouched.
// NEVER calls commitSync() in view mode.

subscribe() participates in Kafka's consumer group protocol. When you subscribe, Kafka's group coordinator can reassign partitions at any time during a rebalance. Your "read-only" DLQ viewer can suddenly become the assigned consumer for a production partition — reading and potentially skipping messages your application needs.

assign() + seek() bypasses all of this. You specify exactly which partition and offset. You read exactly limit records and stop. The rest of the cluster has no idea you exist.

Decision 2: JSONata for Transformation, Not Groovy

The transformation engine was the most debated architectural decision.

The obvious first choice was Groovy — it's powerful, familiar to Java developers, and can handle any transformation logic.

I rejected it immediately.

User-submitted Groovy executes arbitrary Java code on your backend. A careless engineer could write:

// This is a valid Groovy transformation that will be executed:
Runtime.getRuntime().exec("rm -rf /")
// Or extract your AWS credentials:
System.getenv("AWS_SECRET_ACCESS_KEY")

This is an RCE vulnerability built into the product's core feature.

JSONata is purely declarative — a JSON-to-JSON mapping language with no access to the file system, network, or system calls. For our String to Enum scenario:

{
  "orderId": orderId,
  "amount": amount,
  "currency": currency,
  "status": $uppercase(status),
  "processedAt": $now()
}

One expression. Zero RCE surface. Applied to all 23,000 messages identically.

Decision 3: Idempotency at the Kafka Offset Level

What happens if your application crashes halfway through redriving 10,000 messages?

Without idempotency: it restarts, processes the first 5,000 messages again, double-charges 5,000 customers.

DLQ Revive records every message before producing it:

-- Table created on startup:
CREATE TABLE redrive_log (
  id BIGSERIAL PRIMARY KEY,
  topic VARCHAR(255),
  partition INT,
  offset BIGINT,
  redriven_at TIMESTAMP,
  redriven_by VARCHAR(100),
  UNIQUE(topic, partition, offset)  -- ← this is the guarantee
);

// Before producing ANY message:
boolean alreadyRedriven = jdbcTemplate.queryForObject(
    "SELECT COUNT(*) FROM redrive_log WHERE topic=? AND partition=? AND offset=?",
    Integer.class, topic, partition, offset
) > 0;

if (alreadyRedriven) {
    skippedCount++;
    continue;  // Skip. Already processed.
}

// INSERT before produce — not after:
jdbcTemplate.update(
    "INSERT INTO redrive_log (topic, partition, offset, redriven_at) VALUES (?,?,?,NOW())",
    topic, partition, offset
);

kafkaTemplate.send(targetTopic, transformedMessage).get();

The UNIQUE constraint makes this bulletproof at the database level. Even if two concurrent requests try to redrive the same offset, only one INSERT succeeds. The second gets a DataIntegrityViolationException and skips the message.

Pod restart during a 10k bulk redrive cannot double-process a single message.

Decision 4: Cursor-Based Pagination at the Kafka Offset Level

Standard REST pagination (page 1, page 2) breaks for Kafka because page boundaries change as messages are consumed.

DLQ Revive uses cursor-based pagination at the offset level:

GET /dlq/payments.dlq/messages?partition=0&fromOffset=0&limit=50
→ Returns messages at offsets 0-49

GET /dlq/payments.dlq/messages?partition=0&fromOffset=50&limit=50  
→ Returns messages at offsets 50-99

The cursor is the Kafka offset itself. Stable, consistent, and memory-safe regardless of topic size. A DLQ with 500,000 messages loads exactly 50 at a time. No JVM heap exhaustion.

Quick Start

git clone https://github.com/Saifulhuq01/dlq-revive.git
cd dlq-revive
docker compose -f docker/docker-compose.yml up -d
# Dashboard at http://localhost:4200

Starts Kafka, Zookeeper, PostgreSQL, Spring Boot backend, and Angular dashboard together.

Stack: Java 17, Spring Boot 3.x, Spring Kafka, Apache Kafka 3.x, Angular 13+, PostgreSQL 15, JSONata, Docker Compose, GitHub Actions CI.

License: MIT. Self-hosted. Free core forever.

The 2am Scenario, Resolved

Back to those 23,000 stuck payment events:

Without DLQ Revive: Engineer wakes at 2am, spends 4 hours writing a throwaway transformer script, runs it with no validation, hopes it works, deletes the script. Next incident: start over.

With DLQ Revive: Engineer opens browser, connects to Kafka broker, sees 23,000 stuck messages paginated safely, writes one JSONata expression ("status": $uppercase(status)), previews the transformation on 5 sample messages, validates all 23,000, clicks redrive. PostgreSQL audit trail records every message processed. Done in 15 minutes. The expression is saved as a template for next time.

What's Missing (Honest)

The current open-source version handles plain JSON DLQ topics. Avro and Protobuf deserialization (reading the magic byte + Schema ID prefix and fetching schema from Confluent Schema Registry) is on the roadmap for v1.1.

The free tier is limited to 100 messages per redrive session. For bulk redrives, a cloud tier is in progress.

Try It, Break It, Tell Me What's Wrong

If you've dealt with DLQ schema incompatibility in production — or if you think my Kafka consumer approach has an edge case I've missed — I genuinely want to know.

GitHub: github.com/Saifulhuq01/dlq-revive

Drop a comment here or open a GitHub issue. The idempotency design and the JSONata sandbox are the two areas I'm most paranoid about.

Mohammed Saifulhuq — Apache Fineract contributor (SQL injection patch, CI/CD hardening), building DLQ Revive