breakingthecode

Posted on Apr 28 • Originally published at Medium

GitHub Copilot Missed It. I Found a Silent Kafka Bug Hidden in Plain Sight

#ai #testing #kafka #automation

I wasn't asked to read the dev code.

My job was automation testing. To write stable automated tests, I needed to understand the stable flow. There was no documentation. There was no handover. So I did what most testers never do — I opened the dev commits and started reading. Every. Single. Line.

Four days later, I found the bug.

This is that story.

Why I Went Into the Code

As an automation tester, my job looks straightforward from the outside: write scripts, run them, report results. But here's what nobody tells you — you cannot automate what you don't understand.

The system I was working on used Apache Kafka for messaging. Messages went in, consumers picked them up, things happened downstream. Simple enough in theory. But before I could write a single test, I needed to know: what is the stable, expected flow?

There was no spec. There was no stable environment. There was just the codebase.

So I started at the commits.

Four Days in the Code

I went through every recent dev commit related to the Kafka configuration. Not skimming — reading. Understanding. Following the thread of what each change was trying to do.

On day four, I found it.

Two Kafka configuration beans existed in the codebase:

defaultKafkaSetting
neverCenterSetting

Each had its own group-id. Different group IDs. Both configurations were present and active at the same time.

// defaultKafkaSetting
group-id: consumer-group-default

// neverCenterSetting
group-id: consumer-group-nevercenter

In a Kafka system, the consumer group ID determines which consumers receive which messages. It is not a cosmetic setting. It is not a preference. It is the routing logic.

With two conflicting configs both loading, the system didn't know which consumer group it belonged to consistently. Messages were going to the wrong cluster. Not sometimes. Systematically.

And there was no error. No exception. No alert. Kafka doesn't throw errors for this — it just delivers to whoever is listening.

The Bug That Looks Like Normal Behaviour

This is what made it so hard to find — and so easy to miss.

If a service crashes, you see a stack trace. If an API returns 500, you see a log. But if Kafka silently delivers messages to the wrong consumer group, the system appears to be working. Data moves. Consumers respond. Everything looks fine.

Until it isn't.

Think of it like two postal addresses on the same envelope. The letter gets delivered — just not always to the right door. And the postal service doesn't call you to say there was a problem. It just delivers.

That's what the dual profile was doing. Silent misrouting. No complaints from the infrastructure. Just wrong behaviour hiding in config.

Why This Matters More Than It Looks

Let me be specific about what silent misrouting means in a production system.

When a Kafka consumer reads a message, it commits an offset — a marker that says "I have processed up to here." If the wrong consumer group is reading those messages, the offset is being committed against the wrong group. The correct consumer group has never seen those messages. It doesn't know they exist.

Those messages are not lost — but they might as well be. They sit unconsumed, invisible to the system that was supposed to act on them. Depending on your retention policy, Kafka will eventually delete them. Gone. Unrecoverable.

In any large enterprise system processing critical business events at scale — missing messages are not a technical footnote. They are missing transactions. Missing records. Missing responses that the business depended on.

The larger the system, the larger the cost of that going undetected into production.

Why GitHub Copilot Didn't Find This

I want to address something people always ask when a bug like this comes up: "Couldn't AI have caught that?"

Short answer: No. And understanding why matters more than the answer itself.

GitHub Copilot is a code completion tool. It reads what's in front of it and suggests what comes next. It is genuinely impressive at that. But this bug didn't live in one file. It lived in the space between two configuration beans, across Spring profiles, in the context of what the business expected those Kafka messages to do.

Copilot saw defaultKafkaSetting — valid config. It saw neverCenterSetting — also valid config. Both are syntactically correct Java. Neither file has an error. There is nothing for a code assistant to flag.

What it couldn't do is what I did: spend four days understanding why this system existed, what these messages were supposed to trigger, and where they were actually going. That question — "does this behaviour match the business intent?" — is not a code question. It's a systems understanding question.

AI tools are trained on code patterns. They are not trained on your architecture, your business rules, or the conversation your team had six months ago about why neverCenterSetting was created in the first place.

That context lived in the commits. And only a human who cared enough to read them would find it.

When the Scramble Came

When I raised it, the dev team investigated. What followed was exactly the kind of scramble you don't want to see — everyone pulling logs, checking offsets, tracing message paths. Confusion. Urgency. Back and forth.

The dev fixed it. The correct group-id was enforced. The system stabilised.

What This Taught Me About Testing

Config is code.

Most testers never look at it. It lives in YAML files and Spring profiles and environment variables, and it feels like someone else's territory. Infrastructure. DevOps. Not QA.

But configuration controls behaviour just as much as code does. A wrong group-id doesn't care whether it came from a logic error or a config oversight. The impact is the same.

After this experience, I added config review to my personal pre-automation checklist:

Check for duplicate bean definitions — especially in Spring Boot apps with multiple profiles
Verify consumer group IDs across environments — dev, staging, and prod should be deliberate and distinct
Trace at least one message end-to-end manually before writing any automated flow
Read recent commits before assuming stability — the "stable flow" you're automating against might have changed last Tuesday

What Testers Should Take From This

If you're an automation tester working on event-driven or messaging systems, here's my honest advice:

Go into the code. Not to become a developer — but to understand what you're testing. The test cases aren't in the ticket. They're in the behaviour. And the behaviour is in the code.

You will find things developers miss. Not because you're smarter — but because you're looking at it differently. You're asking "what is this system supposed to do?" while they're asking "does my feature work?" Both questions matter. Only one of them looks at the whole picture.

I spent four days reading commits for a flow I needed to automate. I found a bug that caused system-wide chaos. I’d do it again tomorrow.

One Last Thing

If you're a tester reading this and you've had something similar happen — a finding that got dismissed, minimised, or overlooked — I want you to know: the finding still counts.

Write it down. Blog about it. Put it in your portfolio. Tell it in interviews.

The story of how you found it, and what you understood that others didn't, is exactly what separates a tester who runs scripts from a tester who understands systems.

It's not just that the finding counts.

That's called breaking the code.

Enterprise QA automation specialist. I build frameworks. I find what most miss. Breaking the Code.

*Follow her on Medium| *

Tags: Kafka · Software Testing · QA · Java · GitHub Copilot · DevOps · Automation Testing

DEV Community