Some bugs crash loudly. The more interesting ones just hang — no panic, no error, no stack trace, just a goroutine quietly spinning forever on input that looks completely ordinary. I ran into one of those recently while reading through KubeStellar, a CNCF Sandbox project for multi-cluster configuration management. Here's the hunt, the root cause, and the one-character-deep mistake that caused it.
What the function was supposed to do
Deep in KubeStellar's status-combination code (pkg/status/combinedstatus-resolution.go) there's a small helper called objectIsQueried. Its job is simple: given a query string and an object name, decide whether the object appears in the query as a whole word — not as a fragment buried inside a longer identifier. So "foo" should match "select foo from bar" but not "foobar".
To do that, it walks the string looking for each occurrence of the object, and for each one asks a second helper, isWholeWord, whether that position is bounded by non-alphanumeric characters. Reasonable design. Here's the original loop:
func objectIsQueried(query *string, obj string) bool {
idx := 0
for {
idx = strings.Index((*query)[idx:], obj) // search from idx onward
if idx == -1 {
return false
}
if isWholeWord(query, idx, len(obj)) {
return true
}
idx += len(obj)
}
}
Read it quickly and it looks fine. That's exactly what made it dangerous.
The trap: relative vs. absolute indices
The bug lives entirely in one line:
idx = strings.Index((*query)[idx:], obj)
strings.Index returns the offset of obj relative to the slice it was given — and the slice here is (*query)[idx:], not the whole string. But the code then turns around and uses that relative offset as if it were an absolute position in the original query, both when calling isWholeWord(query, idx, ...) and when advancing idx += len(obj).
As long as the first match happens to sit at the start of the string, relative and absolute agree, and everything works. The moment the search slice starts somewhere in the middle, the two diverge — and the loop starts using a wrong index that points back into territory it already searched.
Watching it spin
The clearest way to see the failure is to trace the input query = "foobar foo", obj = "foo":
-
idx = 0.
strings.Index("foobar foo", "foo")→0.isWholeWordat position 0? The character after"foo"isb— alphanumeric — so it's embedded in"foobar", not a whole word.false. Advance:idx += 3→ idx = 3. -
idx = 3. Now we search the slice
"bar foo".strings.Index("bar foo", "foo")→4relative to that slice. The code stores4as if it were absolute.isWholeWordat absolute position 4 is wrong, returnsfalse. Advance:idx += 3→ idx = 7. -
idx = 7. Search the slice
"foo".strings.Index("foo", "foo")→0relative. Stored as absolute0. We're back to checking"foobar"again.false. Advance → idx = 3.
And we're looping: 3 → 7 → 3 → 7, forever. The function never returns. The exact trigger is "the first occurrence of the object is embedded in a longer word, and a later occurrence is standalone" — an input that's not exotic at all once real query strings are involved.
The fix
The fix is to keep the two notions of position separate: a base that tracks where the current search slice begins in absolute terms, and a rel that's the relative result from strings.Index. The absolute position is just base + rel.
func objectIsQueried(query *string, obj string) bool {
base := 0
for {
// strings.Index returns a relative offset within (*query)[base:], so we
// must add base to get the absolute position before passing it to
// isWholeWord (which operates on the original string).
rel := strings.Index((*query)[base:], obj)
if rel == -1 {
return false
}
abs := base + rel
if isWholeWord(query, abs, len(obj)) {
return true
}
base = abs + len(obj)
}
}
Nine lines changed, five removed. The slice-based search is still efficient (slicing a string in Go shares the underlying storage — no copy), but now every index handed to isWholeWord is a true absolute position, and base only ever moves forward. No more oscillation.
Pinning it with a regression test
A fix you can't prove is a fix you'll lose. The change that actually matters long-term is the regression test, which encodes the exact input that used to hang so nobody can quietly reintroduce the bug later:
// Key regression case: first occurrence embedded, second standalone.
// The original code looped forever on this input.
{"foobar foo", "foo", true},
alongside the ordinary cases ("foo" in "select foo from bar" → true, "foobar" → false, punctuation boundaries like "(foo)" → true, empty query → false), plus a separate table-driven test for isWholeWord itself. The point isn't coverage for its own sake — it's that the one input that broke the function is now a named, permanent test case.
Three things I took away
-
When you slice before searching, every offset is relative.
strings.Index(s[base:], x)does not return a position ins. This is the kind of mistake that's invisible on a quick read and obvious the moment you trace a concrete input — which is the whole argument for tracing concrete inputs. -
A hang is a bug, even without a crash. No panic fired here. The only symptom would be a wedged goroutine. Code that loops on an external
Indexresult should always have a strictly-monotonic advance you can point to — here, provingbasealways increases is the proof the loop terminates. - Reading unfamiliar code is a skill worth practicing in public. I found this by reading, not by hitting it in production. Open source is one of the few places you can do that on a real, used codebase and have your fix reviewed by people who maintain it.
The fix is up as a proposed change against KubeStellar — issue #3848 for the report, PR #3849 for the fix and tests. If you spend time in Go codebases, the relative-vs-absolute slice trap is worth keeping in your peripheral vision; it hides well.
Written by Vignan Nallani. Found a bug like this, or want to talk through how you'd have traced it differently? I'm always up for it.
Top comments (0)