- Book: The Complete Guide to Go Programming
- Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
A retry test on a Go service a team I talked to maintains
flaked once every two weeks. The fix they kept applying was
longer sleeps. 100 ms became 200 ms. 200 ms became 500 ms. The
test suite drifted from a two-minute run to nine minutes, and
the retry test still flaked.
The test exercised a retry loop. Call a flaky downstream, back
off, call again, return after three tries or after two seconds
total. The implementation used time.AfterFunc and a buffered
channel. The test used time.Sleep in the assertion path. On
a quiet runner it passed in 80 ms. On a busy runner with
neighbouring containers, the sleep would land late, the retry
budget would tick over, and the assertion would read the wrong
state.
The bug was not in the retry code. The bug was that real time
is shared with the host. No amount of slack makes a
wall-clock-driven test deterministic on a contended machine.
Go 1.24 shipped the experimental
fix for this under GOEXPERIMENT=synctest. Go 1.25
promoted it to a stable package: testing/synctest. The shape
is small. You wrap a test in a bubble. Inside the bubble,
time does not pass on its own. It advances only when every
goroutine in the bubble is blocked on something time-related.
The clock jumps directly to the next wakeup. No sleeps in test
code. No flakes.
The migration is small. Here is what it looks like.
What testing/synctest actually is
The package is two functions.
package synctest
// Test starts a new bubble and runs f inside it. Any goroutine
// started from inside the bubble is part of it.
func Test(t *testing.T, f func(t *testing.T))
// Wait blocks until every goroutine in the current bubble is
// durably blocked. Then it returns.
func Wait()
Inside a bubble, time.Now, time.Sleep, time.NewTimer,
time.AfterFunc, time.Tick, context.WithTimeout,
context.WithDeadline, and the channel operations they depend
on all use a fake clock that the runtime owns. The fake clock
starts at midnight UTC on 2000-01-01. It advances only when
every goroutine in the bubble is durably blocked: waiting on
a channel send/receive, a mutex, a timer, or select. When
that condition holds and the next wakeup is a timer the bubble
owns, the clock jumps directly to that timer's deadline. The
goroutine wakes up. Real wall time has not moved.
Outside the bubble, time works as it always has.
This is a common technique for testing time-dependent code: a
deterministic clock that only ticks when the program asks. The
difference is that you no longer need to thread a Clock
interface through your production code to get it. The bubble
runs your real production code, with the real time package,
and the runtime swaps the clock under it.
For the rest of the post I will use the Go 1.25 stable API,
synctest.Test(t, func(t *testing.T) { ... }). If you are still
on Go 1.24 with GOEXPERIMENT=synctest, the function is
synctest.Run(func() { ... }) and you do not get a *testing.T
inside. The semantics are otherwise the same. The migration to
1.25 is a sed-and-rewrap.
The flaky retry test, before
Here is the kind of code the team had. A retry helper that calls
do up to three times with exponential backoff, capped at a
deadline:
package retry
import (
"context"
"errors"
"time"
)
type Result struct {
Value string
Attempts int
}
// Do calls fn up to maxTries with backoff starting at base
// and doubling. It honours ctx.
func Do(
ctx context.Context,
maxTries int,
base time.Duration,
fn func(context.Context) (string, error),
) (Result, error) {
var last error
for i := 0; i < maxTries; i++ {
v, err := fn(ctx)
if err == nil {
return Result{Value: v, Attempts: i + 1}, nil
}
last = err
wait := base << i
select {
case <-ctx.Done():
return Result{Attempts: i + 1}, ctx.Err()
case <-time.After(wait):
}
}
return Result{Attempts: maxTries},
errors.New("retry: exhausted: " + last.Error())
}
The flaky test:
package retry
import (
"context"
"errors"
"testing"
"time"
)
func TestDo_RetriesUntilDeadline(t *testing.T) {
ctx, cancel := context.WithTimeout(
context.Background(), 350*time.Millisecond)
defer cancel()
calls := 0
fn := func(ctx context.Context) (string, error) {
calls++
return "", errors.New("boom")
}
start := time.Now()
_, err := Do(ctx, 5, 100*time.Millisecond, fn)
elapsed := time.Since(start)
if err == nil {
t.Fatal("expected error, got nil")
}
// range assertion — the team gave up on pinning the count
if calls < 2 || calls > 3 {
t.Fatalf("calls = %d, want 2 or 3", calls)
}
if elapsed < 300*time.Millisecond {
t.Fatalf("elapsed = %v, want >= 300ms", elapsed)
}
}
The test takes 350 ms of real wall time per run, and there are
six like it. The calls assertion is a range because the retry
loop and the context deadline race each other and the answer
changes per host. And on a noisy CI host, the deadline fires
before the second backoff returns and calls comes back as 1.
The test is tied to the runner's ability to schedule goroutines
on time, which is exactly the property CI machines stop giving
you under load.
The same test, with synctest.Test
Here is the rewrite. Same production code. No Clock interface,
no dependency injection, no library:
package retry
import (
"context"
"errors"
"testing"
"testing/synctest"
"time"
)
func TestDo_RetriesUntilDeadline(t *testing.T) {
synctest.Test(t, func(t *testing.T) {
ctx, cancel := context.WithTimeout(
context.Background(), 350*time.Millisecond)
defer cancel()
calls := 0
fn := func(ctx context.Context) (string, error) {
calls++
return "", errors.New("boom")
}
start := time.Now()
_, err := Do(ctx, 5, 100*time.Millisecond, fn)
elapsed := time.Since(start)
if err == nil {
t.Fatal("expected error, got nil")
}
if calls != 3 {
t.Fatalf("calls = %d, want 3", calls)
}
// fake clock advances exactly to the deadline
if elapsed != 350*time.Millisecond {
t.Fatalf("elapsed = %v, want exactly 350ms",
elapsed)
}
})
}
The body is almost identical. The test changed; retry.Do did
not.
The wrapping synctest.Test(t, func(t *testing.T) { ... })
opens the bubble. Everything inside runs against the fake clock.
The context.WithTimeout, the time.After inside Do, and the
time.Now() calls all share that clock.
The assertions tightened. calls is exactly 3. Call 1 fails,
backoff is 100 ms, the clock sits at 100 ms. Call 2 fails,
backoff is 200 ms, the clock sits at 300 ms. Call 3 fails,
backoff would be 400 ms but the context expires at 350 ms
first, so the deadline branch of the select fires. The test
sees three calls, not "two or three". elapsed is exactly
350 ms because that is the deadline the context.WithTimeout
was given and the fake clock jumped to it.
The wall-time cost of the whole test, with the fake clock
removed from real time, is microseconds. The whole package's
suite of "flaky retry tests" finishes faster than the goroutine
scheduler can schedule the first one in the old version.
When you need synctest.Wait
synctest.Test is enough for tests that drive everything from
the goroutine the test runs on. It is not enough when production
code spawns a background goroutine and the test wants to assert
the state after that goroutine has done its work.
A debounce helper makes this concrete:
package debounce
import (
"sync"
"time"
)
// New returns a debounced function that calls fn once after d
// has passed since the last call. Calls before d resets the
// timer.
func New(d time.Duration, fn func()) func() {
var (
mu sync.Mutex
timer *time.Timer
)
return func() {
mu.Lock()
defer mu.Unlock()
if timer != nil {
timer.Stop()
}
timer = time.AfterFunc(d, fn)
}
}
A test that wants to assert "after the debounce window, fn ran
exactly once":
func TestDebounce_FiresOnce(t *testing.T) {
synctest.Test(t, func(t *testing.T) {
var calls int
var mu sync.Mutex
d := debounce.New(100*time.Millisecond, func() {
mu.Lock()
calls++
mu.Unlock()
})
d()
d()
d()
// advance the fake clock past the 100 ms debounce window
time.Sleep(200 * time.Millisecond)
synctest.Wait()
mu.Lock()
defer mu.Unlock()
if calls != 1 {
t.Fatalf("calls = %d, want 1", calls)
}
})
}
The time.Sleep(200ms) is the test goroutine telling the
bubble "advance the fake clock by 200 ms, which is past the
100 ms debounce window". The clock advances, the
time.AfterFunc deadline is reached, and the runtime schedules
the callback on a fresh goroutine inside the bubble.
That callback runs concurrently with the test goroutine.
Without synctest.Wait, the test could read calls before the
callback has had a chance to increment it. synctest.Wait
blocks the test until every goroutine in the bubble (including
the one the runtime spawned for time.AfterFunc) is durably
blocked on something. By the time Wait returns, the callback
has either finished or it is blocked again. Either way, the
assertion is safe.
Rule of thumb: every time your production code spawns a
goroutine the test cares about, follow the time-advancing call
with synctest.Wait. If the test only drives one goroutine,
you do not need it.
"Durably" means blocked on something that another goroutine in
the bubble must do to wake it up: channel ops, mutexes,
time.Sleep and timers in the bubble's clock, select. The
runtime can see all of those. What it cannot see is a goroutine
spinning on a CPU loop, a goroutine blocked on a syscall the
runtime does not own (a real network read, a real file read),
or a goroutine blocked on a channel send/receive whose other
side lives outside the bubble. None of those are durable from
the bubble's view. If your code does any of them while the test
is waiting for time to pass, the bubble deadlocks.
In practice, that means two things. Don't run real network I/O
inside synctest.Test; use httptest.NewServer and stub the
response, or inject a fake transport, because the bubble's
clock will not advance while a goroutine is parked in a read
syscall on a real socket. And don't share channels across the
bubble boundary. If you start a goroutine outside the bubble
and have it send on a channel the test reads inside, the test
goroutine is not durably blocked from the bubble's view (it is
blocked on an outside goroutine), and synctest.Wait will not
return.
Both of these are easier to spot once you know what the runtime
is checking. The package docs spell it out in the synctest
godoc.
Versus clockwork, quartz, and benbjohnson/clock
Before testing/synctest existed, three libraries owned this
problem: benbjohnson/clock
is the original, jonboulle/clockwork
is widely used, and coder/quartz
is the Coder fork with a richer API.
All three solve the same problem the same way: define a Clock
interface (Now(), Sleep(), NewTimer(), AfterFunc()),
inject it into every type that needs time, and provide a
FakeClock for tests that exposes Advance(d) and BlockUntil(n).
The design constraint is the dependency injection. Every package
that touches time has to take a Clock parameter. Every test has
to construct a FakeClock and pass it down. Every helper that
calls time.Now() directly is a hole in the abstraction.
testing/synctest removes the constraint. The production code
still calls time.Now() and time.Sleep() directly. The test
wraps the test body in synctest.Test. The runtime swaps the
clock for that goroutine subtree. The diff in production code
to make a package testable is zero.
That said, the clock libraries still earn their place in two
spots:
-
Tests that need to advance time manually.
synctestadvances the fake clock to the next wakeup automatically. The clock libraries let you callclock.Advance(50 * time.Millisecond)to stop the clock midway between two timer deadlines and inspect intermediate state. If your test cares about "what does the system look like 50 ms before the timeout fires", you need the manual advance. -
Production-time clock injection for non-test reasons.
If your code already takes a
Clockbecause you swap in a monotonic-only clock or a TSO clock for distributed consensus, you keep the interface.synctestdoes not replace that.
For the bulk of "this test sleeps for too long and flakes
sometimes", testing/synctest deletes the dependency and the
flake.
What to do with this on Monday
Pick one flaky time-driven test in your suite. The retry test,
the debounce test, the rate-limiter test, the cache-expiry
test, the websocket-keepalive test. There is at least one.
Wrap its body in synctest.Test. Delete the slack from any
time.Sleep calls in the assertion path; replace each with the
exact duration you want to advance. Add synctest.Wait after
each time-advancing call if production code spawns goroutines.
Run it. The test will either pass in microseconds and pin the
exact calls count you used to round, or it will deadlock
because something inside is reaching outside the bubble. The
deadlock is informative. It tells you which goroutine is
talking to the real world and forcing the slack you were
papering over.
Convert one this week. The next will be easier.
If this was useful
The Complete Guide to Go Programming
covers the runtime layer that makes testing/synctest work —
goroutine scheduling, the timer heap, the durably-blocked
distinction, and how the time package interacts with the
scheduler. It is part of Thinking in Go, the 2-book series —
paired with Hexagonal Architecture in Go
for keeping the time-dependent parts of a service isolated
behind ports so the bubble has clean boundaries to draw.

Top comments (0)