DEV Community

Cover image for Deterministic Testing of Concurrent Go Code
Serif COLAKEL
Serif COLAKEL

Posted on

Deterministic Testing of Concurrent Go Code

How to Test Goroutines Without Flaky CI Pipelines

Writing concurrent code in Go is easy.

Writing deterministic, reliable tests for concurrent code is not.

If you’ve worked on real production systems, you’ve probably seen this:

  • Tests pass locally.
  • CI fails randomly.
  • Increasing time.Sleep “fixes” the problem.
  • Flaky tests get ignored.

That’s not a tooling issue.
That’s a design issue.

In this article, we’ll go deep into:

  • Why time.Sleep makes your tests unreliable
  • How to control goroutine lifecycles properly
  • How to eliminate time-based nondeterminism
  • How to design concurrent components that are testable
  • Real production patterns for stable CI pipelines

This is not a beginner tutorial.
This is about writing concurrency that survives production.


1. The Root Problem: Time-Based Assumptions

Let’s start with a common anti-pattern.

func TestProcessor(t *testing.T) {
    p := NewProcessor()
    go p.Start()

    p.Enqueue(Task{ID: 42})

    time.Sleep(100 * time.Millisecond)

    if !p.HasProcessed(42) {
        t.Fatal("task not processed")
    }
}
Enter fullscreen mode Exit fullscreen mode

Why is this bad?

  • 100ms might not be enough in CI.
  • On a loaded machine, scheduling might delay the goroutine.
  • It makes your test slower than necessary.
  • It hides race conditions.

This test is not deterministic.
It depends on scheduler timing.

Rule #1: Tests should wait on events, not time.


2. Event-Driven Synchronization Instead of Sleep

Let’s redesign the processor to make it testable.

Step 1: Expose completion as an event

type Task struct {
    ID int
}

type Processor struct {
    tasks chan Task
    done  chan int
}

func NewProcessor() *Processor {
    return &Processor{
        tasks: make(chan Task),
        done:  make(chan int),
    }
}

func (p *Processor) Start() {
    for task := range p.tasks {
        p.process(task)
    }
}

func (p *Processor) process(task Task) {
    // real work would happen here
    p.done <- task.ID
}

func (p *Processor) Enqueue(task Task) {
    p.tasks <- task
}
Enter fullscreen mode Exit fullscreen mode

Now the test becomes deterministic:

func TestProcessor(t *testing.T) {
    p := NewProcessor()
    go p.Start()

    p.Enqueue(Task{ID: 42})

    select {
    case id := <-p.done:
        if id != 42 {
            t.Fatalf("unexpected task id: %d", id)
        }
    case <-time.After(time.Second):
        t.Fatal("timeout waiting for task completion")
    }
}
Enter fullscreen mode Exit fullscreen mode

Notice:

  • No time.Sleep
  • No guessing
  • The test reacts to an actual event

The time.After is only there to fail fast — not to “wait long enough”.


3. Goroutine Lifecycle Control (Preventing Test Leaks)

A more subtle production issue:

Your test finishes.
The goroutine keeps running.

That leads to:

  • Data races
  • Cross-test interference
  • Random failures
  • Resource leaks

Let’s fix that properly.

Structured Lifecycle Pattern

type Worker struct {
    ctx    context.Context
    cancel context.CancelFunc
    wg     sync.WaitGroup
}
Enter fullscreen mode Exit fullscreen mode

Constructor:

func NewWorker(parent context.Context) *Worker {
    ctx, cancel := context.WithCancel(parent)

    return &Worker{
        ctx:    ctx,
        cancel: cancel,
    }
}
Enter fullscreen mode Exit fullscreen mode

Start method:

func (w *Worker) Start() {
    w.wg.Add(1)

    go func() {
        defer w.wg.Done()

        for {
            select {
            case <-w.ctx.Done():
                return
            default:
                // simulate work loop
                time.Sleep(10 * time.Millisecond)
            }
        }
    }()
}
Enter fullscreen mode Exit fullscreen mode

Stop method:

func (w *Worker) Stop() {
    w.cancel()
    w.wg.Wait()
}
Enter fullscreen mode Exit fullscreen mode

Now the test:

func TestWorkerLifecycle(t *testing.T) {
    ctx := context.Background()
    w := NewWorker(ctx)

    w.Start()
    w.Stop()
}
Enter fullscreen mode Exit fullscreen mode

This guarantees:

  • No goroutine leaks
  • Deterministic shutdown
  • Clean test isolation

In production systems, this pattern is non-negotiable.


4. Time-Dependent Logic Is a Hidden Enemy

Let’s look at a realistic example.

type RetryManager struct {
    lastAttempt time.Time
}

func (r *RetryManager) ShouldRetry() bool {
    return time.Since(r.lastAttempt) > 5*time.Second
}
Enter fullscreen mode Exit fullscreen mode

How do you test this?

You can’t reliably test time-based logic using real time without sleeps.

And if you use sleeps, your test becomes:

  • Slow
  • Flaky
  • Environment-dependent

5. Clock Abstraction Pattern

In production systems, we abstract time.

Step 1: Define a Clock interface

type Clock interface {
    Now() time.Time
}
Enter fullscreen mode Exit fullscreen mode

Step 2: Real implementation

type RealClock struct{}

func (RealClock) Now() time.Time {
    return time.Now()
}
Enter fullscreen mode Exit fullscreen mode

Step 3: Fake clock for tests

type FakeClock struct {
    mu      sync.Mutex
    current time.Time
}

func NewFakeClock(start time.Time) *FakeClock {
    return &FakeClock{current: start}
}

func (f *FakeClock) Now() time.Time {
    f.mu.Lock()
    defer f.mu.Unlock()
    return f.current
}

func (f *FakeClock) Advance(d time.Duration) {
    f.mu.Lock()
    f.current = f.current.Add(d)
    f.mu.Unlock()
}
Enter fullscreen mode Exit fullscreen mode

Now redesign the component:

type RetryManager struct {
    clock       Clock
    lastAttempt time.Time
}

func NewRetryManager(clock Clock) *RetryManager {
    return &RetryManager{
        clock:       clock,
        lastAttempt: clock.Now(),
    }
}

func (r *RetryManager) ShouldRetry() bool {
    return r.clock.Now().Sub(r.lastAttempt) > 5*time.Second
}
Enter fullscreen mode Exit fullscreen mode

Deterministic test:

func TestRetryManager(t *testing.T) {
    fake := NewFakeClock(time.Now())
    manager := NewRetryManager(fake)

    if manager.ShouldRetry() {
        t.Fatal("should not retry yet")
    }

    fake.Advance(6 * time.Second)

    if !manager.ShouldRetry() {
        t.Fatal("should retry after time advance")
    }
}
Enter fullscreen mode Exit fullscreen mode

No sleep.
No flakiness.
100% deterministic.

This pattern is extremely valuable in:

  • Retry systems
  • Circuit breakers
  • Rate limiters
  • Cache expiration logic
  • Background schedulers

6. Testing Concurrent State Safely

Another production pattern: shared state.

Bad example:

type Counter struct {
    value int
}

func (c *Counter) Inc() {
    c.value++
}
Enter fullscreen mode Exit fullscreen mode

Concurrent test:

func TestCounter(t *testing.T) {
    c := &Counter{}
    for i := 0; i < 1000; i++ {
        go c.Inc()
    }

    time.Sleep(100 * time.Millisecond)

    if c.value != 1000 {
        t.Fatalf("expected 1000, got %d", c.value)
    }
}
Enter fullscreen mode Exit fullscreen mode

This is broken in multiple ways:

  • Race condition
  • No synchronization
  • Sleep-based waiting

Correct implementation:

type Counter struct {
    mu    sync.Mutex
    value int
}

func (c *Counter) Inc() {
    c.mu.Lock()
    c.value++
    c.mu.Unlock()
}

func (c *Counter) Value() int {
    c.mu.Lock()
    defer c.mu.Unlock()
    return c.value
}
Enter fullscreen mode Exit fullscreen mode

Deterministic test:

func TestCounter(t *testing.T) {
    c := &Counter{}
    var wg sync.WaitGroup

    for i := 0; i < 1000; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            c.Inc()
        }()
    }

    wg.Wait()

    if c.Value() != 1000 {
        t.Fatalf("expected 1000, got %d", c.Value())
    }
}
Enter fullscreen mode Exit fullscreen mode

We wait on completion — not on time.


7. Always Run With the Race Detector

In CI:

go test -race ./...
Enter fullscreen mode Exit fullscreen mode

The race detector:

  • Finds shared memory violations
  • Catches hidden concurrency bugs
  • Prevents production incidents

Flaky test + race warning = design issue.

Don’t ignore it.


8. Production Lessons Learned

From real systems:

  • Retry mechanisms caused flaky tests due to real timers.
  • Background workers leaked goroutines across tests.
  • Tests were slow because of accumulated sleeps.
  • CI flakiness reduced confidence in releases.

After introducing:

  • Event-based synchronization
  • Context-driven lifecycles
  • Clock abstraction
  • WaitGroup-based coordination

Results:

  • Flaky rate dropped to zero
  • Test execution time reduced significantly
  • Confidence in concurrent systems increased

Concurrency is not the hard part.

Testing concurrency properly is.


Final Takeaways

If your concurrent tests:

  • Use time.Sleep
  • Depend on real wall-clock time
  • Don’t control goroutine shutdown
  • Ignore race detector warnings

You’re building nondeterminism into your system.

Production-grade Go systems require:

  • Explicit lifecycle control
  • Event-driven synchronization
  • Time abstraction
  • Deterministic state verification

That’s what separates toy concurrency from production concurrency.

Top comments (0)